Review
Abstract
Background: Artificial intelligence (AI) has the potential to revolutionize health care by enhancing both clinical outcomes and operational efficiency. However, its clinical adoption has been slower than anticipated, largely due to the absence of comprehensive evaluation frameworks. Existing frameworks remain insufficient and tend to emphasize technical metrics such as accuracy and validation, while overlooking critical real-world factors such as clinical impact, integration, and economic sustainability. This narrow focus prevents AI tools from being effectively implemented, limiting their broader impact and long-term viability in clinical practice.
Objective: This study aimed to create a framework for assessing AI in health care, extending beyond technical metrics to incorporate social and organizational dimensions. The framework was developed by systematically reviewing, analyzing, and synthesizing the evaluation criteria necessary for successful implementation, focusing on the long-term real-world impact of AI in clinical practice.
Methods: A search was performed in July 2024 across the PubMed, Cochrane, Scopus, and IEEE Xplore databases to identify relevant studies published in English between January 2019 and mid-July 2024, yielding 3528 results, among which 44 studies met the inclusion criteria. The systematic review followed PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines and the Cochrane Handbook for Systematic Reviews. Data were analyzed using NVivo through thematic analysis and narrative synthesis to identify key emergent themes in the studies.
Results: By synthesizing the included studies, we developed a framework that goes beyond the traditional focus on technical metrics or study-level methodologies. It integrates clinical context and real-world implementation factors, offering a more comprehensive approach to evaluating AI tools. With our focus on assessing the long-term real-world impact of AI technologies in health care, we named the framework AI for IMPACTS. The criteria are organized into seven key clusters, each corresponding to a letter in the acronym: (1) I—integration, interoperability, and workflow; (2) M—monitoring, governance, and accountability; (3) P—performance and quality metrics; (4) A—acceptability, trust, and training; (5) C—cost and economic evaluation; (6) T—technological safety and transparency; and (7) S—scalability and impact. These are further broken down into 28 specific subcriteria.
Conclusions: The AI for IMPACTS framework offers a holistic approach to evaluate the long-term real-world impact of AI tools in the heterogeneous and challenging health care context and lays the groundwork for further validation through expert consensus and testing of the framework in real-world health care settings. It is important to emphasize that multidisciplinary expertise is essential for assessment, yet many assessors lack the necessary training. In addition, traditional evaluation methods struggle to keep pace with AI’s rapid development. To ensure successful AI integration, flexible, fast-tracked assessment processes and proper assessor training are needed to maintain rigorous standards while adapting to AI’s dynamic evolution.
Trial Registration: reviewregistry1859; https://tinyurl.com/ysn2d7sh
doi:10.2196/67485
Keywords
Introduction
Background
Artificial intelligence (AI) is profoundly transforming health care across a range of applications, enhancing both clinical outcomes and operational efficiency. In medical imaging, AI algorithms improve diagnostic accuracy by analyzing complex imaging data, such as from magnetic resonance imaging and computed tomography scans, for highly precise and rapid clinical diagnostics [
]. Decision support systems powered by AI assist clinicians in making evidence-based decisions by providing real-time data-driven insights and predictive analytics [ ]. Large language models are increasingly used for generating detailed medical reports and streamlining triage processes by analyzing and summarizing patient data quickly and accurately [ ]. In addition, innovative digital health technologies such as electronic skins use wearable sensor technologies and AI to offer continuous, real-time monitoring of various health indicators, further enhancing personalized care [ ]. These advancements have the potential to contribute to a more efficient, accurate, responsive, and holistic health care, reshaping how patient care is delivered and managed.Despite the growing body of literature on AI in health care, its implementation has lagged behind other industries [
, ]. Previous studies have highlighted substantial barriers to the successful adoption of AI in health care, including issues related to trust; potential risks of harm; accuracy and perceived usefulness; reproducibility; evidentiary standards; and ethical, legal, and societal concerns [ , ]. In addition, uncertainty surrounding postadoption outcomes further complicates the implementation process [ ].A significant barrier identified by health care leaders worldwide is that despite the emergence of various new frameworks for assessing AI in health care, most focus primarily on the quality of study methodologies or technical aspects [
, ]. There remains a lack of a comprehensive, systematic framework that assesses the real-world impact of AI and offers guidance on clinical implementation, monitoring, procurement, and evaluation [ , ]. Most research overlooks the complex, multistep process required for successful AI integration, leaving critical gaps in understanding how to effectively implement and sustain AI tools in clinical practice [ , ]. As a result, the adoption of AI in clinical practice has fallen short of expectations, with only a few algorithms showing sustained clinical impact [ ]. This gap is often due to inadequate or incomplete evaluation and the lack of universally recognized standards for AI assessment. The limited understanding of AI’s true added value in health care highlights the need for a more comprehensive evaluation framework [ - ]. To ensure confidence in the added clinical value and successful integration of AI into health care workflows, a practical, comprehensive tool is needed so that the translational readiness of AI systems can be evaluated. Current approaches assessing AI in health care often focus on foundational technical metrics such as sensitivity and specificity, which fail to capture the full clinical impact [ , ]. A robust valuation should encompass factors such as patient outcomes, effects on clinical decision-making, workflow efficiency, and the tangible benefits for patients to fully determine AI’s true contribution to and impact on health care [ , , ].In the context outlined earlier, regulatory approval is an important milestone for demonstrating overall performance, although the scientific evidence supporting AI tools in health care remains limited compared to traditional medical standards [
, ]. In addition, new regulations are being introduced to keep pace with rapidly evolving AI technologies, such as the European Union (EU) AI Act, which aims to ensure the trustworthiness of high-risk AI tools including those used in health care [ ]. Despite the potentially positive impact of regulatory frameworks on AI-related developments, a recent study revealed that nearly half of Food and Drug Administration (FDA)–authorized AI devices lacked clinical validation data, raising concerns about their safety and effectiveness [ ]. Without robust clinical validation, these technologies could pose significant risks to patient care. Despite efforts to create reporting guidelines for AI in health care, such as Standard Protocol Items Recommendations for Interventional Trials–Artificial Intelligence (SPIRIT-AI) [ ], CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) [ ], Standards for Reporting of Diagnostic Accuracy Studies–Artificial Intelligence [ ], Checklist for Artificial Intelligence in Medical Imaging [ ], Prediction Model Risk of Bias Assessment Tool–Artificial Intelligence [ ], and others, a unified international consensus on the evaluation of AI-based tools has yet to be established. While these guidelines address key methodological issues and share significant overlap, indicating the importance of certain assessment criteria, the absence of a standardized, universally accepted framework remains a significant challenge [ ]. This lack of consensus complicates the consistent evaluation and implementation of AI technologies in clinical practice.Objectives
The goal of this study was to develop a comprehensive framework for assessing the impact of AI tools in health care. This involved synthesizing and consolidating the various evaluation criteria found in existing literature regarding the quality and impact of AI tools. On the basis of the outcomes of this study, we plan on validating the framework through expert consensus using the Delphi process. However, this validation effort will be addressed in the subsequent phase of the project and is beyond the scope of this foundational paper. This approach aims to create a rigorous, evidence-based structure for AI evaluation, ensuring its relevance and applicability in health care settings.
In doing so, we adopted the perspective of the World Health Organization (WHO) on AI in health care, defining it as “the ability of algorithms and software to analyze complex medical data and support health care providers by improving decision-making, predicting outcomes, and enhancing clinical efficiency” [
]. AI tools in health care span a broad spectrum of applications, such as (1) diagnostic support, (2) prognosis of diseases course, (3) personalized treatment recommendations, (4) patient monitoring, and (5) overall health management, driving innovation across the health care landscape [ ].To address this, a systematic review was conducted to offer a comprehensive and current analysis of the criteria used in existing research to evaluate the quality and impact of AI in health care, from technological, social, and organizational perspectives. The review also explores the potential implications of AI implementation for key stakeholders and offers recommendations on how to effectively assess AI-powered clinical tools under consideration for clinical impact. This study builds upon and extends the findings of a prior research project, which examined the sociotechnical assessment criteria for patient-facing eHealth tools, that is already published [
, ].We believe the results of this review will provide valuable insights for clinicians, pharmaceutical leaders, insurance professionals, technology providers, and policy makers by presenting an up-to-date, thorough overview of the criteria used to assess AI-powered clinical tools. These insights will help stakeholders make informed decisions about which tools to implement, recommend to patients, invest in, partner with, or provide reimbursement for, based on their assessed quality and potential impact.
Methods
Overview
The methodology for this review was based on established best practices, specifically following the PRISMA (Preferred Reporting Items for Systematic reviews and Meta-Analyses) guidelines [
] and the Cochrane Handbook for Systematic Reviews of Interventions [ ]. These frameworks were chosen to ensure a rigorous and methodologically sound approach to the systematic literature review process. All review methods were predetermined and documented in advance, with the protocol being publicly registered in the research registry (reviewregistry1859) to enhance transparency and accountability [ ]. The primary research question guiding this systematic review was the following: What technical, social, and organizational criteria should be considered when assessing the quality and impact of AI-powered clinical tools? This question served as the foundation for the analysis and exploration of the criteria relevant to AI’s evaluation in clinical settings. The study remained highly consistent with the initial protocol from a methodological standpoint, adhering to the predefined review question; search strategy; databases; inclusion and exclusion criteria; participants, intervention, comparators, and outcomes (PICO) framework elements; data extraction strategy; quality assessment; and data synthesis approach as originally outlined. The only variation from the protocol was in the presentation of the findings: rather than merely listing the results as an inventory of criteria, we organized them into a cohesive framework. This structured approach enhances both the memorability and practical applicability of the results in real-world settings.Search Strategy
A comprehensive search of the PubMed, Cochrane, Scopus, and IEEE Xplore databases was conducted in July 2024 to identify relevant studies. The review was limited to peer-reviewed papers published in English between January 2019 and mid-July 2024. We focused on this specific time frame and limited the search to the last 5 years to ensure the findings reflect the most recent advancements and challenges, particularly with the emergence of new generative AI technologies. Going back further would have added limited value, as older studies may not capture the rapid technological shifts and evolving complexities that are relevant today. Only fully published research articles were included, while other formats, such as editorials and study protocols, were excluded from the analysis. In accordance with the Cochrane Handbook for Systematic Reviews of Interventions, we chose not to include articles sourced through manual reference list searches, as “positive studies are more likely to be cited,” which could introduce bias [
].This systematic review focused on AI-powered tools designed specifically for clinicians, excluding tools meant solely for patients or medical students as these will most likely not reflect the implementation aspects in real-world health care organizations. The search strategy targeted manuscripts with titles including the terms “AI” or “Artificial Intelligence,” reflecting the intervention focus on AI technologies. Outcomes of interest were assessment criteria, captured through titles containing the terms “assessment,” “assess,” “evaluation,” “evaluating,” “effectiveness,” “efficacy,” “quality,” “efficiency,” “usability,” or “usefulness,” as well as abstracts mentioning “criteria,” “framework,” “method,” “methodology,” “methodologies,” “measurement,” “toolkit,” “tool,” “tools,” “approach,” or “scorecard.” No condition-based restrictions were applied, aligning with a broad approach to capture all relevant studies on assessment methodologies for clinician-targeted AI tools.
illustrates the search string designed using the PICO framework. To ensure the relevance of the retrieved papers, the search was mostly restricted to manuscript titles, focusing on studies that addressed AI assessment criteria comprehensively rather than those evaluating specific tools or pilot studies. Because comparators were not relevant to this review, they were excluded from the search parameters.
Participants: clinicians
- Focus on artificial intelligence (AI)–powered tools for clinicians, excluding those designed solely for patients or medical students.
Intervention: AI-powered clinician tools
- Focus is on AI-powered clinician tools: the search targeted manuscript titles containing the terms (AI OR “Artificial Intelligence”).
Comparator: not applicable
- There were no restrictions on eligible conditions for inclusion.
Outcome: assessment criteria
- The search targeted manuscript titles also containing AND (assessment OR assess OR evaluation OR evaluating OR effectiveness OR efficacy OR quality OR efficiency OR usability OR usefulness). As well as manuscript tiles and abstracts containing AND (criteria OR framework OR method OR methodology OR methodologies OR measurement OR toolkit OR tool OR tools OR approach OR scorecard).
Study Selection
Two researchers (CJ and EL) participated in the screening, eligibility, and inclusion phases of the study. Any discrepancies during these stages were resolved through discussion among them. If consensus could not be reached, a third coauthor was consulted to make the final decision. The team used the open-source Rayyan app (Qatar Computing Research Institute) to streamline collaborative screening efforts [
]. The screening process took place between July and August 2024.The inclusion and exclusion criteria, outlined in
, were developed following the PICO framework. Included studies centered on AI-powered tools in clinical settings, addressing criteria to assess the quality and impact of these tools. Eligible studies were peer-reviewed, published between January 2019 and mid-July 2024, and written in English. Exclusions were made for studies involving only patients or medical students as they were not likely to reflect implementation factors, AI technologies outside clinical settings (eg, patient use chatbots), studies assessing specific tools in isolation, or frameworks solely evaluating AI research methodology or clinical trials rather than the implementation of the tools in real-world settings. Editorials, study protocols, and non-English publications were also excluded.Following the completion of the screening process and resolution of any conflicting views among the researchers, CJ and EL proceeded to assess the full texts of the selected studies for eligibility. Any remaining disagreements were addressed through consultation with a third coauthor. CJ evaluated the risk of bias using the Critical Appraisal Skills Program (CASP) checklist [
], which assesses key quality criteria in the included studies. These criteria include the following: the presence of a clear statement of the research aims, the appropriateness of the methodology for the research objectives, the suitability of the research design in addressing those aims, the relevance of the recruitment strategy, the adequacy of data collection methods in relation to the research question, the consideration given to the researchers’ roles, the evaluation of ethical issues, the rigor of data analysis, the clarity of the study’s findings, and whether the researchers discussed the study’s contribution to existing knowledge, such as its implications for current practice, policy, or relevant literature. The results of this appraisal are available in .Inclusion criteria
- Participants: focused on clinicians
- Intervention: focused on artificial intelligence (AI)–powered clinician tools
- Comparators: does not apply
- Outcomes: addresses the different criteria used to assess the quality and impact of AI-powered clinician tools regardless of the condition
- Publication type: peer-reviewed and published papers
- Time frame: studies published between January 2019 and mid-July 2024
- Language: studies published in English
Exclusion criteria
- Participants: focused solely on patients or medical students
- Intervention: technologies used outside of clinical environments, such as chatbots used by patients to obtain health care information
- Comparators: does not apply
- Outcomes: individual assessments of pilot studies singling out specific tools, and assessment frameworks that focus on the reporting and methodological quality of AI research and clinical studies rather than evaluating the AI tool itself
- Publication type: editorials and study protocols
- Time frame: studies published before January 2019 or after mid-July 2024
- Language: studies published in languages other than English
Data Collection and Synthesis
The procedures and outcomes across the included studies were too diverse to support a quantitative analysis. As a result, a narrative synthesis was used following the sociotechnical approach, organized around the social, organizational, and technical criteria used to evaluate the quality and impact of AI-powered tools for clinicians. The authors were influenced by the sociotechnical theory, which emphasizes that the design and performance of innovations can only be fully understood when both social and technical aspects are considered as interdependent components of a larger system [
]. This approach aligns with recommendations from several scholars who advocate for moving beyond purely technology-focused frameworks to incorporate the broader context, including societal and implementation factors [ - ]. To facilitate this process, NVivo (version 1.7.2; Lumivero), a qualitative data analysis software, was used.Data coding began with a preliminary extraction grid, which was structured around themes derived from previous research and established technology acceptance frameworks. The initial codebook was informed by our prior work on factors influencing eHealth evaluation and adoption [
, , - ], with additional codes being incorporated as new themes emerged during the review. Thematic analysis, as outlined by Braun and Clarke [ ], was conducted to identify and extract themes based on the social, technical, and organizational assessment criteria relevant to the research question. This analysis followed 7 key phases: familiarizing with the data, generating initial codes, searching for themes, reviewing themes, defining and naming themes, linking themes to explanatory frameworks, and producing the final report.In line with the approach of Braun and Clarke [
], we opted not to use interrater reliability as it aligns more closely with quantitative methods and standardized interpretation. Thematic analysis in a qualitative context prioritizes depth, subjectivity, and the unique insights each researcher brings to the data. Rather than using numerical reliability measures such as interrater reliability, reliability in this approach is often ensured through collaborative discussions that allow for consensus and a nuanced understanding of the themes. Accordingly, the first author, CJ, conducted the initial analysis and coding and NB reviewed the coding. Any cases of disagreement were discussed and mutually agreed upon in conjunction with a third author. Using the sociotechnical framework as our guide, we developed our initial codebook and grouped the criteria accordingly. This approach ensures a holistic evaluation of each tool, capturing the complex interdependencies between technical capabilities, social contexts, and organizational fit and readiness. By doing so, we moved beyond a narrow technical focus or methodological evaluation at the study level, ensuring that the social and organizational dimensions are fully integrated into the analysis. As a result, this work prioritizes the often-overlooked social and organizational dimensions that are critical for the successful implementation of AI technologies. Unlike frameworks that focus solely on clinical study quality, our analysis and synthesis specifically emphasize social and organizational factors such as user trust, support and training, interoperability, and integration.However, we intentionally did not apply any hierarchy or prioritization within this foundational framework, as the purpose here is to treat all criteria as equally significant. Prioritization and potential gap identification will occur in the next phase (beyond the scope of this paper), where the Delphi process will engage an expert panel to further refine and prioritize these criteria. The coding and analysis process was carried out from August to October 2024.
Results
Study Selection Flow and Characteristics of the Included Studies
presents the PRISMA flow diagram, illustrating the progression of study selection during the systematic review. It details the number of records identified, screened, included, and excluded, along with reasons for exclusion. After applying these criteria, 44 articles were selected for the qualitative synthesis.

outlines the characteristics of these studies, offering insights into their research methodologies, geographic distributions, and clinical focuses. This comprehensive overview highlights the diversity of approaches and topics addressed within the included studies.
Study characteristics | Studies, n (%) | References | ||||
Country of authors | ||||||
Multiple | 21 (48) | [ | - ]||||
United States | 5 (11) | [ | - ]||||
France | 3 (7) | [ | - ]||||
Netherlands | 3 (7) | [ | - ]||||
Australia | 2 (5) | [ | , ]||||
Canada | 2 (5) | [ | , ]||||
Others | ||||||
China | 1 (2) | [ | ]||||
Denmark | 1 (2) | [ | ]||||
Germany | 1 (2) | [ | ]||||
Greece | 1 (2) | [ | ]||||
India | 1 (2) | [ | ]||||
Saudi Arabia | 1 (2) | [ | ]||||
Sweden | 1 (2) | [ | ]||||
United Kingdom | 1 (2) | [ | ]||||
Focus (some papers encompassed multiple areas of focus) | ||||||
No specific focus | 9 (21) | [ | , , , , , , , , ]||||
Clinical focus | ||||||
Cardiovascular | 3 (7) | [ | , , ]||||
Dermatology | 2 (5) | [ | , ]||||
ENTa | 1 (2) | [ | ]||||
Medical imaging | 12 (27) | [ | , , , , , , , , , , , ]||||
Nuclear medicine | 1 (2) | [ | ]||||
Radiation oncology | 1 (2) | [ | ]||||
Technology focus | ||||||
ANNb | 2 (5) | [ | ]||||
CDSSsc | 3 (7) | [ | , , ]||||
DQMsd | 1 (2) | [ | ]||||
LLMse | 3 (7) | [ | , , ]||||
MLf | 2 (5) | [ | , ]||||
Prediction models | 2 (5) | [ | , ]||||
Thematic focus | ||||||
EEsg | 4 (9) | [ | , , , ]||||
Ethics and equity | 3 (7) | [ | , , ]||||
Explainability | 1 (2) | [ | ]||||
Regulatory and trust | 2 (5) | [ | , ]||||
Paper type | ||||||
Original research | ||||||
Delphi process | 3 (7) | [ | , , ]||||
Survey or questionnaire | 2 (5) | [ | , ]||||
Expert consensus | 3 (7) | [ | , , ]||||
Expert perspective or comment | 9 (21) | [ | , , , , , - , ]||||
Guidelines or statements | 6 (14) | [ | , , , , , ]||||
Policy brief | 1 (2) | [ | ]||||
Review | 10 (23) | [ | , , , , , , , , , ]||||
Scoping review | 6 (14) | [ | , , , , , ]||||
Systematic review | 4 (9) | [ | , , , ]||||
Publication year | ||||||
2019 (from January) | 2 (5) | [ | , ]||||
2020 | 2 (5) | [ | , ]||||
2021 | 5 (11) | [ | , , , , , ]||||
2022 | 10 (23) | [ | , , , , , , , , ]||||
2023 | 12 (27) | [ | , , , , , , , , , , , , , ]||||
2024 (until mid-July) | 13 (30) | [ | , , , , , , , , , , ]||||
Frameworks resulting from the included studies | ||||||
ABCDSh | 1 (2) | [ | ]||||
CHEERS-AIi | 1 (2) | [ | ]||||
CLEARj | 1 (2) | [ | ]||||
DQMk | 1 (2) | [ | ]||||
DRIM France AI gridl | 1 (2) | [ | ]||||
ECLAIRm | 1 (2) | [ | ]||||
HEALn | 1 (2) | [ | ]||||
MAS-AIo | 1 (2) | [ | ]||||
RADARp | 1 (2) | [ | ]||||
RELAINCE guidelinesq | 1 (2) | [ | ]||||
R‑AI‑DIOLOGY checklistr | 1 (2) | [ | ]||||
TEHAIs | 1 (2) | [ | ]||||
TREEt | 1 (2) | [ | ]||||
Frameworks used in or referred to in the included studies | ||||||
CHEERSu | 3 (7) | [ | , , ]||||
CLAIMv | 4 (9) | [ | , , , ]||||
CONSORT-AIw | 6 (14) | [ | , , , , , ]||||
DECIDE-AIx | 2 (5) | [ | , ]||||
FUTURE-AIy | 1 (2) | [ | ]||||
GEP-HIz | 1 (2) | [ | ]||||
HTAaa | 6 (14) | [ | , , , , , ]||||
MASTab | 2 (5) | [ | , ]||||
PROBAST-AIac | 4 (9) | [ | , , , ]||||
QAMAIad | 1 (2) | [ | ]||||
QMSae | 1 (2) | [ | ]||||
RQSaf | 1 (2) | [ | ]||||
SPIRIT-AIag | 6 (14) | [ | , , , , , ]||||
STARD-AIah | 3 (7) | [ | , , ]||||
STARE-HIai | 1 (2) | [ | ]||||
TRIPOD-AIaj | 7 (16) | [ | , , , , , , ]
aENT: ear, nose, and throat.
bANN: artificial neural network.
cCDSS: clinical decision support system.
dDQM: diagnostic quality model.
eLLM: large language model.
fML: machine learning.
gEE: economic evaluation.
hABCDS: Algorithm-Based Clinical Decision Support.
iCHEERS-AI: Consolidated Health Economic Evaluation Reporting Standards for Interventions That Use Artificial Intelligence.
jCLEAR: Derm Consensus Guidelines from the International Skin Imaging Collaboration Artificial Intelligence Working Group.
kDQM: Diagnostic Quality Model.
lDRIM France AI grid: French community grid for the evaluation of radiological artificial intelligence solutions.
mECLAIR: Evaluating Commercial Artificial Intelligence Solutions in Radiology.
nHEAL: Health Equity Assessment of Machine Learning Performance.
oMAS-AI: Model for Assessing the Value of Artificial Intelligence in Medical Imaging.
pRADAR: Radiology Artificial Intelligence Deployment and Assessment Rubric.
qRELAINCE guidelines: Recommendations for Evaluation of Artificial Intelligence for Nuclear Medicine.
rR‑AI‑DIOLOGY checklist: a practical checklist for evaluation of artificial intelligence tools in clinical neuroradiology.
sTEHAI: Translational Evaluation of Healthcare Artificial Intelligence.
tTREE: transparency, reproducibility, ethics, and effectiveness.
uCHEERS: Consolidated Health Economic Evaluation Reporting Standards.
vCLAIM: Checklist for Artificial Intelligence in Medical Imaging.
wCONSORT-AI: Consolidated Standards of Reporting Trials–Artificial Intelligence.
xDECIDE-AI: Reporting Guideline for the Developmental and Exploratory Clinical Investigations of Decision Support Systems Driven by Artificial Intelligence.
yFUTURE-AI: International consensus guideline for trustworthy and deployable artificial intelligence in health care.
zGEP-HI: Good Evaluation Practice in Health Informatics.
aaHTA: Health Technology Assessment.
abMAST: Model for Assessment of Telemedicine.
acPROBAST-AI: Prediction Model Risk of Bias Assessment Tool–Artificial Intelligence.
adQAMAI: Quality Analysis of Medical Artificial Intelligence.
aeQMS: Quality Management System.
afRQS: Radiomics Quality Score.
agSPIRIT-AI: Standard Protocol Items: Recommendations for Interventional Trials–Artificial Intelligence.
ahSTARD-AI: Standards for Reporting of Diagnostic Accuracy Studies–Artificial Intelligence.
aiSTARE-HI: Statement on Reporting of Evaluation Studies in Health Informatics.
ajTRIPOD-AI: Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis–Artificial Intelligence.
Critical Appraisal
We evaluated the quality of the included studies using the CASP checklist [
]. This tool was selected due to the variety of methodologies used in the studies and the narrative approach of our synthesis, which differed from meta-analyses and other quantitative methods. The CASP is widely recognized as the most frequently used tool for appraising the quality of qualitative evidence in health research, with endorsement from the Cochrane Qualitative and Implementation Methods Group [ ]. The studies included in our review used a range of methodologies (quantitative, qualitative, mixed methods, and systematic literature reviews), which meant that some questions on the checklist were not applicable to all study types. As per the checklist’s recommendations, we did not assign scores to the studies.Following the critical appraisal of the 44 studies, several issues were identified. While all studies clearly stated their aims, presented well-defined findings, and provided valuable insights for health care stakeholders, 21 (N=44, 48%) studies lacked a dedicated methods section, making it difficult to assess the appropriateness and suitability of their approach. Similarly, the absence of clear methods in these studies hindered the evaluation of the research design and data collection techniques.
In addition, out of 44 studies, 25 (57%) studies did not detail their analysis methods, making it challenging to gauge the rigor and reliability of their approach. Furthermore, 28 (64%) studies lacked validation of their findings, while 8 (18%) offered only partial validation (eg, expert consensus), highlighting the need for empirical validation in real-world clinical applications to ensure the findings’ robustness. The comprehensive quality assessment of the included studies can be found in
.Studies were not excluded based on the results of the quality assessment, as this was unlikely to significantly impact the definition of the assessment criteria or the development of the aggregated framework. However, the quality assessment offered valuable insight into the overall robustness of the development processes behind the existing frameworks, helping to gauge the strength and reliability of the evidence presented [
]. An in-depth exploration of this topic can be found in the Discussion section, where the challenges associated with current initiatives and frameworks are examined.Synthesized Assessment Criteria
We synthesized comparable measures from various papers, frameworks, and initiatives, ultimately identifying a set of unique criteria that reflected all relevant assessment methods referenced in the included studies. Notably, several criteria are closely interrelated and could fit into multiple categories; however, they were placed in the most appropriate category based on their significance and impact. For instance, while “user trust” and “model explainability” are inherently linked, because trust often correlates with the level of explainability provided by an AI system, we categorized trust under the cluster “acceptability, trust, and training,” which focuses on user-centric aspects, whereas “explainability” was assigned to the cluster evaluating model performance metrics, given its technical focus. In addition, we intentionally included assessment criteria applicable to high-risk tools, enabling us to compile a more comprehensive list. We recognized that not all criteria would apply to lower-risk AI-powered health care tools, such as patient safety assessments, which are more relevant to high-risk tools that pose potential safety concerns. We are guided by National Institute for Health and Care Excellence’s Evidence Standards Framework for Digital Health Technologies to assess and understand the risk levels of health care technologies [
].provides a visual overview of the aggregated criteria, organized into clusters and subclusters, while presents these criteria grouped into 7 primary clusters and their respective subcriteria, outlining their occurrences across the included studies, along with their definitions and corresponding references. A detailed exploration of each criteria cluster and its corresponding subcriteria is provided in the Discussion section.

Criteria | Definition | Studies, n (%) | Studies in which the criteria occurred | ||||
Integration | |||||||
Infrastructure |
| 15 (34) | [ | , , , - , , , , , , , , , ]||||
Interoperability |
| 19 (43) | [ | , , , - , , , - , , - , ]||||
Workflow and organizational changes |
| 22 (50) | [ | , , , , - , , , , - , , , , , , ]||||
Monitoring, governance, and accountability | |||||||
Accountability and liability |
| 13 (30) | [ | - , , , - , , , , , ]||||
Consent and data ownership |
| 5 (11) | [ | , , , , ]||||
Maintenance and updates |
| 13 (30) | [ | , , , , , , , , , , , , ]||||
Monitoring and governance |
| 22 (50) | [ | , - , , , , , , , , , , , , ]||||
Regulatory compliance |
| 23 (52) | [ | , - , , , - , - , , , , , , , ]||||
Security and privacy |
| 26 (59) | [ | , - , , , - , - , , , - ]||||
Performance quality metrics | |||||||
Accuracy, sensitivity, and specificity (foundational metrics) |
| 26 (59) | [ | , , - , , , , , , , , , , , - ]||||
Explainability and interpretability (ethics and trustworthiness) |
| 19 (43) | [ | , - , , , , , , , , , - , , , , ]||||
Fairness (equity) |
| 32 (73) | [ | - , , - , , , - , , - , , , , , ]||||
Reliability, repeatability, and reproducibility (consistency and stability) |
| 24 (55) | [ | , , , , - , - , , , , , , , , , , ]||||
Robustness and generalizability (adaptability) |
| 23 (52) | [ | - , , , , - , , , , , , , , , , , ]||||
Imaging-focused |
| 10 (23) | [ | , - , , , , , ]||||
Large language model-focused |
| 3 (7) | [ | , , ]||||
Acceptability, trust, and training | |||||||
Acceptance and adoption |
| 18 (41) | [ | , , - , , , , , - , , , , , , ]||||
Training and support |
| 17 (39) | [ | , , - , , , , , , , , - , ]||||
Trust |
| 11 (25) | [ | , , , , , , , , , , ]||||
Usability |
| 18 (41) | [ | , - , , , , , , , , - , , ]||||
User centricity (user, domain, and task type) |
| 19 (43) | [ | , , , , , , , , , , , , , , - ]||||
Cost and economic evaluation | |||||||
Costs and economic evaluation in general |
| 18 (41) | [ | , , , , - , , - , , , , , , , ]||||
Cost-effectiveness analysis |
| 12 (27) | [ | , , , - , , - , ]||||
Cost-minimization analysis |
| 5 (11) | [ | , , , , ]||||
Cost-utility analysis |
| 3 (7) | [ | , , ]||||
Technological safety and transparency | |||||||
Safety |
| 26 (59) | [ | , , - , , - , , , , , , , , ]||||
Transparency |
| 27 (61) | [ | , , , , , - , , - , - , , , , , - , , , ]||||
Ethical oversight, human in command |
| 14 (32) | [ | , , , , , , , , , , , , , ]||||
Scalability and impact | |||||||
Clinical effectiveness |
| 26 (59) | [ | - , - , - , , - , - , - , , , , , ]||||
Clinical efficiency |
| 8 (18) | [ | , , , , , , , ]||||
Clinical utility |
| 14 (32) | [ | - , , , , , , , , , , , ]||||
Environmental impact |
| 1 (2) | [ | ]
aAI: artificial intelligence.
With our focus on assessing the long-term real-world impact of AI technologies in health care, we named the framework AI for IMPACTS. The criteria were organized into seven key clusters, each corresponding to a letter in the acronym: (1) I — integration, interoperability, and workflow; (2) M — monitoring, governance, and accountability; (3) P — performance and quality metrics; (4) A — acceptability, trust, and training; (5) C — cost and economic evaluation; (6) T — technological safety and transparency; and (7) S — scalability and impact.
Discussion
Principal Results
Through our systematic review of the literature, which culminated in the inclusion of 44 relevant papers, we conducted a narrative synthesis guided by the sociotechnical framework. This synthesis identified and categorized the key technical, social, and organizational criteria critical for the practical and effective implementation of AI technologies in health care. The results are organized into 7 main clusters, further divided into 28 specific subcriteria, providing a structured framework to address the multifaceted considerations highlighted in the reviewed literature.
By synthesizing and aggregating the assessment criteria from all included studies, we developed the AI for IMPACTS framework. This framework goes beyond focusing solely on technical metrics or methodological guidance at the study level. It integrates the clinical context and real-world implementation factors to ensure AI tools are evaluated holistically. Most criteria in our proposed framework can be aligned with existing frameworks, but none covers all relevant categories without extensions. For successful AI implementation in health care, it is essential to integrate these tools within the broader organizational context. Frameworks should account for the complexities of the sociotechnical environment, recognizing the interplay between technical, social, and organizational dimensions. Our consolidated framework achieves this by synthesizing and expanding existing frameworks for AI assessment in health care. It uses a sociotechnical approach to consider all contextual factors, their interactions, and the long-term real-world impact of these technologies in clinical practice.
The sociotechnical theory, which emphasizes the dynamic interplay between social, organizational, and technical aspects, provides a holistic approach to evaluating novel technologies [
]. This is critical in health care, where the successful implementation of novel technologies requires a balance of these factors to optimize both technology adoption and clinical outcomes [ ]. Each component of the AI for IMPACTS framework reflects this sociotechnical foundation, as described below.- I: integration, interoperability, and workflow — sociotechnical theory stresses the need for alignment between technology and workflow. This criteria cluster ensures that AI tools integrate seamlessly within existing systems and workflows, minimizing disruptions and supporting health care professionals in their work.
- M: monitoring, governance, and accountability — governance structures are vital for ensuring AI applications adhere to clinical standards and ethical norms. The sociotechnical theory supports the need for oversight that considers not just technical capabilities but also social and organizational responsibilities, promoting accountability in decision-making.
- P: performance and quality metrics — effective AI assessment requires robust performance metrics that span technical and clinical outcomes. By applying sociotechnical principles, this criteria cluster ensures that quality standards are met in ways that resonate with both technical requirements and patient care priorities.
- A: acceptability, trust, and training — for AI to be widely adopted, it must be trusted and understood by users. The sociotechnical theory emphasizes the role of social factors such as trust and user training, which are essential for fostering acceptance among health care providers and patients.
- C: cost and economic evaluation — costs are a key concern in health care. The sociotechnical approach underscores the importance of evaluating not just technical implementation costs but also the economic implications for patients and health care systems, ensuring that AI tools are financially sustainable and valuable.
- T: technological safety and transparency — safety and transparency are core to AI in health care, as they directly affect user trust and patient safety. The sociotechnical theory highlights that these technical attributes must be coupled with transparent communication and organizational processes that make AI’s functioning understandable and dependable.
- S: scalability and impact — sociotechnical principles stress adaptability within complex systems. This criteria cluster considers how AI can be scaled effectively across diverse health care settings, evaluating both technical scalability and the social and organizational impact for expansion.
By leveraging the sociotechnical theory, the AI for IMPACTS framework ensures that each criterion is evaluated in a way that respects the complex interdependencies between technical capabilities, social context, and organizational readiness, providing a balanced and comprehensive approach to AI assessment in health care. We selected the acronym IMPACTS to underscore our emphasis on real-world outcomes over isolated, study-level evaluations. This highlights our commitment to assessing the broader, practical effects in health care settings.
depicts the 7 assessment clusters of the AI for IMPACTS framework. Each cluster contains multiple subcriteria, all of which are summarized in a comprehensive checklist presented in . The framework provides a systematic approach for evaluating AI’s holistic role and potential in health care applications. The following subsections provide a detailed analysis of each criteria cluster and their respective subcriteria, offering a comprehensive breakdown of how each factor contributes to the overall assessment.

Criteria | Assessment | ||
Integration | |||
Infrastructure | Does the deployment and scalability of the AIa tool require additional technological, hardware, or software infrastructure beyond what is already available in the current clinical setting? | ||
Interoperability | Does the AI tool seamlessly integrate and exchange data with various health care platforms and devices, ensuring interoperability across different systems without requiring significant modifications? | ||
Workflow and organizational changes | Does the AI tool integrate smoothly into existing clinical workflows and health care operations, minimizing disruption while enhancing efficiency, communication, and the overall delivery of care? | ||
Monitoring, governance, and accountability | |||
Accountability and liability | Is there clear attribution of responsibility for errors or outcomes, supported by well-defined legal and ethical frameworks that ensure accountability and proper recourse in the event of any issues? | ||
Consent and data ownership | Does the AI tool have clear and robust processes for obtaining informed consent from patients, including transparent policies on data ownership, privacy, and control, ensuring patients fully understand how their data will be used? | ||
Maintenance and updates | Does the AI tool have established processes for ongoing support, including regular updates and bug fixes, to ensure it remains effective, secure, and compliant with evolving medical standards and practices? | ||
Monitoring and governance | Does the AI tool have systems in place for ongoing oversight of its performance, including regular assessments and audits to ensure ethical use, effectiveness, and adherence to relevant standards? | ||
Regulatory compliance | Does the AI tool demonstrate adherence to established regulations throughout its entire life cycle, with systems in place for ongoing monitoring and reporting postdeployment to ensure continued safety, efficacy, and compliance with legal requirements? | ||
Security and privacy | Does the AI tool have robust measures in place to protect sensitive patient data from unauthorized access and breaches, while ensuring full compliance with relevant privacy regulations? | ||
Performance quality metrics | |||
Foundational metrics | These are application-specific metrics to ensure each tool is assessed appropriately based on its function:
| ||
Explainability (ethics and trustworthiness) | Is the AI tool able to clearly show how it reached a specific decision or prediction in a way that clinicians can understand? | ||
Interpretability | Is it easy for clinicians to understand the relationship between the input data and the AI tool’s outputs, without needing detailed technical explanations? | ||
Fairness (equity) | Does the AI tool ensure fairness by avoiding systematic discrimination against any specific group, such as race, gender, or socioeconomic status, and promoting equitable outcomes in diagnoses and treatments? | ||
Reliability, repeatability, and reproducibility (consistency and stability) | Does the AI tool demonstrate reliability, repeatability, and reproducibility by consistently delivering the same results over time, under similar conditions, and when applied to different data sets or used by different teams? | ||
Robustness and generalizability (adaptability) | Does the AI tool demonstrate both robustness and generalizability by maintaining strong performance despite variations or noise in input data, and by performing well on new, unseen data from different hospitals or regions compared to its training data? | ||
Acceptability, trust, and training | |||
Acceptance and adoption | Does the AI tool demonstrate strong acceptance by health care professionals and patients, including their willingness to adopt and integrate it into routine clinical practice? | ||
Training and support | Does the AI tool provide comprehensive and readily available resources for users, ensuring they have the necessary guidance, training, and assistance to successfully implement and operate it in clinical practice? | ||
Trust | Does the AI tool inspire trust among health care professionals and patients in terms of its reliability, accuracy, and ethical considerations, thereby positively influencing their willingness to use it? | ||
Usability | Does the AI tool offer an intuitive and user-friendly interface that allows health care professionals and patients to interact with it easily and effectively, ensuring it enhances the user experience and integrates smoothly into clinical workflows? | ||
User centricity (user, domain, and task type) | Does the AI tool effectively meet the specific needs, preferences, and contexts of its users, while addressing domain-specific requirements and supporting the relevant tasks for which it is intended? | ||
Cost and economic evaluation | |||
Costs and economic evaluation | Does the AI tool provide financial value by enhancing care without imposing excessive costs on health care systems or patients, ensuring that its implementation is economically sustainable? This can be measured using one or more of the following methods:
| ||
Technological safety and transparency | |||
Safety | Does the tool reliably adhere to clinical standards, consistently mitigate potential risks, and demonstrate the ability to avoid causing harm to patients through reliable operation and risk management? | ||
Transparency | Does the AI tool provider ensure transparency by making its processes, decision-making logic, and data sources understandable and accessible to all relevant stakeholders? | ||
Ethical oversight, human in command | Does the AI tool incorporate ethical oversight by ensuring that it supports human decision-making, allowing clinicians to maintain control and override AI-generated decisions, when necessary, thereby complementing rather than replacing human judgment? | ||
Scalability and impact | |||
Clinical effectiveness | Does the AI tool demonstrate clinical effectiveness by consistently achieving the desired clinical outcomes in real-world practice, across diverse patient populations and health care settings? | ||
Clinical efficiency | Does the AI tool demonstrate clinical efficiency by optimizing the use of resources, including time, staff, and costs, to effectively deliver care without compromising quality? | ||
Clinical utility | Does the AI tool demonstrate clinical utility by offering practical benefits that improve patient care, such as guiding clinical decision-making or reducing risks during treatment? | ||
Environmental impact | Does the AI tool minimize its environmental impact by considering sustainability in its development, deployment, and operation, including factors such as energy consumption and carbon footprint? |
aAI: artificial intelligence.
Integration
This criteria cluster focuses on evaluating how effectively the AI tool integrates into existing clinical workflows and health care systems.
Infrastructure plays a crucial role in the successful implementation of AI tools in health care settings. Adequate computational power, specialized hardware, and robust IT infrastructure are often necessary to support the processing of large datasets and the operational demands of AI technologies [
, ]. This may include advanced components such as graphics processing units, which are not always standard in health care systems [ ]. In addition, integrating these tools might require significant investment in new hardware or upgrades [ , ]. For cloud-based AI solutions, attention must be paid to network security and performance [ ]. Ensuring infrastructure compatibility is essential for the smooth deployment and optimal functionality of AI in health care [ , ].Interoperability ensures seamless integration with existing systems, such as electronic health records and imaging software. It allows AI tools to operate within current workflows without disrupting established clinical processes, enhancing data exchange across platforms [
, ]. It also ensures that AI tools adhere to industry standards, facilitating communication between different health care technologies and minimizing issues such as data misinterpretation or workflow inefficiencies [ ]. Proper integration can reduce the resource burden on health care facilities and improve the overall usability and effectiveness of AI systems in diverse clinical settings [ ].Understanding the impact on clinical workflows and organizational structures is essential. AI tools must be seamlessly integrated into workflows to avoid disrupting clinical processes [
, ]. Evaluating how AI affects the redistribution of tasks among health care professionals and identifying necessary organizational changes are essential [ , ]. Poor integration or failure to align with clinical routines can negatively impact efficiency, increase cognitive burdens, and require significant resources to adapt systems [ , ].Monitoring, Governance, and Accountability
This criteria cluster focuses on evaluating how effectively the AI tool is monitored throughout its life cycle, addressing critical aspects such as model drift, data governance, and adherence to ethical standards.
Clarity on accountability and liability is essential when assessing AI tools in health care due to the potential risks involved in their implementation [
, ]. AI systems can make errors or offer recommendations that may not be followed by clinicians, raising complex questions about who is responsible when mistakes occur [ , ]. The lack of clear guidelines on whether liability lies with the developer, the health care institution, or the clinician using the tool poses significant legal and ethical concerns [ , ]. Proper assessment frameworks must ensure that accountability is well-defined, including clear roles for all stakeholders involved (eg, clinicians, developers, and institutions) particularly in cases of adverse events or errors [ , , ].Data security, privacy, informed consent, and data ownership are vital criteria for assessing AI tools in health care. These tools often require large amounts of sensitive patient data, which must be protected from unauthorized access, breaches, or misuse [
, ]. Ensuring compliance with relevant regulations, such as General Data Protection Regulation or Health Insurance Portability and Accountability Act, is essential to safeguard patient privacy [ , , ]. In addition, clear processes for obtaining informed consent are critical, ensuring that patients understand how their data will be used [ , ]. Proper data ownership policies must also be in place, ensuring transparency around who controls the data and how it can be accessed or shared [ , ]. These measures are crucial for building trust and ensuring ethical AI deployment in health care settings [ , ].Regulatory compliance and certification are essential but insufficient assessment criteria for AI tools in health care [
]. Although regulatory bodies like the FDA in the United States and CE marking in the EU set minimum safety and efficacy standards, there are significant gaps between legal certification and real-world clinical validation, workflow integration, and ongoing use [ , ]. For instance, FDA clearance does not always assure users that an AI tool will meet their expectations for effective performance in all clinical settings, leading to skepticism among health care professionals [ , ]. Similarly, in the EU, AI tools with CE marking are often assumed to be clinically validated, but many lack sufficient validation for real-world clinical use, such as in dementia diagnosis via magnetic resonance imaging [ , ]. These gaps highlight the need for stronger regulatory frameworks and postmarket surveillance to ensure AI tools are not only certified but also thoroughly validated and integrated into health care workflows for effective and safe use [ , , ].Monitoring and governance mechanisms, including feedback loops, are critical for ensuring the continued safety, effectiveness, performance, and reliability of AI tools in health care [
]. It is essential that the responsibility for monitoring these tools is shared between the developer, regulator, and the health care organization deploying the tool [ ]. Developers are responsible for ongoing performance evaluations, including regular updates to address issues such as data drift or algorithmic failure [ , ]. Regulators must ensure compliance with postmarket surveillance requirements and set clear guidelines for monitoring practices [ , ]. Health care organizations must implement local oversight systems, ensuring that the AI tool continues to meet clinical needs without causing disruption or harm [ , , , , ]. By assigning responsibility to all 3 entities, health care systems can ensure comprehensive, multi-layered oversight that addresses technical, clinical, and regulatory concerns [ ].The maintenance and updating of AI tools are critical to ensuring their continued effectiveness and safety in health care [
]. Regular updates, including adjustments to algorithms and reference datasets, are essential to avoid performance degradation and ensure accurate results [ , ]. Without proper maintenance, different software versions could introduce biases or inconsistencies, which might affect clinical outcomes [ , ]. Establishing clear protocols for updates, including version control and procedures for managing software changes, ensures that AI tools remain reliable and aligned with current medical standards, safeguarding patient care [ ].Performance Quality Metrics
This criteria cluster focuses on evaluating the performance and quality of the AI tool by assessing key metrics such as foundational performance metrics, fairness, explainability, reliability, and robustness.
Foundational performance metrics play a crucial role in assessing the effectiveness of AI tools. The systematic review revealed that 59% (26/44) of studies primarily focused on accuracy, sensitivity, and specificity as key metrics. However, it is essential to consider application-specific metrics when evaluating AI performance, as different AI tools require tailored measures depending on their intended use. For example, diagnosis and prediction tools encompass applications like classification (eg, disease diagnosis), regression (eg, predicting disease progression), anomaly detection, and recommendation systems. These tools can be assessed through metrics such as accuracy, sensitivity, specificity, and the area under the curve for classification tasks [
, , ] and mean absolute error and root mean square error for regression tasks [ , ]. Image and pattern analysis covers tasks such as image segmentation and reinforcement learning, using metrics like the Dice coefficient and Jaccard index for segmentation accuracy [ , ], and cumulative reward for evaluating reinforcement learning performance [ ]. On the other hand, text and language processing applications, such as natural language processing and large language models, are assessed using metrics like relevance, engagement, empathy, token limits, hallucination rates, memory efficiency, and floating-point operation count [ , , ]. These metrics ensure the AI tool is properly evaluated based on its intended use and technology type.Explainability and interpretability are essential for ensuring the ethical and trustworthy use of AI tools in health care. These criteria allow health care professionals to understand how AI models arrive at their conclusions, fostering trust in their recommendations [
, ]. Explainability helps to demystify the AI’s decision-making process, making it transparent and accessible to users [ , ]. This, in turn, improves adoption, as clinicians are more likely to trust and rely on AI tools that are interpretable [ , ]. Ultimately, clear explainability supports ethical deployment, reducing risks associated with “black box” systems [ , ].Fairness or equity ensures that AI models provide unbiased, consistent performance across diverse demographic groups, including those defined by race, gender, age, or socioeconomic status [
, , ]. This criterion addresses the risk of bias in training data, including sample size and representativeness, which can lead to unequal treatment or outcomes for underrepresented populations [ , , ]. By focusing on fairness, AI tools can avoid perpetuating disparities and contribute to more equitable health care delivery for all patients [ , , ].Reliability, repeatability, and reproducibility ensure that the AI tool can produce consistent outputs when presented with similar inputs, is repeatable under identical conditions, and is reproducible in diverse environments, including different institutions or patient populations [
, , , ]. Maintaining consistency and stability is essential for the tool’s trustworthiness and its broader applicability in real-world health care scenarios [ , ].Robustness and generalizability are essential criteria for assessing the adaptability of AI tools in health care [
, ]. Robustness ensures the tool can maintain high performance even when exposed to slight variations in input data or operational environments [ , ]. Generalizability, on the other hand, evaluates whether the AI tool can effectively perform across different populations, clinical settings, or geographic regions beyond the environment in which it was trained [ , ]. These criteria ensure that AI tools remain reliable and effective when scaled or applied to diverse health care contexts [ , ].Acceptability, Trust, and Training
This criteria cluster evaluates user-centric aspects of the AI tool, focusing on its acceptance, trustworthiness, and the adequacy of user training and support.
User acceptance and adoption are crucial for the successful implementation and translation of AI-powered health tools in real-life settings [
, , ]. Key challenges include fostering trust and confidence among health care professionals, ensuring ease of use, and integrating these tools seamlessly into clinical workflows [ ]. User acceptance depends significantly on the perceived benefits, transparency, and safety of the AI systems [ , , ]. Moreover, ethical concerns, the potential for bias, and the need for comprehensive testing also impact adoption [ ]. Clinicians are more likely to embrace these tools when they complement human expertise and are introduced with adequate training and support, ensuring they enhance patient outcomes without compromising safety [ ]. User acceptance and adoption of technology are typically measured through surveys (eg, Technology Acceptance Model and Unified Theory of Acceptance and Use of Technology) assessing factors like perceived usefulness and ease of use, as well as use metrics such as adoption rates, frequency, and retention.Trust is built through factors such as validation, transparency, safety, privacy, and interpretability of the AI tool [
]. Both health care professionals and patients must trust that the AI tool is reliable, safe, and effective in clinical practice [ , ]. Validating AI performance using local data is essential to build clinician confidence, while demonstrating that the tool adheres to rigorous standards helps address concerns about its real-world application [ , ]. Trust also influences adoption, making it vital for the successful implementation of AI tools in health care [ ]. User trust in technology is commonly assessed through surveys and trust scales, such as the Technology Trust Index, which evaluate key dimensions like reliability, competence, transparency, and security. Behavioral metrics, including use patterns and reliance during critical tasks, offer additional insights into how trust manifests in practice.User centricity emphasizes the need for a clear understanding of the intended users, domain, and specific tasks the AI tool is designed to support [
, , ]. AI tools must be tailored to meet the unique requirements of their end users, whether clinicians, nurses, or patients, and address the particular medical conditions they aim to diagnose, monitor, or treat [ , ]. Clarity in defining the tool’s intended use, the health care domain it serves, and the tasks it performs ensures that it delivers meaningful value in its practical application [ , ].Usability ensures that the tool is user-friendly and intuitive for both health care professionals and patients [
, ]. An AI tool’s ease of use and minimal training requirements are essential for successful adoption [ , ]. Usability also impacts user satisfaction, influencing acceptance and trust in the system [ , ]. Proper design should minimize cognitive load, provide relevant information in context, and allow customization by users [ ]. Evaluating usability ensures that AI tools can be effectively deployed in real-world clinical environments, enhancing rather than hindering care delivery [ , ].Adequate training ensures that clinicians and other end users can effectively use AI tools, minimizing user error and maximizing the tool’s potential to improve patient outcomes [
, , ]. Training programs should cover how to interact with the AI interface, interpret its outputs, and understand the tool’s limitations [ , ]. Continuous education is also crucial, and end users should not only be trained on interpreting the algorithm’s output but also be made aware of the factors that can affect its performance [ ]. Moreover, accessible and responsive technical support is necessary to address user concerns, provide ongoing assistance, and maintain confidence in the AI tool’s reliability and safety over time [ ]. Without proper training and support, the integration of AI tools into clinical practice may face significant barriers, limiting their overall effectiveness [ , ].Cost and Economic Evaluation
This criteria cluster evaluates the economic implications of the AI tool to determine its financial viability and long-term sustainability.
Economic evaluation and cost considerations are crucial in assessing AI tools in health care. AI interventions must demonstrate not only clinical value but also health economic impact to ensure their long-term sustainability [
, ]. This includes evaluating both direct costs, such as acquisition, maintenance, and implementation, as well as indirect costs like staff training or workflow disruptions [ , ]. Transparent and comprehensive economic evaluations help health care organizations determine the financial viability of AI tools, guiding decision-making on investments, reimbursement, and long-term sustainability [ , , , ]. Incomplete or unclear cost assessments can hinder AI adoption and create financial risks [ , ].The choice of an economic evaluation method for an AI tool in health care depends on its intended use and desired outcomes. Cost-effectiveness analysis is useful when comparing costs with health outcomes like life years saved [
, , , ]. Cost-utility analysis is ideal when focusing on both life expectancy and quality of life improvements, measured in quality-adjusted life years or disability-adjusted life years [ , , ]. Cost-minimization analysis is appropriate when the AI tool achieves similar outcomes as alternatives but aims to reduce costs [ , , ]. The method chosen should align with the tool’s specific goals and intended health care impact.Technological Safety and Transparency
This criteria cluster focuses on evaluating the technological safety and transparency of the AI tool by assessing the safeguards in place to ensure safe and ethical operation.
Safety ensures that AI systems operate reliably and securely in clinical environments beyond laboratory settings and clinical trials [
, ]. This includes compliance with safety regulations, minimizing the risks of harmful outcomes, and maintaining high standards for long-term safety and patient protection [ , , ]. Safety also encompasses the reliability of the AI model after its implementation, ensuring it consistently avoids errors and unintended consequences [ , , ]. Ongoing monitoring, risk management, and thorough clinical validation are necessary to ensure that AI tools remain safe and effective in diverse health care settings and the long-term safety of constant updates [ , , , ].Transparency is a critical assessment criterion for AI tools in health care, ensuring clarity in data processing, coding standards, and the overall functioning of AI systems [
, , ]. Transparent models allow health care professionals to understand how decisions are made, promoting trust and enabling accurate assessments of the AI’s performance [ , , , ]. Clear documentation and disclosure of data processing methods, coding protocols, and the AI’s decision-making processes ensures accountability and reproducibility [ , , ]. A recent review of 692 FDA-approved AI enabled medical devices highlighted major gaps in transparency and safety reporting [ ]. Key data such as ethnicity (reported in only 3.6% of approvals), socioeconomic information (absent in 99.1%), and study participants’ age (missing in 81.6%) were often underreported [ ]. In addition, only 46.1% of devices provided detailed performance results and only 1.9% were linked to scientific publications on safety and efficacy [ ]. These findings underscore the urgent need for improved transparency and more comprehensive safety reporting to reduce algorithmic bias and ensure equitable health care outcomes.Ethical oversight and human in command ensure human control and responsibility in the AI decision-making processes [
, ]. This criterion emphasizes that humans must retain ultimate authority over AI-generated decisions, particularly in critical health care scenarios [ , ]. Human in command ensures that clinicians can review, intervene, or override AI decisions, maintaining ethical standards and safeguarding patient outcomes [ , ]. This oversight protects against overreliance on automated systems and ensures that AI tools support, rather than replace, human judgment in clinical practice [ , , ].Scalability and Impact
This criteria cluster focuses on evaluating scalability and impact by determining the AI tool’s clinical utility and effectiveness and examining its broader impact.
Clinical effectiveness focuses on the tool’s ability to positively impact patient outcomes [
, , ]. This involves evaluating whether the AI tool contributes to better therapeutic results or patient-reported outcomes [ , ]. The assessment examines how well the AI tool integrates into real-world clinical settings and measures its tangible benefits in terms of patient health and health care quality [ , ]. Clinical effectiveness ensures that AI tools do more than function technically; they must provide meaningful improvements in patient care [ , ].Clinical utility focuses on how effectively the tool supports clinical tasks and decision-making, including its ability to assist with diagnoses, treatment recommendations, and overall health care delivery [
, ]. Ensuring clinical utility means the AI tool must provide tangible benefits that align with clinical needs and enhance health care practices [ , ]. Clinical efficiency focuses on the tool’s ability to optimize resource use while maintaining or improving care quality [ ]. This includes evaluating how well it improves productivity, reduces time spent on routine tasks, and streamlines workflows for health care professionals [ , , ].Environmental impact is an important, yet often overlooked, criterion for assessing AI tools in health care; only 1 out of 44 studies addressed this criterion. The energy consumption and resource use associated with developing, deploying, and maintaining AI systems, such as data centers, computational power, and device infrastructures, can lead to significant environmental harm, including e-waste and greenhouse gas emissions [
]. Implementing eco-responsible practices, such as energy-efficient computing and sustainable data storage, is essential to minimizing the ecologic footprint of AI tools [ ].Practical Implications and Persisting Challenges
The wide array of frameworks and initiatives focused on AI assessment in health care shown in this systematic review highlights the significant lack of standardization in this field, creating additional challenges for stakeholders [
, , ]. Faced with a growing number of assessment tools, they often struggle to determine which approach is most appropriate or how to apply it effectively [ ]. This diversity in assessment methods can lead to confusion and hinder comparability [ , , , ]. Variations in data collection and evaluation methods, ranging from self-reported to objective measures, and from qualitative to quantitative assessments, only add to the complexity, further complicating the establishment of clear, universal guidelines for AI evaluation in health care [ ].Most frameworks included in this analysis were driven by the recognition that many existing methods for assessing AI tools in health care were not specifically tailored to AI-based medical devices or health care applications [
, , ]. Traditional technology assessments often lack a critical focus on the unique, dynamic challenges and opportunities AI presents [ ]. This underscores the need for health care–specific frameworks that account for the evolving nature and complexities of AI systems in clinical environments [ ]. Moreover, existing frameworks tend to prioritize technical metrics such as algorithm accuracy, precision, and validation [ , ]. While these factors are undeniably important, this narrow focus often overlooks broader considerations, including clinical relevance, practical application, and long-term impact on patient outcomes [ , , ]. Consequently, these frameworks can fall short in delivering a holistic evaluation of AI tools, which is essential for ensuring their safe, effective, and seamless integration into real-world health care settings [ , ].This study builds upon and advances the ongoing discussion on AI assessment in health care, aiming to address the recognized gaps by developing the AI for IMPACTS framework. This proposed framework integrates technical, social, and organizational dimensions, ensuring that the adaptive nature of AI and the complexity of the health care ecosystem are fully considered. By encompassing these critical aspects, the framework provides a more comprehensive and nuanced approach to evaluating AI tools, helping shape the field and offering a robust method for assessing AI’s real-world impact in health care settings.
However, numerous challenges still remain. These challenges extend beyond just setting the assessment criteria, to include practical difficulties in implementing, validating, and standardizing these criteria across diverse health care environments. A key challenge in assessing AI tools in health care is the variation across different contexts and settings [
, ]. Most available evidence focuses on high-income countries, limiting the generalizability of findings to diverse health care environments, particularly in low- and middle-income countries [ , ]. Recent studies underscore the importance of collaborative efforts and context-sensitive solutions to effectively address the unique health care challenges faced in these regions [ ]. Another challenge is the need for a multidisciplinary team of assessors. Effective evaluation requires collaboration among professionals from various fields, such as medicine, IT, and social sciences to ensure a comprehensive assessment [ , ]. This diversity of expertise is necessary to address the complexities of AI, from technical and ethical considerations to clinical relevance and real-world impact [ , , ].It is crucial to emphasize the importance of adequate training in assessment methods [
, ]. Many assessors may lack the specific expertise required to thoroughly evaluate AI-based tools [ ]. Proper training in the complexities of AI technology and appropriate evaluation techniques is essential for conducting accurate and meaningful assessments [ ]. Without this, the assessment process may be compromised, potentially leading to inaccurate or incomplete evaluations of an AI tool’s safety and effectiveness, which could undermine its implementation in health care settings [ ]. Furthermore, the rapid pace of AI development, with AI-based medical devices having shorter product life cycles compared to traditional medical devices, underscores the need for more adaptive and fast-tracked health technology assessment processes [ , ]. Conventional health technology assessments are often too time-consuming, taking about a year to complete, which is incompatible with the fast-evolving nature of AI technologies [ ]. Balancing the need for robust evidence with the dynamic nature of AI development is essential to ensure timely, informed decision-making and avoid delays in implementation and potential reimbursement [ , ].Limitations and Future Research
This study enhances the understanding of various criteria for assessing the quality and impact of AI tools in health care, but several limitations must be acknowledged. Relevant studies may have been missed due to language restrictions or limited database searches, and the exclusion of gray literature may have omitted valuable insights. In addition, no follow-up was conducted with the study authors to validate the findings, and manual reference searches were avoided to minimize citation bias. As a result, some relevant frameworks or assessment criteria may not have been captured in this review. Future research could expand to include studies in other languages, offering a more comprehensive understanding of potential interregional or intercultural differences in the assessment of AI tools in health care.
The critical appraisal of the frameworks included in this review highlighted that many papers discussing AI tool assessment in health care lacked rigorous validation, with some omitting the methods section entirely. To address this gap, we propose rigorously validating the AI for IMPACTS framework proposed in this work through a Delphi process. The Delphi method was selected as a means to validate the framework as it is specifically designed to achieve reliable expert consensus, particularly in addressing complex issues [
, ]. This method is widely recognized across various fields of medicine, especially for developing best practice guidance and clinical guidelines, where expert agreement is critical [ , ]. This approach will involve key stakeholders to critically apply, reflect on, and refine the framework, ensuring it is relevant, comprehensive, and user-friendly. The goal is to cocreate practical, accessible tools with industry experts that can support the effective evaluation of AI tools in real-world health care settings.It is also important to highlight that new frameworks were published after the cutoff date of this systematic review, including the Organizational Perspective Checklist for Artificial Intelligence Adoption [
], Stanford’s framework for evaluating Fair, Useful, and Reliable Artificial Intelligence Models in Health Care Systems [ ], and the Transparent Reporting of Ethics for Generative Artificial Intelligence checklist [ ]. While an initial review shows that their assessment dimensions align with this work, a deeper integration will be undertaken before the validation study. This will ensure that the foundation for the Delphi process is as comprehensive and up-to-date as possible.Conclusions
AI has the potential to transform health care by improving clinical outcomes and operational efficiency. However, its adoption has progressed more slowly than anticipated, partly due to the absence of robust and comprehensive evaluation frameworks. Existing frameworks often focus too narrowly on technical metrics, such as accuracy and validation, neglecting real-world factors like clinical impact, workflow integration, and economic viability. Furthermore, the variety of frameworks and initiatives focused on AI assessment in health care, as highlighted in this systematic review, underscores a significant lack of standardization in the field, creating additional challenges for stakeholders and making it difficult to compare and implement AI tools effectively.
This study builds on and advances the ongoing discussion surrounding AI assessment in health care by developing the AI for IMPACTS framework. It aims to address key gaps identified in existing evaluation approaches, offering a comprehensive model that incorporates technical, social, and organizational dimensions. It is organized around 7 key criteria clusters: I—integration, interoperability, and workflow; M—monitoring, governance, and accountability; P—performance and quality metrics; A—acceptability, trust, and training; C—cost and economic evaluation; T—technological safety and transparency; S—scalability and impact.
While the framework provides a more holistic approach, significant challenges persist. The diverse contexts and settings in health care make it difficult to apply a one-size-fits-all framework. Multidisciplinary teams are necessary to evaluate AI tools thoroughly, as expertise from fields such as medicine, IT, and social sciences is required to address the complexities of AI. In addition, many assessors lack the specific training needed to evaluate these tools accurately. The rapid pace of AI development further complicates the assessment process, as conventional evaluation methods are often too slow to keep up with AI’s short product life cycles. To ensure successful AI integration in health care, adaptive and fast-tracked assessment processes are essential, allowing for timely decision-making and implementation while maintaining the necessary rigor.
Data Availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
Conflicts of Interest
CJ is an editorial board member of JMIR Human Factors at the time of this publication.
Multimedia Appendix 1
Critical Appraisal Skills Program appraisal of the included studies.
XLSX File (Microsoft Excel File), 70 KBMultimedia Appendix 2
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) checklist.
DOCX File , 33 KBReferences
- Esteva A, Robicquet A, Ramsundar B, Kuleshov V, DePristo M, Chou K, et al. A guide to deep learning in healthcare. Nat Med. Jan 7, 2019;25(1):24-29. [CrossRef] [Medline]
- Sutton RT, Pincock D, Baumgart DC, Sadowski DC, Fedorak RN, Kroeker KI. An overview of clinical decision support systems: benefits, risks, and strategies for success. NPJ Digit Med. 2020;3:17. [FREE Full text] [CrossRef] [Medline]
- Pressman SM, Borna S, Gomez-Cabello CA, Haider SA, Haider CR, Forte AJ. Clinical and surgical applications of large language models: a systematic review. J Clin Med. May 22, 2024;13(11):3041. [FREE Full text] [CrossRef] [Medline]
- Xu C, Solomon SA, Gao W. Artificial intelligence-powered electronic skin. Nat Mach Intell. Dec 18, 2023;5(12):1344-1355. [FREE Full text] [CrossRef] [Medline]
- Muehlematter UJ, Daniore P, Vokinger KN. Approval of artificial intelligence and machine learning-based medical devices in the USA and Europe (2015-20): a comparative analysis. Lancet Digit Health. Mar 2021;3(3):e195-e203. [FREE Full text] [CrossRef] [Medline]
- Zhang J, Whebell S, Gallifant J, Budhdeo S, Mattie H, Lertvittayakumjorn P, et al. An interactive dashboard to track themes, development maturity, and global equity in clinical artificial intelligence research. Lancet Digit Health. Apr 2022;4(4):e212-e213. [FREE Full text] [CrossRef] [Medline]
- Goh S, Goh RS, Chong B, Ng QX, Koh GC, Ngiam KY, et al. Challenges in implementing artificial intelligence in breast cancer screening programs: a systematic review and framework for safe adoption. J Med Internet Res (Forthcoming). 2022. [FREE Full text] [CrossRef]
- Hassan M, Kushniruk A, Borycki E. Barriers to and facilitators of artificial intelligence adoption in health care: scoping review. JMIR Hum Factors. Aug 29, 2024;11:e48633. [FREE Full text] [CrossRef] [Medline]
- Wu E, Wu K, Daneshjou R, Ouyang D, Ho DE, Zou J. How medical AI devices are evaluated: limitations and recommendations from an analysis of FDA approvals. Nat Med. Apr 2021;27(4):582-584. [CrossRef] [Medline]
- European Society of Radiology (ESR). Value-based radiology: what is the ESR doing, and what should we do in the future? Insights Imaging. Jul 27, 2021;12(1):108-152. [FREE Full text] [CrossRef] [Medline]
- Petersson L, Larsson I, Nygren JM, Nilsen P, Neher M, Reed JE, et al. Challenges to implementing artificial intelligence in healthcare: a qualitative interview study with healthcare leaders in Sweden. BMC Health Serv Res. Jul 01, 2022;22(1):850. [FREE Full text] [CrossRef] [Medline]
- Artificial intelligence in health care: benefits and challenges of technologies to augment patient care. U.S. Government Accountability Office. URL: https://www.gao.gov/products/gao-21-7sp [accessed 2024-04-29]
- Cruz Rivera S, Liu X, Chan AW, Denniston AK, Calvert MJ, SPIRIT-AICONSORT-AI Working Group. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI Extension. BMJ. Sep 09, 2020;370:m3210. [FREE Full text] [CrossRef] [Medline]
- Liu X, Cruz Rivera S, Moher D, Calvert MJ, Denniston AK, SPIRIT-AICONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Lancet Digit Health. Oct 09, 2020;2(10):e537-e548. [FREE Full text] [CrossRef] [Medline]
- Nsoesie EO. Evaluating artificial intelligence applications in clinical settings. JAMA Netw Open. Sep 07, 2018;1(5):e182658. [FREE Full text] [CrossRef] [Medline]
- Collins GS, Reitsma JB, Altman DG, Moons K. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med. Jan 06, 2015;13(1):1. [FREE Full text] [CrossRef] [Medline]
- Geis JR, Brady AP, Wu CC, Spencer J, Ranschaert E, Jaremko JL, et al. Ethics of artificial intelligence in radiology: summary of the joint European and north American multisociety statement. Can Assoc Radiol J. Nov 29, 2019;70(4):329-334. [CrossRef] [Medline]
- Bluemke DA, Moy L, Bredella MA, Ertl-Wagner BB, Fowler KJ, Goh VJ, et al. Assessing radiology research on artificial intelligence: a brief guide for authors, reviewers, and readers-from the editorial board. Radiology. Mar 2020;294(3):487-489. [CrossRef] [Medline]
- van Leeuwen KG, Schalekamp S, Rutten MJ, van Ginneken B, de Rooij M. Artificial intelligence in radiology: 100 commercially available products and their scientific evidence. Eur Radiol. Jun 15, 2021;31(6):3797-3804. [FREE Full text] [CrossRef] [Medline]
- Stettinger G, Weissensteiner P, Khastgir S. Trustworthiness assurance assessment for high-risk AI-based systems. IEEE Access. 2024;12:22718-22745. [CrossRef]
- Chouffani El Fassi S, Abdullah A, Fang Y, Natarajan S, Masroor AB, Kayali N, et al. Not all AI health tools with regulatory authorization are clinically validated. Nat Med. Oct 26, 2024;30(10):2718-2720. [CrossRef] [Medline]
- Sounderajah V, Ashrafian H, Golub RM, Shetty S, De Fauw J, Hooft L, et al. STARD-AI Steering Committee. Developing a reporting guideline for artificial intelligence-centred diagnostic test accuracy studies: the STARD-AI protocol. BMJ Open. Jun 28, 2021;11(6):e047709. [FREE Full text] [CrossRef] [Medline]
- Mongan J, Moy L, Kahn CE. Checklist for artificial intelligence in medical imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell. Mar 01, 2020;2(2):e200029. [FREE Full text] [CrossRef] [Medline]
- Collins GS, Dhiman P, Andaur Navarro CL, Ma J, Hooft L, Reitsma JB, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. Jul 09, 2021;11(7):e048008. [CrossRef] [Medline]
- Ethics and governance of artificial intelligence for health. World Health Organization. URL: https://www.who.int/publications/i/item/9789240029200 [accessed 2024-10-02]
- Jacob C, Lindeque J, Klein A, Ivory C, Heuss S, Peter MK. Assessing the quality and impact of eHealth tools: systematic literature review and narrative synthesis. JMIR Hum Factors. Mar 23, 2023;10:e45143. [FREE Full text] [CrossRef] [Medline]
- Jacob C, Lindeque J, Müller R, Klein A, Metcalfe T, Connolly SL, et al. A sociotechnical framework to assess patient-facing eHealth tools: results of a modified Delphi process. NPJ Digit Med. Dec 15, 2023;6(1):232. [FREE Full text] [CrossRef] [Medline]
- Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. Mar 29, 2021;372:n71. [FREE Full text] [CrossRef] [Medline]
- Higgins J, Thomas J, Chandler J, Cumpston M, Li T, Page M, et al. Cochrane Handbook for Systematic Reviews of Interventions. 2nd edition. Hoboken, NJ. Wiley-Blackwell; 2019.
- Jacob C. AI-powered clinician tools assessment criteria: protocol of a systematic review of the literature. Research Registry. URL: https://www.researchregistry.com/browse-the-registry#registryofsystematicreviewsmeta-analyses/registryofsystematicreviewsmeta-analysesdetails/669754107bdc5b002704adbe/ [accessed 2024-04-29]
- Ouzzani M, Hammady H, Fedorowicz Z, Elmagarmid A. Rayyan-a web and mobile app for systematic reviews. Syst Rev. Dec 05, 2016;5(1):210. [FREE Full text] [CrossRef] [Medline]
- Critical appraisal skills programme checklists. Critical Appraisal Skills Programme. URL: https://casp-uk.net/casp-tools-checklists/ [accessed 2024-04-29]
- Leonardi PM. Methodological guidelines for the study of materiality and affordances. In: Mir R, Jain S, editors. The Routledge Companion to Qualitative Research in Organization Studies. New York, NY. Routledge; 2017:279-290.
- Ammenwerth E. Technology acceptance models in health informatics: TAM and UTAUT. Stud Health Technol Inform. Jul 30, 2019;263:64-71. [CrossRef] [Medline]
- Shachak A, Kuziemsky C, Petersen C. Beyond TAM and UTAUT: future directions for HIT implementation research. J Biomed Inform. Dec 2019;100:103315. [FREE Full text] [CrossRef] [Medline]
- Jacob C, Sanchez-Vazquez A, Ivory C. Understanding clinicians' adoption of mobile health tools: a qualitative review of the most used frameworks. JMIR Mhealth Uhealth. Jul 06, 2020;8(7):e18072. [FREE Full text] [CrossRef] [Medline]
- Jacob C, Sanchez-Vazquez A, Ivory C. Social, organizational, and technological factors impacting clinicians' adoption of mobile health tools: systematic literature review. JMIR Mhealth Uhealth. Feb 20, 2020;8(2):e15935. [FREE Full text] [CrossRef] [Medline]
- Jacob C, Sezgin E, Sanchez-Vazquez A, Ivory C. Sociotechnical factors affecting patients' adoption of mobile health tools: systematic literature review and narrative synthesis. JMIR Mhealth Uhealth. May 05, 2022;10(5):e36284. [FREE Full text] [CrossRef] [Medline]
- Braun V, Clarke V. Successful Qualitative Research: A Practical Guide for Beginners. Thousand Oaks, CA. Sage Publications; 2013.
- Braun V, Clarke V. Using thematic analysis in psychology. Qual Res Psychol. Jan 2006;3(2):77-101. [CrossRef]
- Boverhof BJ, Redekop WK, Bos D, Starmans MP, Birch J, Rockall A, et al. Radiology AI deployment and assessment rubric (RADAR) to bring value-based AI into radiological practice. Insights Imaging. Feb 05, 2024;15(1):34. [FREE Full text] [CrossRef] [Medline]
- Daneshjou R, Barata C, Betz-Stablein B, Celebi ME, Codella N, Combalia M, et al. Checklist for evaluation of image-based artificial intelligence reports in dermatology: CLEAR derm consensus guidelines from the international skin imaging collaboration artificial intelligence working group. JAMA Dermatol. Jan 01, 2022;158(1):90-96. [FREE Full text] [CrossRef] [Medline]
- Di Bidino R, Piaggio D, Andellini M, Merino-Barbancho B, Lopez-Perez L, Zhu T, et al. Scoping meta-review of methods used to assess artificial intelligence-based medical devices for heart failure. Bioengineering (Basel). Sep 22, 2023;10(10):1109. [FREE Full text] [CrossRef] [Medline]
- Elvidge J, Hawksworth C, Avşar TS, Zemplenyi A, Chalkidou A, Petrou S, et al. CHEERS-AI Steering Group. Consolidated health economic evaluation reporting standards for interventions that use artificial intelligence (CHEERS-AI). Value Health. Sep 2024;27(9):1196-1205. [FREE Full text] [CrossRef] [Medline]
- Haller S, Van Cauter S, Federau C, Hedderich DM, Edjlali M. The R-AI-DIOLOGY checklist: a practical checklist for evaluation of artificial intelligence tools in clinical neuroradiology. Neuroradiology. May 31, 2022;64(5):851-864. [CrossRef] [Medline]
- Handelman GS, Kok HK, Chandra RV, Razavi AH, Huang S, Brooks M, et al. Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods. AJR Am J Roentgenol. Jan 2019;212(1):38-43. [CrossRef] [Medline]
- Jackson GP, Vergis R. Evaluation of artificial intelligence in radiation oncology. In: Mun SK, Dieterich S, editors. Artificial Intelligence in Radiation Oncology. New York, NY. World Scientific Publishing; 2023:359-368.
- Jha AK, Bradshaw TJ, Buvat I, Hatt M, Kc P, Liu C, et al. Nuclear medicine and artificial intelligence: best practices for evaluation (the RELAINCE guidelines). J Nucl Med. Sep 26, 2022;63(9):1288-1299. [FREE Full text] [CrossRef] [Medline]
- Khan SD, Hoodbhoy Z, Raja MH, Kim JY, Hogg HD, Manji AA, et al. Frameworks for procurement, integration, monitoring, and evaluation of artificial intelligence tools in clinical settings: a systematic review. PLOS Digit Health. May 29, 2024;3(5):e0000514. [FREE Full text] [CrossRef] [Medline]
- Larson DB, Harvey H, Rubin DL, Irani N, Tse JR, Langlotz CP. Regulatory frameworks for development and evaluation of artificial intelligence-based diagnostic imaging algorithms: summary and recommendations. J Am Coll Radiol. Mar 2021;18(3 Pt A):413-424. [FREE Full text] [CrossRef] [Medline]
- Lennerz JK, Salgado R, Kim GE, Sirintrapun SJ, Thierauf JC, Singh A, et al. Diagnostic quality model (DQM): an integrated framework for the assessment of diagnostic quality when using AI/ML. Clin Chem Lab Med. Mar 28, 2023;61(4):544-557. [FREE Full text] [CrossRef] [Medline]
- Magrabi F, Ammenwerth E, McNair JB, De Keizer NF, Hyppönen H, Nykänen P, et al. Artificial intelligence in clinical decision support: challenges for evaluating AI and practical implications. Yearb Med Inform. Aug 25, 2019;28(1):128-134. [FREE Full text] [CrossRef] [Medline]
- Mahadevaiah G, Rv P, Bermejo I, Jaffray D, Dekker A, Wee L. Artificial intelligence-based clinical decision support in modern medical physics: selection, acceptance, commissioning, and quality assurance. Med Phys. Jun 17, 2020;47(5):e228-e235. [FREE Full text] [CrossRef] [Medline]
- Mahmood U, Shukla-Dave A, Chan HP, Drukker K, Samala RK, Chen Q, et al. Artificial intelligence in medicine: mitigating risks and maximizing benefits via quality assurance, quality control, and acceptance testing. BJR Artif Intell. Jan 2024;1(1):ubae003. [FREE Full text] [CrossRef] [Medline]
- Omoumi P, Ducarouge A, Tournier A, Harvey H, Kahn CE, Louvet-de Verchère F, et al. To buy or not to buy-evaluating commercial AI solutions in radiology (the ECLAIR guidelines). Eur Radiol. Jun 05, 2021;31(6):3786-3796. [FREE Full text] [CrossRef] [Medline]
- Reddy S, Rogers W, Makinen VP, Coiera E, Brown P, Wenzel M, et al. Evaluation framework to guide implementation of AI systems into healthcare settings. BMJ Health Care Inform. Oct 12, 2021;28(1):e100444. [FREE Full text] [CrossRef] [Medline]
- Vaira LA, Lechien JR, Abbate V, Allevi F, Audino G, Beltramini GA, et al. Validation of the Quality Analysis of Medical Artificial Intelligence (QAMAI) tool: a new tool to assess the quality of health information provided by AI platforms. Eur Arch Otorhinolaryngol. Nov 04, 2024;281(11):6123-6131. [FREE Full text] [CrossRef] [Medline]
- van Royen FS, Asselbergs FW, Alfonso F, Vardas P, van Smeden M. Five critical quality criteria for artificial intelligence-based prediction models. Eur Heart J. Dec 07, 2023;44(46):4831-4834. [FREE Full text] [CrossRef] [Medline]
- Vervoort D, Tam DY, Wijeysundera HC. Health technology assessment for cardiovascular digital health technologies and artificial intelligence: why is it different? Can J Cardiol. Feb 2022;38(2):259-266. [CrossRef] [Medline]
- Vollmer S, Mateen BA, Bohner G, Király FJ, Ghani R, Jonsson P, et al. Machine learning and artificial intelligence research for patient benefit: 20 critical questions on transparency, replicability, ethics, and effectiveness. BMJ. Mar 20, 2020;368:l6927. [FREE Full text] [CrossRef] [Medline]
- Vithlani J, Hawksworth C, Elvidge J, Ayiku L, Dawoud D. Economic evaluations of artificial intelligence-based healthcare interventions: a systematic literature review of best practices in their conduct and reporting. Front Pharmacol. Aug 8, 2023;14:1220950. [FREE Full text] [CrossRef] [Medline]
- Abbasian M, Khatibi E, Azimi I, Oniani D, Shakeri Hossein Abad Z, Thieme A, et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digit Med. Mar 29, 2024;7(1):82. [FREE Full text] [CrossRef] [Medline]
- Economou-Zavlanos NJ, Bessias S, Cary MP, Bedoya AD, Goldstein BA, Jelovsek JE, et al. Translating ethical and quality principles for the effective, safe and fair development, deployment and use of artificial intelligence technologies in healthcare. J Am Med Inform Assoc. Feb 16, 2024;31(3):705-713. [CrossRef] [Medline]
- Larson DB, Doo FX, Allen B, Mongan J, Flanders AE, Wald C. Proceedings from the 2022 ACR-RSNA workshop on safety, effectiveness, reliability, and transparency in AI. J Am Coll Radiol. Jul 2024;21(7):1119-1129. [CrossRef] [Medline]
- Overgaard SM, Graham MG, Brereton T, Pencina MJ, Halamka JD, Vidal DE, et al. Implementing quality management systems to close the AI translation gap and facilitate safe, ethical, and effective health AI solutions. NPJ Digit Med. Nov 25, 2023;6(1):218. [FREE Full text] [CrossRef] [Medline]
- Schaekermann M, Spitz T, Pyles M, Cole-Lewis H, Wulczyn E, Pfohl SR, et al. Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study. EClinicalMedicine. Apr 2024;70:102479. [FREE Full text] [CrossRef] [Medline]
- Farah L, Davaze-Schneider J, Martin T, Nguyen P, Borget I, Martelli N. Are current clinical studies on artificial intelligence-based medical devices comprehensive enough to support a full health technology assessment? A systematic review. Artif Intell Med. Jun 2023;140:102547. [FREE Full text] [CrossRef] [Medline]
- Farah L, Borget I, Martelli N, Vallee A. Suitability of the current health technology assessment of innovative artificial intelligence-based medical devices: scoping literature review. J Med Internet Res. May 13, 2024;26:e51514. [FREE Full text] [CrossRef] [Medline]
- Guenoun D, Zins M, Champsaur P, Thomassin-Naggara I, DRIM France AI Study Group. French community grid for the evaluation of radiological artificial intelligence solutions (DRIM France Artificial Intelligence initiative). Diagn Interv Imaging. Feb 2024;105(2):74-81. [CrossRef] [Medline]
- Bimczok SP, Godynyuk EA, Pierey J, Roppel MS, Scholz ML. How are excellence and trust for using artificial intelligence ensured? Evaluation of its current use in EU healthcare. South East Eur J Public Health. Jan 24, 2023:239. [FREE Full text] [CrossRef]
- de Hond AA, Leeuwenberg AM, Hooft L, Kant IM, Nijman SW, van Os HJ, et al. Guidelines and quality criteria for artificial intelligence-based prediction models in healthcare: a scoping review. NPJ Digit Med. Jan 10, 2022;5(1):2. [FREE Full text] [CrossRef] [Medline]
- Voets MM, Veltman J, Slump CH, Siesling S, Koffijberg H. Systematic review of health economic evaluations focused on artificial intelligence in healthcare: the tortoise and the cheetah. Value Health. Mar 2022;25(3):340-349. [FREE Full text] [CrossRef] [Medline]
- Ding H, Simmich J, Vaezipour A, Andrews N, Russell T. Evaluation framework for conversational agents with artificial intelligence in health interventions: a systematic scoping review. J Am Med Inform Assoc. Feb 16, 2024;31(3):746-761. [CrossRef] [Medline]
- Goergen SK, Frazer HM, Reddy S. Quality use of artificial intelligence in medical imaging: what do radiologists need to know? J Med Imaging Radiat Oncol. Mar 03, 2022;66(2):225-232. [CrossRef] [Medline]
- Lehoux P, Rocha de Oliveira R, Rivard L, Silva HP, Alami H, Mörch CM, et al. A comprehensive, valid, and reliable tool to assess the degree of responsibility of digital health solutions that operate with or without artificial intelligence: 3-phase mixed methods study. J Med Internet Res. Aug 28, 2023;25:e48496. [FREE Full text] [CrossRef] [Medline]
- Tanguay W, Acar P, Fine B, Abdolell M, Gong B, Cadrin-Chênevert A, et al. Assessment of radiology artificial intelligence software: a validation and evaluation framework. Can Assoc Radiol J. May 06, 2023;74(2):326-333. [CrossRef] [Medline]
- Ji M, Genchev GZ, Huang H, Xu T, Lu H, Yu G. Evaluation framework for successful artificial intelligence–enabled clinical decision support systems: mixed methods study. J Med Internet Res. Jun 2, 2021;23(6):e25929. [CrossRef]
- Fasterholdt I, Naghavi-Behzad M, Rasmussen BS, Kjølhede T, Skjøth MM, Hildebrandt MG, et al. Value assessment of artificial intelligence in medical imaging: a scoping review. BMC Med Imaging. Oct 31, 2022;22(1):187. [FREE Full text] [CrossRef] [Medline]
- Gomez Rossi J, Feldberg B, Krois J, Schwendicke F. Evaluation of the clinical, technical, and financial aspects of cost-effectiveness analysis of artificial intelligence in medicine: scoping review and framework of analysis. JMIR Med Inform. Aug 12, 2022;10(8):e33703. [FREE Full text] [CrossRef] [Medline]
- Panagoulias DP, Virvou M, Tsihrintzis GA. Applying DOI theory to assess the required level of explainability in artificial intelligence-empowered medical applications. In: Proceedings of the 14th International Conference on Information, Intelligence, Systems & Applications. 2023. Presented at: IISA '23; July 10-12, 2023:1-7; Volos, Greece. URL: https://ieeexplore.ieee.org/document/10345846 [CrossRef]
- Bhatnagar S. Checklist for medical imaging using artificial intelligence by evaluation of machine learning models. In: Proceedings of the 5th International Conference on Inventive Research in Computing Applications. 2023. Presented at: ICIRCA '23; August 3-5, 2023:865-871; Coimbatore, India. URL: https://ieeexplore.ieee.org/document/10220939 [CrossRef]
- Alshehri S, Alahmari KA, Alasiry A. A comprehensive evaluation of AI-assisted diagnostic tools in ENT medicine: insights and perspectives from healthcare professionals. J Pers Med. Mar 28, 2024;14(4):354. [FREE Full text] [CrossRef] [Medline]
- Lundström C, Lindvall M. Mapping the landscape of care providers' quality assurance approaches for AI in diagnostic imaging. J Digit Imaging. Apr 09, 2023;36(2):379-387. [FREE Full text] [CrossRef] [Medline]
- Ross J, Hammouche S, Chen Y, Rockall A, Royal College of Radiologists AI Working Group. Beyond regulatory compliance: evaluating radiology artificial intelligence applications in deployment. Clin Radiol. May 2024;79(5):338-345. [FREE Full text] [CrossRef] [Medline]
- Long HA, French DP, Brooks JM. Optimising the value of the critical appraisal skills programme (CASP) tool for quality appraisal in qualitative evidence synthesis. Res Meth Med Health Sci. Aug 06, 2020;1(1):31-42. [CrossRef]
- Unsworth H, Dillon B, Collinson L, Powell H, Salmon M, Oladapo T, et al. The NICE evidence standards framework for digital health and care technologies - developing and maintaining an innovative evidence framework with global impact. Digit Health. Jun 24, 2021;7:20552076211018617. [FREE Full text] [CrossRef] [Medline]
- Sarwar N, Irshad A, Naith QH, D Alsufiani K, Almalki FA. Skin lesion segmentation using deep learning algorithm with ant colony optimization. BMC Med Inform Decis Mak. Sep 27, 2024;24(1):265. [FREE Full text] [CrossRef] [Medline]
- Zubair M, Owais M, Mahmood T, Iqbal S, Usman SM, Hussain I. Enhanced gastric cancer classification and quantification interpretable framework using digital histopathology images. Sci Rep. Sep 28, 2024;14(1):22533. [FREE Full text] [CrossRef] [Medline]
- Bourdillon AT, Garg A, Wang H, Woo YJ, Pavone M, Boyd J. Integration of reinforcement learning in a virtual robotic surgical simulation. Surg Innov. Feb 03, 2023;30(1):94-102. [CrossRef] [Medline]
- Muralidharan V, Adewale BA, Huang CJ, Nta MT, Ademiju PO, Pathmarajah P, et al. A scoping review of reporting gaps in FDA-approved AI medical devices. NPJ Digit Med. Oct 03, 2024;7(1):273. [FREE Full text] [CrossRef] [Medline]
- Yang J, Dung NT, Thach PN, Phong NT, Phu VD, Phu KD, et al. Generalizability assessment of AI models across hospitals in a low-middle and high income country. Nat Commun. Sep 27, 2024;15(1):8270. [FREE Full text] [CrossRef] [Medline]
- Barrett D, Heale R. What are Delphi studies? Evid Based Nurs. Jul 19, 2020;23(3):68-69. [CrossRef] [Medline]
- Grime MM, Wright G. Delphi method. In: Grime MM, editor. Wiley StatsRef: Statistics Reference Online. Hoboken, NJ. John Wiley & Sons; 2016:1-6.
- Nasa P, Jain R, Juneja D. Delphi methodology in healthcare research: how to decide its appropriateness. World J Methodol. Jul 20, 2021;11(4):116-129. [FREE Full text] [CrossRef] [Medline]
- Jünger S, Payne SA, Brine J, Radbruch L, Brearley SG. Guidance on conducting and REporting DElphi studies (CREDES) in palliative care: recommendations based on a methodological systematic review. Palliat Med. Sep 13, 2017;31(8):684-706. [CrossRef] [Medline]
- Dagan N, Devons-Sberro S, Paz Z, Zoller L, Sommer A, Shaham G, et al. Evaluation of AI solutions in health care organizations — the OPTICA tool. NEJM AI. Aug 22, 2024;1(9):65. [CrossRef]
- Callahan A, McElfresh D, Banda JM, Bunney G, Char D, Chen J, et al. Standing on FURM ground: a framework for evaluating fair, useful, and reliable AI models in health care systems. NEJM Catal. 2024;5(10):131. [CrossRef]
- Ning Y, Teixayavong S, Shang Y, Savulescu J, Nagaraj V, Miao D, et al. Generative artificial intelligence and ethical considerations in health care: a scoping review and ethics checklist. Lancet Digit Health. Nov 2024;6(11):e848-e856. [FREE Full text] [CrossRef] [Medline]
Abbreviations
AI: artificial intelligence |
CASP: Critical Appraisal Skills Program |
CONSORT-AI: Consolidated Standards of Reporting Trials–Artificial Intelligence |
EU: European Union |
PICO: participants, intervention, comparators, and outcomes |
PRISMA: Preferred Reporting Items for Systematic reviews and Meta-Analyses |
Edited by A Mavragani; submitted 14.10.24; peer-reviewed by D Vogel, Q Ng; comments to author 04.11.24; revised version received 14.11.24; accepted 30.12.24; published 05.02.25.
Copyright©Christine Jacob, Noé Brasier, Emanuele Laurenzi, Sabina Heuss, Stavroula-Georgia Mougiakakou, Arzu Cöltekin, Marc K Peter. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 05.02.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.