Published on in Vol 27 (2025)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/81769, first published .
Critical Limitations in Systematic Reviews of Large Language Models in Health Care

Critical Limitations in Systematic Reviews of Large Language Models in Health Care

Critical Limitations in Systematic Reviews of Large Language Models in Health Care

Authors of this article:

Zvi Weizman1 Author Orcid Image

Faculty of Health Sciences, Ben-Gurion University, 8 Balfour Street, Tel-Aviv, Israel

Corresponding Author:

Zvi Weizman, MD, Prof Dr Med




I read with interest the study by Li et al [1] on the implementation of large language models (LLMs) in health care, which provides clinicians with guidance for selecting appropriate models for specific tasks. Although it provides a comprehensive overview, several limitations undermine its utility for clinical decision-making.


The authors exclude journals below a citation threshold of 13,000, which introduces a publication bias. It excludes innovative research from emerging or specialized journals, as documented in the methodology literature. This is problematic in a rapidly evolving field where important innovations may first appear in newer venues. While the authors note that only 8.9% (24/270) of studies reported negative results, which could affect the overall perception of their clinical effectiveness, they do not adequately account for this publication bias.


The definition of “best performance” is problematic. They acknowledge that performance level in one context does not guarantee similar performance in different contexts, and therefore, they state that the frequency of “best performance” should not be interpreted as a metric for comparing models. This acknowledgment undermines their quantitative analysis. The heterogeneity in evaluation metrics, datasets, and contexts across studies renders their performance comparisons essentially meaningless, a problem well-documented in AI literature [2].


The review lacks assessment of the included studies. A recent meta-analysis in medical AI has emphasized the importance of evaluating study design, validation approaches, and statistical rigor [3]. The authors’ approach of simply counting “best performance” instances without considering study quality, sample sizes, or validation rigor represents a significant methodological weakness.


The 5-stage linear workflow model, while organizationally useful, oversimplifies the complex and iterative nature of clinical decision-making. Modern health care delivery involves parallel processes, feedback loops, and multidisciplinary coordination that this model fails to capture, thereby limiting the practical utility of its recommendations [4].


They inadequately address the critical gap between research performance and clinical validation. As noted in recent systematic reviews of AI in health care, models trained and validated on research datasets face substantial deployment challenges in medical institutions due to significant differences between laboratory and clinical settings. While the authors mention this limitation, they do not adequately weigh it in their analysis.


Although the authors discuss ethical concerns, their analysis of patient safety remains superficial. Recent literature emphasizes the critical importance of comprehensive risk assessment in implementing medical AI, including analysis of failure modes, error propagation, and impacts on clinical decision-making [5].


The review lacks a comprehensive economic evaluation of LLM implementation, including cost-effectiveness analyses, resource allocation considerations, and return-on-investment assessments. These limitations significantly impact the review’s clinical applicability and highlight the need for more rigorous methodological approaches in evaluating AI in health care.

Conflicts of Interest

None declared.

  1. Li H, Fu JF, Python A. Implementing large language models in health care: clinician-focused review with interactive guideline. J Med Internet Res. Jul 11, 2025;27:e71916. [CrossRef] [Medline]
  2. Chang Y, Yin JM, Li JM, Liu C, Cao LY, Lin SY. Applications and future prospects of medical LLMs: a survey based on the M-KAT conceptual framework. J Med Syst. Dec 27, 2024;48(1):112. [CrossRef] [Medline]
  3. Liu X, Cruz Rivera S, Moher D, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med. Sep 2020;26(9):1364-1374. [CrossRef]
  4. Sittig DF, Singh H. A new sociotechnical model for studying health information technology in complex adaptive healthcare systems. Qual Saf Health Care. Oct 2010;19 Suppl 3(Suppl 3):i68-i74. [CrossRef] [Medline]
  5. Sendak MP, Ratliff W, Sarro D, et al. Real-World Integration of a sepsis deep learning technology into routine clinical care: implementation study. JMIR Med Inform. Jul 15, 2020;8(7):e15182. [CrossRef] [Medline]


LLM: large language model


Edited by Tiffany Leung; This is a non–peer-reviewed article. submitted 03.Aug.2025; accepted 29.Aug.2025; published 24.Sep.2025.

Copyright

© Zvi Weizman. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 24.Sep.2025.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.