%0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e64486 %T Accuracy of Large Language Models When Answering Clinical Research Questions: Systematic Review and Network Meta-Analysis %A Wang,Ling %A Li,Jinglin %A Zhuang,Boyang %A Huang,Shasha %A Fang,Meilin %A Wang,Cunze %A Li,Wen %A Zhang,Mohan %A Gong,Shurong %+ The Third Department of Critical Care Medicine, Fuzhou University Affiliated Provincial Hospital, Shengli Clinical Medical College, Fujian Medical University, No.134 Dongjie Road, Fuzhou, Fujian, 350001, China, 86 15060677447, shurong_gong@fjmu.edu.cn %K large language models %K LLM %K clinical research questions %K accuracy %K network meta-analysis %K PRISMA %D 2025 %7 30.4.2025 %9 Review %J J Med Internet Res %G English %X Background: Large language models (LLMs) have flourished and gradually become an important research and application direction in the medical field. However, due to the high degree of specialization, complexity, and specificity of medicine, which results in extremely high accuracy requirements, controversy remains about whether LLMs can be used in the medical field. More studies have evaluated the performance of various types of LLMs in medicine, but the conclusions are inconsistent. Objective: This study uses a network meta-analysis (NMA) to assess the accuracy of LLMs when answering clinical research questions to provide high-level evidence-based evidence for its future development and application in the medical field. Methods: In this systematic review and NMA, we searched PubMed, Embase, Web of Science, and Scopus from inception until October 14, 2024. Studies on the accuracy of LLMs when answering clinical research questions were included and screened by reading published reports. The systematic review and NMA were conducted to compare the accuracy of different LLMs when answering clinical research questions, including objective questions, open-ended questions, top 1 diagnosis, top 3 diagnosis, top 5 diagnosis, and triage and classification. The NMA was performed using Bayesian frequency theory methods. Indirect intercomparisons between programs were performed using a grading scale. A larger surface under the cumulative ranking curve (SUCRA) value indicates a higher ranking of the corresponding LLM accuracy. Results: The systematic review and NMA examined 168 articles encompassing 35,896 questions and 3063 clinical cases. Of the 168 studies, 40 (23.8%) were considered to have a low risk of bias, 128 (76.2%) had a moderate risk, and none were rated as having a high risk. ChatGPT-4o (SUCRA=0.9207) demonstrated strong performance in terms of accuracy for objective questions, followed by Aeyeconsult (SUCRA=0.9187) and ChatGPT-4 (SUCRA=0.8087). ChatGPT-4 (SUCRA=0.8708) excelled at answering open-ended questions. In terms of accuracy for top 1 diagnosis and top 3 diagnosis of clinical cases, human experts (SUCRA=0.9001 and SUCRA=0.7126, respectively) ranked the highest, while Claude 3 Opus (SUCRA=0.9672) performed well at the top 5 diagnosis. Gemini (SUCRA=0.9649) had the highest rated SUCRA value for accuracy in the area of triage and classification. Conclusions: Our study indicates that ChatGPT-4o has an advantage when answering objective questions. For open-ended questions, ChatGPT-4 may be more credible. Humans are more accurate at the top 1 diagnosis and top 3 diagnosis. Claude 3 Opus performs better at the top 5 diagnosis, while for triage and classification, Gemini is more advantageous. This analysis offers valuable insights for clinicians and medical practitioners, empowering them to effectively leverage LLMs for improved decision-making in learning, diagnosis, and management of various clinical scenarios. Trial Registration: PROSPERO CRD42024558245; https://www.crd.york.ac.uk/PROSPERO/view/CRD42024558245 %M 40305085 %R 10.2196/64486 %U https://www.jmir.org/2025/1/e64486 %U https://doi.org/10.2196/64486 %U http://www.ncbi.nlm.nih.gov/pubmed/40305085