Research Letter
Abstract
Using simulated patients to mimic 9 established noncommunicable and infectious diseases, we assessed ChatGPT’s performance in treatment recommendations for common diseases in low- and middle-income countries. ChatGPT had a high level of accuracy in both correct diagnoses (20/27, 74%) and medication prescriptions (22/27, 82%) but a concerning level of unnecessary or harmful medications (23/27, 85%) even with correct diagnoses. ChatGPT performed better in managing noncommunicable diseases than infectious ones. These results highlight the need for cautious AI integration in health care systems to ensure quality and safety.
J Med Internet Res 2024;26:e56121doi:10.2196/56121
Keywords
Introduction
The rise of generative artificial intelligence (AI) models like ChatGPT is transforming the health care landscape, especially in low- and middle-income countries (LMICs). These regions, often facing shortages of health care professionals, are increasingly turning to AI tools for medical consultation, aided by growing internet and smartphone access [
]. Research has highlighted generative AI use in the fields of cardiology [ ] and orthopedic diseases [ ]. However, there are concerns about the accuracy and safety of AI models like ChatGPT [ ] given their lack of legal or professional accountability. This is crucial in medical settings, where precise and reliable decision-making is vital. Our study focuses on assessing ChatGPT’s performance in treatment recommendations for common diseases in LMICs, addressing a critical need for the responsible application of AI in health care.Methods
Overview
We used the simulated patient (SP) method to create a realistic testing environment for ChatGPT with GPT-3.5 from August 8 to 19, 2023. SPs are healthy individuals trained to consistently mimic real patients and their symptoms [
]. We trained the SPs to present 9 common, previously validated diseases [ - ]. We asked ChatGPT to act as a doctor in an LMIC and offer consultations. The SPs detailed their primary concerns, gave standardized responses to every question, and recorded all diagnoses and medication recommendations, which were cross-referenced with clinical guidelines to assess their accuracy and appropriateness. For a robust analysis, we presented each disease to ChatGPT 3 times. We conducted descriptive analyses with the final sample of 27 independent trials.Ethical Considerations
The Ethics Committee of the First Affiliated Hospital of Xi’an Jiaotong University approved the study (LLSBPJ-2024-WT-019).
Results
Surprisingly, ChatGPT’s performance varied across trials for each disease (
). When aggregating the results ( ), ChatGPT had a 67% (18/27) success rate in initial diagnoses and a 59% (16/27) success rate in medication recommendations. When considering all recommendations, these rates increased to 74% (20/27) for any correct diagnosis and 82% (22/27) for any appropriate medication recommendation. However, there was a high rate of unnecessary or harmful medication suggestions, occurring in 85% (23/27) of trials overall and in 59% (16/27) of trials after a correct diagnosis. Our study also highlighted ChatGPT’s varying performance across different types of diseases. Specifically, the AI demonstrated a superior ability in handling noncommunicable diseases compared to infectious diseases, both in terms of diagnosis and medication recommendations.Discussion
Our findings reveal a high level of accuracy in both correct diagnoses (74%) and medication recommendations (82%) by ChatGPT. Previous studies using the SP method found that primary care providers in LMICs like China, India, and Kenya could only reach correct diagnoses in 12%-52% of SP visits [
, ]. Therefore, ChatGPT can potentially outperform traditional primary care providers in LMICs in diagnostic accuracy. Since ChatGPT with GPT-3.5 is free, the AI tool has the potential to offer affordable and far-reaching solutions in LMICs, particularly in rural and underserved areas.However, ChatGPT tended to suggest more unnecessary or even harmful medications (in 85% of trials) than primary care providers (28%-64%) [
, ]. AI models work by analyzing available data using machine learning and deep learning techniques [ ]. Their approach to drug prescription can be aggressive due to a lack of professional accountability or a motive to reduce medical expenses. ChatGPT also performed better in managing noncommunicable diseases than infectious diseases. This could be because more information on the former is available for AI training during development [ ]. ChatGPT’s performance also varied within each disease case, contrary to our expectation that this would be more standardized.We acknowledge several limitations. First, a broader array of diseases, especially those specific to different regions, should be used in future studies. Second, we did not introduce more details (ie, location) to avoid the prompts becoming overcomplicated, and by default, ChatGPT’s responses reflect the average population to increase its generalizability. Third, we did not account for the relative importance of the AI’s questions and emotional communications. Fourth, a larger sample size may have enabled us to perform head-to-head comparisons between AI care and traditional care.
Despite the limitations, we present the first audit-study evidence to evaluate ChatGPT’s performance in diagnosing and treating common diseases in LMICs. A rich set of 9 established diseases makes our findings highly relevant to and widely applicable in LMICs. ChatGPT reaches high levels of accuracy in diagnosis and medication recommendations, but also recommends a concerning level of unnecessary or harmful medications. Integrating AI tools like ChatGPT into health care systems in LMICs may potentially improve diagnostic accuracy but also raises concerns about care safety.
Acknowledgments
No funding was available to support this study. XC acknowledges financial support from the Drazen scholarship and the Aden scholarship dedicated to research on Chinese health care systems. YS and SG acknowledge the support from the National Social Science Foundation of China (23AZD091). During the preparation of this work the authors used ChatGPT with GPT-4 in order to improve readability and language. After using this tool, the authors reviewed and edited the content as needed and take full responsibility for the content of the publication.
Data Availability
All data generated or analyzed during this study are included in this published article.
Authors' Contributions
YS contributed to conceptualization, investigation, analysis, and writing (original draft); YY contributed to analysis, investigation, review, and editing; XW contributed to analysis, investigation, review, and editing; JZ contributed to analysis, investigation, review, and editing; XC contributed to review and editing; XF contributed to review and editing; RA contributed to conceptualization, investigation, analysis, and writing; and SG contributed to review and editing. All authors approved the final version of the paper.
Conflicts of Interest
None declared.
References
- Howarth J. How many people own smartphones? (2023-2028). Exploding Topics. URL: https://explodingtopics.com/blog/smartphone-stats [accessed 2023-12-07]
- Sarraju A, Bruemmer D, Van Iterson E, Cho L, Rodriguez F, Laffin L. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA. Mar 14, 2023;329(10):842-844. [FREE Full text] [CrossRef] [Medline]
- Kuroiwa T, Sarcon A, Ibara T, Yamada E, Yamamoto A, Tsukamoto K, et al. The potential of ChatGPT as a self-diagnostic tool in common orthopedic diseases: exploratory study. J Med Internet Res. Sep 15, 2023;25:e47621. [FREE Full text] [CrossRef] [Medline]
- Wang C, Liu S, Yang H, Guo J, Wu Y, Liu J. Ethical considerations of Using ChatGPT in health care. J Med Internet Res. Aug 11, 2023;25:e48009. [FREE Full text] [CrossRef] [Medline]
- Kwan A, Daniels B, Bergkvist S, Das V, Pai M, Das J. Use of standardised patients for healthcare quality research in low- and middle-income countries. BMJ Glob Health. 2019;4(5):e001669. [FREE Full text] [CrossRef] [Medline]
- Si Y, Bateman H, Chen S, Hanewald K, Li B, Su M, et al. Quantifying the financial impact of overuse in primary care in China: a standardised patient study. Soc Sci Med. Mar 2023;320:115670. [CrossRef] [Medline]
- Xue H, D’Souza K, Fang Y, Si Y, Liao H, Qin WA, et al. Direct-to-consumer telemedicine platforms in China: a national market survey and quality evaluation. Preprints with The Lancet. Preprint posted online Oct 18, 2021. [CrossRef]
- Si Y, Xue H, Liao H, Xie Y, Xu D, Smith M, et al. The quality of telemedicine consultations for sexually transmitted infections in China. Health Policy Plan. Mar 12, 2024;39(3):307-317. [CrossRef] [Medline]
- Sellamuthu S, Vaddadi S, Venkata S, Petwal H, Hosur R, Mandala V, et al. AI-based recommendation model for effective decision to maximise ROI. Soft Comput. 2023:1-10. [FREE Full text] [CrossRef]
- Sanders JW, Fuhrer GS, Johnson MD, Riddle MS. The epidemiological transition: the current status of infectious diseases in the developed world versus the developing world. Sci Prog. 2008;91(Pt 1):1-37. [FREE Full text] [CrossRef] [Medline]
Abbreviations
AI: artificial intelligence |
LMIC: low- and middle-income country |
SP: simulated patient |
Edited by G Eysenbach, T de Azevedo Cardoso; submitted 06.01.24; peer-reviewed by D Simmons, N Domingues, W Yang; comments to author 06.04.24; revised version received 21.04.24; accepted 30.07.24; published 09.09.24.
Copyright©Yafei Si, Yuyi Yang, Xi Wang, Jiaqi Zu, Xi Chen, Xiaojing Fan, Ruopeng An, Sen Gong. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 09.09.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.