TY - JOUR AU - Lahat, Adi AU - Sharif, Kassem AU - Zoabi, Narmin AU - Shneor Patt, Yonatan AU - Sharif, Yousra AU - Fisher, Lior AU - Shani, Uria AU - Arow, Mohamad AU - Levin, Roni AU - Klang, Eyal PY - 2024 DA - 2024/6/27 TI - Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4 JO - J Med Internet Res SP - e54571 VL - 26 KW - ChatGPT KW - chat-GPT KW - chatbot KW - chatbots KW - chat-bot KW - chat-bots KW - natural language processing KW - NLP KW - artificial intelligence KW - AI KW - machine learning KW - ML KW - algorithm KW - algorithms KW - predictive model KW - predictive models KW - predictive analytics KW - predictive system KW - practical model KW - practical models KW - internal medicine KW - ethics KW - ethical KW - ethical dilemma KW - ethical dilemmas KW - bioethics KW - emergency medicine KW - EM medicine KW - ED physician KW - emergency physician KW - emergency doctor AB - Background: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. Objective: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors’ and residents’ ratings, and specific question types. Methods: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. Results: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5’s accuracy, beneficial, and completeness dimensions. Conclusions: ChatGPT’s potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments. SN - 1438-8871 UR - https://www.jmir.org/2024/1/e54571 UR - https://doi.org/10.2196/54571 DO - 10.2196/54571 ID - info:doi/10.2196/54571 ER -