%0 Journal Article %@ 2369-1999 %I JMIR Publications %V 11 %N %P e65984 %T Large Language Model Applications for Health Information Extraction in Oncology: Scoping Review %A Chen,David %A Alnassar,Saif Addeen %A Avison,Kate Elizabeth %A Huang,Ryan S %A Raman,Srinivas %K artificial intelligence %K chatbot %K data extraction %K AI %K conversational agent %K health information %K oncology %K scoping review %K natural language processing %K NLP %K large language model %K LLM %K digital health %K health technology %K electronic health record %D 2025 %7 28.3.2025 %9 %J JMIR Cancer %G English %X Background: Natural language processing systems for data extraction from unstructured clinical text require expert-driven input for labeled annotations and model training. The natural language processing competency of large language models (LLM) can enable automated data extraction of important patient characteristics from electronic health records, which is useful for accelerating cancer clinical research and informing oncology care. Objective: This scoping review aims to map the current landscape, including definitions, frameworks, and future directions of LLMs applied to data extraction from clinical text in oncology. Methods: We queried Ovid MEDLINE for primary, peer-reviewed research studies published since 2000 on June 2, 2024, using oncology- and LLM-related keywords. This scoping review included studies that evaluated the performance of an LLM applied to data extraction from clinical text in oncology contexts. Study attributes and main outcomes were extracted to outline key trends of research in LLM-based data extraction. Results: The literature search yielded 24 studies for inclusion. The majority of studies assessed original and fine-tuned variants of the BERT LLM (n=18, 75%) followed by the Chat-GPT conversational LLM (n=6, 25%). LLMs for data extraction were commonly applied in pan-cancer clinical settings (n=11, 46%), followed by breast (n=4, 17%), and lung (n=4, 17%) cancer contexts, and were evaluated using multi-institution datasets (n=18, 75%). Comparing the studies published in 2022‐2024 versus 2019‐2021, both the total number of studies (18 vs 6) and the proportion of studies using prompt engineering increased (5/18, 28% vs 0/6, 0%), while the proportion using fine-tuning decreased (8/18, 44.4% vs 6/6, 100%). Advantages of LLMs included positive data extraction performance and reduced manual workload. Conclusions: LLMs applied to data extraction in oncology can serve as useful automated tools to reduce the administrative burden of reviewing patient health records and increase time for patient-facing care. Recent advances in prompt-engineering and fine-tuning methods, and multimodal data extraction present promising directions for future research. Further studies are needed to evaluate the performance of LLM-enabled data extraction in clinical domains beyond the training dataset and to assess the scope and integration of LLMs into real-world clinical environments. %R 10.2196/65984 %U https://cancer.jmir.org/2025/1/e65984 %U https://doi.org/10.2196/65984 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e58366 %T The AI Reviewer: Evaluating AI’s Role in Citation Screening for Streamlined Systematic Reviews %A Ghossein,Jamie %A Hryciw,Brett N %A Ramsay,Tim %A Kyeremanteng,Kwadwo %K article screening %K artificial intelligence %K systematic review %K AI %K large language model %K LLM %K screening %K analysis %K reviewer %K app %K ChatGPT 3.5 %K chatbot %K dataset %K data %K adoption %D 2025 %7 28.3.2025 %9 %J JMIR Form Res %G English %X %R 10.2196/58366 %U https://formative.jmir.org/2025/1/e58366 %U https://doi.org/10.2196/58366 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e68618 %T Automated Radiology Report Labeling in Chest X-Ray Pathologies: Development and Evaluation of a Large Language Model Framework %A Abdullah,Abdullah %A Kim,Seong Tae %K large language model %K generative pre-trained transformers %K radiology report %K labeling %K BERT %K thoracic pathologies %K LLM %K GPT %D 2025 %7 28.3.2025 %9 %J JMIR Med Inform %G English %X Background: Labeling unstructured radiology reports is crucial for creating structured datasets that facilitate downstream tasks, such as training large-scale medical imaging models. Current approaches typically rely on Bidirectional Encoder Representations from Transformers (BERT)-based methods or manual expert annotations, which have limitations in terms of scalability and performance. Objective: This study aimed to evaluate the effectiveness of a generative pretrained transformer (GPT)-based large language model (LLM) in labeling radiology reports, comparing it with 2 existing methods, CheXbert and CheXpert, on a large chest X-ray dataset (MIMIC Chest X-ray [MIMIC-CXR]). Methods: In this study, we introduce an LLM-based approach fine-tuned on expert-labeled radiology reports. Our model’s performance was evaluated on 687 radiologist-labeled chest X-ray reports, comparing F1 scores across 14 thoracic pathologies. The performance of our LLM model was compared with the CheXbert and CheXpert models across positive, negative, and uncertainty extraction tasks. Paired t tests and Wilcoxon signed-rank tests were performed to evaluate the statistical significance of differences between model performances. Results: The GPT-based LLM model achieved an average F1 score of 0.9014 across all certainty levels, outperforming CheXpert (0.8864) and approaching CheXbert’s performance (0.9047). For positive and negative certainty levels, our model scored 0.8708, surpassing CheXpert (0.8525) and closely matching CheXbert (0.8733). Statistically, paired t tests indicated no significant difference between our model and CheXbert (P=.35) but a significant improvement over CheXpert (P=.01). Wilcoxon signed-rank tests corroborated these findings, showing no significant difference between our model and CheXbert (P=.14) but confirming a significant difference with CheXpert (P=.005). The LLM also demonstrated superior performance for pathologies with longer and more complex descriptions, leveraging its extended context length. Conclusions: The GPT-based LLM model demonstrates competitive performance compared with CheXbert and outperforms CheXpert in radiology report labeling. These findings suggest that LLMs are a promising alternative to traditional BERT-based architectures for this task, offering enhanced context understanding and eliminating the need for extensive feature engineering. Furthermore, with large context length LLM-based models are better suited for this task as compared with the small context length of BERT based models. %R 10.2196/68618 %U https://medinform.jmir.org/2025/1/e68618 %U https://doi.org/10.2196/68618 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e65537 %T Large Language Model–Driven Knowledge Graph Construction in Sepsis Care Using Multicenter Clinical Databases: Development and Usability Study %A Yang,Hao %A Li,Jiaxi %A Zhang,Chi %A Sierra,Alejandro Pazos %A Shen,Bairong %+ Department of Critical Care Medicine, Joint Laboratory of Artifcial Intelligence for Critical Care Medicine, Frontiers Science Center for Disease-related Molecular Network, Institutes for Systems Genetics, Sichuan University, West China Hospital, No. 37, Guo Xue Xiang, Chengdu, 610041, China, 86 85164199, bairong.shen@scu.edu.cn %K sepsis %K knowledge graph %K large language models %K prompt engineering %K real-world %K GPT-4.0 %D 2025 %7 27.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Sepsis is a complex, life-threatening condition characterized by significant heterogeneity and vast amounts of unstructured data, posing substantial challenges for traditional knowledge graph construction methods. The integration of large language models (LLMs) with real-world data offers a promising avenue to address these challenges and enhance the understanding and management of sepsis. Objective: This study aims to develop a comprehensive sepsis knowledge graph by leveraging the capabilities of LLMs, specifically GPT-4.0, in conjunction with multicenter clinical databases. The goal is to improve the understanding of sepsis and provide actionable insights for clinical decision-making. We also established a multicenter sepsis database (MSD) to support this effort. Methods: We collected clinical guidelines, public databases, and real-world data from 3 major hospitals in Western China, encompassing 10,544 patients diagnosed with sepsis. Using GPT-4.0, we used advanced prompt engineering techniques for entity recognition and relationship extraction, which facilitated the construction of a nuanced sepsis knowledge graph. Results: We established a sepsis database with 10,544 patient records, including 8497 from West China Hospital, 690 from Shangjin Hospital, and 357 from Tianfu Hospital. The sepsis knowledge graph comprises of 1894 nodes and 2021 distinct relationships, encompassing nine entity concepts (diseases, symptoms, biomarkers, imaging examinations, etc) and 8 semantic relationships (complications, recommended medications, laboratory tests, etc). GPT-4.0 demonstrated superior performance in entity recognition and relationship extraction, achieving an F1-score of 76.76 on a sepsis-specific dataset, outperforming other models such as Qwen2 (43.77) and Llama3 (48.39). On the CMeEE dataset, GPT-4.0 achieved an F1-score of 65.42 using few-shot learning, surpassing traditional models such as BERT-CRF (62.11) and Med-BERT (60.66). Building upon this, we compiled a comprehensive sepsis knowledge graph, comprising of 1894 nodes and 2021 distinct relationships. Conclusions: This study represents a pioneering effort in using LLMs, particularly GPT-4.0, to construct a comprehensive sepsis knowledge graph. The innovative application of prompt engineering, combined with the integration of multicenter real-world data, has significantly enhanced the efficiency and accuracy of knowledge graph construction. The resulting knowledge graph provides a robust framework for understanding sepsis, supporting clinical decision-making, and facilitating further research. The success of this approach underscores the potential of LLMs in medical research and sets a new benchmark for future studies in sepsis and other complex medical conditions. %M 40146985 %R 10.2196/65537 %U https://www.jmir.org/2025/1/e65537 %U https://doi.org/10.2196/65537 %U http://www.ncbi.nlm.nih.gov/pubmed/40146985 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e67363 %T Generative Large Language Model—Powered Conversational AI App for Personalized Risk Assessment: Case Study in COVID-19 %A Roshani,Mohammad Amin %A Zhou,Xiangyu %A Qiang,Yao %A Suresh,Srinivasan %A Hicks,Steven %A Sethuraman,Usha %A Zhu,Dongxiao %+ Department of Computer Science, Wayne State University, 5057 Woodward Ave, Suite 14101.3, Detroit, MI, 48202, United States, 1 3135773104, dzhu@wayne.edu %K personalized risk assessment %K large language model %K conversational AI %K artificial intelligence %K COVID-19 %D 2025 %7 27.3.2025 %9 Original Paper %J JMIR AI %G English %X Background: Large language models (LLMs) have demonstrated powerful capabilities in natural language tasks and are increasingly being integrated into health care for tasks like disease risk assessment. Traditional machine learning methods rely on structured data and coding, limiting their flexibility in dynamic clinical environments. This study presents a novel approach to disease risk assessment using generative LLMs through conversational artificial intelligence (AI), eliminating the need for programming. Objective: This study evaluates the use of pretrained generative LLMs, including LLaMA2-7b and Flan-T5-xl, for COVID-19 severity prediction with the goal of enabling a real-time, no-code, risk assessment solution through chatbot-based, question-answering interactions. To contextualize their performance, we compare LLMs with traditional machine learning classifiers, such as logistic regression, extreme gradient boosting (XGBoost), and random forest, which rely on tabular data. Methods: We fine-tuned LLMs using few-shot natural language examples from a dataset of 393 pediatric patients, developing a mobile app that integrates these models to provide real-time, no-code, COVID-19 severity risk assessment through clinician-patient interaction. The LLMs were compared with traditional classifiers across different experimental settings, using the area under the curve (AUC) as the primary evaluation metric. Feature importance derived from LLM attention layers was also analyzed to enhance interpretability. Results: Generative LLMs demonstrated strong performance in low-data settings. In zero-shot scenarios, the T0-3b-T model achieved an AUC of 0.75, while other LLMs, such as T0pp(8bit)-T and Flan-T5-xl-T, reached 0.67 and 0.69, respectively. At 2-shot settings, logistic regression and random forest achieved an AUC of 0.57, while Flan-T5-xl-T and T0-3b-T obtained 0.69 and 0.65, respectively. By 32-shot settings, Flan-T5-xl-T reached 0.70, similar to logistic regression (0.69) and random forest (0.68), while XGBoost improved to 0.65. These results illustrate the differences in how generative LLMs and traditional models handle the increasing data availability. LLMs perform well in low-data scenarios, whereas traditional models rely more on structured tabular data and labeled training examples. Furthermore, the mobile app provides real-time, COVID-19 severity assessments and personalized insights through attention-based feature importance, adding value to the clinical interpretation of the results. Conclusions: Generative LLMs provide a robust alternative to traditional classifiers, particularly in scenarios with limited labeled data. Their ability to handle unstructured inputs and deliver personalized, real-time assessments without coding makes them highly adaptable to clinical settings. This study underscores the potential of LLM-powered conversational artificial intelligence (AI) in health care and encourages further exploration of its use for real-time, disease risk assessment and decision-making support. %M 40146990 %R 10.2196/67363 %U https://ai.jmir.org/2025/1/e67363 %U https://doi.org/10.2196/67363 %U http://www.ncbi.nlm.nih.gov/pubmed/40146990 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e69820 %T Prompt Engineering an Informational Chatbot for Education on Mental Health Using a Multiagent Approach for Enhanced Compliance With Prompt Instructions: Algorithm Development and Validation %A Waaler,Per Niklas %A Hussain,Musarrat %A Molchanov,Igor %A Bongo,Lars Ailo %A Elvevåg,Brita %+ Department of Computer Science, UiT The Arctic University of Norway, Backgatan 35, Södra Sandby, Lund, 24731, Sweden, 46 94444096, pwa011@uit.no %K schizophrenia %K mental health %K prompt engineering %K AI in health care %K AI safety %K self-reflection %K limiting scope of AI %K large language model %K LLM %K GPT-4 %K AI transparency %K adaptive learning %D 2025 %7 26.3.2025 %9 Original Paper %J JMIR AI %G English %X Background: People with schizophrenia often present with cognitive impairments that may hinder their ability to learn about their condition. Education platforms powered by large language models (LLMs) have the potential to improve the accessibility of mental health information. However, the black-box nature of LLMs raises ethical and safety concerns regarding the controllability of chatbots. In particular, prompt-engineered chatbots may drift from their intended role as the conversation progresses and become more prone to hallucinations. Objective: This study aimed to develop and evaluate a critical analysis filter (CAF) system that ensures that an LLM-powered prompt-engineered chatbot reliably complies with its predefined instructions and scope while delivering validated mental health information. Methods: For a proof of concept, we prompt engineered an educational chatbot for schizophrenia powered by GPT-4 that could dynamically access information from a schizophrenia manual written for people with schizophrenia and their caregivers. In the CAF, a team of prompt-engineered LLM agents was used to critically analyze and refine the chatbot’s responses and deliver real-time feedback to the chatbot. To assess the ability of the CAF to re-establish the chatbot’s adherence to its instructions, we generated 3 conversations (by conversing with the chatbot with the CAF disabled) wherein the chatbot started to drift from its instructions toward various unintended roles. We used these checkpoint conversations to initialize automated conversations between the chatbot and adversarial chatbots designed to entice it toward unintended roles. Conversations were repeatedly sampled with the CAF enabled and disabled. In total, 3 human raters independently rated each chatbot response according to criteria developed to measure the chatbot’s integrity, specifically, its transparency (such as admitting when a statement lacked explicit support from its scripted sources) and its tendency to faithfully convey the scripted information in the schizophrenia manual. Results: In total, 36 responses (3 different checkpoint conversations, 3 conversations per checkpoint, and 4 adversarial queries per conversation) were rated for compliance with the CAF enabled and disabled. Activating the CAF resulted in a compliance score that was considered acceptable (≥2) in 81% (7/36) of the responses, compared to only 8.3% (3/36) when the CAF was deactivated. Conclusions: Although more rigorous testing in realistic scenarios is needed, our results suggest that self-reflection mechanisms could enable LLMs to be used effectively and safely in educational mental health platforms. This approach harnesses the flexibility of LLMs while reliably constraining their scope to appropriate and accurate interactions. %M 39992720 %R 10.2196/69820 %U https://ai.jmir.org/2025/1/e69820 %U https://doi.org/10.2196/69820 %U http://www.ncbi.nlm.nih.gov/pubmed/39992720 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e56973 %T The Impact of ChatGPT Exposure on User Interactions With a Motivational Interviewing Chatbot: Quasi-Experimental Study %A Zhu,Jiading %A Dong,Alec %A Wang,Cindy %A Veldhuizen,Scott %A Abdelwahab,Mohamed %A Brown,Andrew %A Selby,Peter %A Rose,Jonathan %K chatbot %K digital health %K motivational interviewing %K natural language processing %K ChatGPT %K large language models %K artificial intelligence %K experimental %K smoking cessation %K conversational agent %D 2025 %7 21.3.2025 %9 %J JMIR Form Res %G English %X Background: The worldwide introduction of ChatGPT in November 2022 may have changed how its users perceive and interact with other chatbots. This possibility may confound the comparison of responses to pre-ChatGPT and post-ChatGPT iterations of pre-existing chatbots, in turn affecting the direction of their evolution. Before the release of ChatGPT, we created a therapeutic chatbot, MIBot, whose goal is to use motivational interviewing to guide smokers toward making the decision to quit smoking. We were concerned that measurements going forward would not be comparable to those in the past, impacting the evaluation of future changes to the chatbot. Objective: The aim of the study is to explore changes in how users interact with MIBot after the release of ChatGPT and examine the relationship between these changes and users’ familiarity with ChatGPT. Methods: We compared user interactions with MIBot prior to ChatGPT’s release and 6 months after the release. Participants (N=143) were recruited through a web-based platform in November of 2022, prior to the release of ChatGPT, to converse with MIBot, in an experiment we refer to as MIBot (version 5.2). In May 2023, a set of (n=129) different participants were recruited to interact with the same version of MIBot and asked additional questions about their familiarity with ChatGPT, in the experiment called MIBot (version 5.2A). We used the Mann-Whitney U test to compare metrics between cohorts and Spearman rank correlation to assess relationships between familiarity with ChatGPT and other metrics within the MIBot (version 5.2A) cohort. Results: In total, 83(64.3%) participants in the MIBot (version 5.2A) cohort had used ChatGPT, with 66 (51.2%) using it on a regular basis. Satisfaction with MIBot was significantly lower in the post-ChatGPT cohort (U=11,331.0; P=.001), driven by a decrease in perceived empathy as measured by the Average Consultation and Relational Empathy Measure (U=10,838.0; P=.01). Familiarity with ChatGPT was positively correlated with average response length (ρ=0.181; P=.04) and change in perceived importance of quitting smoking (ρ=0.296; P<.001). Conclusions: The widespread reach of ChatGPT has changed how users interact with MIBot. Post-ChatGPT users are less satisfied with MIBot overall, particularly in terms of perceived empathy. However, users with greater familiarity with ChatGPT provide longer responses and demonstrated a greater increase in their perceived importance of quitting smoking after a session with MIBot. These findings suggest the need for chatbot developers to adapt to evolving user expectations in the era of advanced generative artificial intelligence. %R 10.2196/56973 %U https://formative.jmir.org/2025/1/e56973 %U https://doi.org/10.2196/56973 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 8 %N %P e67551 %T Evaluating the Diagnostic Accuracy of ChatGPT-4 Omni and ChatGPT-4 Turbo in Identifying Melanoma: Comparative Study %A Sattler,Samantha S. %A Chetla,Nitin %A Chen,Matthew %A Hage,Tamer Rajai %A Chang,Joseph %A Guo,William Young %A Hugh,Jeremy %K melanoma %K skin cancer %K chatGPT %K chat-GPT %K chatbot %K dermatology %K cancer %K oncology %K metastases %K diagnostic %K diagnosis %K lesion %K efficacy %K machine learning %K ML %K artificial intelligence %K AI %K algorithm %K model %K analytics %D 2025 %7 21.3.2025 %9 %J JMIR Dermatol %G English %X ChatGPT is increasingly used in healthcare. Fields like dermatology and radiology could benefit from ChatGPT’s ability to help clinicians diagnose skin lesions. This study evaluates the accuracy of ChatGPT in diagnosing melanoma. Our analysis indicates that ChatGPT cannot be used reliably to diagnose melanoma, and further improvements are needed to reach this capability. %R 10.2196/67551 %U https://derma.jmir.org/2025/1/e67551 %U https://doi.org/10.2196/67551 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67967 %T Large Language Model–Based Assessment of Clinical Reasoning Documentation in the Electronic Health Record Across Two Institutions: Development and Validation Study %A Schaye,Verity %A DiTullio,David %A Guzman,Benedict Vincent %A Vennemeyer,Scott %A Shih,Hanniel %A Reinstein,Ilan %A Weber,Danielle E %A Goodman,Abbie %A Wu,Danny T Y %A Sartori,Daniel J %A Santen,Sally A %A Gruppen,Larry %A Aphinyanaphongs,Yindalon %A Burk-Rafel,Jesse %+ Institute for Innovations in Medical Education, NYU Grossman School of Medicine, 550 First Avenue, MS G 61, New York, NY, 10016, United States, 1 212 263 3006, verity.schaye@nyulangone.org %K large language models %K artificial intelligence %K clinical reasoning %K documentation %K assessment %K feedback %K electronic health record %D 2025 %7 21.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Clinical reasoning (CR) is an essential skill; yet, physicians often receive limited feedback. Artificial intelligence holds promise to fill this gap. Objective: We report the development of named entity recognition (NER), logic-based and large language model (LLM)–based assessments of CR documentation in the electronic health record across 2 institutions (New York University Grossman School of Medicine [NYU] and University of Cincinnati College of Medicine [UC]). Methods: The note corpus consisted of internal medicine resident admission notes (retrospective set: July 2020-December 2021, n=700 NYU and 450 UC notes and prospective validation set: July 2023-December 2023, n=155 NYU and 92 UC notes). Clinicians rated CR documentation quality in each note using a previously validated tool (Revised-IDEA), on 3-point scales across 2 domains: differential diagnosis (D0, D1, and D2) and explanation of reasoning, (EA0, EA1, and EA2). At NYU, the retrospective set was annotated for NER for 5 entities (diagnosis, diagnostic category, prioritization of diagnosis language, data, and linkage terms). Models were developed using different artificial intelligence approaches, including NER, logic-based model: a large word vector model (scispaCy en_core_sci_lg) with model weights adjusted with backpropagation from annotations, developed at NYU with external validation at UC, NYUTron LLM: an NYU internal 110 million parameter LLM pretrained on 7.25 million clinical notes, only validated at NYU, and GatorTron LLM: an open source 345 million parameter LLM pretrained on 82 billion words of clinical text, fined tuned on NYU retrospective sets, then externally validated and further fine-tuned at UC. Model performance was assessed in the prospective sets with F1-scores for the NER, logic-based model and area under the receiver operating characteristic curve (AUROC) and area under the precision-recall curve (AUPRC) for the LLMs. Results: At NYU, the NYUTron LLM performed best: the D0 and D2 models had AUROC/AUPRC 0.87/0.79 and 0.89/0.86, respectively. The D1, EA0, and EA1 models had insufficient performance for implementation (AUROC range 0.57-0.80, AUPRC range 0.33-0.63). For the D1 classification, the approach pivoted to a stepwise approach taking advantage of the more performant D0 and D2 models. For the EA model, the approach pivoted to a binary EA2 model (ie, EA2 vs not EA2) with excellent performance, AUROC/AUPRC 0.85/ 0.80. At UC, the NER, D-logic–based model was the best performing D model (F1-scores 0.80, 0.74, and 0.80 for D0, D1, D2, respectively. The GatorTron LLM performed best for EA2 scores AUROC/AUPRC 0.75/ 0.69. Conclusions: This is the first multi-institutional study to apply LLMs for assessing CR documentation in the electronic health record. Such tools can enhance feedback on CR. Lessons learned by implementing these models at distinct institutions support the generalizability of this approach. %M 40117575 %R 10.2196/67967 %U https://www.jmir.org/2025/1/e67967 %U https://doi.org/10.2196/67967 %U http://www.ncbi.nlm.nih.gov/pubmed/40117575 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e58375 %T Performance of Plug-In Augmented ChatGPT and Its Ability to Quantify Uncertainty: Simulation Study on the German Medical Board Examination %A Madrid,Julian %A Diehl,Philipp %A Selig,Mischa %A Rolauffs,Bernd %A Hans,Felix Patricius %A Busch,Hans-Jörg %A Scheef,Tobias %A Benning,Leo %K medical education %K artificial intelligence %K generative AI %K large language model %K LLM %K ChatGPT %K GPT-4 %K board licensing examination %K professional education %K examination %K student %K experimental %K bootstrapping %K confidence interval %D 2025 %7 21.3.2025 %9 %J JMIR Med Educ %G English %X Background: The GPT-4 is a large language model (LLM) trained and fine-tuned on an extensive dataset. After the public release of its predecessor in November 2022, the use of LLMs has seen a significant spike in interest, and a multitude of potential use cases have been proposed. In parallel, however, important limitations have been outlined. Particularly, current LLMs encounter limitations, especially in symbolic representation and accessing contemporary data. The recent version of GPT-4, alongside newly released plugin features, has been introduced to mitigate some of these limitations. Objective: Before this background, this work aims to investigate the performance of GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins using pretranslated English text on the German medical board examination. Recognizing the critical importance of quantifying uncertainty for LLM applications in medicine, we furthermore assess this ability and develop a new metric termed “confidence accuracy” to evaluate it. Methods: We used GPT-3.5, GPT-4, GPT-4 with plugins, and GPT-4 with plugins and translation to answer questions from the German medical board examination. Additionally, we conducted an analysis to assess how the models justify their answers, the accuracy of their responses, and the error structure of their answers. Bootstrapping and CIs were used to evaluate the statistical significance of our findings. Results: This study demonstrated that available GPT models, as LLM examples, exceeded the minimum competency threshold established by the German medical board for medical students to obtain board certification to practice medicine. Moreover, the models could assess the uncertainty in their responses, albeit exhibiting overconfidence. Additionally, this work unraveled certain justification and reasoning structures that emerge when GPT generates answers. Conclusions: The high performance of GPTs in answering medical questions positions it well for applications in academia and, potentially, clinical practice. Its capability to quantify uncertainty in answers suggests it could be a valuable artificial intelligence agent within the clinical decision-making loop. Nevertheless, significant challenges must be addressed before artificial intelligence agents can be robustly and safely implemented in the medical domain. %R 10.2196/58375 %U https://mededu.jmir.org/2025/1/e58375 %U https://doi.org/10.2196/58375 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 8 %N %P e67299 %T Assessing the Diagnostic Accuracy of ChatGPT-4 in Identifying Diverse Skin Lesions Against Squamous and Basal Cell Carcinoma %A Chetla,Nitin %A Chen,Matthew %A Chang,Joseph %A Smith,Aaron %A Hage,Tamer Rajai %A Patel,Romil %A Gardner,Alana %A Bryer,Bridget %K chatbot %K ChatGPT %K ChatGPT-4 %K squamous cell carcinoma %K basal cell carcinoma %K skin cancer %K skin cancer detection %K dermatoscopic image analysis %K skin lesion differentiation %K dermatologist %K machine learning %K ML %K artificial intelligence %K AI %K AI in dermatology %K algorithm %K model %K analytics %K diagnostic accuracy %D 2025 %7 21.3.2025 %9 %J JMIR Dermatol %G English %X Our study evaluates the diagnostic accuracy of ChatGPT-4o in classifying various skin lesions, highlighting its limitations in distinguishing squamous cell carcinoma from basal cell carcinoma using dermatoscopic images. %R 10.2196/67299 %U https://derma.jmir.org/2025/1/e67299 %U https://doi.org/10.2196/67299 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e70222 %T Using AI to Translate and Simplify Spanish Orthopedic Medical Text: Instrument Validation Study %A Andalib,Saman %A Spina,Aidin %A Picton,Bryce %A Solomon,Sean S %A Scolaro,John A %A Nelson,Ariana M %K large language models %K LLM %K patient education %K translation %K bilingual evaluation understudy %K GPT-4 %K Google Translate %D 2025 %7 21.3.2025 %9 %J JMIR AI %G English %X Background: Language barriers contribute significantly to health care disparities in the United States, where a sizable proportion of patients are exclusively Spanish speakers. In orthopedic surgery, such barriers impact both patients’ comprehension of and patients’ engagement with available resources. Studies have explored the utility of large language models (LLMs) for medical translation but have yet to robustly evaluate artificial intelligence (AI)–driven translation and simplification of orthopedic materials for Spanish speakers. Objective: This study used the bilingual evaluation understudy (BLEU) method to assess translation quality and investigated the ability of AI to simplify patient education materials (PEMs) in Spanish. Methods: PEMs (n=78) from the American Academy of Orthopaedic Surgery were translated from English to Spanish, using 2 LLMs (GPT-4 and Google Translate). The BLEU methodology was applied to compare AI translations with professionally human-translated PEMs. The Friedman test and Dunn multiple comparisons test were used to statistically quantify differences in translation quality. A readability analysis and feature analysis were subsequently performed to evaluate text simplification success and the impact of English text features on BLEU scores. The capability of an LLM to simplify medical language written in Spanish was also assessed. Results: As measured by BLEU scores, GPT-4 showed moderate success in translating PEMs into Spanish but was less successful than Google Translate. Simplified PEMs demonstrated improved readability when compared to original versions (P<.001) but were unable to reach the targeted grade level for simplification. The feature analysis revealed that the total number of syllables and average number of syllables per sentence had the highest impact on BLEU scores. GPT-4 was able to significantly reduce the complexity of medical text written in Spanish (P<.001). Conclusions: Although Google Translate outperformed GPT-4 in translation accuracy, LLMs, such as GPT-4, may provide significant utility in translating medical texts into Spanish and simplifying such texts. We recommend considering a dual approach—using Google Translate for translation and GPT-4 for simplification—to improve medical information accessibility and orthopedic surgery education among Spanish-speaking patients. %R 10.2196/70222 %U https://ai.jmir.org/2025/1/e70222 %U https://doi.org/10.2196/70222 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e65729 %T Utility-based Analysis of Statistical Approaches and Deep Learning Models for Synthetic Data Generation With Focus on Correlation Structures: Algorithm Development and Validation %A Miletic,Marko %A Sariyar,Murat %+ Institute for Optimisation and Data Analysis (IODA), Bern University of Applied Sciences, Höheweg 80, Biel, 2502, Switzerland, 41 32 321 64 37, murat.sariyar@bfh.ch %K synthetic data generation %K medical data synthesis %K random forests %K simulation study %K deep learning %K propensity score mean-squared error %D 2025 %7 20.3.2025 %9 Original Paper %J JMIR AI %G English %X Background: Recent advancements in Generative Adversarial Networks and large language models (LLMs) have significantly advanced the synthesis and augmentation of medical data. These and other deep learning–based methods offer promising potential for generating high-quality, realistic datasets crucial for improving machine learning applications in health care, particularly in contexts where data privacy and availability are limiting factors. However, challenges remain in accurately capturing the complex associations inherent in medical datasets. Objective: This study evaluates the effectiveness of various Synthetic Data Generation (SDG) methods in replicating the correlation structures inherent in real medical datasets. In addition, it examines their performance in downstream tasks using Random Forests (RFs) as the benchmark model. To provide a comprehensive analysis, alternative models such as eXtreme Gradient Boosting and Gated Additive Tree Ensembles are also considered. We compare the following SDG approaches: Synthetic Populations in R (synthpop), copula, copulagan, Conditional Tabular Generative Adversarial Network (ctgan), tabular variational autoencoder (tvae), and tabula for LLMs. Methods: We evaluated synthetic data generation methods using both real-world and simulated datasets. Simulated data consist of 10 Gaussian variables and one binary target variable with varying correlation structures, generated via Cholesky decomposition. Real-world datasets include the body performance dataset with 13,393 samples for fitness classification, the Wisconsin Breast Cancer dataset with 569 samples for tumor diagnosis, and the diabetes dataset with 768 samples for diabetes prediction. Data quality is evaluated by comparing correlation matrices, the propensity score mean-squared error (pMSE) for general utility, and F1-scores for downstream tasks as a specific utility metric, using training on synthetic data and testing on real data. Results: Our simulation study, supplemented with real-world data analyses, shows that the statistical methods copula and synthpop consistently outperform deep learning approaches across various sample sizes and correlation complexities, with synthpop being the most effective. Deep learning methods, including large LLMs, show mixed performance, particularly with smaller datasets or limited training epochs. LLMs often struggle to replicate numerical dependencies effectively. In contrast, methods like tvae with 10,000 epochs perform comparably well. On the body performance dataset, copulagan achieves the best performance in terms of pMSE. The results also highlight that model utility depends more on the relative correlations between features and the target variable than on the absolute magnitude of correlation matrix differences. Conclusions: Statistical methods, particularly synthpop, demonstrate superior robustness and utility preservation for synthetic tabular data compared with deep learning approaches. Copula methods show potential but face limitations with integer variables. Deep Learning methods underperform in this context. Overall, these findings underscore the dominance of statistical methods for synthetic data generation for tabular data, while highlighting the niche potential of deep learning approaches for highly complex datasets, provided adequate resources and tuning. %M 40112290 %R 10.2196/65729 %U https://ai.jmir.org/2025/1/e65729 %U https://doi.org/10.2196/65729 %U http://www.ncbi.nlm.nih.gov/pubmed/40112290 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 12 %N %P e57986 %T Exploring Biases of Large Language Models in the Field of Mental Health: Comparative Questionnaire Study of the Effect of Gender and Sexual Orientation in Anorexia Nervosa and Bulimia Nervosa Case Vignettes %A Schnepper,Rebekka %A Roemmel,Noa %A Schaefert,Rainer %A Lambrecht-Walzinger,Lena %A Meinlschmidt,Gunther %K anorexia nervosa %K artificial intelligence %K bulimia nervosa %K ChatGPT %K eating disorders %K LLM %K responsible AI %K transformer %K bias %K large language model %K gender %K vignette %K quality of life %K symptomatology %K questionnaire %K generative AI %K mental health %K AI %D 2025 %7 20.3.2025 %9 %J JMIR Ment Health %G English %X Background: Large language models (LLMs) are increasingly used in mental health, showing promise in assessing disorders. However, concerns exist regarding their accuracy, reliability, and fairness. Societal biases and underrepresentation of certain populations may impact LLMs. Because LLMs are already used for clinical practice, including decision support, it is important to investigate potential biases to ensure a responsible use of LLMs. Anorexia nervosa (AN) and bulimia nervosa (BN) show a lifetime prevalence of 1%‐2%, affecting more women than men. Among men, homosexual men face a higher risk of eating disorders (EDs) than heterosexual men. However, men are underrepresented in ED research, and studies on gender, sexual orientation, and their impact on AN and BN prevalence, symptoms, and treatment outcomes remain limited. Objectives: We aimed to estimate the presence and size of bias related to gender and sexual orientation produced by a common LLM as well as a smaller LLM specifically trained for mental health analyses, exemplified in the context of ED symptomatology and health-related quality of life (HRQoL) of patients with AN or BN. Methods: We extracted 30 case vignettes (22 AN and 8 BN) from scientific papers. We adapted each vignette to create 4 versions, describing a female versus male patient living with their female versus male partner (2 × 2 design), yielding 120 vignettes. We then fed each vignette into ChatGPT-4 and to “MentaLLaMA” based on the Large Language Model Meta AI (LLaMA) architecture thrice with the instruction to evaluate them by providing responses to 2 psychometric instruments, the RAND-36 questionnaire assessing HRQoL and the eating disorder examination questionnaire. With the resulting LLM-generated scores, we calculated multilevel models with a random intercept for gender and sexual orientation (accounting for within-vignette variance), nested in vignettes (accounting for between-vignette variance). Results: In ChatGPT-4, the multilevel model with 360 observations indicated a significant association with gender for the RAND-36 mental composite summary (conditional means: 12.8 for male and 15.1 for female cases; 95% CI of the effect –6.15 to −0.35; P=.04) but neither with sexual orientation (P=.71) nor with an interaction effect (P=.37). We found no indications for main effects of gender (conditional means: 5.65 for male and 5.61 for female cases; 95% CI –0.10 to 0.14; P=.88), sexual orientation (conditional means: 5.63 for heterosexual and 5.62 for homosexual cases; 95% CI –0.14 to 0.09; P=.67), or for an interaction effect (P=.61, 95% CI –0.11 to 0.19) for the eating disorder examination questionnaire overall score (conditional means 5.59‐5.65 95% CIs 5.45 to 5.7). MentaLLaMA did not yield reliable results. Conclusions: LLM-generated mental HRQoL estimates for AN and BN case vignettes may be biased by gender, with male cases scoring lower despite no real-world evidence supporting this pattern. This highlights the risk of bias in generative artificial intelligence in the field of mental health. Understanding and mitigating biases related to gender and other factors, such as ethnicity, and socioeconomic status are crucial for responsible use in diagnostics and treatment recommendations. %R 10.2196/57986 %U https://mental.jmir.org/2025/1/e57986 %U https://doi.org/10.2196/57986 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67677 %T Improving Dietary Supplement Information Retrieval: Development of a Retrieval-Augmented Generation System With Large Language Models %A Hou,Yu %A Bishop,Jeffrey R %A Liu,Hongfang %A Zhang,Rui %+ , Division of Computational Health Sciences, University of Minnesota, 11-132 Phillips-Wangensteen Building, 516 Delaware Street SE, Minneapolis, MN, 55455, United States, 1 6126261999, ruizhang@umn.edu %K dietary supplements %K knowledge representation %K knowledge graph %K retrieval-augmented generation %K large language model %K user interface %D 2025 %7 19.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Dietary supplements (DSs) are widely used to improve health and nutrition, but challenges related to misinformation, safety, and efficacy persist due to less stringent regulations compared with pharmaceuticals. Accurate and reliable DS information is critical for both consumers and health care providers to make informed decisions. Objective: This study aimed to enhance DS-related question answering by integrating an advanced retrieval-augmented generation (RAG) system with the integrated Dietary Supplement Knowledgebase 2.0 (iDISK2.0), a dietary supplement knowledge base, to improve accuracy and reliability. Methods: We developed iDISK2.0 by integrating updated data from authoritative sources, including the Natural Medicines Comprehensive Database, the Memorial Sloan Kettering Cancer Center database, Dietary Supplement Label Database, and Licensed Natural Health Products Database, and applied advanced data cleaning and standardization techniques to reduce noise. The RAG system combined the retrieval power of a biomedical knowledge graph with the generative capabilities of large language models (LLMs) to address limitations of stand-alone LLMs, such as hallucination. The system retrieves contextually relevant subgraphs from iDISK2.0 based on user queries, enabling accurate and evidence-based responses through a user-friendly interface. We evaluated the system using true-or-false and multiple-choice questions derived from the Memorial Sloan Kettering Cancer Center database and compared its performance with stand-alone LLMs. Results: iDISK2.0 integrates 174,317 entities across 7 categories, including 8091 dietary supplement ingredients; 163,806 dietary supplement products; 786 diseases; and 625 drugs, along with 6 types of relationships. The RAG system achieved an accuracy of 99% (990/1000) for true-or-false questions on DS effectiveness and 95% (948/100) for multiple-choice questions on DS-drug interactions, substantially outperforming stand-alone LLMs like GPT-4o (OpenAI), which scored 62% (618/1000) and 52% (517/1000) on these respective tasks. The user interface enabled efficient interaction, supporting free-form text input and providing accurate responses. Integration strategies minimized data noise, ensuring access to up-to-date, DS-related information. Conclusions: By integrating a robust knowledge graph with RAG and LLM technologies, iDISK2.0 addresses the critical limitations of stand-alone LLMs in DS information retrieval. This study highlights the importance of combining structured data with advanced artificial intelligence methods to improve accuracy and reduce misinformation in health care applications. Future work includes extending the framework to broader biomedical domains and improving evaluation with real-world, open-ended queries. %M 40106799 %R 10.2196/67677 %U https://www.jmir.org/2025/1/e67677 %U https://doi.org/10.2196/67677 %U http://www.ncbi.nlm.nih.gov/pubmed/40106799 %0 Journal Article %@ 2563-6316 %I JMIR Publications %V 6 %N %P e65263 %T Large Language Models for Pediatric Differential Diagnoses in Rural Health Care: Multicenter Retrospective Cohort Study Comparing GPT-3 With Pediatrician Performance %A Mansoor,Masab %A Ibrahim,Andrew F %A Grindem,David %A Baig,Asad %K natural language processing %K NLP %K machine learning %K ML %K artificial intelligence %K language model %K large language model %K LLM %K generative pretrained transformer %K GPT %K pediatrics %D 2025 %7 19.3.2025 %9 %J JMIRx Med %G English %X Background: Rural health care providers face unique challenges such as limited specialist access and high patient volumes, making accurate diagnostic support tools essential. Large language models like GPT-3 have demonstrated potential in clinical decision support but remain understudied in pediatric differential diagnosis. Objective: This study aims to evaluate the diagnostic accuracy and reliability of a fine-tuned GPT-3 model compared to board-certified pediatricians in rural health care settings. Methods: This multicenter retrospective cohort study analyzed 500 pediatric encounters (ages 0‐18 years; n=261, 52.2% female) from rural health care organizations in Central Louisiana between January 2020 and December 2021. The GPT-3 model (DaVinci version) was fine-tuned using the OpenAI application programming interface and trained on 350 encounters, with 150 reserved for testing. Five board-certified pediatricians (mean experience: 12, SD 5.8 years) provided reference standard diagnoses. Model performance was assessed using accuracy, sensitivity, specificity, and subgroup analyses. Results: The GPT-3 model achieved an accuracy of 87.3% (131/150 cases), sensitivity of 85% (95% CI 82%‐88%), and specificity of 90% (95% CI 87%‐93%), comparable to pediatricians’ accuracy of 91.3% (137/150 cases; P=.47). Performance was consistent across age groups (0‐5 years: 54/62, 87%; 6‐12 years: 47/53, 89%; 13‐18 years: 30/35, 86%) and common complaints (fever: 36/39, 92%; abdominal pain: 20/23, 87%). For rare diagnoses (n=20), accuracy was slightly lower (16/20, 80%) but comparable to pediatricians (17/20, 85%; P=.62). Conclusions: This study demonstrates that a fine-tuned GPT-3 model can provide diagnostic support comparable to pediatricians, particularly for common presentations, in rural health care. Further validation in diverse populations is necessary before clinical implementation. %R 10.2196/65263 %U https://xmed.jmir.org/2025/1/e65263 %U https://doi.org/10.2196/65263 %0 Journal Article %@ 2369-1999 %I JMIR Publications %V 11 %N %P e63347 %T Using ChatGPT to Improve the Presentation of Plain Language Summaries of Cochrane Systematic Reviews About Oncology Interventions: Cross-Sectional Study %A Šuto Pavičić,Jelena %A Marušić,Ana %A Buljan,Ivan %K health literacy %K patient education %K health communication %K ChatGPT %K neoplasms %K Cochrane %K oncology %K plain language %K medical information %K decision-making %K large language model %K artificial intelligence %K AI %D 2025 %7 19.3.2025 %9 %J JMIR Cancer %G English %X Background: Plain language summaries (PLSs) of Cochrane systematic reviews are a simple format for presenting medical information to the lay public. This is particularly important in oncology, where patients have a more active role in decision-making. However, current PLS formats often exceed the readability requirements for the general population. There is still a lack of cost-effective and more automated solutions to this problem. Objective: This study assessed whether a large language model (eg, ChatGPT) can improve the readability and linguistic characteristics of Cochrane PLSs about oncology interventions, without changing evidence synthesis conclusions. Methods: The dataset included 275 scientific abstracts and corresponding PLSs of Cochrane systematic reviews about oncology interventions. ChatGPT-4 was tasked to make each scientific abstract into a PLS using 3 prompts as follows: (1) rewrite this scientific abstract into a PLS to achieve a Simple Measure of Gobbledygook (SMOG) index of 6, (2) rewrite the PLS from prompt 1 so it is more emotional, and (3) rewrite this scientific abstract so it is easier to read and more appropriate for the lay audience. ChatGPT-generated PLSs were analyzed for word count, level of readability (SMOG index), and linguistic characteristics using Linguistic Inquiry and Word Count (LIWC) software and compared with the original PLSs. Two independent assessors reviewed the conclusiveness categories of ChatGPT-generated PLSs and compared them with original abstracts to evaluate consistency. The conclusion of each abstract about the efficacy and safety of the intervention was categorized as conclusive (positive/negative/equal), inconclusive, or unclear. Group comparisons were conducted using the Friedman nonparametric test. Results: ChatGPT-generated PLSs using the first prompt (SMOG index 6) were the shortest and easiest to read, with a median SMOG score of 8.2 (95% CI 8‐8.4), compared with the original PLSs (median SMOG score 13.1, 95% CI 12.9‐13.4). These PLSs had a median word count of 240 (95% CI 232‐248) compared with the original PLSs’ median word count of 364 (95% CI 339‐388). The second prompt (emotional tone) generated PLSs with a median SMOG score of 11.4 (95% CI 11.1‐12), again lower than the original PLSs. PLSs produced with the third prompt (write simpler and easier) had a median SMOG score of 8.7 (95% CI 8.4‐8.8). ChatGPT-generated PLSs across all prompts demonstrated reduced analytical tone and increased authenticity, clout, and emotional tone compared with the original PLSs. Importantly, the conclusiveness categorization of the original abstracts was unchanged in the ChatGPT-generated PLSs. Conclusions: ChatGPT can be a valuable tool in simplifying PLSs as medically related formats for lay audiences. More research is needed, including oversight mechanisms to ensure that the information is accurate, reliable, and culturally relevant for different audiences. %R 10.2196/63347 %U https://cancer.jmir.org/2025/1/e63347 %U https://doi.org/10.2196/63347 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e58897 %T Performance of ChatGPT-4 on Taiwanese Traditional Chinese Medicine Licensing Examinations: Cross-Sectional Study %A Tseng,Liang-Wei %A Lu,Yi-Chin %A Tseng,Liang-Chi %A Chen,Yu-Chun %A Chen,Hsing-Yu %K artificial intelligence %K AI language understanding tools %K ChatGPT %K natural language processing %K machine learning %K Chinese medicine license exam %K Chinese medical licensing examination %K medical education %K traditional Chinese medicine %K large language model %D 2025 %7 19.3.2025 %9 %J JMIR Med Educ %G English %X Background: The integration of artificial intelligence (AI), notably ChatGPT, into medical education, has shown promising results in various medical fields. Nevertheless, its efficacy in traditional Chinese medicine (TCM) examinations remains understudied. Objective: This study aims to (1) assess the performance of ChatGPT on the TCM licensing examination in Taiwan and (2) evaluate the model’s explainability in answering TCM-related questions to determine its suitability as a TCM learning tool. Methods: We used the GPT-4 model to respond to 480 questions from the 2022 TCM licensing examination. This study compared the performance of the model against that of licensed TCM doctors using 2 approaches, namely direct answer selection and provision of explanations before answer selection. The accuracy and consistency of AI-generated responses were analyzed. Moreover, a breakdown of question characteristics was performed based on the cognitive level, depth of knowledge, types of questions, vignette style, and polarity of questions. Results: ChatGPT achieved an overall accuracy of 43.9%, which was lower than that of 2 human participants (70% and 78.4%). The analysis did not reveal a significant correlation between the accuracy of the model and the characteristics of the questions. An in-depth examination indicated that errors predominantly resulted from a misunderstanding of TCM concepts (55.3%), emphasizing the limitations of the model with regard to its TCM knowledge base and reasoning capability. Conclusions: Although ChatGPT shows promise as an educational tool, its current performance on TCM licensing examinations is lacking. This highlights the need for enhancing AI models with specialized TCM training and suggests a cautious approach to utilizing AI for TCM education. Future research should focus on model improvement and the development of tailored educational applications to support TCM learning. %R 10.2196/58897 %U https://mededu.jmir.org/2025/1/e58897 %U https://doi.org/10.2196/58897 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e72190 %T Examining Multimodal AI Resources in Medical Education: The Role of Immersion, Motivation, and Fidelity in AI Narrative Learning %A Jacobs,Chris %K artificial intelligence %K cinematic clinical narrative %K cinemeducation %K medical education %K narrative learning %K AI %K medical students %K preclinical education %K long-term retention %K pharmacology %K AI tools %K GPT-4 %K image %K applicability %K CCN %D 2025 %7 18.3.2025 %9 %J JMIR Med Educ %G English %X %R 10.2196/72190 %U https://mededu.jmir.org/2025/1/e72190 %U https://doi.org/10.2196/72190 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e72336 %T Author’s Reply: Examining Multimodal AI Resources in Medical Education: The Role of Immersion, Motivation, and Fidelity in AI Narrative Learning %A Bland,Tyler %K artificial intelligence %K cinematic clinical narrative %K cinemeducation %K medical education %K narrative learning %K pharmacology %K AI %K medical students %K preclinical education %K long-term retention %K AI tools %K GPT-4 %K image %K applicability %K CCN %D 2025 %7 18.3.2025 %9 %J JMIR Med Educ %G English %X %R 10.2196/72336 %U https://mededu.jmir.org/2025/1/e72336 %U https://doi.org/10.2196/72336 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e55709 %T Impact of Clinical Decision Support Systems on Medical Students’ Case-Solving Performance: Comparison Study with a Focus Group %A Montagna,Marco %A Chiabrando,Filippo %A De Lorenzo,Rebecca %A Rovere Querini,Patrizia %A , %K chatGPT %K chatbot %K machine learning %K ML %K artificial intelligence %K AI %K algorithm %K predictive model %K predictive analytics %K predictive system %K practical model %K deep learning %K large language models %K LLMs %K medical education %K medical teaching %K teaching environment %K clinical decision support systems %K CDSS %K decision support %K decision support tool %K clinical decision-making %K innovative teaching %D 2025 %7 18.3.2025 %9 %J JMIR Med Educ %G English %X Background: Health care practitioners use clinical decision support systems (CDSS) as an aid in the crucial task of clinical reasoning and decision-making. Traditional CDSS are online repositories (ORs) and clinical practice guidelines (CPG). Recently, large language models (LLMs) such as ChatGPT have emerged as potential alternatives. They have proven to be powerful, innovative tools, yet they are not devoid of worrisome risks. Objective: This study aims to explore how medical students perform in an evaluated clinical case through the use of different CDSS tools. Methods: The authors randomly divided medical students into 3 groups, CPG, n=6 (38%); OR, n=5 (31%); and ChatGPT, n=5 (31%); and assigned each group a different type of CDSS for guidance in answering prespecified questions, assessing how students’ speed and ability at resolving the same clinical case varied accordingly. External reviewers evaluated all answers based on accuracy and completeness metrics (score: 1‐5). The authors analyzed and categorized group scores according to the skill investigated: differential diagnosis, diagnostic workup, and clinical decision-making. Results: Answering time showed a trend for the ChatGPT group to be the fastest. The mean scores for completeness were as follows: CPG 4.0, OR 3.7, and ChatGPT 3.8 (P=.49). The mean scores for accuracy were as follows: CPG 4.0, OR 3.3, and ChatGPT 3.7 (P=.02). Aggregating scores according to the 3 students’ skill domains, trends in differences among the groups emerge more clearly, with the CPG group that performed best in nearly all domains and maintained almost perfect alignment between its completeness and accuracy. Conclusions: This hands-on session provided valuable insights into the potential perks and associated pitfalls of LLMs in medical education and practice. It suggested the critical need to include teachings in medical degree courses on how to properly take advantage of LLMs, as the potential for misuse is evident and real. %R 10.2196/55709 %U https://mededu.jmir.org/2025/1/e55709 %U https://doi.org/10.2196/55709 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e66279 %T Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study %A Šuvalov,Hendrik %A Lepson,Mihkel %A Kukk,Veronika %A Malk,Maria %A Ilves,Neeme %A Kuulmets,Hele-Andra %A Kolde,Raivo %+ Institute of Computer Science, University of Tartu, Narva mnt 28, Tartu, 51009, Estonia, 372 7375100, hendrik.suvalov@ut.ee %K natural language processing %K named entity recognition %K large language model %K synthetic data %K LLM %K NLP %K machine learning %K artificial intelligence %K language model %K NER %K medical entity %K Estonian %K health care data %K annotated data %K data annotation %K clinical decision support %K data mining %D 2025 %7 18.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Named entity recognition (NER) plays a vital role in extracting critical medical entities from health care records, facilitating applications such as clinical decision support and data mining. Developing robust NER models for low-resource languages, such as Estonian, remains a challenge due to the scarcity of annotated data and domain-specific pretrained models. Large language models (LLMs) have proven to be promising in understanding text from any language or domain. Objective: This study addresses the development of medical NER models for low-resource languages, specifically Estonian. We propose a novel approach by generating synthetic health care data and using LLMs to annotate them. These synthetic data are then used to train a high-performing NER model, which is applied to real-world medical texts, preserving patient data privacy. Methods: Our approach to overcoming the shortage of annotated Estonian health care texts involves a three-step pipeline: (1) synthetic health care data are generated using a locally trained GPT-2 model on Estonian medical records, (2) the synthetic data are annotated with LLMs, specifically GPT-3.5-Turbo and GPT-4, and (3) the annotated synthetic data are then used to fine-tune an NER model, which is later tested on real-world medical data. This paper compares the performance of different prompts; assesses the impact of GPT-3.5-Turbo, GPT-4, and a local LLM; and explores the relationship between the amount of annotated synthetic data and model performance. Results: The proposed methodology demonstrates significant potential in extracting named entities from real-world medical texts. Our top-performing setup achieved an F1-score of 0.69 for drug extraction and 0.38 for procedure extraction. These results indicate a strong performance in recognizing certain entity types while highlighting the complexity of extracting procedures. Conclusions: This paper demonstrates a successful approach to leveraging LLMs for training NER models using synthetic data, effectively preserving patient privacy. By avoiding reliance on human-annotated data, our method shows promise in developing models for low-resource languages, such as Estonian. Future work will focus on refining the synthetic data generation and expanding the method’s applicability to other domains and languages. %M 40101227 %R 10.2196/66279 %U https://www.jmir.org/2025/1/e66279 %U https://doi.org/10.2196/66279 %U http://www.ncbi.nlm.nih.gov/pubmed/40101227 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70481 %T How to Design, Create, and Evaluate an Instruction-Tuning Dataset for Large Language Model Training in Health Care: Tutorial From a Clinical Perspective %A Nazar,Wojciech %A Nazar,Grzegorz %A Kamińska,Aleksandra %A Danilowicz-Szymanowicz,Ludmila %+ Department of Allergology, Faculty of Medicine, Gdańsk Medical University, Smoluchowskiego 17, Gdansk, 80-214, Poland, 48 585844300, wojciech.nazar@gumed.edu.pl %K generative artificial intelligence %K large language models %K instruction-tuning datasets %K tutorials %K evaluation framework %K health care %D 2025 %7 18.3.2025 %9 Tutorial %J J Med Internet Res %G English %X High-quality data are critical in health care, forming the cornerstone for accurate diagnoses, effective treatment plans, and reliable conclusions. Similarly, high-quality datasets underpin the development and performance of large language models (LLMs). Among these, instruction-tuning datasets (ITDs) used for instruction fine-tuning have been pivotal in enhancing LLM performance and generalization capabilities across diverse tasks. This tutorial provides a comprehensive guide to designing, creating, and evaluating ITDs for health care applications. Written from a clinical perspective, it aims to make the concepts accessible to a broad audience, especially medical practitioners. Key topics include identifying useful data sources, defining the characteristics of well-designed datasets, and crafting high-quality instruction-input-output examples. We explore practical approaches to dataset construction, examining the advantages and limitations of 3 primary methods: fully manual preparation by expert annotators, fully synthetic generation using artificial intelligence (AI), and an innovative hybrid approach in which experts draft the initial dataset and AI generates additional data. Moreover, we discuss strategies for metadata selection and human evaluation to ensure the quality and effectiveness of ITDs. By integrating these elements, this tutorial provides a structured framework for establishing ITDs. It bridges technical and clinical domains, supporting the continued interdisciplinary advancement of AI in medicine. Additionally, we address the limitations of current practices and propose future directions, emphasizing the need for a global, unified framework for ITDs. We also argue that artificial general intelligence (AGI), if realized, will not replace empirical research in medicine. AGI will depend on human-curated datasets to process and apply medical knowledge. At the same time, ITDs will likely remain the most effective method of supplying this knowledge to AGI, positioning them as a critical tool in AI-driven health care. %M 40100270 %R 10.2196/70481 %U https://www.jmir.org/2025/1/e70481 %U https://doi.org/10.2196/70481 %U http://www.ncbi.nlm.nih.gov/pubmed/40100270 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67033 %T Prompt Framework for Extracting Scale-Related Knowledge Entities from Chinese Medical Literature: Development and Evaluation Study %A Hao,Jie %A Chen,Zhenli %A Peng,Qinglong %A Zhao,Liang %A Zhao,Wanqing %A Cong,Shan %A Li,Junlian %A Li,Jiao %A Qian,Qing %A Sun,Haixia %+ , Institute of Medical Information/Medical Library, Chinese Academy of Medical Sciences & Peking Union Medical College, No. 3, Yabao Road, Chaoyang District, Beijing, 100020, China, 86 01052328741, sun.haixia@imicams.ac.cn %K prompt engineering %K named entity recognition %K in-context learning %K large language model %K Chinese medical literature %K measurement-based care %K framework %K prompt %K prompt framework %K scale %K China %K medical literature %K MBC %K LLM %K MedScaleNER %K retrieval %K information retrieval %K dataset %K artificial intelligence %K AI %D 2025 %7 18.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Measurement-based care improves patient outcomes by using standardized scales, but its widespread adoption is hindered by the lack of accessible and structured knowledge, particularly in unstructured Chinese medical literature. Extracting scale-related knowledge entities from these texts is challenging due to limited annotated data. While large language models (LLMs) show promise in named entity recognition (NER), specialized prompting strategies are needed to accurately recognize medical scale-related entities, especially in low-resource settings. Objective: This study aims to develop and evaluate MedScaleNER, a task-oriented prompt framework designed to optimize LLM performance in recognizing medical scale-related entities from Chinese medical literature. Methods: MedScaleNER incorporates demonstration retrieval within in-context learning, chain-of-thought prompting, and self-verification strategies to improve performance. The framework dynamically retrieves optimal examples using a k-nearest neighbors approach and decomposes the NER task into two subtasks: entity type identification and entity labeling. Self-verification ensures the reliability of the final output. A dataset of manually annotated Chinese medical journal papers was constructed, focusing on three key entity types: scale names, measurement concepts, and measurement items. Experiments were conducted by varying the number of examples and the proportion of training data to evaluate performance in low-resource settings. Additionally, MedScaleNER’s performance was compared with locally fine-tuned models. Results: The CMedS-NER (Chinese Medical Scale Corpus for Named Entity Recognition) dataset, containing 720 papers with 27,499 manually annotated scale-related knowledge entities, was used for evaluation. Initial experiments identified GLM-4-0520 as the best-performing LLM among six tested models. When applied with GLM-4-0520, MedScaleNER significantly improved NER performance for scale-related entities, achieving a macro F1-score of 59.64% in an exact string match with the full training dataset. The highest performance was achieved with 20-shot demonstrations. Under low-resource scenarios (eg, 1% of the training data), MedScaleNER outperformed all tested locally fine-tuned models. Ablation studies highlighted the importance of demonstration retrieval and self-verification in improving model reliability. Error analysis revealed four main types of mistakes: identification errors, type errors, boundary errors, and missing entities, indicating areas for further improvement. Conclusions: MedScaleNER advances the application of LLMs and prompts engineering for specialized NER tasks in Chinese medical literature. By addressing the challenges of unstructured texts and limited annotated data, MedScaleNER’s adaptability to various biomedical contexts supports more efficient and reliable knowledge extraction, contributing to broader measurement-based care implementation and improved clinical and research outcomes. %M 40100267 %R 10.2196/67033 %U https://www.jmir.org/2025/1/e67033 %U https://doi.org/10.2196/67033 %U http://www.ncbi.nlm.nih.gov/pubmed/40100267 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e66344 %T Revealing Patient Dissatisfaction With Health Care Resource Allocation in Multiple Dimensions Using Large Language Models and the International Classification of Diseases 11th Revision: Aspect-Based Sentiment Analysis %A Li,Jiaxuan %A Yang,Yunchu %A Mao,Chao %A Pang,Patrick Cheong-Iao %A Zhu,Quanjing %A Xu,Dejian %A Wang,Yapeng %+ Faculty of Applied Sciences, Macao Polytechnic University, Rua de Luís Gonzaga Gomes, Macao, 999078, Macao, 853 85996886, mail@patrickpang.net %K ICD-11 %K International Classification of Diseases 11th Revision %K disease classification %K patient reviews %K patient satisfaction %K ChatGPT %K Sustainable Development Goals %K chain of thought %K large language model %D 2025 %7 17.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Accurately measuring the health care needs of patients with different diseases remains a public health challenge for health care management worldwide. There is a need for new computational methods to be able to assess the health care resources required by patients with different diseases to avoid wasting resources. Objective: This study aimed to assessing dissatisfaction with allocation of health care resources from the perspective of patients with different diseases that can help optimize resource allocation and better achieve several of the Sustainable Development Goals (SDGs), such as SDG 3 (“Good Health and Well-being”). Our goal was to show the effectiveness and practicality of large language models (LLMs) in assessing the distribution of health care resources. Methods: We used aspect-based sentiment analysis (ABSA), which can divide textual data into several aspects for sentiment analysis. In this study, we used Chat Generative Pretrained Transformer (ChatGPT) to perform ABSA of patient reviews based on 3 aspects (patient experience, physician skills and efficiency, and infrastructure and administration)00 in which we embedded chain-of-thought (CoT) prompting and compared the performance of Chinese and English LLMs on a Chinese dataset. Additionally, we used the International Classification of Diseases 11th Revision (ICD-11) application programming interface (API) to classify the sentiment analysis results into different disease categories. Results: We evaluated the performance of the models by comparing predicted sentiments (either positive or negative) with the labels judged by human evaluators in terms of the aforementioned 3 aspects. The results showed that ChatGPT 3.5 is superior in a combination of stability, expense, and runtime considerations compared to ChatGPT-4o and Qwen-7b. The weighted total precision of our method based on the ABSA of patient reviews was 0.907, while the average accuracy of all 3 sampling methods was 0.893. Both values suggested that the model was able to achieve our objective. Using our approach, we identified that dissatisfaction is highest for sex-related diseases and lowest for circulatory diseases and that the need for better infrastructure and administration is much higher for blood-related diseases than for other diseases in China. Conclusions: The results prove that our method with LLMs can use patient reviews and the ICD-11 classification to assess the health care needs of patients with different diseases, which can assist with resource allocation rationally. %M 40096682 %R 10.2196/66344 %U https://www.jmir.org/2025/1/e66344 %U https://doi.org/10.2196/66344 %U http://www.ncbi.nlm.nih.gov/pubmed/40096682 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e57257 %T Assessing Racial and Ethnic Bias in Text Generation by Large Language Models for Health Care–Related Tasks: Cross-Sectional Study %A Hanna,John J %A Wakene,Abdi D %A Johnson,Andrew O %A Lehmann,Christoph U %A Medford,Richard J %+ Information Services, ECU Health, 2100 Stantonsburg Rd, Greenville, NC, 27834, United States, 1 2528474100, john.hanna@ecuhealth.org %K sentiment analysis %K racism %K bias %K artificial intelligence %K reading ease %K word frequency %K large language models %K text generation %K healthcare %K task %K ChatGPT %K cross sectional %K consumer-directed %K human immunodeficiency virus %D 2025 %7 13.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Racial and ethnic bias in large language models (LLMs) used for health care tasks is a growing concern, as it may contribute to health disparities. In response, LLM operators implemented safeguards against prompts that are overtly seeking certain biases. Objective: This study aims to investigate a potential racial and ethnic bias among 4 popular LLMs: GPT-3.5-turbo (OpenAI), GPT-4 (OpenAI), Gemini-1.0-pro (Google), and Llama3-70b (Meta) in generating health care consumer–directed text in the absence of overtly biased queries. Methods: In this cross-sectional study, the 4 LLMs were prompted to generate discharge instructions for patients with HIV. Each patient’s encounter deidentified metadata including race/ethnicity as a variable was passed over in a table format through a prompt 4 times, altering only the race/ethnicity information (African American, Asian, Hispanic White, and non-Hispanic White) each time, while keeping all other information constant. The prompt requested the model to write discharge instructions for each encounter without explicitly mentioning race or ethnicity. The LLM-generated instructions were analyzed for sentiment, subjectivity, reading ease, and word frequency by race/ethnicity. Results: The only observed statistically significant difference between race/ethnicity groups was found in entity count (GPT-4, df=42, P=.047). However, post hoc chi-square analysis for GPT-4’s entity counts showed no significant pairwise differences among race/ethnicity categories after Bonferroni correction. Conclusions: A total of 4 LLMs were relatively invariant to race/ethnicity in terms of linguistic and readability measures. While our study used proxy linguistic and readability measures to investigate racial and ethnic bias among 4 LLM responses in a health care–related task, there is an urgent need to establish universally accepted standards for measuring bias in LLM-generated responses. Further studies are needed to validate these results and assess their implications. %M 40080818 %R 10.2196/57257 %U https://www.jmir.org/2025/1/e57257 %U https://doi.org/10.2196/57257 %U http://www.ncbi.nlm.nih.gov/pubmed/40080818 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e55277 %T Creation of Scientific Response Documents for Addressing Product Medical Information Inquiries: Mixed Method Approach Using Artificial Intelligence %A Lau,Jerry %A Bisht,Shivani %A Horton,Robert %A Crisan,Annamaria %A Jones,John %A Gantotti,Sandeep %A Hermes-DeSantis,Evelyn %+ phactMI, 5931 NW 1st Place, Gainesville, FL, 32607, United States, 1 2155881585, evelyn@phactmi.org %K AI %K LLM %K GPT %K biopharmaceutical %K medical information %K content generation %K artificial intelligence %K pharmaceutical %K scientific response %K documentation %K information %K clinical data %K strategy %K reference %K feasibility %K development %K machine learning %K large language model %K accuracy %K context %K traceability %K accountability %K survey %K scientific response documentation %K SRD %K benefit %K content generator %K content analysis %K Generative Pre-trained Transformer %D 2025 %7 13.3.2025 %9 Original Paper %J JMIR AI %G English %X Background: Pharmaceutical manufacturers address health care professionals’ information needs through scientific response documents (SRDs), offering evidence-based answers to medication and disease state questions. Medical information departments, staffed by medical experts, develop SRDs that provide concise summaries consisting of relevant background information, search strategies, clinical data, and balanced references. With an escalating demand for SRDs and the increasing complexity of therapies, medical information departments are exploring advanced technologies and artificial intelligence (AI) tools like large language models (LLMs) to streamline content development. While AI and LLMs show promise in generating draft responses, a synergistic approach combining an LLM with traditional machine learning classifiers in a series of human-supervised and -curated steps could help address limitations, including hallucinations. This will ensure accuracy, context, traceability, and accountability in the development of the concise clinical data summaries of an SRD. Objective: This study aims to quantify the challenges of SRD development and develop a framework exploring the feasibility and value addition of integrating AI capabilities in the process of creating concise summaries for an SRD. Methods: To measure the challenges in SRD development, a survey was conducted by phactMI, a nonprofit consortium of medical information leaders in the pharmaceutical industry, assessing aspects of SRD creation among its member companies. The survey collected data on the time and tediousness of various activities related to SRD development. Another working group, consisting of medical information professionals and data scientists, used AI to aid SRD authoring, focusing on data extraction and abstraction. They used logistic regression on semantic embedding features to train classification models and transformer-based summarization pipelines to generate concise summaries. Results: Of the 33 companies surveyed, 64% (21/33) opened the survey, and 76% (16/21) of those responded. On average, medical information departments generate 614 new documents and update 1352 documents each year. Respondents considered paraphrasing scientific articles to be the most tedious and time-intensive task. In the project’s second phase, sentence classification models showed the ability to accurately distinguish target categories with receiver operating characteristic scores ranging from 0.67 to 0.85 (all P<.001), allowing for accurate data extraction. For data abstraction, the comparison of the bilingual evaluation understudy (BLEU) score and semantic similarity in the paraphrased texts yielded different results among reviewers, with each preferring different trade-offs between these metrics. Conclusions: This study establishes a framework for integrating LLM and machine learning into SRD development, supported by a pharmaceutical company survey emphasizing the challenges of paraphrasing content. While machine learning models show potential for section identification and content usability assessment in data extraction and abstraction, further optimization and research are essential before full-scale industry implementation. The working group’s insights guide an AI-driven content analysis; address limitations; and advance efficient, precise, and responsive frameworks to assist with pharmaceutical SRD development. %M 40080808 %R 10.2196/55277 %U https://ai.jmir.org/2025/1/e55277 %U https://doi.org/10.2196/55277 %U http://www.ncbi.nlm.nih.gov/pubmed/40080808 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e63216 %T Large Language Model–Based Critical Care Big Data Deployment and Extraction: Descriptive Analysis %A Yang,Zhongbao %A Xu,Shan-Shan %A Liu,Xiaozhu %A Xu,Ningyuan %A Chen,Yuqing %A Wang,Shuya %A Miao,Ming-Yue %A Hou,Mengxue %A Liu,Shuai %A Zhou,Yi-Min %A Zhou,Jian-Xin %A Zhang,Linlin %K big data %K critical care–related databases %K database deployment %K large language model %K database extraction %K intensive care unit %K ICU %K GPT %K artificial intelligence %K AI %K LLM %D 2025 %7 12.3.2025 %9 %J JMIR Med Inform %G English %X Background: Publicly accessible critical care–related databases contain enormous clinical data, but their utilization often requires advanced programming skills. The growing complexity of large databases and unstructured data presents challenges for clinicians who need programming or data analysis expertise to utilize these systems directly. Objective: This study aims to simplify critical care–related database deployment and extraction via large language models. Methods: The development of this platform was a 2-step process. First, we enabled automated database deployment using Docker container technology, with incorporated web-based analytics interfaces Metabase and Superset. Second, we developed the intensive care unit–generative pretrained transformer (ICU-GPT), a large language model fine-tuned on intensive care unit (ICU) data that integrated LangChain and Microsoft AutoGen. Results: The automated deployment platform was designed with user-friendliness in mind, enabling clinicians to deploy 1 or multiple databases in local, cloud, or remote environments without the need for manual setup. After successfully overcoming GPT’s token limit and supporting multischema data, ICU-GPT could generate Structured Query Language (SQL) queries and extract insights from ICU datasets based on request input. A front-end user interface was developed for clinicians to achieve code-free SQL generation on the web-based client. Conclusions: By harnessing the power of our automated deployment platform and ICU-GPT model, clinicians are empowered to easily visualize, extract, and arrange critical care–related databases more efficiently and flexibly than manual methods. Our research could decrease the time and effort spent on complex bioinformatics methods and advance clinical research. %R 10.2196/63216 %U https://medinform.jmir.org/2025/1/e63216 %U https://doi.org/10.2196/63216 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e67696 %T Evaluation of ChatGPT Performance on Emergency Medicine Board Examination Questions: Observational Study %A Pastrak,Mila %A Kajitani,Sten %A Goodings,Anthony James %A Drewek,Austin %A LaFree,Andrew %A Murphy,Adrian %K artificial intelligence %K ChatGPT-4 %K medical education %K emergency medicine %K examination %K examination preparation %D 2025 %7 12.3.2025 %9 %J JMIR AI %G English %X Background: The ever-evolving field of medicine has highlighted the potential for ChatGPT as an assistive platform. However, its use in medical board examination preparation and completion remains unclear. Objective: This study aimed to evaluate the performance of a custom-modified version of ChatGPT-4, tailored with emergency medicine board examination preparatory materials (Anki flashcard deck), compared to its default version and previous iteration (3.5). The goal was to assess the accuracy of ChatGPT-4 answering board-style questions and its suitability as a tool to aid students and trainees in standardized examination preparation. Methods: A comparative analysis was conducted using a random selection of 598 questions from the Rosh In-Training Examination Question Bank. The subjects of the study included three versions of ChatGPT: the Default, a Custom, and ChatGPT-3.5. The accuracy, response length, medical discipline subgroups, and underlying causes of error were analyzed. Results: The Custom version did not demonstrate a significant improvement in accuracy over the Default version (P=.61), although both significantly outperformed ChatGPT-3.5 (P<.001). The Default version produced significantly longer responses than the Custom version, with the mean (SD) values being 1371 (444) and 929 (408), respectively (P<.001). Subgroup analysis revealed no significant difference in the performance across different medical subdisciplines between the versions (P>.05 in all cases). Both the versions of ChatGPT-4 had similar underlying error types (P>.05 in all cases) and had a 99% predicted probability of passing while ChatGPT-3.5 had an 85% probability. Conclusions: The findings suggest that while newer versions of ChatGPT exhibit improved performance in emergency medicine board examination preparation, specific enhancement with a comprehensive Anki flashcard deck on the topic does not significantly impact accuracy. The study highlights the potential of ChatGPT-4 as a tool for medical education, capable of providing accurate support across a wide range of topics in emergency medicine in its default form. %R 10.2196/67696 %U https://ai.jmir.org/2025/1/e67696 %U https://doi.org/10.2196/67696 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e64682 %T GPT-3.5 Turbo and GPT-4 Turbo in Title and Abstract Screening for Systematic Reviews %A Oami,Takehiko %A Okada,Yohei %A Nakada,Taka-aki %K large language models %K citation screening %K systematic review %K clinical practice guidelines %K artificial intelligence %K sepsis %K AI %K review %K GPT %K screening %K citations %K critical care %K Japan %K Japanese %K accuracy %K efficiency %K reliability %K LLM %D 2025 %7 12.3.2025 %9 %J JMIR Med Inform %G English %X This study demonstrated that while GPT-4 Turbo had superior specificity when compared to GPT-3.5 Turbo (0.98 vs 0.51), as well as comparable sensitivity (0.85 vs 0.83), GPT-3.5 Turbo processed 100 studies faster (0.9 min vs 1.6 min) in citation screening for systematic reviews, suggesting that GPT-4 Turbo may be more suitable due to its higher specificity and highlighting the potential of large language models in optimizing literature selection. %R 10.2196/64682 %U https://medinform.jmir.org/2025/1/e64682 %U https://doi.org/10.2196/64682 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e59210 %T Leveraging Generative Artificial Intelligence to Improve Motivation and Retrieval in Higher Education Learners %A Monzon,Noahlana %A Hays,Franklin Alan %K educational technology %K retrieval practice %K flipped classroom %K cognitive engagement %K personalized learning %K generative artificial intelligence %K higher education %K university education %K learners %K instructors %K curriculum structure %K learning %K technologies %K innovation %K academic misconduct %K gamification %K self-directed %K socio-economic disparities %K interactive approach %K medical education %K chatGPT %K machine learning %K AI %K large language models %D 2025 %7 11.3.2025 %9 %J JMIR Med Educ %G English %X Generative artificial intelligence (GenAI) presents novel approaches to enhance motivation, curriculum structure and development, and learning and retrieval processes for both learners and instructors. Though a focus for this emerging technology is academic misconduct, we sought to leverage GenAI in curriculum structure to facilitate educational outcomes. For instructors, GenAI offers new opportunities in course design and management while reducing time requirements to evaluate outcomes and personalizing learner feedback. These include innovative instructional designs such as flipped classrooms and gamification, enriching teaching methodologies with focused and interactive approaches, and team-based exercise development among others. For learners, GenAI offers unprecedented self-directed learning opportunities, improved cognitive engagement, and effective retrieval practices, leading to enhanced autonomy, motivation, and knowledge retention. Though empowering, this evolving landscape has integration challenges and ethical considerations, including accuracy, technological evolution, loss of learner’s voice, and socioeconomic disparities. Our experience demonstrates that the responsible application of GenAI’s in educational settings will revolutionize learning practices, making education more accessible and tailored, producing positive motivational outcomes for both learners and instructors. Thus, we argue that leveraging GenAI in educational settings will improve outcomes with implications extending from primary through higher and continuing education paradigms. %R 10.2196/59210 %U https://mededu.jmir.org/2025/1/e59210 %U https://doi.org/10.2196/59210 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67488 %T Accuracy of Large Language Models for Literature Screening in Thoracic Surgery: Diagnostic Study %A Dai,Zhang-Yi %A Wang,Fu-Qiang %A Shen,Cheng %A Ji,Yan-Li %A Li,Zhi-Yang %A Wang,Yun %A Pu,Qiang %+ Department of Thoracic Surgery, West China Hospital of Sichuan University, No.37, Guoxue Alley, Chengdu, 610041, China, 86 18980606738, puqiang100@163.com %K accuracy %K large language models %K meta-analysis %K literature screening %K thoracic surgery %D 2025 %7 11.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Systematic reviews and meta-analyses rely on labor-intensive literature screening. While machine learning offers potential automation, its accuracy remains suboptimal. This raises the question of whether emerging large language models (LLMs) can provide a more accurate and efficient approach. Objective: This paper evaluates the sensitivity, specificity, and summary receiver operating characteristic (SROC) curve of LLM-assisted literature screening. Methods: We conducted a diagnostic study comparing the accuracy of LLM-assisted screening versus manual literature screening across 6 thoracic surgery meta-analyses. Manual screening by 2 investigators served as the reference standard. LLM-assisted screening was performed using ChatGPT-4o (OpenAI) and Claude-3.5 (Anthropic) sonnet, with discrepancies resolved by Gemini-1.5 pro (Google). In addition, 2 open-source, machine learning–based screening tools, ASReview (Utrecht University) and Abstrackr (Center for Evidence Synthesis in Health, Brown University School of Public Health), were also evaluated. We calculated sensitivity, specificity, and 95% CIs for the title and abstract, as well as full-text screening, generating pooled estimates and SROC curves. LLM prompts were revised based on a post hoc error analysis. Results: LLM-assisted full-text screening demonstrated high pooled sensitivity (0.87, 95% CI 0.77-0.99) and specificity (0.96, 95% CI 0.91-0.98), with the area under the curve (AUC) of 0.96 (95% CI 0.94-0.97). Title and abstract screening achieved a pooled sensitivity of 0.73 (95% CI 0.57-0.85) and specificity of 0.99 (95% CI 0.97-0.99), with an AUC of 0.97 (95% CI 0.96-0.99). Post hoc revisions improved sensitivity to 0.98 (95% CI 0.74-1.00) while maintaining high specificity (0.98, 95% CI 0.94-0.99). In comparison, the pooled sensitivity and specificity of ASReview tool-assisted screening were 0.58 (95% CI 0.53-0.64) and 0.97 (95% CI 0.91-0.99), respectively, with an AUC of 0.66 (95% CI 0.62-0.70). The pooled sensitivity and specificity of Abstrackr tool-assisted screening were 0.48 (95% CI 0.35-0.62) and 0.96 (95% CI 0.88-0.99), respectively, with an AUC of 0.78 (95% CI 0.74-0.82). A post hoc meta-analysis revealed comparable effect sizes between LLM-assisted and conventional screening. Conclusions: LLMs hold significant potential for streamlining literature screening in systematic reviews, reducing workload without sacrificing quality. Importantly, LLMs outperformed traditional machine learning-based tools (ASReview and Abstrackr) in both sensitivity and AUC values, suggesting that LLMs offer a more accurate and efficient approach to literature screening. %M 40068152 %R 10.2196/67488 %U https://www.jmir.org/2025/1/e67488 %U https://doi.org/10.2196/67488 %U http://www.ncbi.nlm.nih.gov/pubmed/40068152 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e58855 %T The Reliability and Quality of Videos as Guidance for Gastrointestinal Endoscopy: Cross-Sectional Study %A Liu,Jinpei %A Qiu,Yifan %A Liu,Yilong %A Xu,Wenping %A Ning,Weichen %A Shi,Peimei %A Yuan,Zongli %A Wang,Fang %A Shi,Yihai %+ Department of Gastroenterology, Gongli Hospital of Shanghai Pudong New Area, Pudong New Area 219 Miaopu Road, Shanghai, 200135, China, 86 5885873, syh01206@163.com %K gastrointestinal endoscopy %K YouTube %K patient education %K social media gastrointestinal %K large language model %K LLM %K reliability %K quality %K video %K cross-sectional study %K endoscopy-related videos %K health information %K endoscopy %K gastroscopy %K colonoscopy %D 2025 %7 11.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Gastrointestinal endoscopy represents a useful tool for the diagnosis and treatment of gastrointestinal diseases. Video platforms for spreading endoscopy-related knowledge may help patients understand the pros and cons of endoscopy on the premise of ensuring accuracy. However, videos with misinformation may lead to adverse consequences. Objective: This study aims to evaluate the quality of gastrointestinal endoscopy-related videos on YouTube and to assess whether large language models (LLMs) can help patients obtain information from videos more efficiently. Methods: We collected information from YouTube videos about 3 commonly used gastrointestinal endoscopes (gastroscopy, colonoscopy, and capsule endoscopy) and assessed their quality (rated by the modified DISCERN Tool, mDISCERN), reliability (rated by the Journal of the American Medical Association), and recommendation (rated by the Global Quality Score). We tasked LLM with summarizing the video content and assessed it from 3 perspectives: accuracy, completeness, and readability. Results: A total of 167 videos were included. According to the indicated scoring, the quality, reliability, and recommendation of the 3 gastrointestinal endoscopy-related videos on YouTube were overall unsatisfactory, and the quality of the videos released by patients was particularly poor. Capsule endoscopy yielded a significantly lower Global Quality Score than did gastroscopy and colonoscopy. LLM-based summaries yielded accuracy scores of 4 (IQR 4-5), completeness scores of 4 (IQR 4-5), and readability scores of 2 (IQR 1-2). Conclusions: The quality of gastrointestinal endoscope-related videos currently on YouTube is poor. Moreover, additional regulatory and improvement strategies are needed in the future. LLM may be helpful in generalizing video-related information, but there is still room for improvement in its ability. %M 40068165 %R 10.2196/58855 %U https://www.jmir.org/2025/1/e58855 %U https://doi.org/10.2196/58855 %U http://www.ncbi.nlm.nih.gov/pubmed/40068165 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e66207 %T Medical Misinformation in AI-Assisted Self-Diagnosis: Development of a Method (EvalPrompt) for Analyzing Large Language Models %A Zada,Troy %A Tam,Natalie %A Barnard,Francois %A Van Sittert,Marlize %A Bhat,Venkat %A Rambhatla,Sirisha %K ChatGPT %K health care %K LLM %K misinformation %K self-diagnosis %K large language model %D 2025 %7 10.3.2025 %9 %J JMIR Form Res %G English %X Background: Rapid integration of large language models (LLMs) in health care is sparking global discussion about their potential to revolutionize health care quality and accessibility. At a time when improving health care quality and access remains a critical concern for countries worldwide, the ability of these models to pass medical examinations is often cited as a reason to use them for medical training and diagnosis. However, the impact of their inevitable use as a self-diagnostic tool and their role in spreading health care misinformation has not been evaluated. Objective: This study aims to assess the effectiveness of LLMs, particularly ChatGPT, from the perspective of an individual self-diagnosing to better understand the clarity, correctness, and robustness of the models. Methods: We propose the comprehensive testing methodology evaluation of LLM prompts (EvalPrompt). This evaluation methodology uses multiple-choice medical licensing examination questions to evaluate LLM responses. Experiment 1 prompts ChatGPT with open-ended questions to mimic real-world self-diagnosis use cases, and experiment 2 performs sentence dropout on the correct responses from experiment 1 to mimic self-diagnosis with missing information. Humans then assess the responses returned by ChatGPT for both experiments to evaluate the clarity, correctness, and robustness of ChatGPT. Results: In experiment 1, we found that ChatGPT-4.0 was deemed correct for 31% (29/94) of the questions by both nonexperts and experts, with only 34% (32/94) agreement between the 2 groups. Similarly, in experiment 2, which assessed robustness, 61% (92/152) of the responses continued to be categorized as correct by all assessors. As a result, in comparison to a passing threshold of 60%, ChatGPT-4.0 is considered incorrect and unclear, though robust. This indicates that sole reliance on ChatGPT-4.0 for self-diagnosis could increase the risk of individuals being misinformed. Conclusions: The results highlight the modest capabilities of LLMs, as their responses are often unclear and inaccurate. Any medical advice provided by LLMs should be cautiously approached due to the significant risk of misinformation. However, evidence suggests that LLMs are steadily improving and could potentially play a role in health care systems in the future. To address the issue of medical misinformation, there is a pressing need for the development of a comprehensive self-diagnosis dataset. This dataset could enhance the reliability of LLMs in medical applications by featuring more realistic prompt styles with minimal information across a broader range of medical fields. %R 10.2196/66207 %U https://formative.jmir.org/2025/1/e66207 %U https://doi.org/10.2196/66207 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e65651 %T Assessment of the Efficiency of a ChatGPT-Based Tool, MyGenAssist, in an Industry Pharmacovigilance Department for Case Documentation: Cross-Over Study %A Benaïche,Alexandre %A Billaut-Laden,Ingrid %A Randriamihaja,Herivelo %A Bertocchio,Jean-Philippe %+ Bayer Healthcare SAS France, 1 Rue Claude Bernard, Lille, 59000, France, 33 320445962, benaichealexandre@gmail.com %K MyGenAssist %K large language model %K artificial intelligence %K ChatGPT %K pharmacovigilance %K efficiency %D 2025 %7 10.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: At the end of 2023, Bayer AG launched its own internal large language model (LLM), MyGenAssist, based on ChatGPT technology to overcome data privacy concerns. It may offer the possibility to decrease their harshness and save time spent on repetitive and recurrent tasks that could then be dedicated to activities with higher added value. Although there is a current worldwide reflection on whether artificial intelligence should be integrated into pharmacovigilance, medical literature does not provide enough data concerning LLMs and their daily applications in such a setting. Here, we studied how this tool could improve the case documentation process, which is a duty for authorization holders as per European and French good vigilance practices. Objective: The aim of the study is to test whether the use of an LLM could improve the pharmacovigilance documentation process. Methods: MyGenAssist was trained to draft templates for case documentation letters meant to be sent to the reporters. Information provided within the template changes depending on the case: such data come from a table sent to the LLM. We then measured the time spent on each case for a period of 4 months (2 months before using the tool and 2 months after its implementation). A multiple linear regression model was created with the time spent on each case as the explained variable, and all parameters that could influence this time were included as explanatory variables (use of MyGenAssist, type of recipient, number of questions, and user). To test if the use of this tool impacts the process, we compared the recipients’ response rates with and without the use of MyGenAssist. Results: An average of 23.3% (95% CI 13.8%-32.8%) of time saving was made thanks to MyGenAssist (P<.001; adjusted R2=0.286) on each case, which could represent an average of 10.7 (SD 3.6) working days saved each year. The answer rate was not modified by the use of MyGenAssist (20/48, 42% vs 27/74, 36%; P=.57) whether the recipient was a physician or a patient. No significant difference was found regarding the time spent by the recipient to answer (mean 2.20, SD 3.27 days vs mean 2.65, SD 3.30 days after the last attempt of contact; P=.64). The implementation of MyGenAssist for this activity only required a 2-hour training session for the pharmacovigilance team. Conclusions: Our study is the first to show that a ChatGPT-based tool can improve the efficiency of a good practice activity without needing a long training session for the affected workforce. These first encouraging results could be an incentive for the implementation of LLMs in other processes. %M 40063946 %R 10.2196/65651 %U https://www.jmir.org/2025/1/e65651 %U https://doi.org/10.2196/65651 %U http://www.ncbi.nlm.nih.gov/pubmed/40063946 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e59792 %T Generative AI Models in Time-Varying Biomedical Data: Scoping Review %A He,Rosemary %A Sarwal,Varuni %A Qiu,Xinru %A Zhuang,Yongwen %A Zhang,Le %A Liu,Yue %A Chiang,Jeffrey %+ Department of Neurosurgery, David Geffen School of Medicine, University of California, Los Angeles, 300 Stein Plaza, Suite 560, Los Angeles, CA, 90095, United States, 1 310 825 5111, njchiang@g.ucla.edu %K generative artificial intelligence %K artificial intelligence %K time series %K electronic health records %K electronic medical records %K systematic reviews %K disease trajectory %K machine learning %K algorithms %K forecasting %D 2025 %7 10.3.2025 %9 Review %J J Med Internet Res %G English %X Background: Trajectory modeling is a long-standing challenge in the application of computational methods to health care. In the age of big data, traditional statistical and machine learning methods do not achieve satisfactory results as they often fail to capture the complex underlying distributions of multimodal health data and long-term dependencies throughout medical histories. Recent advances in generative artificial intelligence (AI) have provided powerful tools to represent complex distributions and patterns with minimal underlying assumptions, with major impact in fields such as finance and environmental sciences, prompting researchers to apply these methods for disease modeling in health care. Objective: While AI methods have proven powerful, their application in clinical practice remains limited due to their highly complex nature. The proliferation of AI algorithms also poses a significant challenge for nondevelopers to track and incorporate these advances into clinical research and application. In this paper, we introduce basic concepts in generative AI and discuss current algorithms and how they can be applied to health care for practitioners with little background in computer science. Methods: We surveyed peer-reviewed papers on generative AI models with specific applications to time-series health data. Our search included single- and multimodal generative AI models that operated over structured and unstructured data, physiological waveforms, medical imaging, and multi-omics data. We introduce current generative AI methods, review their applications, and discuss their limitations and future directions in each data modality. Results: We followed the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines and reviewed 155 articles on generative AI applications to time-series health care data across modalities. Furthermore, we offer a systematic framework for clinicians to easily identify suitable AI methods for their data and task at hand. Conclusions: We reviewed and critiqued existing applications of generative AI to time-series health data with the aim of bridging the gap between computational methods and clinical application. We also identified the shortcomings of existing approaches and highlighted recent advances in generative AI that represent promising directions for health care modeling. %M 40063929 %R 10.2196/59792 %U https://www.jmir.org/2025/1/e59792 %U https://doi.org/10.2196/59792 %U http://www.ncbi.nlm.nih.gov/pubmed/40063929 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e60391 %T GPT-4 as a Clinical Decision Support Tool in Ischemic Stroke Management: Evaluation Study %A Shmilovitch,Amit Haim %A Katson,Mark %A Cohen-Shelly,Michal %A Peretz,Shlomi %A Aran,Dvir %A Shelly,Shahar %+ Department of Neurology, Rambam Medical Center, HaAliya HaShniya Street 8, PO Box 9602, Haifa, 3109601, Israel, 972 543541995, s_shelly@rmc.gov.il %K GPT-4 %K ischemic stroke %K clinical decision support %K artificial intelligence %K neurology %D 2025 %7 7.3.2025 %9 Original Paper %J JMIR AI %G English %X Background: Cerebrovascular diseases are the second most common cause of death worldwide and one of the major causes of disability burden. Advancements in artificial intelligence have the potential to revolutionize health care delivery, particularly in critical decision-making scenarios such as ischemic stroke management. Objective: This study aims to evaluate the effectiveness of GPT-4 in providing clinical support for emergency department neurologists by comparing its recommendations with expert opinions and real-world outcomes in acute ischemic stroke management. Methods: A cohort of 100 patients with acute stroke symptoms was retrospectively reviewed. Data used for decision-making included patients’ history, clinical evaluation, imaging study results, and other relevant details. Each case was independently presented to GPT-4, which provided scaled recommendations (1-7) regarding the appropriateness of treatment, the use of tissue plasminogen activator, and the need for endovascular thrombectomy. Additionally, GPT-4 estimated the 90-day mortality probability for each patient and elucidated its reasoning for each recommendation. The recommendations were then compared with a stroke specialist’s opinion and actual treatment decisions. Results: In our cohort of 100 patients, treatment recommendations by GPT-4 showed strong agreement with expert opinion (area under the curve [AUC] 0.85, 95% CI 0.77-0.93) and real-world treatment decisions (AUC 0.80, 95% CI 0.69-0.91). GPT-4 showed near-perfect agreement with real-world decisions in recommending endovascular thrombectomy (AUC 0.94, 95% CI 0.89-0.98) and strong agreement for tissue plasminogen activator treatment (AUC 0.77, 95% CI 0.68-0.86). Notably, in some cases, GPT-4 recommended more aggressive treatment than human experts, with 11 instances where GPT-4 suggested tissue plasminogen activator use against expert opinion. For mortality prediction, GPT-4 accurately identified 10 (77%) out of 13 deaths within its top 25 high-risk predictions (AUC 0.89, 95% CI 0.8077-0.9739; hazard ratio 6.98, 95% CI 2.88-16.9; P<.001), outperforming supervised machine learning models such as PRACTICE (AUC 0.70; log-rank P=.02) and PREMISE (AUC 0.77; P=.07). Conclusions: This study demonstrates the potential of GPT-4 as a viable clinical decision-support tool in the management of acute stroke. Its ability to provide explainable recommendations without requiring structured data input aligns well with the routine workflows of treating physicians. However, the tendency toward more aggressive treatment recommendations highlights the importance of human oversight in clinical decision-making. Future studies should focus on prospective validations and exploring the safe integration of such artificial intelligence tools into clinical practice. %M 40053715 %R 10.2196/60391 %U https://ai.jmir.org/2025/1/e60391 %U https://doi.org/10.2196/60391 %U http://www.ncbi.nlm.nih.gov/pubmed/40053715 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 12 %N %P e65785 %T Comparison of an AI Chatbot With a Nurse Hotline in Reducing Anxiety and Depression Levels in the General Population: Pilot Randomized Controlled Trial %A Chen,Chen %A Lam,Kok Tai %A Yip,Ka Man %A So,Hung Kwan %A Lum,Terry Yat Sang %A Wong,Ian Chi Kei %A Yam,Jason C %A Chui,Celine Sze Ling %A Ip,Patrick %K AI chatbot %K anxiety %K depression %K effectiveness %K artificial intelligence %D 2025 %7 6.3.2025 %9 %J JMIR Hum Factors %G English %X Background: Artificial intelligence (AI) chatbots have been customized to deliver on-demand support for people with mental health problems. However, the effectiveness of AI chatbots in tackling mental health problems among the general public in Hong Kong remains unclear. Objective: This study aimed to develop a local AI chatbot and compare the effectiveness of the AI chatbot with a conventional nurse hotline in reducing the level of anxiety and depression among individuals in Hong Kong. Methods: This study was a pilot randomized controlled trial conducted from October 2022 to March 2023, involving 124 participants allocated randomly (1:1 ratio) into the AI chatbot and nurse hotline groups. Among these, 62 participants in the AI chatbot group and 41 in the nurse hotline group completed both the pre- and postquestionnaires, including the GAD-7 (Generalized Anxiety Disorder Scale-7), PHQ-9 (Patient Health Questionnaire-9), and satisfaction questionnaire. Comparisons were conducted using independent and paired sample t tests (2-tailed) and the χ2 test to analyze changes in anxiety and depression levels. Results: Compared to the mean baseline score of 5.13 (SD 4.623), the mean postdepression score in the chatbot group was 3.68 (SD 4.397), which was significantly lower (P=.008). Similarly, a reduced anxiety score was also observed after the chatbot test (pre vs post: mean 4.74, SD 4.742 vs mean 3.4, SD 3.748; P=.005), respectively. No significant differences were found in the pre-post scores for either depression (P=.38) or anxiety (P=.19). No statistically significant difference was observed in service satisfaction between the two platforms (P=.32). Conclusions: The AI chatbot was comparable to the traditional nurse hotline in alleviating participants’ anxiety and depression after responding to inquiries. Moreover, the AI chatbot has shown potential in alleviating short-term anxiety and depression compared to the nurse hotline. While the AI chatbot presents a promising solution for offering accessible strategies to the public, more extensive randomized controlled studies are necessary to further validate its effectiveness. Trial Registration: ClinicalTrials.gov NCT06621134; https://clinicaltrials.gov/study/NCT06621134 %R 10.2196/65785 %U https://humanfactors.jmir.org/2025/1/e65785 %U https://doi.org/10.2196/65785 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e66032 %T Health Communication on the Internet: Promoting Public Health and Exploring Disparities in the Generative AI Era %A Uddin,Jamal %A Feng,Cheng %A Xu,Junfang %+ Department of Pharmacy, Second Affiliated Hospital, School of Public health, Zhejiang University School of Medicine, 866 Yuhangtang road, Xihu district, Hangzhou, 310058, China, 86 18801230482 ext 000, junfangxuhappy1987@163.com %K internet %K generative AI %K artificial intelligence %K ChatGPT %K health communication %K health promotion %K health disparity %K health %K communication %K internet %K AI %K generative %K tool %K genAI %K gratification theory %K gratification %K public health %K inequity %K disparity %D 2025 %7 6.3.2025 %9 Viewpoint %J J Med Internet Res %G English %X Health communication and promotion on the internet have evolved over time, driven by the development of new technologies, including generative artificial intelligence (GenAI). These technological tools offer new opportunities for both the public and professionals. However, these advancements also pose risks of exacerbating health disparities. Limited research has focused on combining these health communication mediums, particularly those enabled by new technologies like GenAI, and their applications for health promotion and health disparities. Therefore, this viewpoint, adopting a conceptual approach, provides an updated overview of health communication mediums and their role in understanding health promotion and disparities in the GenAI era. Additionally, health promotion and health disparities associated with GenAI are briefly discussed through the lens of the Technology Acceptance Model 2, the uses and gratifications theory, and the knowledge gap hypothesis. This viewpoint discusses the limitations and barriers of previous internet-based communication mediums regarding real-time responses, personalized advice, and follow-up inquiries, highlighting the potential of new technology for public health promotion. It also discusses the health disparities caused by the limitations of GenAI, such as individuals’ inability to evaluate information, restricted access to services, and the lack of skill development. Overall, this study lays the groundwork for future research on how GenAI could be leveraged for public health promotion and how its challenges and barriers may exacerbate health inequities. It underscores the need for more empirical studies, as well as the importance of enhancing digital literacy and increasing access to technology for socially disadvantaged populations. %M 40053755 %R 10.2196/66032 %U https://www.jmir.org/2025/1/e66032 %U https://doi.org/10.2196/66032 %U http://www.ncbi.nlm.nih.gov/pubmed/40053755 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e65108 %T ChatGPT’s Performance on Portuguese Medical Examination Questions: Comparative Analysis of ChatGPT-3.5 Turbo and ChatGPT-4o Mini %A Prazeres,Filipe %K ChatGPT-3.5 Turbo %K ChatGPT-4o mini %K medical examination %K European Portuguese %K AI performance evaluation %K Portuguese %K evaluation %K medical examination questions %K examination question %K chatbot %K ChatGPT %K model %K artificial intelligence %K AI %K GPT %K LLM %K NLP %K natural language processing %K machine learning %K large language model %D 2025 %7 5.3.2025 %9 %J JMIR Med Educ %G English %X Background: Advancements in ChatGPT are transforming medical education by providing new tools for assessment and learning, potentially enhancing evaluations for doctors and improving instructional effectiveness. Objective: This study evaluates the performance and consistency of ChatGPT-3.5 Turbo and ChatGPT-4o mini in solving European Portuguese medical examination questions (2023 National Examination for Access to Specialized Training; Prova Nacional de Acesso à Formação Especializada [PNA]) and compares their performance to human candidates. Methods: ChatGPT-3.5 Turbo was tested on the first part of the examination (74 questions) on July 18, 2024, and ChatGPT-4o mini on the second part (74 questions) on July 19, 2024. Each model generated an answer using its natural language processing capabilities. To test consistency, each model was asked, “Are you sure?” after providing an answer. Differences between the first and second responses of each model were analyzed using the McNemar test with continuity correction. A single-parameter t test compared the models’ performance to human candidates. Frequencies and percentages were used for categorical variables, and means and CIs for numerical variables. Statistical significance was set at P<.05. Results: ChatGPT-4o mini achieved an accuracy rate of 65% (48/74) on the 2023 PNA examination, surpassing ChatGPT-3.5 Turbo. ChatGPT-4o mini outperformed medical candidates, while ChatGPT-3.5 Turbo had a more moderate performance. Conclusions: This study highlights the advancements and potential of ChatGPT models in medical education, emphasizing the need for careful implementation with teacher oversight and further research. %R 10.2196/65108 %U https://mededu.jmir.org/2025/1/e65108 %U https://doi.org/10.2196/65108 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67891 %T Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study %A McBain,Ryan K %A Cantor,Jonathan H %A Zhang,Li Ang %A Baker,Olesya %A Zhang,Fang %A Halbisen,Alyssa %A Kofner,Aaron %A Breslau,Joshua %A Stein,Bradley %A Mehrotra,Ateev %A Yu,Hao %+ RAND, 1200 S Hayes St, Arlington, VA, United States, 1 5088433901, rmcbain@rand.org %K depression %K suicide %K mental health %K large language model %K chatbot %K digital health %K Suicidal Ideation Response Inventory %K ChatGPT %K suicidologist %K artificial intelligence %D 2025 %7 5.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: With suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. Objective: The objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation. Methods: This observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from –3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <–1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies. Results: All 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master’s level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff. Conclusions: Current versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals. %M 40053817 %R 10.2196/67891 %U https://www.jmir.org/2025/1/e67891 %U https://doi.org/10.2196/67891 %U http://www.ncbi.nlm.nih.gov/pubmed/40053817 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e64364 %T Retrieval Augmented Therapy Suggestion for Molecular Tumor Boards: Algorithmic Development and Validation Study %A Berman,Eliza %A Sundberg Malek,Holly %A Bitzer,Michael %A Malek,Nisar %A Eickhoff,Carsten %+ Center for Digital Health, University Hospital Tuebingen, Schaffhausenstrasse 77, Tuebingen, 72072, Germany, 49 70712984350, eliza_berman@alumni.brown.edu %K large language models %K retrieval augmented generation %K LLaMA %K precision oncology %K molecular tumor board %K molecular tumor %K LLMs %K augmented therapy %K MTB %K oncology %K tumor %K clinical trials %K patient care %K treatment %K evidence-based %K accessibility to care %D 2025 %7 5.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Molecular tumor boards (MTBs) require intensive manual investigation to generate optimal treatment recommendations for patients. Large language models (LLMs) can catalyze MTB recommendations, decrease human error, improve accessibility to care, and enhance the efficiency of precision oncology. Objective: In this study, we aimed to investigate the efficacy of LLM-generated treatments for MTB patients. We specifically investigate the LLMs’ ability to generate evidence-based treatment recommendations using PubMed references. Methods: We built a retrieval augmented generation pipeline using PubMed data. We prompted the resulting LLM to generate treatment recommendations with PubMed references using a test set of patients from an MTB conference at a large comprehensive cancer center at a tertiary care institution. Members of the MTB manually assessed the relevancy and correctness of the generated responses. Results: A total of 75% of the referenced articles were properly cited from PubMed, while 17% of the referenced articles were hallucinations, and the remaining were not properly cited from PubMed. Clinician-generated LLM queries achieved higher accuracy through clinician evaluation than automated queries, with clinicians labeling 25% of LLM responses as equal to their recommendations and 37.5% as alternative plausible treatments. Conclusions: This study demonstrates how retrieval augmented generation–enhanced LLMs can be a powerful tool in accelerating MTB conferences, as LLMs are sometimes capable of achieving clinician-equal treatment recommendations. However, further investigation is required to achieve stable results with zero hallucinations. LLMs signify a scalable solution to the time-intensive process of MTB investigations. However, LLM performance demonstrates that they must be used with heavy clinician supervision, and cannot yet fully automate the MTB pipeline. %M 40053768 %R 10.2196/64364 %U https://www.jmir.org/2025/1/e64364 %U https://doi.org/10.2196/64364 %U http://www.ncbi.nlm.nih.gov/pubmed/40053768 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e63631 %T Large Language Models’ Accuracy in Emulating Human Experts’ Evaluation of Public Sentiments about Heated Tobacco Products on Social Media: Evaluation Study %A Kim,Kwanho %A Kim,Soojong %+ Department of Communication, University of California Davis, 1 Shields Ave, Kerr Hall #361, Davis, CA, 95616, United States, 1 530 752 0966, sjokim@ucdavis.edu %K heated tobacco products %K artificial intelligence %K large language models %K social media %K sentiment analysis %K ChatGPT %K generative pre-trained transformer %K GPT %K LLM %K NLP %K natural language processing %K machine learning %K language model %K sentiment %K evaluation %K social media %K tobacco %K alternative %K prevention %K nicotine %K OpenAI %D 2025 %7 4.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Sentiment analysis of alternative tobacco products discussed on social media is crucial in tobacco control research. Large language models (LLMs) are artificial intelligence models that were trained on extensive text data to emulate the linguistic patterns of humans. LLMs may hold the potential to streamline the time-consuming and labor-intensive process of human sentiment analysis. Objective: This study aimed to examine the accuracy of LLMs in replicating human sentiment evaluation of social media messages relevant to heated tobacco products (HTPs). Methods: GPT-3.5 and GPT-4 Turbo (OpenAI) were used to classify 500 Facebook (Meta Platforms) and 500 Twitter (subsequently rebranded X) messages. Each set consisted of 200 human-labeled anti-HTPs, 200 pro-HTPs, and 100 neutral messages. The models evaluated each message up to 20 times to generate multiple response instances reporting its classification decisions. The majority of the labels from these responses were assigned as a model’s decision for the message. The models’ classification decisions were then compared with those of human evaluators. Results: GPT-3.5 accurately replicated human sentiment evaluation in 61.2% of Facebook messages and 57% of Twitter messages. GPT-4 Turbo demonstrated higher accuracies overall, with 81.7% for Facebook messages and 77% for Twitter messages. GPT-4 Turbo’s accuracy with 3 response instances reached 99% of the accuracy achieved with 20 response instances. GPT-4 Turbo’s accuracy was higher for human-labeled anti- and pro-HTP messages compared with neutral messages. Most of the GPT-3.5 misclassifications occurred when anti- or pro-HTP messages were incorrectly classified as neutral or irrelevant by the model, whereas GPT-4 Turbo showed improvements across all sentiment categories and reduced misclassifications, especially in incorrectly categorized messages as irrelevant. Conclusions: LLMs can be used to analyze sentiment in social media messages about HTPs. Results from GPT-4 Turbo suggest that accuracy can reach approximately 80% compared with the results of human experts, even with a small number of labeling decisions generated by the model. A potential risk of using LLMs is the misrepresentation of the overall sentiment due to the differences in accuracy across sentiment categories. Although this issue could be reduced with the newer language model, future efforts should explore the mechanisms underlying the discrepancies and how to address them systematically. %M 40053746 %R 10.2196/63631 %U https://www.jmir.org/2025/1/e63631 %U https://doi.org/10.2196/63631 %U http://www.ncbi.nlm.nih.gov/pubmed/40053746 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e55341 %T Enhancing Doctor-Patient Shared Decision-Making: Design of a Novel Collaborative Decision Description Language %A Guo,XiaoRui %A Xiao,Liang %A Liu,Xinyu %A Chen,Jianxia %A Tong,Zefang %A Liu,Ziji %+ School of Computer Science, Hubei University of Technology, 28 Nanli Road, Hongshan District, Hubei Province, Wuhan, 430068, China, 86 18062500600, lx@mail.hbut.edu.cn %K shared decision-making %K speech acts %K agent %K argumentation %K interaction protocol %D 2025 %7 4.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Effective shared decision-making between patients and physicians is crucial for enhancing health care quality and reducing medical errors. The literature shows that the absence of effective methods to facilitate shared decision-making can result in poor patient engagement and unfavorable decision outcomes. Objective: In this paper, we propose a Collaborative Decision Description Language (CoDeL) to model shared decision-making between patients and physicians, offering a theoretical foundation for studying various shared decision scenarios. Methods: CoDeL is based on an extension of the interaction protocol language of Lightweight Social Calculus. The language utilizes speech acts to represent the attitudes of shared decision-makers toward decision propositions, as well as their semantic relationships within dialogues. It supports interactive argumentation among decision makers by embedding clinical evidence into each segment of decision protocols. Furthermore, CoDeL enables personalized decision-making, allowing for the demonstration of characteristics such as persistence, critical thinking, and openness. Results: The feasibility of the approach is demonstrated through a case study of shared decision-making in the disease domain of atrial fibrillation. Our experimental results show that integrating the proposed language with GPT can further enhance its capabilities in interactive decision-making, improving interpretability. Conclusions: The proposed novel CoDeL can enhance doctor-patient shared decision-making in a rational, personalized, and interpretable manner. %M 40053763 %R 10.2196/55341 %U https://www.jmir.org/2025/1/e55341 %U https://doi.org/10.2196/55341 %U http://www.ncbi.nlm.nih.gov/pubmed/40053763 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e63312 %T Virtual Patient Simulations Using Social Robotics Combined With Large Language Models for Clinical Reasoning Training in Medical Education: Mixed Methods Study %A Borg,Alexander %A Georg,Carina %A Jobs,Benjamin %A Huss,Viking %A Waldenlind,Kristin %A Ruiz,Mini %A Edelbring,Samuel %A Skantze,Gabriel %A Parodis,Ioannis %+ Division of Rheumatology, Department of Medicine Solna, Karolinska Institutet, Karolinska University Hospital, and Center for Molecular Medicine (CMM), D2:01 Rheumatology Karolinska University Hospital Solna, Stockholm, SE-171 76, Sweden, 46 722321322, ioannis.parodis@ki.se %K virtual patients %K clinical reasoning %K large language models %K social robotics %K medical education %K sustainable learning %K medical students %D 2025 %7 3.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Virtual patients (VPs) are computer-based simulations of clinical scenarios used in health professions education to address various learning outcomes, including clinical reasoning (CR). CR is a crucial skill for health care practitioners, and its inadequacy can compromise patient safety. Recent advancements in large language models (LLMs) and social robots have introduced new possibilities for enhancing VP interactivity and realism. However, their application in VP simulations has been limited, and no studies have investigated the effectiveness of combining LLMs with social robots for CR training. Objective: The aim of the study is to explore the potential added value of a social robotic VP platform combined with an LLM compared to a conventional computer-based VP modality for CR training of medical students. Methods: A Swedish explorative proof-of-concept study was conducted between May and July 2023, combining quantitative and qualitative methodology. In total, 15 medical students from Karolinska Institutet and an international exchange program completed a VP case in a social robotic platform and a computer-based semilinear platform. Students’ self-perceived VP experience focusing on CR training was assessed using a previously developed index, and paired 2-tailed t test was used to compare mean scores (scales from 1 to 5) between the platforms. Moreover, in-depth interviews were conducted with 8 medical students. Results: The social robotic platform was perceived as more authentic (mean 4.5, SD 0.7 vs mean 3.9, SD 0.5; odds ratio [OR] 2.9, 95% CI 0.0-1.0; P=.04) and provided a beneficial overall learning effect (mean 4.4, SD 0.6 versus mean 4.1, SD 0.6; OR 3.7, 95% CI 0.1-0.5; P=.01) compared with the computer-based platform. Qualitative analysis revealed 4 themes, wherein students experienced the social robot as superior to the computer-based platform in training CR, communication, and emotional skills. Limitations related to technical and user-related aspects were identified, and suggestions for improvements included enhanced facial expressions and VP cases simulating multiple personalities. Conclusions: A social robotic platform enhanced by an LLM may provide an authentic and engaging learning experience for medical students in the context of VP simulations for training CR. Beyond its limitations, several aspects of potential improvement were identified for the social robotic platform, lending promise for this technology as a means toward the attainment of learning outcomes within medical education curricula. %M 40053778 %R 10.2196/63312 %U https://www.jmir.org/2025/1/e63312 %U https://doi.org/10.2196/63312 %U http://www.ncbi.nlm.nih.gov/pubmed/40053778 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e62758 %T Current Landscape and Future Directions for Mental Health Conversational Agents for Youth: Scoping Review %A Park,Jinkyung Katie %A Singh,Vivek K %A Wisniewski,Pamela %+ Human-Centered Computing Division, School of Computing, Clemson University, 105 Sikes Hall, Clemson, SC, United States, 1 864 656 3444, jinkyup@clemson.edu %K conversational agent %K chatbot %K mental health %K youth %K adolescent %K scoping review %K Preferred Reporting Items for Systematic Reviews and Meta-Analyses %K artificial intelligence %D 2025 %7 28.2.2025 %9 Review %J JMIR Med Inform %G English %X Background: Conversational agents (CAs; chatbots) are systems with the ability to interact with users using natural human dialogue. They are increasingly used to support interactive knowledge discovery of sensitive topics such as mental health topics. While much of the research on CAs for mental health has focused on adult populations, the insights from such research may not apply to CAs for youth. Objective: This study aimed to comprehensively evaluate the state-of-the-art research on mental health CAs for youth. Methods: Following PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, we identified 39 peer-reviewed studies specific to mental health CAs designed for youth across 4 databases, including ProQuest, Scopus, Web of Science, and PubMed. We conducted a scoping review of the literature to evaluate the characteristics of research on mental health CAs designed for youth, the design and computational considerations of mental health CAs for youth, and the evaluation outcomes reported in the research on mental health CAs for youth. Results: We found that many mental health CAs (11/39, 28%) were designed as older peers to provide therapeutic or educational content to promote youth mental well-being. All CAs were designed based on expert knowledge, with a few that incorporated inputs from youth. The technical maturity of CAs was in its infancy, focusing on building prototypes with rule-based models to deliver prewritten content, with limited safety features to respond to imminent risk. Research findings suggest that while youth appreciate the 24/7 availability of friendly or empathetic conversation on sensitive topics with CAs, they found the content provided by CAs to be limited. Finally, we found that most (35/39, 90%) of the reviewed studies did not address the ethical aspects of mental health CAs, while youth were concerned about the privacy and confidentiality of their sensitive conversation data. Conclusions: Our study highlights the need for researchers to continue to work together to align evidence-based research on mental health CAs for youth with lessons learned on how to best deliver these technologies to youth. Our review brings to light mental health CAs needing further development and evaluation. The new trend of large language model–based CAs can make such technologies more feasible. However, the privacy and safety of the systems should be prioritized. Although preliminary evidence shows positive trends in mental health CAs, long-term evaluative research with larger sample sizes and robust research designs is needed to validate their efficacy. More importantly, collaboration between youth and clinical experts is essential from the early design stages through to the final evaluation to develop safe, effective, and youth-centered mental health chatbots. Finally, best practices for risk mitigation and ethical development of CAs with and for youth are needed to promote their mental well-being. %M 40053735 %R 10.2196/62758 %U https://medinform.jmir.org/2025/1/e62758 %U https://doi.org/10.2196/62758 %U http://www.ncbi.nlm.nih.gov/pubmed/40053735 %0 Journal Article %@ 2562-7600 %I JMIR Publications %V 8 %N %P e63058 %T Advancing Clinical Chatbot Validation Using AI-Powered Evaluation With a New 3-Bot Evaluation System: Instrument Validation Study %A Choo,Seungheon %A Yoo,Suyoung %A Endo,Kumiko %A Truong,Bao %A Son,Meong Hi %K artificial intelligence %K patient education %K therapy %K computer-assisted %K computer %K understandable %K accurate %K understandability %K automation %K chatbots %K bots %K conversational agents %K emotions %K emotional %K depression %K depressive %K anxiety %K anxious %K nervous %K nervousness %K empathy %K empathetic %K communication %K interactions %K frustrated %K frustration %K relationships %D 2025 %7 27.2.2025 %9 %J JMIR Nursing %G English %X Background: The health care sector faces a projected shortfall of 10 million workers by 2030. Artificial intelligence (AI) automation in areas such as patient education and initial therapy screening presents a strategic response to mitigate this shortage and reallocate medical staff to higher-priority tasks. However, current methods of evaluating early-stage health care AI chatbots are highly limited due to safety concerns and the amount of time and effort that goes into evaluating them. Objective: This study introduces a novel 3-bot method for efficiently testing and validating early-stage AI health care provider chatbots. To extensively test AI provider chatbots without involving real patients or researchers, various AI patient bots and an evaluator bot were developed. Methods: Provider bots interacted with AI patient bots embodying frustrated, anxious, or depressed personas. An evaluator bot reviewed interaction transcripts based on specific criteria. Human experts then reviewed each interaction transcript, and the evaluator bot’s results were compared to human evaluation results to ensure accuracy. Results: The patient-education bot’s evaluations by the AI evaluator and the human evaluator were nearly identical, with minimal variance, limiting the opportunity for further analysis. The screening bot’s evaluations also yielded similar results between the AI evaluator and human evaluator. Statistical analysis confirmed the reliability and accuracy of the AI evaluations. Conclusions: The innovative evaluation method ensures a safe, adaptable, and effective means to test and refine early versions of health care provider chatbots without risking patient safety or investing excessive researcher time and effort. Our patient-education evaluator bots could have benefitted from larger evaluation criteria, as we had extremely similar results from the AI and human evaluators, which could have arisen because of the small number of evaluation criteria. We were limited in the amount of prompting we could input into each bot due to the practical consideration that response time increases with larger and larger prompts. In the future, using techniques such as retrieval augmented generation will allow the system to receive more information and become more specific and accurate in evaluating the chatbots. This evaluation method will allow for rapid testing and validation of health care chatbots to automate basic medical tasks, freeing providers to address more complex tasks. %R 10.2196/63058 %U https://nursing.jmir.org/2025/1/e63058 %U https://doi.org/10.2196/63058 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e66478 %T Novel Evaluation Metric and Quantified Performance of ChatGPT-4 Patient Management Simulations for Early Clinical Education: Experimental Study %A Scherr,Riley %A Spina,Aidin %A Dao,Allen %A Andalib,Saman %A Halaseh,Faris F %A Blair,Sarah %A Wiechmann,Warren %A Rivera,Ronald %K medical school simulations %K AI in medical education %K preclinical curriculum %K ChatGPT %K ChatGPT-4 %K medical simulation %K simulation %K multimedia %K feedback %K medical education %K medical student %K clinical education %K pilot study %K patient management %D 2025 %7 27.2.2025 %9 %J JMIR Form Res %G English %X Background: Case studies have shown ChatGPT can run clinical simulations at the medical student level. However, no data have assessed ChatGPT’s reliability in meeting desired simulation criteria such as medical accuracy, simulation formatting, and robust feedback mechanisms. Objective: This study aims to quantify ChatGPT’s ability to consistently follow formatting instructions and create simulations for preclinical medical student learners according to principles of medical simulation and multimedia educational technology. Methods: Using ChatGPT-4 and a prevalidated starting prompt, the authors ran 360 separate simulations of an acute asthma exacerbation. A total of 180 simulations were given correct answers and 180 simulations were given incorrect answers. ChatGPT was evaluated for its ability to adhere to basic simulation parameters (stepwise progression, free response, interactivity), advanced simulation parameters (autonomous conclusion, delayed feedback, comprehensive feedback), and medical accuracy (vignette, treatment updates, feedback). Significance was determined with χ² analyses using 95% CIs for odds ratios. Results: In total, 100% (n=360) of simulations met basic simulation parameters and were medically accurate. For advanced parameters, 55% (200/360) of all simulations delayed feedback, while the Correct arm (157/180, 87%) delayed feedback was significantly more than the Incorrect arm (43/180, 24%; P<.001). A total of 79% (285/360) of simulations concluded autonomously, and there was no difference between the Correct and Incorrect arms in autonomous conclusion (146/180, 81% and 139/180, 77%; P=.36). Overall, 78% (282/360) of simulations gave comprehensive feedback, and there was no difference between the Correct and Incorrect arms in comprehensive feedback (137/180, 76% and 145/180, 81%; P=.31). ChatGPT-4 was not significantly more likely to conclude simulations autonomously (P=.34) and provide comprehensive feedback (P=.27) when feedback was delayed compared to when feedback was not delayed. Conclusions: These simulations have the potential to be a reliable educational tool for simple simulations and can be evaluated by a novel 9-part metric. Per this metric, ChatGPT simulations performed perfectly on medical accuracy and basic simulation parameters. It performed well on comprehensive feedback and autonomous conclusion. Delayed feedback depended on the accuracy of user inputs. A simulation meeting one advanced parameter was not more likely to meet all advanced parameters. Further work must be done to ensure consistent performance across a broader range of simulation scenarios. %R 10.2196/66478 %U https://formative.jmir.org/2025/1/e66478 %U https://doi.org/10.2196/66478 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67010 %T Stroke Diagnosis and Prediction Tool Using ChatGLM: Development and Validation Study %A Song,Xiaowei %A Wang,Jiayi %A He,Feifei %A Yin,Wei %A Ma,Weizhi %A Wu,Jian %+ Department of Neurology, Beijing Tsinghua Changgung Hospital, School of Clinical Medicine, Tsinghua University, No.168 of Litang Road, Beijing, 102218, China, 86 01056118918, wujianxuanwu@126.com %K stroke %K diagnosis %K large language model %K ChatGLM %K generative language model %K primary care %K acute stroke %K prediction tool %K stroke detection %K treatment %K electronic health records %K noncontrast computed tomography %D 2025 %7 26.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Stroke is a globally prevalent disease that imposes a significant burden on health care systems and national economies. Accurate and rapid stroke diagnosis can substantially increase reperfusion rates, mitigate disability, and reduce mortality. However, there are considerable discrepancies in the diagnosis and treatment of acute stroke. Objective: The aim of this study is to develop and validate a stroke diagnosis and prediction tool using ChatGLM-6B, which uses free-text information from electronic health records in conjunction with noncontrast computed tomography (NCCT) reports to enhance stroke detection and treatment. Methods: A large language model (LLM) using ChatGLM-6B was proposed to facilitate stroke diagnosis by identifying optimal input combinations, using external tools, and applying instruction tuning and low-rank adaptation (LoRA) techniques. A dataset containing details of 1885 patients with and those without stroke from 2016 to 2024 was used for training and internal validation; another 335 patients from two hospitals were used as an external test set, including 230 patients from the training hospital but admitted at different periods, and 105 patients from another hospital. Results: The LLM, which is based on clinical notes and NCCT, demonstrates exceptionally high accuracy in stroke diagnosis, achieving 99% in the internal validation dataset and 95.5% and 79.1% in two external test cohorts. It effectively distinguishes between ischemia and hemorrhage, with an accuracy of 100% in the validation dataset and 99.1% and 97.1% in the other test cohorts. In addition, it identifies large vessel occlusions (LVO) with an accuracy of 80% in the validation dataset and 88.6% and 83.3% in the other test cohorts. Furthermore, it screens patients eligible for intravenous thrombolysis (IVT) with an accuracy of 89.4% in the validation dataset and 60% and 80% in the other test cohorts. Conclusions: We developed an LLM that leverages clinical text and NCCT to identify strokes and guide recanalization therapy. While our results necessitate validation through widespread deployment, they hold the potential to enhance stroke identification and reduce reperfusion time. %M 40009850 %R 10.2196/67010 %U https://www.jmir.org/2025/1/e67010 %U https://doi.org/10.2196/67010 %U http://www.ncbi.nlm.nih.gov/pubmed/40009850 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e68347 %T The Feasibility of Large Language Models in Verbal Comprehension Assessment: Mixed Methods Feasibility Study %A Hadar-Shoval,Dorit %A Lvovsky,Maya %A Asraf,Kfir %A Shimoni,Yoav %A Elyoseph,Zohar %+ Department of Brain Sciences, Faculty of Medicine, Imperial College London, Reynolds Building, St Dunstan's Road, Charing Cross Hospital, Hammersmith, London, W6 8RP, United Kingdom, 972 547836088, Zohar.j.a@gmail.com %K large language models %K verbal comprehension assessment %K artificial intelligence %K AI in psychodiagnostics %K personalized intelligence tests %K verbal comprehension index %K Wechsler Adult Intelligence Scale %K WAIS-III %K psychological test validity %K ethics in computerized cognitive assessment %D 2025 %7 24.2.2025 %9 Original Paper %J JMIR Form Res %G English %X Background: Cognitive assessment is an important component of applied psychology, but limited access and high costs make these evaluations challenging. Objective: This study aimed to examine the feasibility of using large language models (LLMs) to create personalized artificial intelligence–based verbal comprehension tests (AI-BVCTs) for assessing verbal intelligence, in contrast with traditional assessment methods based on standardized norms. Methods: We used a within-participants design, comparing scores obtained from AI-BVCTs with those from the Wechsler Adult Intelligence Scale (WAIS-III) verbal comprehension index (VCI). In total, 8 Hebrew-speaking participants completed both the VCI and AI-BVCT, the latter being generated using the LLM Claude. Results: The concordance correlation coefficient (CCC) demonstrated strong agreement between AI-BVCT and VCI scores (Claude: CCC=.75, 90% CI 0.266-0.933; GPT-4: CCC=.73, 90% CI 0.170-0.935). Pearson correlations further supported these findings, showing strong associations between VCI and AI-BVCT scores (Claude: r=.84, P<.001; GPT-4: r=.77, P=.02). No statistically significant differences were found between AI-BVCT and VCI scores (P>.05). Conclusions: These findings support the potential of LLMs to assess verbal intelligence. The study attests to the promise of AI-based cognitive tests in increasing the accessibility and affordability of assessment processes, enabling personalized testing. The research also raises ethical concerns regarding privacy and overreliance on AI in clinical work. Further research with larger and more diverse samples is needed to establish the validity and reliability of this approach and develop more accurate scoring procedures. %M 39993720 %R 10.2196/68347 %U https://formative.jmir.org/2025/1/e68347 %U https://doi.org/10.2196/68347 %U http://www.ncbi.nlm.nih.gov/pubmed/39993720 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e58670 %T Leveraging Medical Knowledge Graphs Into Large Language Models for Diagnosis Prediction: Design and Application Study %A Gao,Yanjun %A Li,Ruizhe %A Croxford,Emma %A Caskey,John %A Patterson,Brian W %A Churpek,Matthew %A Miller,Timothy %A Dligach,Dmitriy %A Afshar,Majid %+ Department of Biomedical Informatics, University of Colorado Anschutz Medical Campus, 1890 N Revere Ct, Denver, CO, 80045, United States, 1 303 724 5375, yanjun.gao@cuanschutz.edu %K knowledge graph %K natural language processing %K machine learning %K electronic health record %K large language model %K diagnosis prediction %K graph model %K artificial intelligence %D 2025 %7 24.2.2025 %9 Original Paper %J JMIR AI %G English %X Background: Electronic health records (EHRs) and routine documentation practices play a vital role in patients’ daily care, providing a holistic record of health, diagnoses, and treatment. However, complex and verbose EHR narratives can overwhelm health care providers, increasing the risk of diagnostic inaccuracies. While large language models (LLMs) have showcased their potential in diverse language tasks, their application in health care must prioritize the minimization of diagnostic errors and the prevention of patient harm. Integrating knowledge graphs (KGs) into LLMs offers a promising approach because structured knowledge from KGs could enhance LLMs’ diagnostic reasoning by providing contextually relevant medical information. Objective: This study introduces DR.KNOWS (Diagnostic Reasoning Knowledge Graph System), a model that integrates Unified Medical Language System–based KGs with LLMs to improve diagnostic predictions from EHR data by retrieving contextually relevant paths aligned with patient-specific information. Methods: DR.KNOWS combines a stack graph isomorphism network for node embedding with an attention-based path ranker to identify and rank knowledge paths relevant to a patient’s clinical context. We evaluated DR.KNOWS on 2 real-world EHR datasets from different geographic locations, comparing its performance to baseline models, including QuickUMLS and standard LLMs (Text-to-Text Transfer Transformer and ChatGPT). To assess diagnostic reasoning quality, we designed and implemented a human evaluation framework grounded in clinical safety metrics. Results: DR.KNOWS demonstrated notable improvements over baseline models, showing higher accuracy in extracting diagnostic concepts and enhanced diagnostic prediction metrics. Prompt-based fine-tuning of Text-to-Text Transfer Transformer with DR.KNOWS knowledge paths achieved the highest ROUGE-L (Recall-Oriented Understudy for Gisting Evaluation–Longest Common Subsequence) and concept unique identifier F1-scores, highlighting the benefits of KG integration. Human evaluators found the diagnostic rationales of DR.KNOWS to be aligned strongly with correct clinical reasoning, indicating improved abstraction and reasoning. Recognized limitations include potential biases within the KG data, which we addressed by emphasizing case-specific path selection and proposing future bias-mitigation strategies. Conclusions: DR.KNOWS offers a robust approach for enhancing diagnostic accuracy and reasoning by integrating structured KG knowledge into LLM-based clinical workflows. Although further work is required to address KG biases and extend generalizability, DR.KNOWS represents progress toward trustworthy artificial intelligence–driven clinical decision support, with a human evaluation framework focused on diagnostic safety and alignment with clinical standards. %M 39993309 %R 10.2196/58670 %U https://ai.jmir.org/2025/1/e58670 %U https://doi.org/10.2196/58670 %U http://www.ncbi.nlm.nih.gov/pubmed/39993309 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 12 %N %P e60432 %T Exploring the Ethical Challenges of Conversational AI in Mental Health Care: Scoping Review %A Rahsepar Meadi,Mehrdad %A Sillekens,Tomas %A Metselaar,Suzanne %A van Balkom,Anton %A Bernstein,Justin %A Batelaan,Neeltje %+ Department of Psychiatry, Amsterdam Public Health, Vrije Universiteit Amsterdam, Boelelaan 1117, Amsterdam, 1081 HV, The Netherlands, 31 204444444, m.rahseparmeadi@ggzingeest.nl %K chatbot %K mHealth %K mobile health %K ethics %K mental health %K conversational agent %K artificial intelligence %K psychotherapy %K scoping review %K conversational agents %K digital technology %K natural language processing %K qualitative %K psychotherapist %D 2025 %7 21.2.2025 %9 Review %J JMIR Ment Health %G English %X Background: Conversational artificial intelligence (CAI) is emerging as a promising digital technology for mental health care. CAI apps, such as psychotherapeutic chatbots, are available in app stores, but their use raises ethical concerns. Objective: We aimed to provide a comprehensive overview of ethical considerations surrounding CAI as a therapist for individuals with mental health issues. Methods: We conducted a systematic search across PubMed, Embase, APA PsycINFO, Web of Science, Scopus, the Philosopher’s Index, and ACM Digital Library databases. Our search comprised 3 elements: embodied artificial intelligence, ethics, and mental health. We defined CAI as a conversational agent that interacts with a person and uses artificial intelligence to formulate output. We included articles discussing the ethical challenges of CAI functioning in the role of a therapist for individuals with mental health issues. We added additional articles through snowball searching. We included articles in English or Dutch. All types of articles were considered except abstracts of symposia. Screening for eligibility was done by 2 independent researchers (MRM and TS or AvB). An initial charting form was created based on the expected considerations and revised and complemented during the charting process. The ethical challenges were divided into themes. When a concern occurred in more than 2 articles, we identified it as a distinct theme. Results: We included 101 articles, of which 95% (n=96) were published in 2018 or later. Most were reviews (n=22, 21.8%) followed by commentaries (n=17, 16.8%). The following 10 themes were distinguished: (1) safety and harm (discussed in 52/101, 51.5% of articles); the most common topics within this theme were suicidality and crisis management, harmful or wrong suggestions, and the risk of dependency on CAI; (2) explicability, transparency, and trust (n=26, 25.7%), including topics such as the effects of “black box” algorithms on trust; (3) responsibility and accountability (n=31, 30.7%); (4) empathy and humanness (n=29, 28.7%); (5) justice (n=41, 40.6%), including themes such as health inequalities due to differences in digital literacy; (6) anthropomorphization and deception (n=24, 23.8%); (7) autonomy (n=12, 11.9%); (8) effectiveness (n=38, 37.6%); (9) privacy and confidentiality (n=62, 61.4%); and (10) concerns for health care workers’ jobs (n=16, 15.8%). Other themes were discussed in 9.9% (n=10) of the identified articles. Conclusions: Our scoping review has comprehensively covered ethical aspects of CAI in mental health care. While certain themes remain underexplored and stakeholders’ perspectives are insufficiently represented, this study highlights critical areas for further research. These include evaluating the risks and benefits of CAI in comparison to human therapists, determining its appropriate roles in therapeutic contexts and its impact on care access, and addressing accountability. Addressing these gaps can inform normative analysis and guide the development of ethical guidelines for responsible CAI use in mental health care. %M 39983102 %R 10.2196/60432 %U https://mental.jmir.org/2025/1/e60432 %U https://doi.org/10.2196/60432 %U http://www.ncbi.nlm.nih.gov/pubmed/39983102 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e63190 %T Leveraging Large Language Models for Infectious Disease Surveillance—Using a Web Service for Monitoring COVID-19 Patterns From Self-Reporting Tweets: Content Analysis %A Xie,Jiacheng %A Zhang,Ziyang %A Zeng,Shuai %A Hilliard,Joel %A An,Guanghui %A Tang,Xiaoting %A Jiang,Lei %A Yu,Yang %A Wan,Xiufeng %A Xu,Dong %+ Department of Electrical Engineering and Computer Science, University of Missouri, 227 Naka Hall, Columbia, MO, 65211, United States, 1 5738822299, xudong@missouri.edu %K COVID-19 %K self-reporting data %K large language model %K Twitter %K social media analysis %K natural language processing %K machine learning %D 2025 %7 20.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: The emergence of new SARS-CoV-2 variants, the resulting reinfections, and post–COVID-19 condition continue to impact many people’s lives. Tracking websites like the one at Johns Hopkins University no longer report the daily confirmed cases, posing challenges to accurately determine the true extent of infections. Many COVID-19 cases with mild symptoms are self-assessed at home and reported on social media, which provides an opportunity to monitor and understand the progression and evolving trends of the disease. Objective: We aim to build a publicly available database of COVID-19–related tweets and extracted information about symptoms and recovery cycles from self-reported tweets. We have presented the results of our analysis of infection, reinfection, recovery, and long-term effects of COVID-19 on a visualization website that refreshes data on a weekly basis. Methods: We used Twitter (subsequently rebranded as X) to collect COVID-19–related data, from which 9 native English-speaking annotators annotated a training dataset of COVID-19–positive self-reporters. We then used large language models to identify positive self-reporters from other unannotated tweets. We used the Hibert transform to calculate the lead of the prediction curve ahead of the reported curve. Finally, we presented our findings on symptoms, recovery, reinfections, and long-term effects of COVID-19 on the Covlab website. Results: We collected 7.3 million tweets related to COVID-19 between January 1, 2020, and April 1, 2024, including 262,278 self-reported cases. The predicted number of infection cases by our model is 7.63 days ahead of the official report. In addition to common symptoms, we identified some symptoms that were not included in the list from the US Centers for Disease Control and Prevention, such as lethargy and hallucinations. Repeat infections were commonly occurring, with rates of second and third infections at 7.49% (19,644/262,278) and 1.37% (3593/262,278), respectively, whereas 0.45% (1180/262,278) also reported that they had been infected >5 times. We identified 723 individuals who shared detailed recovery experiences through tweets, indicating a substantially reduction in recovery time over the years. Specifically, the average recovery period decreased from around 30 days in 2020 to approximately 12 days in 2023. In addition, geographic information collected from confirmed individuals indicates that the temporal patterns of confirmed cases in states such as California and Texas closely mirror the overall trajectory observed across the United States. Conclusions: Although with some biases and limitations, self-reported tweet data serves as a valuable complement to clinical data, especially in the postpandemic era dominated by mild cases. Our web-based analytic platform can play a significant role in continuously tracking COVID-19, finding new uncommon symptoms, detecting and monitoring the manifestation of long-term effects, and providing necessary insights to the public and decision-makers. %M 39977859 %R 10.2196/63190 %U https://www.jmir.org/2025/1/e63190 %U https://doi.org/10.2196/63190 %U http://www.ncbi.nlm.nih.gov/pubmed/39977859 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e63400 %T Perceptions and Earliest Experiences of Medical Students and Faculty With ChatGPT in Medical Education: Qualitative Study %A Abouammoh,Noura %A Alhasan,Khalid %A Aljamaan,Fadi %A Raina,Rupesh %A Malki,Khalid H %A Altamimi,Ibraheem %A Muaygil,Ruaim %A Wahabi,Hayfaa %A Jamal,Amr %A Alhaboob,Ali %A Assiri,Rasha Assad %A Al-Tawfiq,Jaffar A %A Al-Eyadhy,Ayman %A Soliman,Mona %A Temsah,Mohamad-Hani %+ Pediatric Department, King Saud University Medical City, King Saud University, King Abdullah Road, Riyadh, 11424, Saudi Arabia, 966 114692002, mtemsah@ksu.edu.sa %K ChatGPT %K medical education %K Saudi Arabia %K perceptions %K knowledge %K medical students %K faculty %K chatbot %K qualitative study %K artificial intelligence %K AI %K AI-based tools %K universities %K thematic analysis %K learning %K satisfaction %D 2025 %7 20.2.2025 %9 Original Paper %J JMIR Med Educ %G English %X Background: With the rapid development of artificial intelligence technologies, there is a growing interest in the potential use of artificial intelligence–based tools like ChatGPT in medical education. However, there is limited research on the initial perceptions and experiences of faculty and students with ChatGPT, particularly in Saudi Arabia. Objective: This study aimed to explore the earliest knowledge, perceived benefits, concerns, and limitations of using ChatGPT in medical education among faculty and students at a leading Saudi Arabian university. Methods: A qualitative exploratory study was conducted in April 2023, involving focused meetings with medical faculty and students with varying levels of ChatGPT experience. A thematic analysis was used to identify key themes and subthemes emerging from the discussions. Results: Participants demonstrated good knowledge of ChatGPT and its functions. The main themes were perceptions of ChatGPT use, potential benefits, and concerns about ChatGPT in research and medical education. The perceived benefits included collecting and summarizing information and saving time and effort. However, concerns and limitations centered around the potential lack of critical thinking in the information provided, the ambiguity of references, limitations of access, trust in the output of ChatGPT, and ethical concerns. Conclusions: This study provides valuable insights into the perceptions and experiences of medical faculty and students regarding the use of newly introduced large language models like ChatGPT in medical education. While the benefits of ChatGPT were recognized, participants also expressed concerns and limitations requiring further studies for effective integration into medical education, exploring the impact of ChatGPT on learning outcomes, student and faculty satisfaction, and the development of critical thinking skills. %M 39977012 %R 10.2196/63400 %U https://mededu.jmir.org/2025/1/e63400 %U https://doi.org/10.2196/63400 %U http://www.ncbi.nlm.nih.gov/pubmed/39977012 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e72007 %T Authors’ Reply: Enhancing the Clinical Relevance of Al Research for Medication Decision-Making %A Vordenberg,Sarah E %A Nichols,Julianna %A Marshall,Vincent D %A Weir,Kristie Rebecca %A Dorsch,Michael P %+ , College of Pharmacy, University of Michigan, 428 Church St, Ann Arbor, MI, 48109, United States, 1 734 763 6691, skelling@med.umich.edu %K older adults %K artificial intelligence %K vignette %K pharmacology %K medication %K decision-making %K aging %K attitude %K perception %K perspective %K electronic heath record %D 2025 %7 18.2.2025 %9 Letter to the Editor %J J Med Internet Res %G English %X %M 39964740 %R 10.2196/72007 %U https://www.jmir.org/2025/1/e72007 %U https://doi.org/10.2196/72007 %U http://www.ncbi.nlm.nih.gov/pubmed/39964740 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e70657 %T Enhancing the Clinical Relevance of Al Research for Medication Decision-Making %A Wang,Qi %A Chen,Mingxian %+ Department of Gastroenterology, Tongde Hospital of Zhejiang Province, 234 Gucui Street, Xihu Region, Hangzhou, 310012, China, 86 151 576 82797, chenmingxian2005@126.com %K older adults %K artificial intelligence %K medication %K decision-making %K data security %K patient trust %D 2025 %7 18.2.2025 %9 Letter to the Editor %J J Med Internet Res %G English %X %M 39964744 %R 10.2196/70657 %U https://www.jmir.org/2025/1/e70657 %U https://doi.org/10.2196/70657 %U http://www.ncbi.nlm.nih.gov/pubmed/39964744 %0 Journal Article %@ 2369-2960 %I JMIR Publications %V 11 %N %P e65699 %T The Promise and Perils of Artificial Intelligence in Advancing Participatory Science and Health Equity in Public Health %A King,Abby C %A Doueiri,Zakaria N %A Kaulberg,Ankita %A Goldman Rosas,Lisa %K digital health %K artificial intelligence %K community-based participatory research %K citizen science %K health equity %K societal trends %K public health %K viewpoint %K policy makers %K public participation %K information technology %K micro-level data %K macro-level data %K LLM %K natural language processing %K machine learning %K language model %K Our Voice %D 2025 %7 14.2.2025 %9 %J JMIR Public Health Surveill %G English %X Current societal trends reflect an increased mistrust in science and a lowered civic engagement that threaten to impair research that is foundational for ensuring public health and advancing health equity. One effective countermeasure to these trends lies in community-facing citizen science applications to increase public participation in scientific research, making this field an important target for artificial intelligence (AI) exploration. We highlight potentially promising citizen science AI applications that extend beyond individual use to the community level, including conversational large language models, text-to-image generative AI tools, descriptive analytics for analyzing integrated macro- and micro-level data, and predictive analytics. The novel adaptations of AI technologies for community-engaged participatory research also bring an array of potential risks. We highlight possible negative externalities and mitigations for some of the potential ethical and societal challenges in this field. %R 10.2196/65699 %U https://publichealth.jmir.org/2025/1/e65699 %U https://doi.org/10.2196/65699 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 12 %N %P e68135 %T Leveraging Large Language Models and Agent-Based Systems for Scientific Data Analysis: Validation Study %A Peasley,Dale %A Kuplicki,Rayus %A Sen,Sandip %A Paulus,Martin %K LLM %K agent-based systems %K scientific data analysis %K data contextualization %K AI-driven research tools %K large language model %K scientific data %K analysis %K contextualization %K AI %K artificial intelligence %K research tool %D 2025 %7 13.2.2025 %9 %J JMIR Ment Health %G English %X Background: Large language models have shown promise in transforming how complex scientific data are analyzed and communicated, yet their application to scientific domains remains challenged by issues of factual accuracy and domain-specific precision. The Laureate Institute for Brain Research–Tulsa University (LIBR-TU) Research Agent (LITURAt) leverages a sophisticated agent-based architecture to mitigate these limitations, using external data retrieval and analysis tools to ensure reliable, context-aware outputs that make scientific information accessible to both experts and nonexperts. Objective: The objective of this study was to develop and evaluate LITURAt to enable efficient analysis and contextualization of complex scientific datasets for diverse user expertise levels. Methods: An agent-based system based on large language models was designed to analyze and contextualize complex scientific datasets using a “plan-and-solve” framework. The system dynamically retrieves local data and relevant PubMed literature, performs statistical analyses, and generates comprehensive, context-aware summaries to answer user queries with high accuracy and consistency. Results: Our experiments demonstrated that LITURAt achieved an internal consistency rate of 94.8% and an external consistency rate of 91.9% across repeated and rephrased queries. Additionally, GPT-4 evaluations rated 80.3% (171/213) of the system’s answers as accurate and comprehensive, with 23.5% (50/213) receiving the highest rating of 5 for completeness and precision. Conclusions: These findings highlight the potential of LITURAt to significantly enhance the accessibility and accuracy of scientific data analysis, achieving high consistency and strong performance in complex query resolution. Despite existing limitations, such as model stability for highly variable queries, LITURAt demonstrates promise as a robust tool for democratizing data-driven insights across diverse scientific domains. %R 10.2196/68135 %U https://mental.jmir.org/2025/1/e68135 %U https://doi.org/10.2196/68135 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e64290 %T Laypeople’s Use of and Attitudes Toward Large Language Models and Search Engines for Health Queries: Survey Study %A Mendel,Tamir %A Singh,Nina %A Mann,Devin M %A Wiesenfeld,Batia %A Nov,Oded %+ Department of Technology Management and Innovation, Tandon School of Engineering, New York University, 2 Metrotech Center, Brooklyn, New York, NY, 11201, United States, 1 8287348968, tamir.mendel@nyu.edu %K large language model %K artificial intelligence %K LLMs %K search engine %K Google %K internet %K online health information %K United States %K survey %K mobile phone %D 2025 %7 13.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Laypeople have easy access to health information through large language models (LLMs), such as ChatGPT, and search engines, such as Google. Search engines transformed health information access, and LLMs offer a new avenue for answering laypeople’s questions. Objective: We aimed to compare the frequency of use and attitudes toward LLMs and search engines as well as their comparative relevance, usefulness, ease of use, and trustworthiness in responding to health queries. Methods: We conducted a screening survey to compare the demographics of LLM users and nonusers seeking health information, analyzing results with logistic regression. LLM users from the screening survey were invited to a follow-up survey to report the types of health information they sought. We compared the frequency of use of LLMs and search engines using ANOVA and Tukey post hoc tests. Lastly, paired-sample Wilcoxon tests compared LLMs and search engines on perceived usefulness, ease of use, trustworthiness, feelings, bias, and anthropomorphism. Results: In total, 2002 US participants recruited on Prolific participated in the screening survey about the use of LLMs and search engines. Of them, 52% (n=1045) of the participants were female, with a mean age of 39 (SD 13) years. Participants were 9.7% (n=194) Asian, 12.1% (n=242) Black, 73.3% (n=1467) White, 1.1% (n=22) Hispanic, and 3.8% (n=77) were of other races and ethnicities. Further, 1913 (95.6%) used search engines to look up health queries versus 642 (32.6%) for LLMs. Men had higher odds (odds ratio [OR] 1.63, 95% CI 1.34-1.99; P<.001) of using LLMs for health questions than women. Black (OR 1.90, 95% CI 1.42-2.54; P<.001) and Asian (OR 1.66, 95% CI 1.19-2.30; P<.01) individuals had higher odds than White individuals. Those with excellent perceived health (OR 1.46, 95% CI 1.1-1.93; P=.01) were more likely to use LLMs than those with good health. Higher technical proficiency increased the likelihood of LLM use (OR 1.26, 95% CI 1.14-1.39; P<.001). In a follow-up survey of 281 LLM users for health, most participants used search engines first (n=174, 62%) to answer health questions, but the second most common first source consulted was LLMs (n=39, 14%). LLMs were perceived as less useful (P<.01) and less relevant (P=.07), but elicited fewer negative feelings (P<.001), appeared more human (LLM: n=160, vs search: n=32), and were seen as less biased (P<.001). Trust (P=.56) and ease of use (P=.27) showed no differences. Conclusions: Search engines are the primary source of health information; yet, positive perceptions of LLMs suggest growing use. Future work could explore whether LLM trust and usefulness are enhanced by supplementing answers with external references and limiting persuasive language to curb overreliance. Collaboration with health organizations can help improve the quality of LLMs’ health output. %M 39946180 %R 10.2196/64290 %U https://www.jmir.org/2025/1/e64290 %U https://doi.org/10.2196/64290 %U http://www.ncbi.nlm.nih.gov/pubmed/39946180 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e48328 %T Large Language Models–Supported Thrombectomy Decision-Making in Acute Ischemic Stroke Based on Radiology Reports: Feasibility Qualitative Study %A Kottlors,Jonathan %A Hahnfeldt,Robert %A Görtz,Lukas %A Iuga,Andra-Iza %A Fervers,Philipp %A Bremm,Johannes %A Zopfs,David %A Laukamp,Kai R %A Onur,Oezguer A %A Lennartz,Simon %A Schönfeld,Michael %A Maintz,David %A Kabbasch,Christoph %A Persigehl,Thorsten %A Schlamann,Marc %+ Institute for Diagnostic and Interventional Radiology, Faculty of Medicine and University Hospital Cologne, University of Cologne, Kerpener Straße 62, Cologne, 50937, Germany, 49 221 47896063, jonathan.kottlors@uk-koeln.de %K artificial intelligence %K radiology %K report %K large language model %K text-based augmented supporting system %K mechanical thrombectomy %K GPT %K stroke %K decision-making %K thrombectomy %K imaging %K model %K machine learning %K ischemia %D 2025 %7 13.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: The latest advancement of artificial intelligence (AI) is generative pretrained transformer large language models (LLMs). They have been trained on massive amounts of text, enabling humanlike and semantical responses to text-based inputs and requests. Foreshadowing numerous possible applications in various fields, the potential of such tools for medical data integration and clinical decision-making is not yet clear. Objective: In this study, we investigate the potential of LLMs in report-based medical decision-making on the example of acute ischemic stroke (AIS), where clinical and image-based information may indicate an immediate need for mechanical thrombectomy (MT). The purpose was to elucidate the feasibility of integrating radiology report data and other clinical information in the context of therapy decision-making using LLMs. Methods: A hundred patients with AIS were retrospectively included, for which 50% (50/100) was indicated for MT, whereas the other 50% (50/100) was not. The LLM was provided with the computed tomography report, information on neurological symptoms and onset, and patients’ age. The performance of the AI decision-making model was compared with an expert consensus regarding the binary determination of MT indication, for which sensitivity, specificity, and accuracy were calculated. Results: The AI model had an overall accuracy of 88%, with a specificity of 96% and a sensitivity of 80%. The area under the curve for the report-based MT decision was 0.92. Conclusions: The LLM achieved promising accuracy in determining the eligibility of patients with AIS for MT based on radiology reports and clinical information. Our results underscore the potential of LLMs for radiological and medical data integration. This investigation should serve as a stimulus for further clinical applications of LLMs, in which this AI should be used as an augmented supporting system for human decision-making. %M 39946168 %R 10.2196/48328 %U https://www.jmir.org/2025/1/e48328 %U https://doi.org/10.2196/48328 %U http://www.ncbi.nlm.nih.gov/pubmed/39946168 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e64318 %T Performance Assessment of Large Language Models in Medical Consultation: Comparative Study %A Seo,Sujeong %A Kim,Kyuli %A Yang,Heyoung %+ Future Technology Analysis Center, Korea Institute of Science and Technology Information, Hoegi-ro 66, Dongdaemun-gu, Seoul, 92456, Republic of Korea, 82 10 9265 5661, hyyang@kisti.re.kr %K artificial intelligence %K biomedical %K large language model %K depression %K similarity measurement %K text validity %D 2025 %7 12.2.2025 %9 Original Paper %J JMIR Med Inform %G English %X Background: The recent introduction of generative artificial intelligence (AI) as an interactive consultant has sparked interest in evaluating its applicability in medical discussions and consultations, particularly within the domain of depression. Objective: This study evaluates the capability of large language models (LLMs) in AI to generate responses to depression-related queries. Methods: Using the PubMedQA and QuoraQA data sets, we compared various LLMs, including BioGPT, PMC-LLaMA, GPT-3.5, and Llama2, and measured the similarity between the generated and original answers. Results: The latest general LLMs, GPT-3.5 and Llama2, exhibited superior performance, particularly in generating responses to medical inquiries from the PubMedQA data set. Conclusions: Considering the rapid advancements in LLM development in recent years, it is hypothesized that version upgrades of general LLMs offer greater potential for enhancing their ability to generate “knowledge text” in the biomedical domain compared with fine-tuning for the biomedical field. These findings are expected to contribute significantly to the evolution of AI-based medical counseling systems. %M 39763114 %R 10.2196/64318 %U https://medinform.jmir.org/2025/1/e64318 %U https://doi.org/10.2196/64318 %U http://www.ncbi.nlm.nih.gov/pubmed/39763114 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e58766 %T Generative Artificial Intelligence in Medical Education—Policies and Training at US Osteopathic Medical Schools: Descriptive Cross-Sectional Survey %A Ichikawa,Tsunagu %A Olsen,Elizabeth %A Vinod,Arathi %A Glenn,Noah %A Hanna,Karim %A Lund,Gregg C %A Pierce-Talsma,Stacey %K artificial intelligence %K medical education %K faculty development %K policy %K AI %K training %K United States %K school %K university %K college %K institution %K osteopathic %K osteopathy %K curriculum %K student %K faculty %K administrator %K survey %K cross-sectional %D 2025 %7 11.2.2025 %9 %J JMIR Med Educ %G English %X Background: Interest has recently increased in generative artificial intelligence (GenAI), a subset of artificial intelligence that can create new content. Although the publicly available GenAI tools are not specifically trained in the medical domain, they have demonstrated proficiency in a wide range of medical assessments. The future integration of GenAI in medicine remains unknown. However, the rapid availability of GenAI with a chat interface and the potential risks and benefits are the focus of great interest. As with any significant medical advancement or change, medical schools must adapt their curricula to equip students with the skills necessary to become successful physicians. Furthermore, medical schools must ensure that faculty members have the skills to harness these new opportunities to increase their effectiveness as educators. How medical schools currently fulfill their responsibilities is unclear. Colleges of Osteopathic Medicine (COMs) in the United States currently train a significant proportion of the total number of medical students. These COMs are in academic settings ranging from large public research universities to small private institutions. Therefore, studying COMs will offer a representative sample of the current GenAI integration in medical education. Objective: This study aims to describe the policies and training regarding the specific aspect of GenAI in US COMs, targeting students, faculty, and administrators. Methods: Web-based surveys were sent to deans and Student Government Association (SGA) presidents of the main campuses of fully accredited US COMs. The dean survey included questions regarding current and planned policies and training related to GenAI for students, faculty, and administrators. The SGA president survey included only those questions related to current student policies and training. Results: Responses were received from 81% (26/32) of COMs surveyed. This included 47% (15/32) of the deans and 50% (16/32) of the SGA presidents (with 5 COMs represented by both the deans and the SGA presidents). Most COMs did not have a policy on the student use of GenAI, as reported by the dean (14/15, 93%) and the SGA president (14/16, 88%). Of the COMs with no policy, 79% (11/14) had no formal plans for policy development. Only 1 COM had training for students, which focused entirely on the ethics of using GenAI. Most COMs had no formal plans to provide mandatory (11/14, 79%) or elective (11/15, 73%) training. No COM had GenAI policies for faculty or administrators. Eighty percent had no formal plans for policy development. Furthermore, 33.3% (5/15) of COMs had faculty or administrator GenAI training. Except for examination question development, there was no training to increase faculty or administrator capabilities and efficiency or to decrease their workload. Conclusions: The survey revealed that most COMs lack GenAI policies and training for students, faculty, and administrators. The few institutions with policies or training were extremely limited in scope. Most institutions without current training or policies had no formal plans for development. The lack of current policies and training initiatives suggests inadequate preparedness for integrating GenAI into the medical school environment, therefore, relegating the responsibility for ethical guidance and training to the individual COM member. %R 10.2196/58766 %U https://mededu.jmir.org/2025/1/e58766 %U https://doi.org/10.2196/58766 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e68881 %T Understanding Human Papillomavirus Vaccination Hesitancy in Japan Using Social Media: Content Analysis %A Liu,Junyu %A Niu,Qian %A Nagai-Tanima,Momoko %A Aoyama,Tomoki %+ , Kyoto University, Yoshida-honmachi, Sakyo-Ku, Kyoto, 606-8501, Japan, 81 075 751 3952, aoyama.tomoki.4e@kyoto-u.ac.jp %K human papillomavirus %K HPV %K HPV vaccine %K vaccine confidence %K large language model %K stance analysis %K topic modeling %D 2025 %7 11.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Despite the reinstatement of proactive human papillomavirus (HPV) vaccine recommendations in 2022, Japan continues to face persistently low HPV vaccination rates, which pose significant public health challenges. Misinformation, complacency, and accessibility issues have been identified as key factors undermining vaccine uptake. Objective: This study aims to examine the evolution of public attitudes toward HPV vaccination in Japan by analyzing social media content. Specifically, we investigate the role of misinformation, public health events, and cross-vaccine attitudes (eg, COVID-19 vaccines) in shaping vaccine hesitancy over time. Methods: We collected tweets related to the HPV vaccine from 2011 to 2021. Natural language processing techniques and large language models (LLMs) were used for stance analysis of the collected data. Time series analysis and latent Dirichlet allocation topic modeling were used to identify shifts in public sentiment and topic trends over the decade. Misinformation within opposed-stance tweets was detected using LLMs. Furthermore, we analyzed the relationship between attitudes toward HPV and COVID-19 vaccines through logic analysis. Results: Among the tested models, Gemini 1.0 pro (Google) achieved the highest accuracy (0.902) for stance analysis, improving to 0.968 with hyperparameter tuning. Time series analysis identified significant shifts in public stance in 2013, 2016, and 2020, corresponding to key public health events and policy changes. Topic modeling revealed that discussions around vaccine safety peaked in 2015 before declining, while topics concerning vaccine effectiveness exhibited an opposite trend. Misinformation in topic "Scientific Warnings and Public Health Risk" in the sopposed-stance tweets reached a peak of 2.84% (47/1656) in 2012 and stabilized at approximately 0.5% from 2014 onward. The volume of tweets using HPV vaccine experiences to argue stances on COVID-19 vaccines was significantly higher than the reverse. Conclusions: Based on observation on the public attitudes toward HPV vaccination from social media contents over 10 years, our findings highlight the need for targeted public health interventions to address vaccine hesitancy in Japan. Although vaccine confidence has increased slowly, sustained efforts are necessary to ensure long-term improvements. Addressing misinformation, reducing complacency, and enhancing vaccine accessibility are key strategies for improving vaccine uptake. Some evidence suggests that confidence in one vaccine may positively influence perceptions of other vaccines. This study also demonstrated the use of LLMs in providing a comprehensive understanding of public health attitudes. Future public health strategies can benefit from these insights by designing effective interventions to boost vaccine confidence and uptake. %M 39933163 %R 10.2196/68881 %U https://www.jmir.org/2025/1/e68881 %U https://doi.org/10.2196/68881 %U http://www.ncbi.nlm.nih.gov/pubmed/39933163 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e56737 %T Smart Pharmaceutical Monitoring System With Personalized Medication Schedules and Self-Management Programs for Patients With Diabetes: Development and Evaluation Study %A Xiao,Jian %A Li,Mengyao %A Cai,Ruwen %A Huang,Hangxing %A Yu,Huimin %A Huang,Ling %A Li,Jingyang %A Yu,Ting %A Zhang,Jiani %A Cheng,Shuqiao %+ Department of Pharmacy, Xiangya Hospital, Central South University, 87 Xiangya Rd, Changsha, 410008, China, 86 15974198203, cheng0203@csu.edu.cn %K pharmaceutical services %K diabetes %K self-management %K intelligent medication scheduling system %K drug database %K GPT-4 %D 2025 %7 11.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: With the climbing incidence of type 2 diabetes, the health care system is under pressure to manage patients with this condition properly. Particularly, pharmacological therapy constitutes the most fundamental means of controlling blood glucose levels and preventing the progression of complications. However, its effectiveness is often hindered by factors such as treatment complexity, polypharmacy, and poor patient adherence. As new technologies, artificial intelligence and digital technologies are covering all aspects of the medical and health care field, but their application and evaluation in the domain of diabetes research remain limited. Objective: This study aims to develop and establish a stand-alone diabetes management service system designed to enhance self-management support for patients, as well as to assess its performance with experienced health care professionals. Methods: Diabetes Universal Medication Schedule (DUMS) system is grounded in official medicine instructions and evidence-based data to establish medication constraints and drug-drug interaction profiles. Individualized medication schedules and self-management programs were generated based on patient-specific conditions and needs, using an app framework to build patient-side contact pathways. The system’s ability to provide medication guidance and health management was assessed by senior health care professionals using a 5-point Likert scale across 3 groups: outputs generated by the system (DUMS group), outputs refined by pharmacists (intervention group), and outputs generated by ChatGPT-4 (GPT-4 group). Results: We constructed a cloud-based drug information management system loaded with 475 diabetes treatment–related medications; 684 medication constraints; and 12,351 drug-drug interactions and theoretical supports. The generated personalized medication plan and self-management program included recommended dosing times, disease education, dietary considerations, and lifestyle recommendations to help patients with diabetes achieve correct medication use and active disease management. Reliability analysis demonstrated that the DUMS group outperformed the GPT-4 group in medication schedule accuracy and safety, as well as comprehensiveness and richness of the self-management program (P<.001). The intervention group outperformed the DUMS and GPT-4 groups on all indicator scores. Conclusions: DUMS’s treatment monitoring service can provide reliable self-management support for patients with diabetes. ChatGPT-4, powered by artificial intelligence, can act as a collaborative assistant to health care professionals in clinical contexts, although its performance still requires further training and optimization. %M 39933171 %R 10.2196/56737 %U https://www.jmir.org/2025/1/e56737 %U https://doi.org/10.2196/56737 %U http://www.ncbi.nlm.nih.gov/pubmed/39933171 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e63824 %T Understanding Citizens’ Response to Social Activities on Twitter in US Metropolises During the COVID-19 Recovery Phase Using a Fine-Tuned Large Language Model: Application of AI %A Saito,Ryuichi %A Tsugawa,Sho %+ , Institute of Systems and Information Engineering, University of Tsukuba, 1-1-1 Tennodai, Tsukuba, 305-8577, Japan, 81 08055751714, saito.ryuichi.tkb_gw@u.tsukuba.ac.jp %K COVID-19 %K restriction %K United States %K X %K Twitter %K sentiment analysis %K large language model %K LLM %K GPT-3.5 %K fine-tuning %D 2025 %7 11.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: The COVID-19 pandemic continues to hold an important place in the collective memory as of 2024. As of March 2024, >676 million cases, 6 million deaths, and 13 billion vaccine doses have been reported. It is crucial to evaluate sociopsychological impacts as well as public health indicators such as these to understand the effects of the COVID-19 pandemic. Objective: This study aimed to explore the sentiments of residents of major US cities toward restrictions on social activities in 2022 during the transitional phase of the COVID-19 pandemic, from the peak of the pandemic to its gradual decline. By illuminating people’s susceptibility to COVID-19, we provide insights into the general sentiment trends during the recovery phase of the pandemic. Methods: To analyze these trends, we collected posts (N=119,437) on the social media platform Twitter (now X) created by people living in New York City, Los Angeles, and Chicago from December 2021 to December 2022, which were impacted by the COVID-19 pandemic in similar ways. A total of 47,111 unique users authored these posts. In addition, for privacy considerations, any identifiable information, such as author IDs and usernames, was excluded, retaining only the text for analysis. Then, we developed a sentiment estimation model by fine-tuning a large language model on the collected data and used it to analyze how citizens’ sentiments evolved throughout the pandemic. Results: In the evaluation of models, GPT-3.5 Turbo with fine-tuning outperformed GPT-3.5 Turbo without fine-tuning and Robustly Optimized Bidirectional Encoder Representations from Transformers Pretraining Approach (RoBERTa)–large with fine-tuning, demonstrating significant accuracy (0.80), recall (0.79), precision (0.79), and F1-score (0.79). The findings using GPT-3.5 Turbo with fine-tuning reveal a significant relationship between sentiment levels and actual cases in all 3 cities. Specifically, the correlation coefficient for New York City is 0.89 (95% CI 0.81-0.93), for Los Angeles is 0.39 (95% CI 0.14-0.60), and for Chicago is 0.65 (95% CI 0.47-0.78). Furthermore, feature words analysis showed that COVID-19–related keywords were replaced with non–COVID-19-related keywords in New York City and Los Angeles from January 2022 onward and Chicago from March 2022 onward. Conclusions: The results show a gradual decline in sentiment and interest in restrictions across all 3 cities as the pandemic approached its conclusion. These results are also ensured by a sentiment estimation model fine-tuned on actual Twitter posts. This study represents the first attempt from a macro perspective to depict sentiment using a classification model created with actual data from the period when COVID-19 was prevalent. This approach can be applied to the spread of other infectious diseases by adjusting search keywords for observational data. %M 39932775 %R 10.2196/63824 %U https://www.jmir.org/2025/1/e63824 %U https://doi.org/10.2196/63824 %U http://www.ncbi.nlm.nih.gov/pubmed/39932775 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e60095 %T Developing an ICD-10 Coding Assistant: Pilot Study Using RoBERTa and GPT-4 for Term Extraction and Description-Based Code Selection %A Puts,Sander %A Zegers,Catharina M L %A Dekker,Andre %A Bermejo,Iñigo %K International Classification of Diseases %K ICD-10 %K computer-assisted-coding %K GPT-4 %K coding %K term extraction %K code analysis %K computer assisted coding %K transformer model %K artificial intelligence %K AI automation %K retrieval-augmented generation %K RAG %K large language model %K LLM %K Bidirectional Encoder Representations from Transformers %K Robustly Optimized BERT Pretraining Approach %K RoBERTa %K named entity recognition %K NER %D 2025 %7 11.2.2025 %9 %J JMIR Form Res %G English %X Background: The International Classification of Diseases (ICD), developed by the World Health Organization, standardizes health condition coding to support health care policy, research, and billing, but artificial intelligence automation, while promising, still underperforms compared with human accuracy and lacks the explainability needed for adoption in medical settings. Objective: The potential of large language models for assisting medical coders in the ICD-10 coding was explored through the development of a computer-assisted coding system. This study aimed to augment human coding by initially identifying lead terms and using retrieval-augmented generation (RAG)–based methods for computer-assisted coding enhancement. Methods: The explainability dataset from the CodiEsp challenge (CodiEsp-X) was used, featuring 1000 Spanish clinical cases annotated with ICD-10 codes. A new dataset, CodiEsp-X-lead, was generated using GPT-4 to replace full-textual evidence annotations with lead term annotations. A Robustly Optimized BERT (Bidirectional Encoder Representations from Transformers) Pretraining Approach transformer model was fine-tuned for named entity recognition to extract lead terms. GPT-4 was subsequently employed to generate code descriptions from the extracted textual evidence. Using a RAG approach, ICD codes were assigned to the lead terms by querying a vector database of ICD code descriptions with OpenAI’s text-embedding-ada-002 model. Results: The fine-tuned Robustly Optimized BERT Pretraining Approach achieved an overall F1-score of 0.80 for ICD lead term extraction on the new CodiEsp-X-lead dataset. GPT-4-generated code descriptions reduced retrieval failures in the RAG approach by approximately 5% for both diagnoses and procedures. However, the overall explainability F1-score for the CodiEsp-X task was limited to 0.305, significantly lower than the state-of-the-art F1-score of 0.633. The diminished performance was partly due to the reliance on code descriptions, as some ICD codes lacked descriptions, and the approach did not fully align with the medical coder’s workflow. Conclusions: While lead term extraction showed promising results, the subsequent RAG-based code assignment using GPT-4 and code descriptions was less effective. Future research should focus on refining the approach to more closely mimic the medical coder’s workflow, potentially integrating the alphabetic index and official coding guidelines, rather than relying solely on code descriptions. This alignment may enhance system accuracy and better support medical coders in practice. %R 10.2196/60095 %U https://formative.jmir.org/2025/1/e60095 %U https://doi.org/10.2196/60095 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e63881 %T InfectA-Chat, an Arabic Large Language Model for Infectious Diseases: Comparative Analysis %A Selcuk,Yesim %A Kim,Eunhui %A Ahn,Insung %+ , Department of Data-Centric Problem Solving Research, Korea Institute of Science and Technology Information, 245 Daehak-ro, Yuseong-gu, Daejeon, 34141, Republic of Korea, 82 42 869 1053, isahn@kisti.re.kr %K large language model %K Arabic large language models %K AceGPT %K multilingual large language model %K infectious disease monitoring %K public health %D 2025 %7 10.2.2025 %9 Original Paper %J JMIR Med Inform %G English %X Background: Infectious diseases have consistently been a significant concern in public health, requiring proactive measures to safeguard societal well-being. In this regard, regular monitoring activities play a crucial role in mitigating the adverse effects of diseases on society. To monitor disease trends, various organizations, such as the World Health Organization (WHO) and the European Centre for Disease Prevention and Control (ECDC), collect diverse surveillance data and make them publicly accessible. However, these platforms primarily present surveillance data in English, which creates language barriers for non–English-speaking individuals and global public health efforts to accurately observe disease trends. This challenge is particularly noticeable in regions such as the Middle East, where specific infectious diseases, such as Middle East respiratory syndrome coronavirus (MERS-CoV), have seen a dramatic increase. For such regions, it is essential to develop tools that can overcome language barriers and reach more individuals to alleviate the negative impacts of these diseases. Objective: This study aims to address these issues; therefore, we propose InfectA-Chat, a cutting-edge large language model (LLM) specifically designed for the Arabic language but also incorporating English for question and answer (Q&A) tasks. InfectA-Chat leverages its deep understanding of the language to provide users with information on the latest trends in infectious diseases based on their queries. Methods: This comprehensive study was achieved by instruction tuning the AceGPT-7B and AceGPT-7B-Chat models on a Q&A task, using a dataset of 55,400 Arabic and English domain–specific instruction–following data. The performance of these fine-tuned models was evaluated using 2770 domain-specific Arabic and English instruction–following data, using the GPT-4 evaluation method. A comparative analysis was then performed against Arabic LLMs and state-of-the-art models, including AceGPT-13B-Chat, Jais-13B-Chat, Gemini, GPT-3.5, and GPT-4. Furthermore, to ensure the model had access to the latest information on infectious diseases by regularly updating the data without additional fine-tuning, we used the retrieval-augmented generation (RAG) method. Results: InfectA-Chat demonstrated good performance in answering questions about infectious diseases by the GPT-4 evaluation method. Our comparative analysis revealed that it outperforms the AceGPT-7B-Chat and InfectA-Chat (based on AceGPT-7B) models by a margin of 43.52%. It also surpassed other Arabic LLMs such as AceGPT-13B-Chat and Jais-13B-Chat by 48.61%. Among the state-of-the-art models, InfectA-Chat achieved a leading performance of 23.78%, competing closely with the GPT-4 model. Furthermore, the RAG method in InfectA-Chat significantly improved document retrieval accuracy. Notably, RAG retrieved more accurate documents based on queries when the top-k parameter value was increased. Conclusions: Our findings highlight the shortcomings of general Arabic LLMs in providing up-to-date information about infectious diseases. With this study, we aim to empower individuals and public health efforts by offering a bilingual Q&A system for infectious disease monitoring. %M 39928922 %R 10.2196/63881 %U https://medinform.jmir.org/2025/1/e63881 %U https://doi.org/10.2196/63881 %U http://www.ncbi.nlm.nih.gov/pubmed/39928922 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 12 %N %P e64396 %T The Efficacy of Conversational AI in Rectifying the Theory-of-Mind and Autonomy Biases: Comparative Analysis %A Rządeczka,Marcin %A Sterna,Anna %A Stolińska,Julia %A Kaczyńska,Paulina %A Moskalewicz,Marcin %+ Institute of Philosophy, Maria Curie-Skłodowska University, Pl. Marii Curie-Skłodowskiej 4, pok. 204, Lublin, 20-031, Poland, 48 815375481, marcin.rzadeczka@umcs.pl %K cognitive bias %K conversational artificial intelligence %K artificial intelligence %K AI %K chatbots %K digital mental health %K bias rectification %K affect recognition %D 2025 %7 7.2.2025 %9 Original Paper %J JMIR Ment Health %G English %X Background: The increasing deployment of conversational artificial intelligence (AI) in mental health interventions necessitates an evaluation of their efficacy in rectifying cognitive biases and recognizing affect in human-AI interactions. These biases are particularly relevant in mental health contexts as they can exacerbate conditions such as depression and anxiety by reinforcing maladaptive thought patterns or unrealistic expectations in human-AI interactions. Objective: This study aimed to assess the effectiveness of therapeutic chatbots (Wysa and Youper) versus general-purpose language models (GPT-3.5, GPT-4, and Gemini Pro) in identifying and rectifying cognitive biases and recognizing affect in user interactions. Methods: This study used constructed case scenarios simulating typical user-bot interactions to examine how effectively chatbots address selected cognitive biases. The cognitive biases assessed included theory-of-mind biases (anthropomorphism, overtrust, and attribution) and autonomy biases (illusion of control, fundamental attribution error, and just-world hypothesis). Each chatbot response was evaluated based on accuracy, therapeutic quality, and adherence to cognitive behavioral therapy principles using an ordinal scale to ensure consistency in scoring. To enhance reliability, responses underwent a double review process by 2 cognitive scientists, followed by a secondary review by a clinical psychologist specializing in cognitive behavioral therapy, ensuring a robust assessment across interdisciplinary perspectives. Results: This study revealed that general-purpose chatbots outperformed therapeutic chatbots in rectifying cognitive biases, particularly in overtrust bias, fundamental attribution error, and just-world hypothesis. GPT-4 achieved the highest scores across all biases, whereas the therapeutic bot Wysa scored the lowest. Notably, general-purpose bots showed more consistent accuracy and adaptability in recognizing and addressing bias-related cues across different contexts, suggesting a broader flexibility in handling complex cognitive patterns. In addition, in affect recognition tasks, general-purpose chatbots not only excelled but also demonstrated quicker adaptation to subtle emotional nuances, outperforming therapeutic bots in 67% (4/6) of the tested biases. Conclusions: This study shows that, while therapeutic chatbots hold promise for mental health support and cognitive bias intervention, their current capabilities are limited. Addressing cognitive biases in AI-human interactions requires systems that can both rectify and analyze biases as integral to human cognition, promoting precision and simulating empathy. The findings reveal the need for improved simulated emotional intelligence in chatbot design to provide adaptive, personalized responses that reduce overreliance and encourage independent coping skills. Future research should focus on enhancing affective response mechanisms and addressing ethical concerns such as bias mitigation and data privacy to ensure safe, effective AI-based mental health support. %M 39919295 %R 10.2196/64396 %U https://mental.jmir.org/2025/1/e64396 %U https://doi.org/10.2196/64396 %U http://www.ncbi.nlm.nih.gov/pubmed/39919295 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e59524 %T Unraveling Online Mental Health Through the Lens of Early Maladaptive Schemas: AI-Enabled Content Analysis of Online Mental Health Communities %A Ang,Beng Heng %A Gollapalli,Sujatha Das %A Du,Mingzhe %A Ng,See-Kiong %+ Integrative Sciences and Engineering Programme, NUS Graduate School, National University of Singapore, University Hall, Tan Chin Tuan Wing Level 5, #05-03 21 Lower Kent Ridge Road, Singapore, 119077, Singapore, 65 92983451, bengheng.ang@u.nus.edu %K early maladaptive schemas %K large language models %K online mental health communities %K case conceptualization %K prompt engineering %K artificial intelligence %K AI %D 2025 %7 7.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Early maladaptive schemas (EMSs) are pervasive, self-defeating patterns of thoughts and emotions underlying most mental health problems and are central in schema therapy. However, the characteristics of EMSs vary across demographics, and despite the growing use of online mental health communities (OMHCs), how EMSs manifest in these online support-seeking environments remains unclear. Understanding these characteristics could inform the design of more effective interventions powered by artificial intelligence to address online support seekers’ unique therapeutic needs. Objective: We aimed to uncover associations between EMSs and mental health problems within OMHCs and examine features of EMSs as they are reflected in OMHCs. Methods: We curated a dataset of 29,329 posts from widely accessed OMHCs, labeling each with relevant schemas and mental health problems. To identify associations, we conducted chi-square tests of independence and calculated odds ratios (ORs) with the dataset. In addition, we developed a novel group-level case conceptualization technique, leveraging GPT-4 to extract features of EMSs from OMHC texts across key schema therapy dimensions, such as schema triggers and coping responses. Results: Several associations were identified between EMSs and mental health problems, reflecting how EMSs manifest in online support-seeking contexts. Anxiety-related problems typically highlighted vulnerability to harm or illness (OR 5.64, 95% CI 5.34-5.96; P<.001), while depression-related problems emphasized unmet interpersonal needs, such as social isolation (OR 3.18, 95% CI 3.02-3.34; P<.001). Conversely, problems with eating disorders mostly exemplified negative self-perception and emotional inhibition (OR 1.89, 95% CI 1.45-2.46; P<.001). Personality disorders reflected themes of subjugation (OR 2.51, 95% CI 1.86-3.39; P<.001), while posttraumatic stress disorder problems involved distressing experiences and mistrust (OR 5.04, 95% CI 4.49-5.66; P<.001). Substance use disorder problems reflected negative self-perception of failure to achieve (OR 1.83, 95% CI 1.35-2.49; P<.001). Depression, personality disorders, and posttraumatic stress disorder were also associated with 12, 9, and 7 EMSs, respectively, emphasizing their complexities and the need for more comprehensive interventions. In contrast, anxiety, eating disorder, and substance use disorder were related to only 2 to 3 EMSs, suggesting that these problems are better addressed through targeted interventions. In addition, the EMS features extracted from our dataset averaged 13.27 (SD 3.05) negative features per schema, with 2.65 (SD 1.07) features per dimension, as supported by existing literature. Conclusions: We uncovered various associations between EMSs and mental health problems among online support seekers, highlighting the prominence of specific EMSs in each problem and the unique complexities of each problem in terms of EMSs. We also identified EMS features as expressed by support seekers in OMHCs, reinforcing the relevance of EMSs in these online support-seeking contexts. These insights are valuable for understanding how EMS are characterized in OMHCs and can inform the development of more effective artificial intelligence–powered tools to enhance support on these platforms. %M 39919286 %R 10.2196/59524 %U https://www.jmir.org/2025/1/e59524 %U https://doi.org/10.2196/59524 %U http://www.ncbi.nlm.nih.gov/pubmed/39919286 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e63550 %T ChatGPT for Univariate Statistics: Validation of AI-Assisted Data Analysis in Healthcare Research %A Ruta,Michael R %A Gaidici,Tony %A Irwin,Chase %A Lifshitz,Jonathan %+ University of Arizona College of Medicine – Phoenix, 475 N 5th St, Phoenix, AZ, 85004, United States, 1 602 827 2002, mruta@arizona.edu %K ChatGPT %K data analysis %K statistics %K chatbot %K artificial intelligence %K biomedical research %K programmers %K bioinformatics %K data processing %D 2025 %7 7.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: ChatGPT, a conversational artificial intelligence developed by OpenAI, has rapidly become an invaluable tool for researchers. With the recent integration of Python code interpretation into the ChatGPT environment, there has been a significant increase in the potential utility of ChatGPT as a research tool, particularly in terms of data analysis applications. Objective: This study aimed to assess ChatGPT as a data analysis tool and provide researchers with a framework for applying ChatGPT to data management tasks, descriptive statistics, and inferential statistics. Methods: A subset of the National Inpatient Sample was extracted. Data analysis trials were divided into data processing, categorization, and tabulation, as well as descriptive and inferential statistics. For data processing, categorization, and tabulation assessments, ChatGPT was prompted to reclassify variables, subset variables, and present data, respectively. Descriptive statistics assessments included mean, SD, median, and IQR calculations. Inferential statistics assessments were conducted at varying levels of prompt specificity (“Basic,” “Intermediate,” and “Advanced”). Specific tests included chi-square, Pearson correlation, independent 2-sample t test, 1-way ANOVA, Fisher exact, Spearman correlation, Mann-Whitney U test, and Kruskal-Wallis H test. Outcomes from consecutive prompt-based trials were assessed against expected statistical values calculated in Python (Python Software Foundation), SAS (SAS Institute), and RStudio (Posit PBC). Results: ChatGPT accurately performed data processing, categorization, and tabulation across all trials. For descriptive statistics, it provided accurate means, SDs, medians, and IQRs across all trials. Inferential statistics accuracy against expected statistical values varied with prompt specificity: 32.5% accuracy for “Basic” prompts, 81.3% for “Intermediate” prompts, and 92.5% for “Advanced” prompts. Conclusions: ChatGPT shows promise as a tool for exploratory data analysis, particularly for researchers with some statistical knowledge and limited programming expertise. However, its application requires careful prompt construction and human oversight to ensure accuracy. As a supplementary tool, ChatGPT can enhance data analysis efficiency and broaden research accessibility. %M 39919289 %R 10.2196/63550 %U https://www.jmir.org/2025/1/e63550 %U https://doi.org/10.2196/63550 %U http://www.ncbi.nlm.nih.gov/pubmed/39919289 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e65146 %T Unveiling GPT-4V's hidden challenges behind high accuracy on USMLE questions: Observational Study %A Yang,Zhichao %A Yao,Zonghai %A Tasmin,Mahbuba %A Vashisht,Parth %A Jang,Won Seok %A Ouyang,Feiyun %A Wang,Beining %A McManus,David %A Berlowitz,Dan %A Yu,Hong %+ , Miner School of Computer & Information Sciences, University of Massachusetts Lowell, 1 University Ave, Lowell, MA, 01854, United States, 1 508 612 7292, Hong_Yu@uml.edu %K artificial intelligence %K natural language processing %K large language model %K LLM %K ChatGPT %K GPT %K GPT-4V %K USMLE %K Medical License Exam %K medical image interpretation %K United States Medical Licensing Examination %K NLP %D 2025 %7 7.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Recent advancements in artificial intelligence, such as GPT-3.5 Turbo (OpenAI) and GPT-4, have demonstrated significant potential by achieving good scores on text-only United States Medical Licensing Examination (USMLE) exams and effectively answering questions from physicians. However, the ability of these models to interpret medical images remains underexplored. Objective: This study aimed to comprehensively evaluate the performance, interpretability, and limitations of GPT-3.5 Turbo, GPT-4, and its successor, GPT-4 Vision (GPT-4V), specifically focusing on GPT-4V’s newly introduced image-understanding feature. By assessing the models on medical licensing examination questions that require image interpretation, we sought to highlight the strengths and weaknesses of GPT-4V in handling complex multimodal clinical information, thereby exposing hidden flaws and providing insights into its readiness for integration into clinical settings. Methods: This cross-sectional study tested GPT-4V, GPT-4, and ChatGPT-3.5 Turbo on a total of 227 multiple-choice questions with images from USMLE Step 1 (n=19), Step 2 clinical knowledge (n=14), Step 3 (n=18), the Diagnostic Radiology Qualifying Core Exam (DRQCE) (n=26), and AMBOSS question banks (n=150). AMBOSS provided expert-written hints and question difficulty levels. GPT-4V’s accuracy was compared with 2 state-of-the-art large language models, GPT-3.5 Turbo and GPT-4. The quality of the explanations was evaluated by choosing human preference between an explanation by GPT-4V (without hint), an explanation by an expert, or a tie, using 3 qualitative metrics: comprehensive explanation, question information, and image interpretation. To better understand GPT-4V’s explanation ability, we modified a patient case report to resemble a typical “curbside consultation” between physicians. Results: For questions with images, GPT-4V achieved an accuracy of 84.2%, 85.7%, 88.9%, and 73.1% in Step 1, Step 2 clinical knowledge, Step 3 of USMLE, and DRQCE, respectively. It outperformed GPT-3.5 Turbo (42.1%, 50%, 50%, 19.2%) and GPT-4 (63.2%, 64.3%, 66.7%, 26.9%). When GPT-4V answered correctly, its explanations were nearly as good as those provided by domain experts from AMBOSS. However, incorrect answers often had poor explanation quality: 18.2% (10/55) contained inaccurate text, 45.5% (25/55) had inference errors, and 76.3% (42/55) demonstrated image misunderstandings. With human expert assistance, GPT-4V reduced errors by an average of 40% (22/55). GPT-4V accuracy improved with hints, maintaining stable performance across difficulty levels, while medical student performance declined as difficulty increased. In a simulated curbside consultation scenario, GPT-4V required multiple specific prompts to interpret complex case data accurately. Conclusions: GPT-4V achieved high accuracy on multiple-choice questions with images, highlighting its potential in medical assessments. However, significant shortcomings were observed in the quality of explanations when questions were answered incorrectly, particularly in the interpretation of images, which could not be efficiently resolved through expert interaction. These findings reveal hidden flaws in the image interpretation capabilities of GPT-4V, underscoring the need for more comprehensive evaluations beyond multiple-choice questions before integrating GPT-4V into clinical settings. %M 39919278 %R 10.2196/65146 %U https://www.jmir.org/2025/1/e65146 %U https://doi.org/10.2196/65146 %U http://www.ncbi.nlm.nih.gov/pubmed/39919278 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e57319 %T Investigating the Classification of Living Kidney Donation Experiences on Reddit and Understanding the Sensitivity of ChatGPT to Prompt Engineering: Content Analysis %A Nielsen,Joshua %A Chen,Xiaoyu %A Davis,LaShara %A Waterman,Amy %A Gentili,Monica %+ Department of Industrial Engineering, JB Speed School of Engineering, University of Louisville, 220 Eastern Parkway, Louisville, KY, 40292, United States, 1 5024891335, joshua.nielsen@louisville.edu %K prompt engineering %K generative artificial intelligence %K kidney donation %K transplant %K living donor %D 2025 %7 7.2.2025 %9 Original Paper %J JMIR AI %G English %X Background: Living kidney donation (LKD), where individuals donate one kidney while alive, plays a critical role in increasing the number of kidneys available for those experiencing kidney failure. Previous studies show that many generous people are interested in becoming living donors; however, a huge gap exists between the number of patients on the waiting list and the number of living donors yearly. Objective: To bridge this gap, we aimed to investigate how to identify potential living donors from discussions on public social media forums so that educational interventions could later be directed to them. Methods: Using Reddit forums as an example, this study described the classification of Reddit content shared about LKD into three classes: (1) present (presently dealing with LKD personally), (2) past (dealt with LKD personally in the past), and (3) other (LKD general comments). An evaluation was conducted comparing a fine-tuned distilled version of the Bidirectional Encoder Representations from Transformers (BERT) model with inference using GPT-3.5 (ChatGPT). To systematically evaluate ChatGPT’s sensitivity to distinguishing between the 3 prompt categories, we used a comprehensive prompt engineering strategy encompassing a full factorial analysis in 48 runs. A novel prompt engineering approach, dialogue until classification consensus, was introduced to simulate a deliberation between 2 domain experts until a consensus on classification was achieved. Results: BERT and GPT-3.5 exhibited classification accuracies of approximately 75% and 78%, respectively. Recognizing the inherent ambiguity between classes, a post hoc analysis of incorrect predictions revealed sensible reasoning and acceptable errors in the predictive models. Considering these acceptable mismatched predictions, the accuracy improved to 89.3% for BERT and 90.7% for GPT-3.5. Conclusions: Large language models, such as GPT-3.5, are highly capable of detecting and categorizing LKD-targeted content on social media forums. They are sensitive to instructions, and the introduced dialogue until classification consensus method exhibited superior performance over stand-alone reasoning, highlighting the merit in advancing prompt engineering methodologies. The models can produce appropriate contextual reasoning, even when final conclusions differ from their human counterparts. %M 39918869 %R 10.2196/57319 %U https://ai.jmir.org/2025/1/e57319 %U https://doi.org/10.2196/57319 %U http://www.ncbi.nlm.nih.gov/pubmed/39918869 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 14 %N %P e63887 %T ChatGPT-4 Performance on German Continuing Medical Education—Friend or Foe (Trick or Treat)? Protocol for a Randomized Controlled Trial %A Burisch,Christian %A Bellary,Abhav %A Breuckmann,Frank %A Ehlers,Jan %A Thal,Serge C %A Sellmann,Timur %A Gödde,Daniel %+ State of North Rhine-Westphalia, Regional Government Düsseldorf, Leibniz-Gymnasium, Stankeitstraße 22, Essen, 45326, Germany, 49 201 79938720, christian.burisch@rub.de %K ChatGPT %K artificial intelligence %K large language model %K postgraduate education %K continuing medical education %K self-assessment program %D 2025 %7 6.2.2025 %9 Protocol %J JMIR Res Protoc %G English %X Background: The increasing development and spread of artificial and assistive intelligence is opening up new areas of application not only in applied medicine but also in related fields such as continuing medical education (CME), which is part of the mandatory training program for medical doctors in Germany. This study aimed to determine whether medical laypersons can successfully conduct training courses specifically for physicians with the help of a large language model (LLM) such as ChatGPT-4. This study aims to qualitatively and quantitatively investigate the impact of using artificial intelligence (AI; specifically ChatGPT) on the acquisition of credit points in German postgraduate medical education. Objective: Using this approach, we wanted to test further possible applications of AI in the postgraduate medical education setting and obtain results for practical use. Depending on the results, the potential influence of LLMs such as ChatGPT-4 on CME will be discussed, for example, as part of a SWOT (strengths, weaknesses, opportunities, threats) analysis. Methods: We designed a randomized controlled trial, in which adult high school students attempt to solve CME tests across six medical specialties in three study arms in total with 18 CME training courses per study arm under different interventional conditions with varying amounts of permitted use of ChatGPT-4. Sample size calculation was performed including guess probability (20% correct answers, SD=40%; confidence level of 1–α=.95/α=.05; test power of 1–β=.95; P<.05). The study was registered at open scientific framework. Results: As of October 2024, the acquisition of data and students to participate in the trial is ongoing. Upon analysis of our acquired data, we predict our findings to be ready for publication as soon as early 2025. Conclusions: We aim to prove that the advances in AI, especially LLMs such as ChatGPT-4 have considerable effects on medical laypersons’ ability to successfully pass CME tests. The implications that this holds on how the concept of continuous medical education requires reevaluation are yet to be contemplated. Trial Registration: OSF Registries 10.17605/OSF.IO/MZNUF; https://osf.io/mznuf International Registered Report Identifier (IRRID): PRR1-10.2196/63887 %M 39913914 %R 10.2196/63887 %U https://www.researchprotocols.org/2025/1/e63887 %U https://doi.org/10.2196/63887 %U http://www.ncbi.nlm.nih.gov/pubmed/39913914 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e63626 %T Performance Evaluation of Large Language Models in Cervical Cancer Management Based on a Standardized Questionnaire: Comparative Study %A Kuerbanjiang,Warisijiang %A Peng,Shengzhe %A Jiamaliding,Yiershatijiang %A Yi,Yuexiong %+ , Department of Gynecology, Zhongnan Hospital of Wuhan University, 169 Donghu Road, Wuhan, Hubei Province, 430071, China, 86 15671669885, yiyuexiong@163.com %K large language model %K cervical cancer %K screening %K artificial intelligence %K model interpretability %D 2025 %7 5.2.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Cervical cancer remains the fourth leading cause of death among women globally, with a particularly severe burden in low-resource settings. A comprehensive approach—from screening to diagnosis and treatment—is essential for effective prevention and management. Large language models (LLMs) have emerged as potential tools to support health care, though their specific role in cervical cancer management remains underexplored. Objective: This study aims to systematically evaluate the performance and interpretability of LLMs in cervical cancer management. Methods: Models were selected from the AlpacaEval leaderboard version 2.0 and based on the capabilities of our computer. The questions inputted into the models cover aspects of general knowledge, screening, diagnosis, and treatment, according to guidelines. The prompt was developed using the Context, Objective, Style, Tone, Audience, and Response (CO-STAR) framework. Responses were evaluated for accuracy, guideline compliance, clarity, and practicality, graded as A, B, C, and D with corresponding scores of 3, 2, 1, and 0. The effective rate was calculated as the ratio of A and B responses to the total number of designed questions. Local Interpretable Model-Agnostic Explanations (LIME) was used to explain and enhance physicians’ trust in model outputs within the medical context. Results: Nine models were included in this study, and a set of 100 standardized questions covering general information, screening, diagnosis, and treatment was designed based on international and national guidelines. Seven models (ChatGPT-4.0 Turbo, Claude 2, Gemini Pro, Mistral-7B-v0.2, Starling-LM-7B alpha, HuatuoGPT, and BioMedLM 2.7B) provided stable responses. Among all the models included, ChatGPT-4.0 Turbo ranked first with a mean score of 2.67 (95% CI 2.54-2.80; effective rate 94.00%) with a prompt and 2.52 (95% CI 2.37-2.67; effective rate 87.00%) without a prompt, outperforming the other 8 models (P<.001). Regardless of prompts, QiZhenGPT consistently ranked among the lowest-performing models, with P<.01 in comparisons against all models except BioMedLM. Interpretability analysis showed that prompts improved alignment with human annotations for proprietary models (median intersection over union 0.43), while medical-specialized models exhibited limited improvement. Conclusions: Proprietary LLMs, particularly ChatGPT-4.0 Turbo and Claude 2, show promise in clinical decision-making involving logical analysis. The use of prompts can enhance the accuracy of some models in cervical cancer management to varying degrees. Medical-specialized models, such as HuatuoGPT and BioMedLM, did not perform as well as expected in this study. By contrast, proprietary models, particularly those augmented with prompts, demonstrated notable accuracy and interpretability in medical tasks, such as cervical cancer management. However, this study underscores the need for further research to explore the practical application of LLMs in medical practice. %M 39908540 %R 10.2196/63626 %U https://www.jmir.org/2025/1/e63626 %U https://doi.org/10.2196/63626 %U http://www.ncbi.nlm.nih.gov/pubmed/39908540 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e56126 %T Proficiency, Clarity, and Objectivity of Large Language Models Versus Specialists’ Knowledge on COVID-19's Impacts in Pregnancy: Cross-Sectional Pilot Study %A Bragazzi,Nicola Luigi %A Buchinger,Michèle %A Atwan,Hisham %A Tuma,Ruba %A Chirico,Francesco %A Szarpak,Lukasz %A Farah,Raymond %A Khamisy-Farah,Rola %+ Laboratory for Industrial and Applied Mathematics, Department of Mathematics and Statistics, York University, 4700 Keele Street, Toronto, ON, M3J 1P3, Canada, 1 416 736 2100, robertobragazzi@gmail.com %K COVID-19 %K vaccine %K reproductive health %K generative artificial intelligence %K large language model %K chatGPT %K google bard %K microsoft copilot %K vaccination %K natural language processing %K obstetric %K gynecology %K women %K text mining %K sentiment %K accuracy %K zero shot %K pregnancy %K readability %K infectious %D 2025 %7 5.2.2025 %9 Original Paper %J JMIR Form Res %G English %X Background: The COVID-19 pandemic has significantly strained health care systems globally, leading to an overwhelming influx of patients and exacerbating resource limitations. Concurrently, an “infodemic” of misinformation, particularly prevalent in women’s health, has emerged. This challenge has been pivotal for health care providers, especially gynecologists and obstetricians, in managing pregnant women’s health. The pandemic heightened risks for pregnant women from COVID-19, necessitating balanced advice from specialists on vaccine safety versus known risks. In addition, the advent of generative artificial intelligence (AI), such as large language models (LLMs), offers promising support in health care. However, they necessitate rigorous testing. Objective: This study aimed to assess LLMs’ proficiency, clarity, and objectivity regarding COVID-19’s impacts on pregnancy. Methods: This study evaluates 4 major AI prototypes (ChatGPT-3.5, ChatGPT-4, Microsoft Copilot, and Google Bard) using zero-shot prompts in a questionnaire validated among 159 Israeli gynecologists and obstetricians. The questionnaire assesses proficiency in providing accurate information on COVID-19 in relation to pregnancy. Text-mining, sentiment analysis, and readability (Flesch-Kincaid grade level and Flesch Reading Ease Score) were also conducted. Results: In terms of LLMs’ knowledge, ChatGPT-4 and Microsoft Copilot each scored 97% (32/33), Google Bard 94% (31/33), and ChatGPT-3.5 82% (27/33). ChatGPT-4 incorrectly stated an increased risk of miscarriage due to COVID-19. Google Bard and Microsoft Copilot had minor inaccuracies concerning COVID-19 transmission and complications. In the sentiment analysis, Microsoft Copilot achieved the least negative score (–4), followed by ChatGPT-4 (–6) and Google Bard (–7), while ChatGPT-3.5 obtained the most negative score (–12). Finally, concerning the readability analysis, Flesch-Kincaid Grade Level and Flesch Reading Ease Score showed that Microsoft Copilot was the most accessible at 9.9 and 49, followed by ChatGPT-4 at 12.4 and 37.1, while ChatGPT-3.5 (12.9 and 35.6) and Google Bard (12.9 and 35.8) generated particularly complex responses. Conclusions: The study highlights varying knowledge levels of LLMs in relation to COVID-19 and pregnancy. ChatGPT-3.5 showed the least knowledge and alignment with scientific evidence. Readability and complexity analyses suggest that each AI’s approach was tailored to specific audiences, with ChatGPT versions being more suitable for specialized readers and Microsoft Copilot for the general public. Sentiment analysis revealed notable variations in the way LLMs communicated critical information, underscoring the essential role of neutral and objective health care communication in ensuring that pregnant women, particularly vulnerable during the COVID-19 pandemic, receive accurate and reassuring guidance. Overall, ChatGPT-4, Microsoft Copilot, and Google Bard generally provided accurate, updated information on COVID-19 and vaccines in maternal and fetal health, aligning with health guidelines. The study demonstrated the potential role of AI in supplementing health care knowledge, with a need for continuous updating and verification of AI knowledge bases. The choice of AI tool should consider the target audience and required information detail level. %M 39794312 %R 10.2196/56126 %U https://formative.jmir.org/2025/1/e56126 %U https://doi.org/10.2196/56126 %U http://www.ncbi.nlm.nih.gov/pubmed/39794312 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e66896 %T Assessing the Adherence of ChatGPT Chatbots to Public Health Guidelines for Smoking Cessation: Content Analysis %A Abroms,Lorien C %A Yousefi,Artin %A Wysota,Christina N %A Wu,Tien-Chin %A Broniatowski,David A %+ Department of Prevention & Community Health, Milken Institute School of Public Health, George Washington University, 950 New Hampshire Avenue NW, Washington, DC, 20052, United States, 1 202 9943518, lorien@gwu.edu %K ChatGPT %K large language models %K chatbots %K tobacco %K smoking cessation %K cigarettes %K artificial intelligence %D 2025 %7 30.1.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language model (LLM) artificial intelligence chatbots using generative language can offer smoking cessation information and advice. However, little is known about the reliability of the information provided to users. Objective: This study aims to examine whether 3 ChatGPT chatbots—the World Health Organization’s Sarah, BeFreeGPT, and BasicGPT—provide reliable information on how to quit smoking. Methods: A list of quit smoking queries was generated from frequent quit smoking searches on Google related to “how to quit smoking” (n=12). Each query was given to each chatbot, and responses were analyzed for their adherence to an index developed from the US Preventive Services Task Force public health guidelines for quitting smoking and counseling principles. Responses were independently coded by 2 reviewers, and differences were resolved by a third coder. Results: Across chatbots and queries, on average, chatbot responses were rated as being adherent to 57.1% of the items on the adherence index. Sarah’s adherence (72.2%) was significantly higher than BeFreeGPT (50%) and BasicGPT (47.8%; P<.001). The majority of chatbot responses had clear language (97.3%) and included a recommendation to seek out professional counseling (80.3%). About half of the responses included the recommendation to consider using nicotine replacement therapy (52.7%), the recommendation to seek out social support from friends and family (55.6%), and information on how to deal with cravings when quitting smoking (44.4%). The least common was information about considering the use of non–nicotine replacement therapy prescription drugs (14.1%). Finally, some types of misinformation were present in 22% of responses. Specific queries that were most challenging for the chatbots included queries on “how to quit smoking cold turkey,” “...with vapes,” “...with gummies,” “...with a necklace,” and “...with hypnosis.” All chatbots showed resilience to adversarial attacks that were intended to derail the conversation. Conclusions: LLM chatbots varied in their adherence to quit-smoking guidelines and counseling principles. While chatbots reliably provided some types of information, they omitted other types, as well as occasionally provided misinformation, especially for queries about less evidence-based methods of quitting. LLM chatbot instructions can be revised to compensate for these weaknesses. %M 39883917 %R 10.2196/66896 %U https://www.jmir.org/2025/1/e66896 %U https://doi.org/10.2196/66896 %U http://www.ncbi.nlm.nih.gov/pubmed/39883917 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e63065 %T Assessing Familiarity, Usage Patterns, and Attitudes of Medical Students Toward ChatGPT and Other Chat-Based AI Apps in Medical Education: Cross-Sectional Questionnaire Study %A Elhassan,Safia Elwaleed %A Sajid,Muhammad Raihan %A Syed,Amina Mariam %A Fathima,Sidrah Afreen %A Khan,Bushra Shehroz %A Tamim,Hala %K ChatGPT %K artificial intelligence %K large language model %K medical students %K ethics %K chat-based %K AI apps %K medical education %K social media %K attitude %K AI %D 2025 %7 30.1.2025 %9 %J JMIR Med Educ %G English %X Background: There has been a rise in the popularity of ChatGPT and other chat-based artificial intelligence (AI) apps in medical education. Despite data being available from other parts of the world, there is a significant lack of information on this topic in medical education and research, particularly in Saudi Arabia. Objective: The primary objective of the study was to examine the familiarity, usage patterns, and attitudes of Alfaisal University medical students toward ChatGPT and other chat-based AI apps in medical education. Methods: This was a cross-sectional study conducted from October 8, 2023, through November 22, 2023. A questionnaire was distributed through social media channels to medical students at Alfaisal University who were 18 years or older. Current Alfaisal University medical students in years 1 through 6, of both genders, were exclusively targeted by the questionnaire. The study was approved by Alfaisal University Institutional Review Board. A χ2 test was conducted to assess the relationships between gender, year of study, familiarity, and reasons for usage. Results: A total of 293 responses were received, of which 95 (32.4%) were from men and 198 (67.6%) were from women. There were 236 (80.5%) responses from preclinical students and 57 (19.5%) from clinical students, respectively. Overall, males (n=93, 97.9%) showed more familiarity with ChatGPT compared to females (n=180, 90.09%; P=.03). Additionally, males also used Google Bard and Microsoft Bing ChatGPT more than females (P<.001). Clinical-year students used ChatGPT significantly more for general writing purposes compared to preclinical students (P=.005). Additionally, 136 (46.4%) students believed that using ChatGPT and other chat-based AI apps for coursework was ethical, 86 (29.4%) were neutral, and 71 (24.2%) considered it unethical (all Ps>.05). Conclusions: Familiarity with and usage of ChatGPT and other chat-based AI apps were common among the students of Alfaisal University. The usage patterns of these apps differ between males and females and between preclinical and clinical-year students. %R 10.2196/63065 %U https://mededu.jmir.org/2025/1/e63065 %U https://doi.org/10.2196/63065 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e54601 %T Using Large Language Models to Detect and Understand Drug Discontinuation Events in Web-Based Forums: Development and Validation Study %A Trevena,William %A Zhong,Xiang %A Alvarado,Michelle %A Semenov,Alexander %A Oktay,Alp %A Devlin,Devin %A Gohil,Aarya Yogesh %A Chittimouju,Sai Harsha %+ Department of Industrial and Systems Engineering, The University of Florida, PO BOX 115002, GAINESVILLE, FL, 32611-5002, United States, 1 3523922477, xiang.zhong@ise.ufl.edu %K natural language processing %K large language models %K ChatGPT %K drug discontinuation events %K zero-shot classification %K artificial intelligence %K AI %D 2025 %7 30.1.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: The implementation of large language models (LLMs), such as BART (Bidirectional and Auto-Regressive Transformers) and GPT-4, has revolutionized the extraction of insights from unstructured text. These advancements have expanded into health care, allowing analysis of social media for public health insights. However, the detection of drug discontinuation events (DDEs) remains underexplored. Identifying DDEs is crucial for understanding medication adherence and patient outcomes. Objective: The aim of this study is to provide a flexible framework for investigating various clinical research questions in data-sparse environments. We provide an example of the utility of this framework by identifying DDEs and their root causes in an open-source web-based forum, MedHelp, and by releasing the first open-source DDE datasets to aid further research in this domain. Methods: We used several LLMs, including GPT-4 Turbo, GPT-4o, DeBERTa (Decoding-Enhanced Bidirectional Encoder Representations from Transformer with Disentangled Attention), and BART, among others, to detect and determine the root causes of DDEs in user comments posted on MedHelp. Our study design included the use of zero-shot classification, which allows these models to make predictions without task-specific training. We split user comments into sentences and applied different classification strategies to assess the performance of these models in identifying DDEs and their root causes. Results: Among the selected models, GPT-4o performed the best at determining the root causes of DDEs, predicting only 12.9% of root causes incorrectly (hamming loss). Among the open-source models tested, BART demonstrated the best performance in detecting DDEs, achieving an F1-score of 0.86, a false positive rate of 2.8%, and a false negative rate of 6.5%, all without any fine-tuning. The dataset included 10.7% (107/1000) DDEs, emphasizing the models’ robustness in an imbalanced data context. Conclusions: This study demonstrated the effectiveness of open- and closed-source LLMs, such as GPT-4o and BART, for detecting DDEs and their root causes from publicly accessible data through zero-shot classification. The robust and scalable framework we propose can aid researchers in addressing data-sparse clinical research questions. The launch of open-access DDE datasets has the potential to stimulate further research and novel discoveries in this field. %M 39883487 %R 10.2196/54601 %U https://www.jmir.org/2025/1/e54601 %U https://doi.org/10.2196/54601 %U http://www.ncbi.nlm.nih.gov/pubmed/39883487 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 14 %N %P e62865 %T Exploring the Credibility of Large Language Models for Mental Health Support: Protocol for a Scoping Review %A Gautam,Dipak %A Kellmeyer,Philipp %+ Data and Web Science Group, School of Business Informatics and Mathematics, University of Manneim, B6, 26, Mannheim, D-68159, Germany, 49 621181 ext 2422, philipp.kellmeyer@uni-mannheim.de %K large language model %K LLM %K mental health %K explainability %K credibility %K mobile phone %D 2025 %7 29.1.2025 %9 Protocol %J JMIR Res Protoc %G English %X Background: The rapid evolution of large language models (LLMs), such as Bidirectional Encoder Representations from Transformers (BERT; Google) and GPT (OpenAI), has introduced significant advancements in natural language processing. These models are increasingly integrated into various applications, including mental health support. However, the credibility of LLMs in providing reliable and explainable mental health information and support remains underexplored. Objective: This scoping review systematically maps the factors influencing the credibility of LLMs in mental health support, including reliability, explainability, and ethical considerations. The review is expected to offer critical insights for practitioners, researchers, and policy makers, guiding future research and policy development. These findings will contribute to the responsible integration of LLMs into mental health care, with a focus on maintaining ethical standards and user trust. Methods: This review follows PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines and the Joanna Briggs Institute (JBI) methodology. Eligibility criteria include studies that apply transformer-based generative language models in mental health support, such as BERT and GPT. Sources include PsycINFO, MEDLINE via PubMed, Web of Science, IEEE Xplore, and ACM Digital Library. A systematic search of studies from 2019 onward will be conducted and updated until October 2024. Data will be synthesized qualitatively. The Population, Concept, and Context framework will guide the inclusion criteria. Two independent reviewers will screen and extract data, resolving discrepancies through discussion. Data will be synthesized and presented descriptively. Results: As of September 2024, this study is currently in progress, with the systematic search completed and the screening phase ongoing. We expect to complete data extraction by early November 2024 and synthesis by late November 2024. Conclusions: This scoping review will map the current evidence on the credibility of LLMs in mental health support. It will identify factors influencing the reliability, explainability, and ethical considerations of these models, providing insights for practitioners, researchers, policy makers, and users. These findings will fill a critical gap in the literature and inform future research, practice, and policy development, ensuring the responsible integration of LLMs in mental health services. International Registered Report Identifier (IRRID): DERR1-10.2196/62865 %M 39879615 %R 10.2196/62865 %U https://www.researchprotocols.org/2025/1/e62865 %U https://doi.org/10.2196/62865 %U http://www.ncbi.nlm.nih.gov/pubmed/39879615 %0 Journal Article %@ 2369-2960 %I JMIR Publications %V 11 %N %P e63809 %T An Explainable Artificial Intelligence Text Classifier for Suicidality Prediction in Youth Crisis Text Line Users: Development and Validation Study %A Thomas,Julia %A Lucht,Antonia %A Segler,Jacob %A Wundrack,Richard %A Miché,Marcel %A Lieb,Roselind %A Kuchinke,Lars %A Meinlschmidt,Gunther %+ Division of Clinical Psychology and Epidemiology, Faculty of Psychology, University of Basel, Missionsstrasse 60/62, Basel, 4055, Switzerland, 49 30 57714627, julia.thomas@krisenchat.de %K deep learning %K explainable artificial intelligence (XAI) %K large language model (LLM) %K machine learning %K neural network %K prevention %K risk monitoring %K suicide %K transformer model %K suicidality %K suicidal ideation %K self-murder %K self-harm %K youth %K adolescent %K adolescents %K public health %K language model %K language models %K chat protocols %K crisis helpline %K help-seeking behaviors %K German %K Shapley %K decision-making %K mental health %K health informatics %K mobile phone %D 2025 %7 29.1.2025 %9 Original Paper %J JMIR Public Health Surveill %G English %X Background: Suicide represents a critical public health concern, and machine learning (ML) models offer the potential for identifying at-risk individuals. Recent studies using benchmark datasets and real-world social media data have demonstrated the capability of pretrained large language models in predicting suicidal ideation and behaviors (SIB) in speech and text. Objective: This study aimed to (1) develop and implement ML methods for predicting SIBs in a real-world crisis helpline dataset, using transformer-based pretrained models as a foundation; (2) evaluate, cross-validate, and benchmark the model against traditional text classification approaches; and (3) train an explainable model to highlight relevant risk-associated features. Methods: We analyzed chat protocols from adolescents and young adults (aged 14-25 years) seeking assistance from a German crisis helpline. An ML model was developed using a transformer-based language model architecture with pretrained weights and long short-term memory layers. The model predicted suicidal ideation (SI) and advanced suicidal engagement (ASE), as indicated by composite Columbia-Suicide Severity Rating Scale scores. We compared model performance against a classical word-vector-based ML model. We subsequently computed discrimination, calibration, clinical utility, and explainability information using a Shapley Additive Explanations value-based post hoc estimation model. Results: The dataset comprised 1348 help-seeking encounters (1011 for training and 337 for testing). The transformer-based classifier achieved a macroaveraged area under the curve (AUC) receiver operating characteristic (ROC) of 0.89 (95% CI 0.81-0.91) and an overall accuracy of 0.79 (95% CI 0.73-0.99). This performance surpassed the word-vector-based baseline model (AUC-ROC=0.77, 95% CI 0.64-0.90; accuracy=0.61, 95% CI 0.61-0.80). The transformer model demonstrated excellent prediction for nonsuicidal sessions (AUC-ROC=0.96, 95% CI 0.96-0.99) and good prediction for SI and ASE, with AUC-ROCs of 0.85 (95% CI 0.97-0.86) and 0.87 (95% CI 0.81-0.88), respectively. The Brier Skill Score indicated a 44% improvement in classification performance over the baseline model. The Shapley Additive Explanations model identified language features predictive of SIBs, including self-reference, negation, expressions of low self-esteem, and absolutist language. Conclusions: Neural networks using large language model–based transfer learning can accurately identify SI and ASE. The post hoc explainer model revealed language features associated with SI and ASE. Such models may potentially support clinical decision-making in suicide prevention services. Future research should explore multimodal input features and temporal aspects of suicide risk. %M 39879608 %R 10.2196/63809 %U https://publichealth.jmir.org/2025/1/e63809 %U https://doi.org/10.2196/63809 %U http://www.ncbi.nlm.nih.gov/pubmed/39879608 %0 Journal Article %@ 1929-073X %I JMIR Publications %V 14 %N %P e59823 %T The Clinicians’ Guide to Large Language Models: A General Perspective With a Focus on Hallucinations %A Roustan,Dimitri %A Bastardot,François %+ Emergency Medicine Department, Cliniques Universitaires Saint-Luc, Avenue Hippocrate 10, Brussels, 1200, Belgium, 32 477063174, dim.roustan@gmail.com %K medical informatics %K large language model %K clinical informatics %K decision-making %K computer assisted %K decision support techniques %K decision support %K decision %K AI %K artificial intelligence %K artificial intelligence tool %K LLM %K electronic data system %K hallucinations %K false information %K technical framework %D 2025 %7 28.1.2025 %9 Viewpoint %J Interact J Med Res %G English %X Large language models (LLMs) are artificial intelligence tools that have the prospect of profoundly changing how we practice all aspects of medicine. Considering the incredible potential of LLMs in medicine and the interest of many health care stakeholders for implementation into routine practice, it is therefore essential that clinicians be aware of the basic risks associated with the use of these models. Namely, a significant risk associated with the use of LLMs is their potential to create hallucinations. Hallucinations (false information) generated by LLMs arise from a multitude of causes, including both factors related to the training dataset as well as their auto-regressive nature. The implications for clinical practice range from the generation of inaccurate diagnostic and therapeutic information to the reinforcement of flawed diagnostic reasoning pathways, as well as a lack of reliability if not used properly. To reduce this risk, we developed a general technical framework for approaching LLMs in general clinical practice, as well as for implementation on a larger institutional scale. %M 39874574 %R 10.2196/59823 %U https://www.i-jmr.org/2025/1/e59823 %U https://doi.org/10.2196/59823 %U http://www.ncbi.nlm.nih.gov/pubmed/39874574 %0 Journal Article %@ 2369-1999 %I JMIR Publications %V 11 %N %P e57275 %T Large Language Model Approach for Zero-Shot Information Extraction and Clustering of Japanese Radiology Reports: Algorithm Development and Validation %A Yamagishi,Yosuke %A Nakamura,Yuta %A Hanaoka,Shouhei %A Abe,Osamu %K radiology reports %K clustering %K large language model %K natural language processing %K information extraction %K lung cancer %K machine learning %D 2025 %7 23.1.2025 %9 %J JMIR Cancer %G English %X Background: The application of natural language processing in medicine has increased significantly, including tasks such as information extraction and classification. Natural language processing plays a crucial role in structuring free-form radiology reports, facilitating the interpretation of textual content, and enhancing data utility through clustering techniques. Clustering allows for the identification of similar lesions and disease patterns across a broad dataset, making it useful for aggregating information and discovering new insights in medical imaging. However, most publicly available medical datasets are in English, with limited resources in other languages. This scarcity poses a challenge for development of models geared toward non-English downstream tasks. Objective: This study aimed to develop and evaluate an algorithm that uses large language models (LLMs) to extract information from Japanese lung cancer radiology reports and perform clustering analysis. The effectiveness of this approach was assessed and compared with previous supervised methods. Methods: This study employed the MedTxt-RR dataset, comprising 135 Japanese radiology reports from 9 radiologists who interpreted the computed tomography images of 15 lung cancer patients obtained from Radiopaedia. Previously used in the NTCIR-16 (NII Testbeds and Community for Information Access Research) shared task for clustering performance competition, this dataset was ideal for comparing the clustering ability of our algorithm with those of previous methods. The dataset was split into 8 cases for development and 7 for testing, respectively. The study’s approach involved using the LLM to extract information pertinent to lung cancer findings and transforming it into numeric features for clustering, using the K-means method. Performance was evaluated using 135 reports for information extraction accuracy and 63 test reports for clustering performance. This study focused on the accuracy of automated systems for extracting tumor size, location, and laterality from clinical reports. The clustering performance was evaluated using normalized mutual information, adjusted mutual information , and the Fowlkes-Mallows index for both the development and test data. Results: The tumor size was accurately identified in 99 out of 135 reports (73.3%), with errors in 36 reports (26.7%), primarily due to missing or incorrect size information. Tumor location and laterality were identified with greater accuracy in 112 out of 135 reports (83%); however, 23 reports (17%) contained errors mainly due to empty values or incorrect data. Clustering performance of the test data yielded an normalized mutual information of 0.6414, adjusted mutual information of 0.5598, and Fowlkes-Mallows index of 0.5354. The proposed method demonstrated superior performance across all evaluation metrics compared to previous methods. Conclusions: The unsupervised LLM approach surpassed the existing supervised methods in clustering Japanese radiology reports. These findings suggest that LLMs hold promise for extracting information from radiology reports and integrating it into disease-specific knowledge structures. %R 10.2196/57275 %U https://cancer.jmir.org/2025/1/e57275 %U https://doi.org/10.2196/57275 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e63126 %T Applications of Large Language Models in the Field of Suicide Prevention: Scoping Review %A Holmes,Glenn %A Tang,Biya %A Gupta,Sunil %A Venkatesh,Svetha %A Christensen,Helen %A Whitton,Alexis %+ Black Dog Institute, University of New South Wales, Sydney, Hospital Road, Randwick, 2031, Australia, 61 290659046, a.whitton@unsw.edu.au %K suicide %K suicide prevention %K large language model %K self-harm %K artificial intelligence %K AI %K PRISMA %D 2025 %7 23.1.2025 %9 Review %J J Med Internet Res %G English %X Background: Prevention of suicide is a global health priority. Approximately 800,000 individuals die by suicide yearly, and for every suicide death, there are another 20 estimated suicide attempts. Large language models (LLMs) hold the potential to enhance scalable, accessible, and affordable digital services for suicide prevention and self-harm interventions. However, their use also raises clinical and ethical questions that require careful consideration. Objective: This scoping review aims to identify emergent trends in LLM applications in the field of suicide prevention and self-harm research. In addition, it summarizes key clinical and ethical considerations relevant to this nascent area of research. Methods: Searches were conducted in 4 databases (PsycINFO, Embase, PubMed, and IEEE Xplore) in February 2024. Eligible studies described the application of LLMs for suicide or self-harm prevention, detection, or management. English-language peer-reviewed articles and conference proceedings were included, without date restrictions. Narrative synthesis was used to synthesize study characteristics, objectives, models, data sources, proposed clinical applications, and ethical considerations. This review adhered to the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) standards. Results: Of the 533 studies identified, 36 (6.8%) met the inclusion criteria. An additional 7 studies were identified through citation chaining, resulting in 43 studies for review. The studies showed a bifurcation of publication fields, with varying publication norms between computer science and mental health. While most of the studies (33/43, 77%) focused on identifying suicide risk, newer applications leveraging generative functions (eg, support, education, and training) are emerging. Social media was the most common source of LLM training data. Bidirectional Encoder Representations from Transformers (BERT) was the predominant model used, although generative pretrained transformers (GPTs) featured prominently in generative applications. Clinical LLM applications were reported in 60% (26/43) of the studies, often for suicide risk detection or as clinical assistance tools. Ethical considerations were reported in 33% (14/43) of the studies, with privacy, confidentiality, and consent strongly represented. Conclusions: This evolving research area, bridging computer science and mental health, demands a multidisciplinary approach. While open access models and datasets will likely shape the field of suicide prevention, documenting their limitations and potential biases is crucial. High-quality training data are essential for refining these models and mitigating unwanted biases. Policies that address ethical concerns—particularly those related to privacy and security when using social media data—are imperative. Limitations include high variability across disciplines in how LLMs and study methodology are reported. The emergence of generative artificial intelligence signals a shift in approach, particularly in applications related to care, support, and education, such as improved crisis care and gatekeeper training methods, clinician copilot models, and improved educational practices. Ongoing human oversight—through human-in-the-loop testing or expert external validation—is essential for responsible development and use. Trial Registration: OSF Registries osf.io/nckq7; https://osf.io/nckq7 %M 39847414 %R 10.2196/63126 %U https://www.jmir.org/2025/1/e63126 %U https://doi.org/10.2196/63126 %U http://www.ncbi.nlm.nih.gov/pubmed/39847414 %0 Journal Article %@ 2562-7600 %I JMIR Publications %V 8 %N %P e67197 %T Impact of Attached File Formats on the Performance of ChatGPT-4 on the Japanese National Nursing Examination: Evaluation Study %A Taira,Kazuya %A Itaya,Takahiro %A Yada,Shuntaro %A Hiyama,Kirara %A Hanada,Ayame %K nursing examination %K machine learning %K ML %K artificial intelligence %K AI %K large language models %K ChatGPT %K generative AI %D 2025 %7 22.1.2025 %9 %J JMIR Nursing %G English %X Abstract: This research letter discusses the impact of different file formats on ChatGPT-4’s performance on the Japanese National Nursing Examination, highlighting the need for standardized reporting protocols to enhance the integration of artificial intelligence in nursing education and practice. %R 10.2196/67197 %U https://nursing.jmir.org/2025/1/e67197 %U https://doi.org/10.2196/67197 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e68198 %T AI Can Be a Powerful Social Innovation for Public Health if Community Engagement Is at the Core %A Bazzano,Alessandra N %A Mantsios,Andrea %A Mattei,Nicholas %A Kosorok,Michael R %A Culotta,Aron %+ Department of Maternal and Child Health, Gillings School of Global Public Health, University of North Carolina at Chapel Hill, 135 Dauer Drive, CB #7445, Chapel Hill, NC, 27599-7445, United States, 1 919 966 9306, abazzano@tulane.edu %K Artificial Intelligence %K Generative Artificial Intelligence %K Citizen Science %K Community Participation %K Innovation Diffusion %D 2025 %7 22.1.2025 %9 Viewpoint %J J Med Internet Res %G English %X There is a critical need for community engagement in the process of adopting artificial intelligence (AI) technologies in public health. Public health practitioners and researchers have historically innovated in areas like vaccination and sanitation but have been slower in adopting emerging technologies such as generative AI. However, with increasingly complex funding, programming, and research requirements, the field now faces a pivotal moment to enhance its agility and responsiveness to evolving health challenges. Participatory methods and community engagement are key components of many current public health programs and research. The field of public health is well positioned to ensure community engagement is part of AI technologies applied to population health issues. Without such engagement, the adoption of these technologies in public health may exclude significant portions of the population, particularly those with the fewest resources, with the potential to exacerbate health inequities. Risks to privacy and perpetuation of bias are more likely to be avoided if AI technologies in public health are designed with knowledge of community engagement, existing health disparities, and strategies for improving equity. This viewpoint proposes a multifaceted approach to ensure safer and more effective integration of AI in public health with the following call to action: (1) include the basics of AI technology in public health training and professional development; (2) use a community engagement approach to co-design AI technologies in public health; and (3) introduce governance and best practice mechanisms that can guide the use of AI in public health to prevent or mitigate potential harms. These actions will support the application of AI to varied public health domains through a framework for more transparent, responsive, and equitable use of this evolving technology, augmenting the work of public health practitioners and researchers to improve health outcomes while minimizing risks and unintended consequences. %M 39841529 %R 10.2196/68198 %U https://www.jmir.org/2025/1/e68198 %U https://doi.org/10.2196/68198 %U http://www.ncbi.nlm.nih.gov/pubmed/39841529 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e67143 %T What’s Going On With Me and How Can I Better Manage My Health? The Potential of GPT-4 to Transform Discharge Letters Into Patient-Centered Letters to Enhance Patient Safety: Prospective, Exploratory Study %A Eisinger,Felix %A Holderried,Friederike %A Mahling,Moritz %A Stegemann–Philipps,Christian %A Herrmann–Werner,Anne %A Nazarenus,Eric %A Sonanini,Alessandra %A Guthoff,Martina %A Eickhoff,Carsten %A Holderried,Martin %+ Tübingen Institute for Medical Education, University of Tübingen, Elfriede-Aulhorn-Str. 10, Tübingen, 72076, Germany, 49 1704848650, Friederike.Holderried@med.uni-tuebingen.de %K GPT-4 %K patient letters %K health care communication %K artificial intelligence %K patient safety %K patient education %D 2025 %7 21.1.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: For hospitalized patients, the discharge letter serves as a crucial source of medical information, outlining important discharge instructions and health management tasks. However, these letters are often written in professional jargon, making them difficult for patients with limited medical knowledge to understand. Large language models, such as GPT, have the potential to transform these discharge summaries into patient-friendly letters, improving accessibility and understanding. Objective: This study aims to use GPT-4 to convert discharge letters into more readable patient-centered letters. We evaluated how effectively and comprehensively GPT-4 identified and transferred patient safety–relevant information from the discharge letters to the transformed patient letters. Methods: Three discharge letters were created based on common medical conditions, containing 72 patient safety–relevant pieces of information, referred to as “learning objectives.” GPT-4 was prompted to transform these discharge letters into patient-centered letters. The resulting patient letters were analyzed for medical accuracy, patient centricity, and the ability to identify and translate the learning objectives. Bloom’s taxonomy was applied to analyze and categorize the learning objectives. Results: GPT-4 addressed the majority (56/72, 78%) of the learning objectives from the discharge letters. However, 11 of the 72 (15%) learning objectives were not included in the majority of the patient-centered letters. A qualitative analysis based on Bloom’s taxonomy revealed that learning objectives in the “Understand” category (9/11) were more frequently omitted than those in the “Remember” category (2/11). Most of the missing learning objectives were related to the content field of “prevention of complications.” By contrast, learning objectives regarding “lifestyle” and “organizational” aspects were addressed more frequently. Medical errors were found in a small proportion of sentences (31/787, 3.9%). In terms of patient centricity, the patient-centered letters demonstrated better readability than the discharge letters. Compared with discharge letters, they included fewer medical terms (132/860, 15.3%, vs 165/273, 60/4%), fewer abbreviations (43/860, 5%, vs 49/273, 17.9%), and more explanations of medical terms (121/131, 92.4%, vs 0/165, 0%). Conclusions: Our study demonstrates that GPT-4 has the potential to transform discharge letters into more patient-centered communication. While the readability and patient centricity of the transformed letters are well-established, they do not fully address all patient safety–relevant information, resulting in the omission of key aspects. Further optimization of prompt engineering may help address this issue and improve the completeness of the transformation. %M 39836954 %R 10.2196/67143 %U https://www.jmir.org/2025/1/e67143 %U https://doi.org/10.2196/67143 %U http://www.ncbi.nlm.nih.gov/pubmed/39836954 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 14 %N %P e66094 %T Applications of Natural Language Processing and Large Language Models for Social Determinants of Health: Protocol for a Systematic Review %A Rajwal,Swati %A Zhang,Ziyuan %A Chen,Yankai %A Rogers,Hannah %A Sarker,Abeed %A Xiao,Yunyu %+ Department of Computer Science, Emory University, Mathematics & Science Center, Suite W401, 400 Dowman Drive, Atlanta, GA, 30322, United States, 1 4704478469, swati.rajwal@emory.edu %K social determinants of health %K SDOH %K natural language processing %K NLP %K systematic review protocol %K large language models %K LLM %D 2025 %7 21.1.2025 %9 Protocol %J JMIR Res Protoc %G English %X Background: In recent years, the intersection of natural language processing (NLP) and public health has opened innovative pathways for investigating social determinants of health (SDOH) in textual datasets. Despite the promise of NLP in the SDOH domain, the literature is dispersed across various disciplines, and there is a need to consolidate existing knowledge, identify knowledge gaps in the literature, and inform future research directions in this emerging field. Objective: This research protocol describes a systematic review to identify and highlight NLP techniques, including large language models, used for SDOH-related studies. Methods: A search strategy will be executed across PubMed, Web of Science, IEEE Xplore, Scopus, PsycINFO, HealthSource: Academic Nursing, and ACL Anthology to find studies published in English between 2014 and 2024. Three reviewers (SR, ZZ, and YC) will independently screen the studies to avoid voting bias, and two (AS and YX) additional reviewers will resolve any conflicts during the screening process. We will further screen studies that cited the included studies (forward search). Following the title abstract and full-text screening, the characteristics and main findings of the included studies and resources will be tabulated, visualized, and summarized. Results: The search strategy was formulated and run across the 7 databases in August 2024. We expect the results to be submitted for peer review publication in early 2025. As of December 2024, the title and abstract screening was underway. Conclusions: This systematic review aims to provide a comprehensive study of existing research on the application of NLP for various SDOH tasks across multiple textual datasets. By rigorously evaluating the methodologies, tools, and outcomes of eligible studies, the review will identify gaps in current knowledge and suggest directions for future research in the form of specific research questions. The findings will be instrumental in developing more effective NLP models for SDOH, ultimately contributing to improved health outcomes and a better understanding of social determinants in diverse populations. International Registered Report Identifier (IRRID): DERR1-10.2196/66094 %M 39836952 %R 10.2196/66094 %U https://www.researchprotocols.org/2025/1/e66094 %U https://doi.org/10.2196/66094 %U http://www.ncbi.nlm.nih.gov/pubmed/39836952 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e69742 %T Advantages and Inconveniences of a Multi-Agent Large Language Model System to Mitigate Cognitive Biases in Diagnostic Challenges %A Bousquet,Cedric %A Beltramin,Divà %+ Laboratory of Medical Informatics and Knowledge Engineering in e-Health, Inserm, Sorbonne University, 15 rue de l'école de Médecine, Paris, F-75006, France, 33 0477127974, cedric.bousquet@chu-st-etienne.fr %K large language model %K multi-agent system %K diagnostic errors %K cognition %K clinical decision-making %K cognitive bias %K generative artificial intelligence %D 2025 %7 20.1.2025 %9 Letter to the Editor %J J Med Internet Res %G English %X %M 39832364 %R 10.2196/69742 %U https://www.jmir.org/2025/1/e69742 %U https://doi.org/10.2196/69742 %U http://www.ncbi.nlm.nih.gov/pubmed/39832364 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e69007 %T Era of Generalist Conversational Artificial Intelligence to Support Public Health Communications %A Sezgin,Emre %A Kocaballi,Ahmet Baki %+ The Abigail Wexner Research Institute at Nationwide Children’s Hospital, 700 Children's Dr, Columbus, OH, 43205, United States, 1 6147223179, emre.sezgin@nationwidechildrens.org %K messaging apps %K public health communication %K language models %K artificial intelligence %K AI %K generative AI %K conversational AI %D 2025 %7 20.1.2025 %9 Viewpoint %J J Med Internet Res %G English %X The integration of artificial intelligence (AI) into health communication systems has introduced a transformative approach to public health management, particularly during public health emergencies, capable of reaching billions through familiar digital channels. This paper explores the utility and implications of generalist conversational artificial intelligence (CAI) advanced AI systems trained on extensive datasets to handle a wide range of conversational tasks across various domains with human-like responsiveness. The specific focus is on the application of generalist CAI within messaging services, emphasizing its potential to enhance public health communication. We highlight the evolution and current applications of AI-driven messaging services, including their ability to provide personalized, scalable, and accessible health interventions. Specifically, we discuss the integration of large language models and generative AI in mainstream messaging platforms, which potentially outperform traditional information retrieval systems in public health contexts. We report a critical examination of the advantages of generalist CAI in delivering health information, with a case of its operationalization during the COVID-19 pandemic and propose the strategic deployment of these technologies in collaboration with public health agencies. In addition, we address significant challenges and ethical considerations, such as AI biases, misinformation, privacy concerns, and the required regulatory oversight. We envision a future with leverages generalist CAI in messaging apps, proposing a multiagent approach to enhance the reliability and specificity of health communications. We hope this commentary initiates the necessary conversations and research toward building evaluation approaches, adaptive strategies, and robust legal and technical frameworks to fully realize the benefits of AI-enhanced communications in public health, aiming to ensure equitable and effective health outcomes across diverse populations. %M 39832358 %R 10.2196/69007 %U https://www.jmir.org/2025/1/e69007 %U https://doi.org/10.2196/69007 %U http://www.ncbi.nlm.nih.gov/pubmed/39832358 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e56850 %T Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan National Pharmacist Licensing Examination: Comparative Evaluation Study %A Wang,Ying-Mei %A Shen,Hung-Wei %A Chen,Tzeng-Ji %A Chiang,Shu-Chiung %A Lin,Ting-Guan %K artificial intelligence %K ChatGPT %K chat generative pre-trained transformer %K GPT-4 %K medical education %K educational measurement %K pharmacy licensure %K Taiwan %K Taiwan national pharmacist licensing examination %K learning model %K AI %K Chatbot %K pharmacist %K evaluation and comparison study %K pharmacy %K statistical analyses %K medical databases %K medical decision-making %K generative AI %K machine learning %D 2025 %7 17.1.2025 %9 %J JMIR Med Educ %G English %X Background: OpenAI released versions ChatGPT-3.5 and GPT-4 between 2022 and 2023. GPT-3.5 has demonstrated proficiency in various examinations, particularly the United States Medical Licensing Examination. However, GPT-4 has more advanced capabilities. Objective: This study aims to examine the efficacy of GPT-3.5 and GPT-4 within the Taiwan National Pharmacist Licensing Examination and to ascertain their utility and potential application in clinical pharmacy and education. Methods: The pharmacist examination in Taiwan consists of 2 stages: basic subjects and clinical subjects. In this study, exam questions were manually fed into the GPT-3.5 and GPT-4 models, and their responses were recorded; graphic-based questions were excluded. This study encompassed three steps: (1) determining the answering accuracy of GPT-3.5 and GPT-4, (2) categorizing question types and observing differences in model performance across these categories, and (3) comparing model performance on calculation and situational questions. Microsoft Excel and R software were used for statistical analyses. Results: GPT-4 achieved an accuracy rate of 72.9%, overshadowing GPT-3.5, which achieved 59.1% (P<.001). In the basic subjects category, GPT-4 significantly outperformed GPT-3.5 (73.4% vs 53.2%; P<.001). However, in clinical subjects, only minor differences in accuracy were observed. Specifically, GPT-4 outperformed GPT-3.5 in the calculation and situational questions. Conclusions: This study demonstrates that GPT-4 outperforms GPT-3.5 in the Taiwan National Pharmacist Licensing Examination, particularly in basic subjects. While GPT-4 shows potential for use in clinical practice and pharmacy education, its limitations warrant caution. Future research should focus on refining prompts, improving model stability, integrating medical databases, and designing questions that better assess student competence and minimize guessing. %R 10.2196/56850 %U https://mededu.jmir.org/2025/1/e56850 %U https://doi.org/10.2196/56850 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e64284 %T Performance Evaluation and Implications of Large Language Models in Radiology Board Exams: Prospective Comparative Analysis %A Wei,Boxiong %K large language models %K LLM %K artificial intelligence %K AI %K GPT-4 %K radiology exams %K medical education %K diagnostics %K medical training %K radiology %K ultrasound %D 2025 %7 16.1.2025 %9 %J JMIR Med Educ %G English %X Background: Artificial intelligence advancements have enabled large language models to significantly impact radiology education and diagnostic accuracy. Objective: This study evaluates the performance of mainstream large language models, including GPT-4, Claude, Bard, Tongyi Qianwen, and Gemini Pro, in radiology board exams. Methods: A comparative analysis of 150 multiple-choice questions from radiology board exams without images was conducted. Models were assessed on their accuracy for text-based questions and were categorized by cognitive levels and medical specialties using χ2 tests and ANOVA. Results: GPT-4 achieved the highest accuracy (83.3%, 125/150), significantly outperforming all other models. Specifically, Claude achieved an accuracy of 62% (93/150; P<.001), Bard 54.7% (82/150; P<.001), Tongyi Qianwen 70.7% (106/150; P=.009), and Gemini Pro 55.3% (83/150; P<.001). The odds ratios compared to GPT-4 were 0.33 (95% CI 0.18‐0.60) for Claude, 0.24 (95% CI 0.13‐0.44) for Bard, and 0.25 (95% CI 0.14‐0.45) for Gemini Pro. Tongyi Qianwen performed relatively well with an accuracy of 70.7% (106/150; P=0.02) and had an odds ratio of 0.48 (95% CI 0.27‐0.87) compared to GPT-4. Performance varied across question types and specialties, with GPT-4 excelling in both lower-order and higher-order questions, while Claude and Bard struggled with complex diagnostic questions. Conclusions: GPT-4 and Tongyi Qianwen show promise in medical education and training. The study emphasizes the need for domain-specific training datasets to enhance large language models’ effectiveness in specialized fields like radiology. %R 10.2196/64284 %U https://mededu.jmir.org/2025/1/e64284 %U https://doi.org/10.2196/64284 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e51319 %T Assessing the Current Limitations of Large Language Models in Advancing Health Care Education %A Kim,JaeYong %A Vajravelu,Bathri Narayan %K large language model %K generative pretrained transformer %K health care education %K health care delivery %K artificial intelligence %K LLM %K ChatGPT %K AI %D 2025 %7 16.1.2025 %9 %J JMIR Form Res %G English %X The integration of large language models (LLMs), as seen with the generative pretrained transformers series, into health care education and clinical management represents a transformative potential. The practical use of current LLMs in health care sparks great anticipation for new avenues, yet its embracement also elicits considerable concerns that necessitate careful deliberation. This study aims to evaluate the application of state-of-the-art LLMs in health care education, highlighting the following shortcomings as areas requiring significant and urgent improvements: (1) threats to academic integrity, (2) dissemination of misinformation and risks of automation bias, (3) challenges with information completeness and consistency, (4) inequity of access, (5) risks of algorithmic bias, (6) exhibition of moral instability, (7) technological limitations in plugin tools, and (8) lack of regulatory oversight in addressing legal and ethical challenges. Future research should focus on strategically addressing the persistent challenges of LLMs highlighted in this paper, opening the door for effective measures that can improve their application in health care education. %R 10.2196/51319 %U https://formative.jmir.org/2025/1/e51319 %U https://doi.org/10.2196/51319 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e65047 %T Evaluating and Enhancing Japanese Large Language Models for Genetic Counseling Support: Comparative Study of Domain Adaptation and the Development of an Expert-Evaluated Dataset %A Fukushima,Takuya %A Manabe,Masae %A Yada,Shuntaro %A Wakamiya,Shoko %A Yoshida,Akiko %A Urakawa,Yusaku %A Maeda,Akiko %A Kan,Shigeyuki %A Takahashi,Masayo %A Aramaki,Eiji %+ , Graduate School of Science and Technology, Nara Institute of Science and Technology, 8916-5, Takayama-cho, Ikoma, 630-0192, Japan, 81 743 72 5250, aramaki@is.naist.jp %K large language models %K genetic counseling %K medical %K health %K artificial intelligence %K machine learning %K domain adaptation %K retrieval-augmented generation %K instruction tuning %K prompt engineering %K question-answer %K dialogue %K ethics %K safety %K low-rank adaptation %K Japanese %K expert evaluation %D 2025 %7 16.1.2025 %9 Original Paper %J JMIR Med Inform %G English %X Background: Advances in genetics have underscored a strong association between genetic factors and health outcomes, leading to an increased demand for genetic counseling services. However, a shortage of qualified genetic counselors poses a significant challenge. Large language models (LLMs) have emerged as a potential solution for augmenting support in genetic counseling tasks. Despite the potential, Japanese genetic counseling LLMs (JGCLLMs) are underexplored. To advance a JGCLLM-based dialogue system for genetic counseling, effective domain adaptation methods require investigation. Objective: This study aims to evaluate the current capabilities and identify challenges in developing a JGCLLM-based dialogue system for genetic counseling. The primary focus is to assess the effectiveness of prompt engineering, retrieval-augmented generation (RAG), and instruction tuning within the context of genetic counseling. Furthermore, we will establish an experts-evaluated dataset of responses generated by LLMs adapted to Japanese genetic counseling for the future development of JGCLLMs. Methods: Two primary datasets were used in this study: (1) a question-answer (QA) dataset for LLM adaptation and (2) a genetic counseling question dataset for evaluation. The QA dataset included 899 QA pairs covering medical and genetic counseling topics, while the evaluation dataset contained 120 curated questions across 6 genetic counseling categories. Three enhancement techniques of LLMs—instruction tuning, RAG, and prompt engineering—were applied to a lightweight Japanese LLM to enhance its ability for genetic counseling. The performance of the adapted LLM was evaluated on the 120-question dataset by 2 certified genetic counselors and 1 ophthalmologist (SK, YU, and AY). Evaluation focused on four metrics: (1) inappropriateness of information, (2) sufficiency of information, (3) severity of harm, and (4) alignment with medical consensus. Results: The evaluation by certified genetic counselors and an ophthalmologist revealed varied outcomes across different methods. RAG showed potential, particularly in enhancing critical aspects of genetic counseling. In contrast, instruction tuning and prompt engineering produced less favorable outcomes. This evaluation process facilitated the creation an expert-evaluated dataset of responses generated by LLMs adapted with different combinations of these methods. Error analysis identified key ethical concerns, including inappropriate promotion of prenatal testing, criticism of relatives, and inaccurate probability statements. Conclusions: RAG demonstrated notable improvements across all evaluation metrics, suggesting potential for further enhancement through the expansion of RAG data. The expert-evaluated dataset developed in this study provides valuable insights for future optimization efforts. However, the ethical issues observed in JGCLLM responses underscore the critical need for ongoing refinement and thorough ethical evaluation before these systems can be implemented in health care settings. %M 39819819 %R 10.2196/65047 %U https://medinform.jmir.org/2025/1/e65047 %U https://doi.org/10.2196/65047 %U http://www.ncbi.nlm.nih.gov/pubmed/39819819 %0 Journal Article %@ 2817-092X %I JMIR Publications %V 4 %N %P e64182 %T Transforming Perceptions: Exploring the Multifaceted Potential of Generative AI for People With Cognitive Disabilities %A Hadar Souval,Dorit %A Haber,Yuval %A Tal,Amir %A Simon,Tomer %A Elyoseph,Tal %A Elyoseph,Zohar %K generative artificial intelligence %K cognitive disability %K social participation %K AI ethics %K assistive technology %K cognitive disorder %K societal barriers %K social inclusion %K disability study %K social mirror %K cognitive partner %K empowerment %K user involvement %K GenAI %K artificial intelligence %K neurotechnology %K neuroinformatics %K digital health %K health informatics %K neuroscience %K mental health %K computer science %K machine learning %D 2025 %7 15.1.2025 %9 %J JMIR Neurotech %G English %X Background: The emergence of generative artificial intelligence (GenAI) presents unprecedented opportunities to redefine conceptions of personhood and cognitive disability, potentially enhancing the inclusion and participation of individuals with cognitive disabilities in society. Objective: We aim to explore the transformative potential of GenAI in reshaping perceptions of cognitive disability, dismantling societal barriers, and promoting social participation for individuals with cognitive disabilities. Methods: This study is a critical review of current literature in disability studies, artificial intelligence (AI) ethics, and computer science, integrating insights from disability theories and the philosophy of technology. The analysis focused on 2 key aspects: GenAI as a social mirror reflecting societal values and biases, and GenAI as a cognitive partner for individuals with cognitive disabilities. Results: This paper proposes a theoretical framework for understanding the impact of GenAI on perceptions of cognitive disability. It introduces the concepts of GenAI as a “social mirror” that reflects and potentially amplifies societal biases and as a “cognitive copilot” providing personalized assistance in daily tasks, social interactions, and environmental navigation. This paper also presents a novel protocol for developing AI systems tailored to the needs of individuals with cognitive disabilities, emphasizing user involvement, ethical considerations, and the need to address both the opportunities and challenges posed by GenAI. Conclusions: Although GenAI has great potential for promoting the inclusion and empowerment of individuals with cognitive disabilities, realizing this potential requires a change in societal attitudes and development practices. This paper calls for interdisciplinary collaboration and close partnership with the disability community in the development and implementation of GenAI technologies. Realizing the potential of GenAI for promoting the inclusion and empowerment of individuals with cognitive disabilities requires a multifaceted approach. This involves a shift in societal attitudes, inclusive AI development practices that prioritize the needs and perspectives of the disability community, and ongoing interdisciplinary collaboration. This paper emphasizes the importance of proceeding with caution, recognizing the ethical complexities and potential risks alongside the transformative possibilities of GenAI technology. %R 10.2196/64182 %U https://neuro.jmir.org/2025/1/e64182 %U https://doi.org/10.2196/64182 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e63731 %T Qwen-2.5 Outperforms Other Large Language Models in the Chinese National Nursing Licensing Examination: Retrospective Cross-Sectional Comparative Study %A Zhu,Shiben %A Hu,Wanqin %A Yang,Zhi %A Yan,Jiani %A Zhang,Fang %+ Department of Science and Education, Shenzhen Baoan Women's and Children's Hospital, 56 Yulu Road, Xin'an Street, Bao'an District, Shenzhen, 518001, China, 86 13686891225, zhangfangf11@163.com %K large language models %K LLMs %K Chinese National Nursing Licensing Examination %K ChatGPT %K Qwen-2.5 %K multiple-choice questions %K %D 2025 %7 10.1.2025 %9 Original Paper %J JMIR Med Inform %G English %X Background: Large language models (LLMs) have been proposed as valuable tools in medical education and practice. The Chinese National Nursing Licensing Examination (CNNLE) presents unique challenges for LLMs due to its requirement for both deep domain–specific nursing knowledge and the ability to make complex clinical decisions, which differentiates it from more general medical examinations. However, their potential application in the CNNLE remains unexplored. Objective: This study aims to evaluates the accuracy of 7 LLMs including GPT-3.5, GPT-4.0, GPT-4o, Copilot, ERNIE Bot-3.5, SPARK, and Qwen-2.5 on the CNNLE, focusing on their ability to handle domain-specific nursing knowledge and clinical decision-making. We also explore whether combining their outputs using machine learning techniques can improve their overall accuracy. Methods: This retrospective cross-sectional study analyzed all 1200 multiple-choice questions from the CNNLE conducted between 2019 and 2023. Seven LLMs were evaluated on these multiple-choice questions, and 9 machine learning models, including Logistic Regression, Support Vector Machine, Multilayer Perceptron, k-nearest neighbors, Random Forest, LightGBM, AdaBoost, XGBoost, and CatBoost, were used to optimize overall performance through ensemble techniques. Results: Qwen-2.5 achieved the highest overall accuracy of 88.9%, followed by GPT-4o (80.7%), ERNIE Bot-3.5 (78.1%), GPT-4.0 (70.3%), SPARK (65.0%), and GPT-3.5 (49.5%). Qwen-2.5 demonstrated superior accuracy in the Practical Skills section compared with the Professional Practice section across most years. It also performed well in brief clinical case summaries and questions involving shared clinical scenarios. When the outputs of the 7 LLMs were combined using 9 machine learning models, XGBoost yielded the best performance, increasing accuracy to 90.8%. XGBoost also achieved an area under the curve of 0.961, sensitivity of 0.905, specificity of 0.978, F1-score of 0.901, positive predictive value of 0.901, and negative predictive value of 0.977. Conclusions: This study is the first to evaluate the performance of 7 LLMs on the CNNLE and that the integration of models via machine learning significantly boosted accuracy, reaching 90.8%. These findings demonstrate the transformative potential of LLMs in revolutionizing health care education and call for further research to refine their capabilities and expand their impact on examination preparation and professional training. %M 39793017 %R 10.2196/63731 %U https://medinform.jmir.org/2025/1/e63731 %U https://doi.org/10.2196/63731 %U http://www.ncbi.nlm.nih.gov/pubmed/39793017 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 4 %N %P e67621 %T Evaluating ChatGPT’s Efficacy in Pediatric Pneumonia Detection From Chest X-Rays: Comparative Analysis of Specialized AI Models %A Chetla,Nitin %A Tandon,Mihir %A Chang,Joseph %A Sukhija,Kunal %A Patel,Romil %A Sanchez,Ramon %+ Department of Orthopaedics, Albany Medical College, 43 New Scotland Ave, Albany, NY, 12208, United States, 1 3322488708, tandonm@amc.edu %K artificial intelligence %K ChatGPT %K pneumonia %K chest x-ray %K pediatric %K radiology %K large language models %K machine learning %K pneumonia detection %K diagnosis %K pediatric pneumonia %D 2025 %7 10.1.2025 %9 Research Letter %J JMIR AI %G English %X %M 39793007 %R 10.2196/67621 %U https://ai.jmir.org/2025/1/e67621 %U https://doi.org/10.2196/67621 %U http://www.ncbi.nlm.nih.gov/pubmed/39793007 %0 Journal Article %@ 2561-7605 %I JMIR Publications %V 8 %N %P e60566 %T Designing a Multimodal and Culturally Relevant Alzheimer Disease and Related Dementia Generative Artificial Intelligence Tool for Black American Informal Caregivers: Cognitive Walk-Through Usability Study %A Bosco,Cristina %A Otenen,Ege %A Osorio Torres,John %A Nguyen,Vivian %A Chheda,Darshil %A Peng,Xinran %A Jessup,Nenette M %A Himes,Anna K %A Cureton,Bianca %A Lu,Yvonne %A Hill,Carl V %A Hendrie,Hugh C %A Barnes,Priscilla A %A Shih,Patrick C %+ , Luddy School of Informatics, Computing, and Engineering, Indiana University, 700 N Woodlawn Ave, Bloomington, IN, 4740, United States, 1 812 856 5754, cribosco@iu.edu %K multimodality %K artificial intelligence %K AI %K generative AI %K usability %K black %K African American %K cultural %K Alzheimer's %K dementia %K caregivers %K mobile app %K interaction %K cognition %K user opinion %K geriatrics %K smartphone %K mHealth %K digital health %K aging %D 2025 %7 8.1.2025 %9 Original Paper %J JMIR Aging %G English %X Background: Many members of Black American communities, faced with the high prevalence of Alzheimer disease and related dementias (ADRD) within their demographic, find themselves taking on the role of informal caregivers. Despite being the primary individuals responsible for the care of individuals with ADRD, these caregivers often lack sufficient knowledge about ADRD-related health literacy and feel ill-prepared for their caregiving responsibilities. Generative AI has become a new promising technological innovation in the health care domain, particularly for improving health literacy; however, some generative AI developments might lead to increased bias and potential harm toward Black American communities. Therefore, rigorous development of generative AI tools to support the Black American community is needed. Objective: The goal of this study is to test Lola, a multimodal mobile app, which, by relying on generative AI, facilitates access to ADRD-related health information by enabling speech and text as inputs and providing auditory, textual, and visual outputs. Methods: To test our mobile app, we used the cognitive walk-through methodology, and we recruited 15 informal ADRD caregivers who were older than 50 years and part of the Black American community living within the region. We asked them to perform 3 tasks on the mobile app (ie, searching for an article on brain health, searching for local events, and finally, searching for opportunities to participate in scientific research in their area), then we recorded their opinions and impressions. The main aspects to be evaluated were the mobile app’s usability, accessibility, cultural relevance, and adoption. Results: Our findings highlight the users’ need for a system that enables interaction with different modalities, the need for a system that can provide personalized and culturally and contextually relevant information, and the role of community and physical spaces in increasing the use of Lola. Conclusions: Our study shows that, when designing for Black American older adults, a multimodal interaction with the generative AI system can allow individuals to choose their own interaction way and style based upon their interaction preferences and external constraints. This flexibility of interaction modes can guarantee an inclusive and engaging generative AI experience. %M 39778201 %R 10.2196/60566 %U https://aging.jmir.org/2025/1/e60566 %U https://doi.org/10.2196/60566 %U http://www.ncbi.nlm.nih.gov/pubmed/39778201 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e59069 %T Revolutionizing Health Care: The Transformative Impact of Large Language Models in Medicine %A Zhang,Kuo %A Meng,Xiangbin %A Yan,Xiangyu %A Ji,Jiaming %A Liu,Jingqian %A Xu,Hua %A Zhang,Heng %A Liu,Da %A Wang,Jingjia %A Wang,Xuliang %A Gao,Jun %A Wang,Yuan-geng-shuo %A Shao,Chunli %A Wang,Wenyao %A Li,Jiarong %A Zheng,Ming-Qi %A Yang,Yaodong %A Tang,Yi-Da %+ Department of Cardiology and Institute of Vascular Medicine, Key Laboratory of Molecular Cardiovascular Science, Ministry of Education, Peking University Third Hospital, 49 North Garden Road, Beijing, 100191, China, 86 88396171, tangyida@bjmu.edu.cn %K large language models %K LLMs %K digital health %K medical diagnosis %K treatment %K multimodal data integration %K technological fairness %K artificial intelligence %K AI %K natural language processing %K NLP %D 2025 %7 7.1.2025 %9 Viewpoint %J J Med Internet Res %G English %X Large language models (LLMs) are rapidly advancing medical artificial intelligence, offering revolutionary changes in health care. These models excel in natural language processing (NLP), enhancing clinical support, diagnosis, treatment, and medical research. Breakthroughs, like GPT-4 and BERT (Bidirectional Encoder Representations from Transformer), demonstrate LLMs’ evolution through improved computing power and data. However, their high hardware requirements are being addressed through technological advancements. LLMs are unique in processing multimodal data, thereby improving emergency, elder care, and digital medical procedures. Challenges include ensuring their empirical reliability, addressing ethical and societal implications, especially data privacy, and mitigating biases while maintaining privacy and accountability. The paper emphasizes the need for human-centric, bias-free LLMs for personalized medicine and advocates for equitable development and access. LLMs hold promise for transformative impacts in health care. %M 39773666 %R 10.2196/59069 %U https://www.jmir.org/2025/1/e59069 %U https://doi.org/10.2196/59069 %U http://www.ncbi.nlm.nih.gov/pubmed/39773666 %0 Journal Article %@ 2561-7605 %I JMIR Publications %V 8 %N %P e63715 %T The PDC30 Chatbot—Development of a Psychoeducational Resource on Dementia Caregiving Among Family Caregivers: Mixed Methods Acceptability Study %A Cheng,Sheung-Tak %A Ng,Peter H F %+ Department of Health and Physical Education, The Education University of Hong Kong, 10 Lo Ping Road, Tai Po, China (Hong Kong), 852 29486563, takcheng@eduhk.hk %K Alzheimer %K caregiving %K chatbot %K conversational artificial intelligence %K dementia %K digital health %K health care technology %K psychoeducational %K medical innovations %K language models %K mobile phone %D 2025 %7 6.1.2025 %9 Original Paper %J JMIR Aging %G English %X Background: Providing ongoing support to the increasing number of caregivers as their needs change in the long-term course of dementia is a severe challenge to any health care system. Conversational artificial intelligence (AI) operating 24/7 may help to tackle this problem. Objective: This study describes the development of a generative AI chatbot—the PDC30 Chatbot—and evaluates its acceptability in a mixed methods study. Methods: The PDC30 Chatbot was developed using the GPT-4o large language model, with a personality agent to constrain its behavior to provide advice on dementia caregiving based on the Positive Dementia Caregiving in 30 Days Guidebook—a laypeople’s resource based on a validated training manual for dementia caregivers. The PDC30 Chatbot’s responses to 21 common questions were compared with those of ChatGPT and another chatbot (called Chatbot-B) as standards of reference. Chatbot-B was constructed using PDC30 Chatbot’s architecture but replaced the latter’s knowledge base with a collection of authoritative sources, including the World Health Organization’s iSupport, By Us For Us Guides, and 185 web pages or manuals by Alzheimer’s Association, National Institute on Aging, and UK Alzheimer’s Society. In the next phase, to assess the acceptability of the PDC30 Chatbot, 21 family caregivers used the PDC30 Chatbot for two weeks and provided ratings and comments on its acceptability. Results: Among the three chatbots, ChatGPT’s responses tended to be repetitive and not specific enough. PDC30 Chatbot and Chatbot-B, by virtue of their design, produced highly context-sensitive advice, with the former performing slightly better when the questions conveyed significant psychological distress on the part of the caregiver. In the acceptability study, caregivers found the PDC30 Chatbot highly user-friendly, and its responses quite helpful and easy to understand. They were rather satisfied with it and would strongly recommend it to other caregivers. During the 2-week trial period, the majority used the chatbot more than once per day. Thematic analysis of their written feedback revealed three major themes: helpfulness, accessibility, and improved attitude toward AI. Conclusions: The PDC30 Chatbot provides quality responses to caregiver questions, which are well-received by caregivers. Conversational AI is a viable approach to improve the support of caregivers. %R 10.2196/63715 %U https://aging.jmir.org/2025/1/e63715 %U https://doi.org/10.2196/63715 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e66220 %T Two-Layer Retrieval-Augmented Generation Framework for Low-Resource Medical Question Answering Using Reddit Data: Proof-of-Concept Study %A Das,Sudeshna %A Ge,Yao %A Guo,Yuting %A Rajwal,Swati %A Hairston,JaMor %A Powell,Jeanne %A Walker,Drew %A Peddireddy,Snigdha %A Lakamana,Sahithi %A Bozkurt,Selen %A Reyna,Matthew %A Sameni,Reza %A Xiao,Yunyu %A Kim,Sangmi %A Chandler,Rasheeta %A Hernandez,Natalie %A Mowery,Danielle %A Wightman,Rachel %A Love,Jennifer %A Spadaro,Anthony %A Perrone,Jeanmarie %A Sarker,Abeed %+ Department of Biomedical Informatics, School of Medicine, Emory University, 101 Woodruff Circle, Atlanta, GA, 30322, United States, 1 4047270229, sudeshna.das@emory.edu %K retrieval-augmented generation %K substance use %K social media %K large language models %K natural language processing %K artificial intelligence %K GPT %K psychoactive substance %D 2025 %7 6.1.2025 %9 Short Paper %J J Med Internet Res %G English %X Background: The increasing use of social media to share lived and living experiences of substance use presents a unique opportunity to obtain information on side effects, use patterns, and opinions on novel psychoactive substances. However, due to the large volume of data, obtaining useful insights through natural language processing technologies such as large language models is challenging. Objective: This paper aims to develop a retrieval-augmented generation (RAG) architecture for medical question answering pertaining to clinicians’ queries on emerging issues associated with health-related topics, using user-generated medical information on social media. Methods: We proposed a two-layer RAG framework for query-focused answer generation and evaluated a proof of concept for the framework in the context of query-focused summary generation from social media forums, focusing on emerging drug-related information. Our modular framework generates individual summaries followed by an aggregated summary to answer medical queries from large amounts of user-generated social media data in an efficient manner. We compared the performance of a quantized large language model (Nous-Hermes-2-7B-DPO), deployable in low-resource settings, with GPT-4. For this proof-of-concept study, we used user-generated data from Reddit to answer clinicians’ questions on the use of xylazine and ketamine. Results: Our framework achieves comparable median scores in terms of relevance, length, hallucination, coverage, and coherence when evaluated using GPT-4 and Nous-Hermes-2-7B-DPO, evaluated for 20 queries with 76 samples. There was no statistically significant difference between GPT-4 and Nous-Hermes-2-7B-DPO for coverage (Mann-Whitney U=733.0; n1=37; n2=39; P=.89 two-tailed), coherence (U=670.0; n1=37; n2=39; P=.49 two-tailed), relevance (U=662.0; n1=37; n2=39; P=.15 two-tailed), length (U=672.0; n1=37; n2=39; P=.55 two-tailed), and hallucination (U=859.0; n1=37; n2=39; P=.01 two-tailed). A statistically significant difference was noted for the Coleman-Liau Index (U=307.5; n1=20; n2=16; P<.001 two-tailed). Conclusions: Our RAG framework can effectively answer medical questions about targeted topics and can be deployed in resource-constrained settings. %M 39761554 %R 10.2196/66220 %U https://www.jmir.org/2025/1/e66220 %U https://doi.org/10.2196/66220 %U http://www.ncbi.nlm.nih.gov/pubmed/39761554 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 11 %N %P e63865 %T Enhancing Medical Student Engagement Through Cinematic Clinical Narratives: Multimodal Generative AI–Based Mixed Methods Study %A Bland,Tyler %K artificial intelligence %K cinematic clinical narratives %K cinemeducation %K medical education %K narrative learning %K AI %K medical student %K pharmacology %K preclinical education %K long-term retention %K AI tools %K GPT-4 %K image %K applicability %D 2025 %7 6.1.2025 %9 %J JMIR Med Educ %G English %X Background: Medical students often struggle to engage with and retain complex pharmacology topics during their preclinical education. Traditional teaching methods can lead to passive learning and poor long-term retention of critical concepts. Objective: This study aims to enhance the teaching of clinical pharmacology in medical school by using a multimodal generative artificial intelligence (genAI) approach to create compelling, cinematic clinical narratives (CCNs). Methods: We transformed a standard clinical case into an engaging, interactive multimedia experience called “Shattered Slippers.” This CCN used various genAI tools for content creation: GPT-4 for developing the storyline, Leonardo.ai and Stable Diffusion for generating images, Eleven Labs for creating audio narrations, and Suno for composing a theme song. The CCN integrated narrative styles and pop culture references to enhance student engagement. It was applied in teaching first-year medical students about immune system pharmacology. Student responses were assessed through the Situational Interest Survey for Multimedia and examination performance. The target audience comprised first-year medical students (n=40), with 18 responding to the Situational Interest Survey for Multimedia survey (n=18). Results: The study revealed a marked preference for the genAI-enhanced CCNs over traditional teaching methods. Key findings include the majority of surveyed students preferring the CCN over traditional clinical cases (14/18), as well as high average scores for triggered situational interest (mean 4.58, SD 0.53), maintained interest (mean 4.40, SD 0.53), maintained-feeling interest (mean 4.38, SD 0.51), and maintained-value interest (mean 4.42, SD 0.54). Students achieved an average score of 88% on examination questions related to the CCN material, indicating successful learning and retention. Qualitative feedback highlighted increased engagement, improved recall, and appreciation for the narrative style and pop culture references. Conclusions: This study demonstrates the potential of using a multimodal genAI-driven approach to create CCNs in medical education. The “Shattered Slippers” case effectively enhanced student engagement and promoted knowledge retention in complex pharmacological topics. This innovative method suggests a novel direction for curriculum development that could improve learning outcomes and student satisfaction in medical education. Future research should explore the long-term retention of knowledge and the applicability of learned material in clinical settings, as well as the potential for broader implementation of this approach across various medical education contexts. %R 10.2196/63865 %U https://mededu.jmir.org/2025/1/e63865 %U https://doi.org/10.2196/63865 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e63538 %T Development and Evaluation of a Mental Health Chatbot Using ChatGPT 4.0: Mixed Methods User Experience Study With Korean Users %A Kang,Boyoung %A Hong,Munpyo %+ Sungkyunkwan University, 25-2, Sungkyunkwan-Ro, Jongno-gu, Seoul, 03063, Republic of Korea, 82 027401770, bykang2015@gmail.com %K mental health chatbot %K Dr. CareSam %K HoMemeTown %K ChatGPT 4.0 %K large language model %K LLM %K cross-lingual %K pilot testing %K cultural sensitivity %K localization %K Korean students %D 2025 %7 3.1.2025 %9 Original Paper %J JMIR Med Inform %G English %X Background: Mental health chatbots have emerged as a promising tool for providing accessible and convenient support to individuals in need. Building on our previous research on digital interventions for loneliness and depression among Korean college students, this study addresses the limitations identified and explores more advanced artificial intelligence–driven solutions. Objective: This study aimed to develop and evaluate the performance of HoMemeTown Dr. CareSam, an advanced cross-lingual chatbot using ChatGPT 4.0 (OpenAI) to provide seamless support in both English and Korean contexts. The chatbot was designed to address the need for more personalized and culturally sensitive mental health support identified in our previous work while providing an accessible and user-friendly interface for Korean young adults. Methods: We conducted a mixed methods pilot study with 20 Korean young adults aged 18 to 27 (mean 23.3, SD 1.96) years. The HoMemeTown Dr CareSam chatbot was developed using the GPT application programming interface, incorporating features such as a gratitude journal and risk detection. User satisfaction and chatbot performance were evaluated using quantitative surveys and qualitative feedback, with triangulation used to ensure the validity and robustness of findings through cross-verification of data sources. Comparative analyses were conducted with other large language models chatbots and existing digital therapy tools (Woebot [Woebot Health Inc] and Happify [Twill Inc]). Results: Users generally expressed positive views towards the chatbot, with positivity and support receiving the highest score on a 10-point scale (mean 9.0, SD 1.2), followed by empathy (mean 8.7, SD 1.6) and active listening (mean 8.0, SD 1.8). However, areas for improvement were noted in professionalism (mean 7.0, SD 2.0), complexity of content (mean 7.4, SD 2.0), and personalization (mean 7.4, SD 2.4). The chatbot demonstrated statistically significant performance differences compared with other large language models chatbots (F=3.27; P=.047), with more pronounced differences compared with Woebot and Happify (F=12.94; P<.001). Qualitative feedback highlighted the chatbot’s strengths in providing empathetic responses and a user-friendly interface, while areas for improvement included response speed and the naturalness of Korean language responses. Conclusions: The HoMemeTown Dr CareSam chatbot shows potential as a cross-lingual mental health support tool, achieving high user satisfaction and demonstrating comparative advantages over existing digital interventions. However, the study’s limited sample size and short-term nature necessitate further research. Future studies should include larger-scale clinical trials, enhanced risk detection features, and integration with existing health care systems to fully realize its potential in supporting mental well-being across different linguistic and cultural contexts. %M 39752663 %R 10.2196/63538 %U https://medinform.jmir.org/2025/1/e63538 %U https://doi.org/10.2196/63538 %U http://www.ncbi.nlm.nih.gov/pubmed/39752663 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 9 %N %P e63494 %T ChatGPT’s Attitude, Knowledge, and Clinical Application in Geriatrics Practice and Education: Exploratory Observational Study %A Cheng,Huai Yong %+ Minneapolis VA Health Care System, 1 Veterans Dr., Minneapolis, MN, 55417, United States, 1 6124672051, wchengwcheng@gmail.com %K ChatGPT %K geriatrics attitude %K ageism %K geriatrics competence %K geriatric syndromes %K polypharmacy %K falls %K aging, older adults %D 2025 %7 3.1.2025 %9 Original Paper %J JMIR Form Res %G English %X Background: The increasing use of ChatGPT in clinical practice and medical education necessitates the evaluation of its reliability, particularly in geriatrics. Objective: This study aimed to evaluate ChatGPT’s trustworthiness in geriatrics through 3 distinct approaches: evaluating ChatGPT’s geriatrics attitude, knowledge, and clinical application with 2 vignettes of geriatric syndromes (polypharmacy and falls). Methods: We used the validated University of California, Los Angeles, geriatrics attitude and knowledge instruments to evaluate ChatGPT’s geriatrics attitude and knowledge and compare its performance with that of medical students, residents, and geriatrics fellows from reported results in the literature. We also evaluated ChatGPT’s application to 2 vignettes of geriatric syndromes (polypharmacy and falls). Results: The mean total score on geriatrics attitude of ChatGPT was significantly lower than that of trainees (medical students, internal medicine residents, and geriatric medicine fellows; 2.7 vs 3.7 on a scale from 1-5; 1=strongly disagree; 5=strongly agree). The mean subscore on positive geriatrics attitude of ChatGPT was higher than that of the trainees (medical students, internal medicine residents, and neurologists; 4.1 vs 3.7 on a scale from 1 to 5 where a higher score means a more positive attitude toward older adults). The mean subscore on negative geriatrics attitude of ChatGPT was lower than that of the trainees and neurologists (1.8 vs 2.8 on a scale from 1 to 5 where a lower subscore means a less negative attitude toward aging). On the University of California, Los Angeles geriatrics knowledge test, ChatGPT outperformed all medical students, internal medicine residents, and geriatric medicine fellows from validated studies (14.7 vs 11.3 with a score range of –18 to +18 where +18 means that all questions were answered correctly). Regarding the polypharmacy vignette, ChatGPT not only demonstrated solid knowledge of potentially inappropriate medications but also accurately identified 7 common potentially inappropriate medications and 5 drug-drug and 3 drug-disease interactions. However, ChatGPT missed 5 drug-disease and 1 drug-drug interaction and produced 2 hallucinations. Regarding the fall vignette, ChatGPT answered 3 of 5 pretests correctly and 2 of 5 pretests partially correctly, identified 6 categories of fall risks, followed fall guidelines correctly, listed 6 key physical examinations, and recommended 6 categories of fall prevention methods. Conclusions: This study suggests that ChatGPT can be a valuable supplemental tool in geriatrics, offering reliable information with less age bias, robust geriatrics knowledge, and comprehensive recommendations for managing 2 common geriatric syndromes (polypharmacy and falls) that are consistent with evidence from guidelines, systematic reviews, and other types of studies. ChatGPT’s potential as an educational and clinical resource could significantly benefit trainees, health care providers, and laypeople. Further research using GPT-4o, larger geriatrics question sets, and more geriatric syndromes is needed to expand and confirm these findings before adopting ChatGPT widely for geriatrics education and practice. %M 39752214 %R 10.2196/63494 %U https://formative.jmir.org/2025/1/e63494 %U https://doi.org/10.2196/63494 %U http://www.ncbi.nlm.nih.gov/pubmed/39752214 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 13 %N %P e58457 %T The Transformative Potential of Large Language Models in Mining Electronic Health Records Data: Content Analysis %A Wals Zurita,Amadeo Jesus %A Miras del Rio,Hector %A Ugarte Ruiz de Aguirre,Nerea %A Nebrera Navarro,Cristina %A Rubio Jimenez,Maria %A Muñoz Carmona,David %A Miguez Sanchez,Carlos %+ Servicio Oncologia Radioterápica, Hospital Universitario Virgen Macarena, Andalusian Health Service, Avenida Dr. Fedriani s/n, Seville, 41009, Spain, 34 954712932, amadeoj.wals.sspa@juntadeandalucia.es %K electronic health record %K EHR %K oncology %K radiotherapy %K data mining %K ChatGPT %K large language models %K LLMs %D 2025 %7 2.1.2025 %9 Original Paper %J JMIR Med Inform %G English %X Background: In this study, we evaluate the accuracy, efficiency, and cost-effectiveness of large language models in extracting and structuring information from free-text clinical reports, particularly in identifying and classifying patient comorbidities within oncology electronic health records. We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators. Objective: We specifically compare the performance of gpt-3.5-turbo-1106 and gpt-4-1106-preview models against that of specialized human evaluators. Methods: We implemented a script using the OpenAI application programming interface to extract structured information in JavaScript object notation format from comorbidities reported in 250 personal history reports. These reports were manually reviewed in batches of 50 by 5 specialists in radiation oncology. We compared the results using metrics such as sensitivity, specificity, precision, accuracy, F-value, κ index, and the McNemar test, in addition to examining the common causes of errors in both humans and generative pretrained transformer (GPT) models. Results: The GPT-3.5 model exhibited slightly lower performance compared to physicians across all metrics, though the differences were not statistically significant (McNemar test, P=.79). GPT-4 demonstrated clear superiority in several key metrics (McNemar test, P<.001). Notably, it achieved a sensitivity of 96.8%, compared to 88.2% for GPT-3.5 and 88.8% for physicians. However, physicians marginally outperformed GPT-4 in precision (97.7% vs 96.8%). GPT-4 showed greater consistency, replicating the exact same results in 76% of the reports across 10 repeated analyses, compared to 59% for GPT-3.5, indicating more stable and reliable performance. Physicians were more likely to miss explicit comorbidities, while the GPT models more frequently inferred nonexplicit comorbidities, sometimes correctly, though this also resulted in more false positives. Conclusions: This study demonstrates that, with well-designed prompts, the large language models examined can match or even surpass medical specialists in extracting information from complex clinical reports. Their superior efficiency in time and costs, along with easy integration with databases, makes them a valuable tool for large-scale data mining and real-world evidence generation. %M 39746191 %R 10.2196/58457 %U https://medinform.jmir.org/2025/1/e58457 %U https://doi.org/10.2196/58457 %U http://www.ncbi.nlm.nih.gov/pubmed/39746191 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e59435 %T Application of Large Language Models in Medical Training Evaluation—Using ChatGPT as a Standardized Patient: Multimetric Assessment %A Wang,Chenxu %A Li,Shuhan %A Lin,Nuoxi %A Zhang,Xinyu %A Han,Ying %A Wang,Xiandi %A Liu,Di %A Tan,Xiaomei %A Pu,Dan %A Li,Kang %A Qian,Guangwu %A Yin,Rong %+ West China Biomedical Big Data Center, West China Hospital, Sichuan University, No.37, Guoxue Lane, Wuhou District, Chengdu, 610041, China, 86 02881739902, likang@wchscu.cn %K ChatGPT %K artificial intelligence %K standardized patient %K health care %K prompt engineering %K accuracy %K large language models %K performance evaluation %K medical training %K inflammatory bowel disease %D 2025 %7 1.1.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: With the increasing interest in the application of large language models (LLMs) in the medical field, the feasibility of its potential use as a standardized patient in medical assessment is rarely evaluated. Specifically, we delved into the potential of using ChatGPT, a representative LLM, in transforming medical education by serving as a cost-effective alternative to standardized patients, specifically for history-taking tasks. Objective: The study aims to explore ChatGPT’s viability and performance as a standardized patient, using prompt engineering to refine its accuracy and use in medical assessments. Methods: A 2-phase experiment was conducted. The first phase assessed feasibility by simulating conversations about inflammatory bowel disease (IBD) across 3 quality groups (good, medium, and bad). Responses were categorized based on their relevance and accuracy. Each group consisted of 30 runs, with responses scored to determine whether they were related to the inquiries. For the second phase, we evaluated ChatGPT’s performance against specific criteria, focusing on its anthropomorphism, clinical accuracy, and adaptability. Adjustments were made to prompts based on ChatGPT’s response shortcomings, with a comparative analysis of ChatGPT’s performance between original and revised prompts. A total of 300 runs were conducted and compared against standard reference scores. Finally, the generalizability of the revised prompt was tested using other scripts for another 60 runs, together with the exploration of the impact of the used language on the performance of the chatbot. Results: The feasibility test confirmed ChatGPT’s ability to simulate a standardized patient effectively, differentiating among poor, medium, and good medical inquiries with varying degrees of accuracy. Score differences between the poor (74.7, SD 5.44) and medium (82.67, SD 5.30) inquiry groups (P<.001), between the poor and good (85, SD 3.27) inquiry groups (P<.001) were significant at a significance level (α) of .05, while the score differences between the medium and good inquiry groups were not statistically significant (P=.16). The revised prompt significantly improved ChatGPT’s realism, clinical accuracy, and adaptability, leading to a marked reduction in scoring discrepancies. The score accuracy of ChatGPT improved 4.926 times compared to unrevised prompts. The score difference percentage drops from 29.83% to 6.06%, with a drop in SD from 0.55 to 0.068. The performance of the chatbot on a separate script is acceptable with an average score difference percentage of 3.21%. Moreover, the performance differences between test groups using various language combinations were found to be insignificant. Conclusions: ChatGPT, as a representative LLM, is a viable tool for simulating standardized patients in medical assessments, with the potential to enhance medical training. By incorporating proper prompts, ChatGPT’s scoring accuracy and response realism significantly improved, approaching the feasibility of actual clinical use. Also, the influence of the adopted language is nonsignificant on the outcome of the chatbot. %M 39742453 %R 10.2196/59435 %U https://www.jmir.org/2025/1/e59435 %U https://doi.org/10.2196/59435 %U http://www.ncbi.nlm.nih.gov/pubmed/39742453 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54047 %T Slit Lamp Report Generation and Question Answering: Development and Validation of a Multimodal Transformer Model with Large Language Model Integration %A Zhao,Ziwei %A Zhang,Weiyi %A Chen,Xiaolan %A Song,Fan %A Gunasegaram,James %A Huang,Wenyong %A Shi,Danli %A He,Mingguang %A Liu,Na %+ Guangzhou Cadre and Talent Health Management Center, No. 109 Changling Road, Huangpu District, Guangzhou, 510700, China, 86 18701985445, 1256695904@qq.com %K large language model %K slit lamp %K medical report generation %K question answering %D 2024 %7 30.12.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models have shown remarkable efficacy in various medical research and clinical applications. However, their skills in medical image recognition and subsequent report generation or question answering (QA) remain limited. Objective: We aim to finetune a multimodal, transformer-based model for generating medical reports from slit lamp images and develop a QA system using Llama2. We term this entire process slit lamp–GPT. Methods: Our research used a dataset of 25,051 slit lamp images from 3409 participants, paired with their corresponding physician-created medical reports. We used these data, split into training, validation, and test sets, to finetune the Bootstrapping Language-Image Pre-training framework toward report generation. The generated text reports and human-posed questions were then input into Llama2 for subsequent QA. We evaluated performance using qualitative metrics (including BLEU [bilingual evaluation understudy], CIDEr [consensus-based image description evaluation], ROUGE-L [Recall-Oriented Understudy for Gisting Evaluation—Longest Common Subsequence], SPICE [Semantic Propositional Image Caption Evaluation], accuracy, sensitivity, specificity, precision, and F1-score) and the subjective assessments of two experienced ophthalmologists on a 1-3 scale (1 referring to high quality). Results: We identified 50 conditions related to diseases or postoperative complications through keyword matching in initial reports. The refined slit lamp–GPT model demonstrated BLEU scores (1-4) of 0.67, 0.66, 0.65, and 0.65, respectively, with a CIDEr score of 3.24, a ROUGE (Recall-Oriented Understudy for Gisting Evaluation) score of 0.61, and a Semantic Propositional Image Caption Evaluation score of 0.37. The most frequently identified conditions were cataracts (22.95%), age-related cataracts (22.03%), and conjunctival concretion (13.13%). Disease classification metrics demonstrated an overall accuracy of 0.82 and an F1-score of 0.64, with high accuracies (≥0.9) observed for intraocular lens, conjunctivitis, and chronic conjunctivitis, and high F1-scores (≥0.9) observed for cataract and age-related cataract. For both report generation and QA components, the two evaluating ophthalmologists reached substantial agreement, with κ scores between 0.71 and 0.84. In assessing 100 generated reports, they awarded scores of 1.36 for both completeness and correctness; 64% (64/100) were considered “entirely good,” and 93% (93/100) were “acceptable.” In the evaluation of 300 generated answers to questions, the scores were 1.33 for completeness, 1.14 for correctness, and 1.15 for possible harm, with 66.3% (199/300) rated as “entirely good” and 91.3% (274/300) as “acceptable.” Conclusions: This study introduces the slit lamp–GPT model for report generation and subsequent QA, highlighting the potential of large language models to assist ophthalmologists and patients. %M 39753218 %R 10.2196/54047 %U https://www.jmir.org/2024/1/e54047 %U https://doi.org/10.2196/54047 %U http://www.ncbi.nlm.nih.gov/pubmed/39753218 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e64081 %T Effects of Large Language Model–Based Offerings on the Well-Being of Students: Qualitative Study %A Selim,Rania %A Basu,Arunima %A Anto,Ailin %A Foscht,Thomas %A Eisingerich,Andreas Benedikt %+ Faculty of Medicine, Imperial College London, Exhibition Rd, South Kensington, London, SW7 2AZ, United Kingdom, 44 020 7589 5111, rania.selim18@imperial.ac.uk %K large language models %K ChatGPT %K functional support %K escapism %K fantasy fulfillment %K angst %K despair %K anxiety %K deskilling %K pessimism about the future %D 2024 %7 27.12.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: In recent years, the adoption of large language model (LLM) applications, such as ChatGPT, has seen a significant surge, particularly among students. These artificial intelligence–driven tools offer unprecedented access to information and conversational assistance, which is reshaping the way students engage with academic content and manage the learning process. Despite the growing prevalence of LLMs and reliance on these technologies, there remains a notable gap in qualitative in-depth research examining the emotional and psychological effects of LLMs on users’ mental well-being. Objective: In order to address these emerging and critical issues, this study explores the role of LLM-based offerings, such as ChatGPT, in students’ lives, namely, how postgraduate students use such offerings and how they make students feel, and examines the impact on students’ well-being. Methods: To address the aims of this study, we employed an exploratory approach, using in-depth, semistructured, qualitative, face-to-face interviews with 23 users (13 female and 10 male users; mean age 23 years, SD 1.55 years) of ChatGPT-4o, who were also university students at the time (inclusion criteria). Interviewees were invited to reflect upon how they use ChatGPT, how it makes them feel, and how it may influence their lives. Results: The current findings from the exploratory qualitative interviews showed that users appreciate the functional support (8/23, 35%), escapism (8/23, 35%), and fantasy fulfillment (7/23, 30%) they receive from LLM-based offerings, such as ChatGPT, but at the same time, such usage is seen as a “double-edged sword,” with respondents indicating anxiety (8/23, 35%), dependence (11/23, 48%), concerns about deskilling (12/23, 52%), and angst or pessimism about the future (11/23, 48%). Conclusions: This study employed exploratory in-depth interviews to examine how the usage of LLM-based offerings, such as ChatGPT, makes users feel and assess the effects of using LLM-based offerings on mental well-being. The findings of this study show that students used ChatGPT to make their lives easier and felt a sense of cognitive escapism and even fantasy fulfillment, but this came at the cost of feeling anxious and pessimistic about the future. %M 39729617 %R 10.2196/64081 %U https://formative.jmir.org/2024/1/e64081 %U https://doi.org/10.2196/64081 %U http://www.ncbi.nlm.nih.gov/pubmed/39729617 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e66114 %T Large Language Models in Worldwide Medical Exams: Platform Development and Comprehensive Analysis %A Zong,Hui %A Wu,Rongrong %A Cha,Jiaxue %A Wang,Jiao %A Wu,Erman %A Li,Jiakun %A Zhou,Yi %A Zhang,Chi %A Feng,Weizhe %A Shen,Bairong %+ Joint Laboratory of Artificial Intelligence for Critical Care Medicine, Department of Critical Care Medicine and Institutes for Systems Genetics, Frontiers Science Center for Disease-related Molecular Network, West China Hospital, Sichuan University, No. 37 Guoxue Alley, Chengdu, 610041, China, 86 28 61528682, bairong.shen@scu.edu.cn %K large language models %K LLMs %K generative pretrained transformer %K ChatGPT %K medical exam %K medical education %K artifical intelligence %K AI %D 2024 %7 27.12.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models (LLMs) are increasingly integrated into medical education, with transformative potential for learning and assessment. However, their performance across diverse medical exams globally has remained underexplored. Objective: This study aims to introduce MedExamLLM, a comprehensive platform designed to systematically evaluate the performance of LLMs on medical exams worldwide. Specifically, the platform seeks to (1) compile and curate performance data for diverse LLMs on worldwide medical exams; (2) analyze trends and disparities in LLM capabilities across geographic regions, languages, and contexts; and (3) provide a resource for researchers, educators, and developers to explore and advance the integration of artificial intelligence in medical education. Methods: A systematic search was conducted on April 25, 2024, in the PubMed database to identify relevant publications. Inclusion criteria encompassed peer-reviewed, English-language, original research articles that evaluated at least one LLM on medical exams. Exclusion criteria included review articles, non-English publications, preprints, and studies without relevant data on LLM performance. The screening process for candidate publications was independently conducted by 2 researchers to ensure accuracy and reliability. Data, including exam information, data process information, model performance, data availability, and references, were manually curated, standardized, and organized. These curated data were integrated into the MedExamLLM platform, enabling its functionality to visualize and analyze LLM performance across geographic, linguistic, and exam characteristics. The web platform was developed with a focus on accessibility, interactivity, and scalability to support continuous data updates and user engagement. Results: A total of 193 articles were included for final analysis. MedExamLLM comprised information for 16 LLMs on 198 medical exams conducted in 28 countries across 15 languages from the year 2009 to the year 2023. The United States accounted for the highest number of medical exams and related publications, with English being the dominant language used in these exams. The Generative Pretrained Transformer (GPT) series models, especially GPT-4, demonstrated superior performance, achieving pass rates significantly higher than other LLMs. The analysis revealed significant variability in the capabilities of LLMs across different geographic and linguistic contexts. Conclusions: MedExamLLM is an open-source, freely accessible, and publicly available online platform providing comprehensive performance evaluation information and evidence knowledge about LLMs on medical exams around the world. The MedExamLLM platform serves as a valuable resource for educators, researchers, and developers in the fields of clinical medicine and artificial intelligence. By synthesizing evidence on LLM capabilities, the platform provides valuable insights to support the integration of artificial intelligence into medical education. Limitations include potential biases in the data source and the exclusion of non-English literature. Future research should address these gaps and explore methods to enhance LLM performance in diverse contexts. %M 39729356 %R 10.2196/66114 %U https://www.jmir.org/2024/1/e66114 %U https://doi.org/10.2196/66114 %U http://www.ncbi.nlm.nih.gov/pubmed/39729356 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e59843 %T Large Language Models May Help Patients Understand Peer-Reviewed Scientific Articles About Ophthalmology: Development and Usability Study %A Kianian,Reza %A Sun,Deyu %A Rojas-Carabali,William %A Agrawal,Rupesh %A Tsui,Edmund %+ Stein Eye Institute, Department of Ophthalmology, David Geffen School of Medicine, 200 Stein Plaza, Los Angeles, CA, 90095, United States, 1 310 825 5440, etsui@mednet.ucla.edu %K uveitis %K artificial intelligence %K ChatGPT %K readability %K peer review %K large language models %K LLMs %K health literacy %K patient education %K medical information %K ophthalmology %D 2024 %7 24.12.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Adequate health literacy has been shown to be important for the general health of a population. To address this, it is recommended that patient-targeted medical information is written at a sixth-grade reading level. To make well-informed decisions about their health, patients may want to interact directly with peer-reviewed open access scientific articles. However, studies have shown that such text is often written with highly complex language above the levels that can be comprehended by the general population. Previously, we have published on the use of large language models (LLMs) in easing the readability of patient-targeted health information on the internet. In this study, we continue to explore the advantages of LLMs in patient education. Objective: This study aimed to explore the use of LLMs, specifically ChatGPT (OpenAI), to enhance the readability of peer-reviewed scientific articles in the field of ophthalmology. Methods: A total of 12 open access, peer-reviewed papers published by the senior authors of this study (ET and RA) were selected. Readability was assessed using the Flesch-Kincaid Grade Level and Simple Measure of Gobbledygook tests. ChatGPT 4.0 was asked “I will give you the text of a peer-reviewed scientific paper. Considering that the recommended readability of the text is 6th grade, can you simplify the following text so that a layperson reading this text can fully comprehend it? - Insert Manuscript Text -”. Appropriateness was evaluated by the 2 uveitis-trained ophthalmologists. Statistical analysis was performed in Microsoft Excel. Results: ChatGPT significantly lowered the readability and length of the selected papers from 15th to 7th grade (P<.001) while generating responses that were deemed appropriate by expert ophthalmologists. Conclusions: LLMs show promise in improving health literacy by enhancing the accessibility of peer-reviewed scientific articles and allowing the general population to interact directly with medical literature. %M 39719077 %R 10.2196/59843 %U https://www.jmir.org/2024/1/e59843 %U https://doi.org/10.2196/59843 %U http://www.ncbi.nlm.nih.gov/pubmed/39719077 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e63129 %T Performance of ChatGPT-4o on the Japanese Medical Licensing Examination: Evalution of Accuracy in Text-Only and Image-Based Questions %A Miyazaki,Yuki %A Hata,Masahiro %A Omori,Hisaki %A Hirashima,Atsuya %A Nakagawa,Yuta %A Eto,Mitsuhiro %A Takahashi,Shun %A Ikeda,Manabu %K medical education %K artificial intelligence %K clinical decision-making %K GPT-4o %K medical licensing examination %K Japan %K images %K accuracy %K AI technology %K application %K decision-making %K image-based %K reliability %K ChatGPT %D 2024 %7 24.12.2024 %9 %J JMIR Med Educ %G English %X This study evaluated the performance of ChatGPT with GPT-4 Omni (GPT-4o) on the 118th Japanese Medical Licensing Examination. The study focused on both text-only and image-based questions. The model demonstrated a high level of accuracy overall, with no significant difference in performance between text-only and image-based questions. Common errors included clinical judgment mistakes and prioritization issues, underscoring the need for further improvement in the integration of artificial intelligence into medical education and practice. %R 10.2196/63129 %U https://mededu.jmir.org/2024/1/e63129 %U https://doi.org/10.2196/63129 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e65123 %T Authors’ Reply: Reassessing AI in Medicine: Exploring the Capabilities of AI in Academic Abstract Synthesis %A Hsu,Tien-Wei %A Liang,Chih-Sung %+ Department of Psychiatry, Tri-service Hospital, Beitou Branch, No. 60, Xinmin Road, Beitou District, Taipei, 112, Taiwan, 886 2 28959808, lcsyfw@gmail.com %K ChatGPT %K AI-generated scientific content %K plagiarism %K AI %K artificial intelligence %K NLP %K natural language processing %K LLM %K language model %K text %K textual %K generation %K generative %K extract %K extraction %K scientific research %K academic research %K publication %K abstract %K comparative analysis %K reviewer bias %D 2024 %7 23.12.2024 %9 Letter to the Editor %J J Med Internet Res %G English %X N/A %R 10.2196/65123 %U https://www.jmir.org/2024/1/e65123 %U https://doi.org/10.2196/65123 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e55920 %T Reassessing AI in Medicine: Exploring the Capabilities of AI in Academic Abstract Synthesis %A Wang,Zijian %A Zhou,Chunyang %+ Department of Radiation Oncology, Qilu Hospital (Qingdao), Cheeloo College of Medicine, Shandong University, 758 Hefei Road, Qingdao, Qingdao, 266000, China, 86 18561813085, chunyangzhou29@163.com %K ChatGPT %K AI-generated scientific content %K plagiarism %K AI %K artificial intelligence %K NLP %K natural language processing %K LLM %K language model %K text %K textual %K generation %K generative %K extract %K extraction %K scientific research %K academic research %K publication %K abstract %K comparative analysis %K reviewer bias %D 2024 %7 23.12.2024 %9 Letter to the Editor %J J Med Internet Res %G English %X %R 10.2196/55920 %U https://www.jmir.org/2024/1/e55920 %U https://doi.org/10.2196/55920 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60684 %T AI in Dental Radiology—Improving the Efficiency of Reporting With ChatGPT: Comparative Study %A Stephan,Daniel %A Bertsch,Annika %A Burwinkel,Matthias %A Vinayahalingam,Shankeeth %A Al-Nawas,Bilal %A Kämmerer,Peer W %A Thiem,Daniel GE %+ Department of Oral and Maxillofacial Surgery, Facial Plastic Surgery, University Medical Centre of the Johannes Gutenberg-University Mainz, Augustusplatz 2, Mainz, 55131, Germany, 49 6131177038, stephand@uni-mainz.de %K artificial intelligence %K ChatGPT %K radiology report %K dental radiology %K dental orthopantomogram %K panoramic radiograph %K dental %K radiology %K chatbot %K medical documentation %K medical application %K imaging %K disease detection %K clinical decision support %K natural language processing %K medical licensing %K dentistry %K patient care %D 2024 %7 23.12.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Structured and standardized documentation is critical for accurately recording diagnostic findings, treatment plans, and patient progress in health care. Manual documentation can be labor-intensive and error-prone, especially under time constraints, prompting interest in the potential of artificial intelligence (AI) to automate and optimize these processes, particularly in medical documentation. Objective: This study aimed to assess the effectiveness of ChatGPT (OpenAI) in generating radiology reports from dental panoramic radiographs, comparing the performance of AI-generated reports with those manually created by dental students. Methods: A total of 100 dental students were tasked with analyzing panoramic radiographs and generating radiology reports manually or assisted by ChatGPT using a standardized prompt derived from a diagnostic checklist. Results: Reports generated by ChatGPT showed a high degree of textual similarity to reference reports; however, they often lacked critical diagnostic information typically included in reports authored by students. Despite this, the AI-generated reports were consistent in being error-free and matched the readability of student-generated reports. Conclusions: The findings from this study suggest that ChatGPT has considerable potential for generating radiology reports, although it currently faces challenges in accuracy and reliability. This underscores the need for further refinement in the AI’s prompt design and the development of robust validation mechanisms to enhance its use in clinical settings. %M 39714078 %R 10.2196/60684 %U https://www.jmir.org/2024/1/e60684 %U https://doi.org/10.2196/60684 %U http://www.ncbi.nlm.nih.gov/pubmed/39714078 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e67056 %T Automated Pathologic TN Classification Prediction and Rationale Generation From Lung Cancer Surgical Pathology Reports Using a Large Language Model Fine-Tuned With Chain-of-Thought: Algorithm Development and Validation Study %A Kim,Sanghwan %A Jang,Sowon %A Kim,Borham %A Sunwoo,Leonard %A Kim,Seok %A Chung,Jin-Haeng %A Nam,Sejin %A Cho,Hyeongmin %A Lee,Donghyoung %A Lee,Keehyuck %A Yoo,Sooyoung %+ Office of eHealth Research and Business, Seoul National University Bundang Hospital, Healthcare Innovation Park, Seongnam, 13605, Republic of Korea, 82 317878980, yoosoo0@snubh.org %K AJCC Cancer Staging Manual 8th edition %K American Joint Committee on Cancer %K large language model %K chain-of-thought %K rationale %K lung cancer %K report analysis %K AI %K surgery %K pathology reports %K tertiary hospital %K generative language models %K efficiency %K accuracy %K automated %D 2024 %7 20.12.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: Traditional rule-based natural language processing approaches in electronic health record systems are effective but are often time-consuming and prone to errors when handling unstructured data. This is primarily due to the substantial manual effort required to parse and extract information from diverse types of documentation. Recent advancements in large language model (LLM) technology have made it possible to automatically interpret medical context and support pathologic staging. However, existing LLMs encounter challenges in rapidly adapting to specialized guideline updates. In this study, we fine-tuned an LLM specifically for lung cancer pathologic staging, enabling it to incorporate the latest guidelines for pathologic TN classification. Objective: This study aims to evaluate the performance of fine-tuned generative language models in automatically inferring pathologic TN classifications and extracting their rationale from lung cancer surgical pathology reports. By addressing the inefficiencies and extensive parsing efforts associated with rule-based methods, this approach seeks to enable rapid and accurate reclassification aligned with the latest cancer staging guidelines. Methods: We conducted a comparative performance evaluation of 6 open-source LLMs for automated TN classification and rationale generation, using 3216 deidentified lung cancer surgical pathology reports based on the American Joint Committee on Cancer (AJCC) Cancer Staging Manual8th edition, collected from a tertiary hospital. The dataset was preprocessed by segmenting each report according to lesion location and morphological diagnosis. Performance was assessed using exact match ratio (EMR) and semantic match ratio (SMR) as evaluation metrics, which measure classification accuracy and the contextual alignment of the generated rationales, respectively. Results: Among the 6 models, the Orca2_13b model achieved the highest performance with an EMR of 0.934 and an SMR of 0.864. The Orca2_7b model also demonstrated strong performance, recording an EMR of 0.914 and an SMR of 0.854. In contrast, the Llama2_7b model achieved an EMR of 0.864 and an SMR of 0.771, while the Llama2_13b model showed an EMR of 0.762 and an SMR of 0.690. The Mistral_7b and Llama3_8b models, on the other hand, showed lower performance, with EMRs of 0.572 and 0.489, and SMRs of 0.377 and 0.456, respectively. Overall, the Orca2 models consistently outperformed the others in both TN stage classification and rationale generation. Conclusions: The generative language model approach presented in this study has the potential to enhance and automate TN classification in complex cancer staging, supporting both clinical practice and oncology data curation. With additional fine-tuning based on cancer-specific guidelines, this approach can be effectively adapted to other cancer types. %M 39705675 %R 10.2196/67056 %U https://medinform.jmir.org/2024/1/e67056 %U https://doi.org/10.2196/67056 %U http://www.ncbi.nlm.nih.gov/pubmed/39705675 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e66648 %T Large Language Models in Gastroenterology: Systematic Review %A Gong,Eun Jeong %A Bang,Chang Seok %A Lee,Jae Jun %A Park,Jonghyung %A Kim,Eunsil %A Kim,Subeen %A Kimm,Minjae %A Choi,Seoung-Ho %+ Department of Internal Medicine, Hallym University College of Medicine, sakjuro, 77, Chuncheon, Republic of Korea, 82 332405000, csbang@hallym.ac.kr %K large language model %K LLM %K deep learning %K artificial intelligence %K AI %K endoscopy %K gastroenterology %K clinical practice %K systematic review %K diagnostic %K accuracy %K patient engagement %K emotional support %K data privacy %K diagnosis %K clinical reasoning %D 2024 %7 20.12.2024 %9 Review %J J Med Internet Res %G English %X Background: As health care continues to evolve with technological advancements, the integration of artificial intelligence into clinical practices has shown promising potential to enhance patient care and operational efficiency. Among the forefront of these innovations are large language models (LLMs), a subset of artificial intelligence designed to understand, generate, and interact with human language at an unprecedented scale. Objective: This systematic review describes the role of LLMs in improving diagnostic accuracy, automating documentation, and advancing specialist education and patient engagement within the field of gastroenterology and gastrointestinal endoscopy. Methods: Core databases including MEDLINE through PubMed, Embase, and Cochrane Central registry were searched using keywords related to LLMs (from inception to April 2024). Studies were included if they satisfied the following criteria: (1) any type of studies that investigated the potential role of LLMs in the field of gastrointestinal endoscopy or gastroenterology, (2) studies published in English, and (3) studies in full-text format. The exclusion criteria were as follows: (1) studies that did not report the potential role of LLMs in the field of gastrointestinal endoscopy or gastroenterology, (2) case reports and review papers, (3) ineligible research objects (eg, animals or basic research), and (4) insufficient data regarding the potential role of LLMs. Risk of Bias in Non-Randomized Studies—of Interventions was used to evaluate the quality of the identified studies. Results: Overall, 21 studies on the potential role of LLMs in gastrointestinal disorders were included in the systematic review, and narrative synthesis was done because of heterogeneity in the specified aims and methodology in each included study. The overall risk of bias was low in 5 studies and moderate in 16 studies. The ability of LLMs to spread general medical information, offer advice for consultations, generate procedure reports automatically, or draw conclusions about the presumptive diagnosis of complex medical illnesses was demonstrated by the systematic review. Despite promising benefits, such as increased efficiency and improved patient outcomes, challenges related to data privacy, accuracy, and interdisciplinary collaboration remain. Conclusions: We highlight the importance of navigating these challenges to fully leverage LLMs in transforming gastrointestinal endoscopy practices. Trial Registration: PROSPERO 581772; https://www.crd.york.ac.uk/prospero/ %M 39705703 %R 10.2196/66648 %U https://www.jmir.org/2024/1/e66648 %U https://doi.org/10.2196/66648 %U http://www.ncbi.nlm.nih.gov/pubmed/39705703 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e60665 %T An Automatic and End-to-End System for Rare Disease Knowledge Graph Construction Based on Ontology-Enhanced Large Language Models: Development Study %A Cao,Lang %A Sun,Jimeng %A Cross,Adam %K rare disease %K clinical informatics %K LLM %K natural language processing %K machine learning %K artificial intelligence %K large language models %K data extraction %K ontologies %K knowledge graphs %K text mining %D 2024 %7 18.12.2024 %9 %J JMIR Med Inform %G English %X Background: Rare diseases affect millions worldwide but sometimes face limited research focus individually due to low prevalence. Many rare diseases do not have specific International Classification of Diseases, Ninth Edition (ICD-9) and Tenth Edition (ICD-10), codes and therefore cannot be reliably extracted from granular fields like “Diagnosis” and “Problem List” entries, which complicates tasks that require identification of patients with these conditions, including clinical trial recruitment and research efforts. Recent advancements in large language models (LLMs) have shown promise in automating the extraction of medical information, offering the potential to improve medical research, diagnosis, and management. However, most LLMs lack professional medical knowledge, especially concerning specific rare diseases, and cannot effectively manage rare disease data in its various ontological forms, making it unsuitable for these tasks. Objective: Our aim is to create an end-to-end system called automated rare disease mining (AutoRD), which automates the extraction of rare disease–related information from medical text, focusing on entities and their relations to other medical concepts, such as signs and symptoms. AutoRD integrates up-to-date ontologies with other structured knowledge and demonstrates superior performance in rare disease extraction tasks. We conducted various experiments to evaluate AutoRD’s performance, aiming to surpass common LLMs and traditional methods. Methods: AutoRD is a pipeline system that involves data preprocessing, entity extraction, relation extraction, entity calibration, and knowledge graph construction. We implemented this system using GPT-4 and medical knowledge graphs developed from the open-source Human Phenotype and Orphanet ontologies, using techniques such as chain-of-thought reasoning and prompt engineering. We quantitatively evaluated our system’s performance in entity extraction, relation extraction, and knowledge graph construction. The experiment used the well-curated dataset RareDis2023, which contains medical literature focused on rare disease entities and their relations, making it an ideal dataset for training and testing our methodology. Results: On the RareDis2023 dataset, AutoRD achieved an overall entity extraction F1-score of 56.1% and a relation extraction F1-score of 38.6%, marking a 14.4% improvement over the baseline LLM. Notably, the F1-score for rare disease entity extraction reached 83.5%, indicating high precision and recall in identifying rare disease mentions. These results demonstrate the effectiveness of integrating LLMs with medical ontologies in extracting complex rare disease information. Conclusions: AutoRD is an automated end-to-end system for extracting rare disease information from text to build knowledge graphs, addressing critical limitations of existing LLMs by improving identification of these diseases and connecting them to related clinical features. This work underscores the significant potential of LLMs in transforming health care, particularly in the rare disease domain. By leveraging ontology-enhanced LLMs, AutoRD constructs a robust medical knowledge base that incorporates up-to-date rare disease information, facilitating improved identification of patients and resulting in more inclusive research and trial candidacy efforts. %R 10.2196/60665 %U https://medinform.jmir.org/2024/1/e60665 %U https://doi.org/10.2196/60665 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e57592 %T Evaluating Bard Gemini Pro and GPT-4 Vision Against Student Performance in Medical Visual Question Answering: Comparative Case Study %A Roos,Jonas %A Martin,Ron %A Kaczmarczyk,Robert %K medical education %K visual question answering %K image analysis %K large language model %K LLM %K student %K performance %K comparative %K case study %K artificial intelligence %K AI %K ChatGPT %K effectiveness %K diagnostic %K training %K accuracy %K utility %K image-based %K question %K image %K AMBOSS %K English %K German %K question and answer %K Python %K AI in health care %K health care %D 2024 %7 17.12.2024 %9 %J JMIR Form Res %G English %X Background: The rapid development of large language models (LLMs) such as OpenAI’s ChatGPT has significantly impacted medical research and education. These models have shown potential in fields ranging from radiological imaging interpretation to medical licensing examination assistance. Recently, LLMs have been enhanced with image recognition capabilities. Objective: This study aims to critically examine the effectiveness of these LLMs in medical diagnostics and training by assessing their accuracy and utility in answering image-based questions from medical licensing examinations. Methods: This study analyzed 1070 image-based multiple-choice questions from the AMBOSS learning platform, divided into 605 in English and 465 in German. Customized prompts in both languages directed the models to interpret medical images and provide the most likely diagnosis. Student performance data were obtained from AMBOSS, including metrics such as the “student passed mean” and “majority vote.” Statistical analysis was conducted using Python (Python Software Foundation), with key libraries for data manipulation and visualization. Results: GPT-4 1106 Vision Preview (OpenAI) outperformed Bard Gemini Pro (Google), correctly answering 56.9% (609/1070) of questions compared to Bard’s 44.6% (477/1070), a statistically significant difference (χ2₁=32.1, P<.001). However, GPT-4 1106 left 16.1% (172/1070) of questions unanswered, significantly higher than Bard’s 4.1% (44/1070; χ2₁=83.1, P<.001). When considering only answered questions, GPT-4 1106’s accuracy increased to 67.8% (609/898), surpassing both Bard (477/1026, 46.5%; χ2₁=87.7, P<.001) and the student passed mean of 63% (674/1070, SE 1.48%; χ2₁=4.8, P=.03). Language-specific analysis revealed both models performed better in German than English, with GPT-4 1106 showing greater accuracy in German (282/465, 60.65% vs 327/605, 54.1%; χ2₁=4.4, P=.04) and Bard Gemini Pro exhibiting a similar trend (255/465, 54.8% vs 222/605, 36.7%; χ2₁=34.3, P<.001). The student majority vote achieved an overall accuracy of 94.5% (1011/1070), significantly outperforming both artificial intelligence models (GPT-4 1106: χ2₁=408.5, P<.001; Bard Gemini Pro: χ2₁=626.6, P<.001). Conclusions: Our study shows that GPT-4 1106 Vision Preview and Bard Gemini Pro have potential in medical visual question-answering tasks and to serve as a support for students. However, their performance varies depending on the language used, with a preference for German. They also have limitations in responding to non-English content. The accuracy rates, particularly when compared to student responses, highlight the potential of these models in medical education, yet the need for further optimization and understanding of their limitations in diverse linguistic contexts remains critical. %R 10.2196/57592 %U https://formative.jmir.org/2024/1/e57592 %U https://doi.org/10.2196/57592 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60794 %T Investigating Older Adults’ Perceptions of AI Tools for Medication Decisions: Vignette-Based Experimental Survey %A Vordenberg,Sarah E %A Nichols,Julianna %A Marshall,Vincent D %A Weir,Kristie Rebecca %A Dorsch,Michael P %+ College of Pharmacy, University of Michigan, 428 Church St, Ann Arbor, MI, 48109, United States, 1 734 763 6691, skelling@med.umich.edu %K older adults %K survey %K decisions %K artificial intelligence %K vignette %K drug %K pharmacology %K pharmaceutic %K medication %K decision-making %K geriatric %K aging %K surveys %K attitude %K perception %K perspective %K recommendation %K electronic heath record %D 2024 %7 16.12.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Given the public release of large language models, research is needed to explore whether older adults would be receptive to personalized medication advice given by artificial intelligence (AI) tools. Objective: This study aims to identify predictors of the likelihood of older adults stopping a medication and the influence of the source of the information. Methods: We conducted a web-based experimental survey in which US participants aged ≥65 years were asked to report their likelihood of stopping a medication based on the source of information using a 6-point Likert scale (scale anchors: 1=not at all likely; 6=extremely likely). In total, 3 medications were presented in a randomized order: aspirin (risk of bleeding), ranitidine (cancer-causing chemical), or simvastatin (lack of benefit with age). In total, 5 sources of information were presented: primary care provider (PCP), pharmacist, AI that connects with the electronic health record (EHR) and provides advice to the PCP (“EHR-PCP”), AI with EHR access that directly provides advice (“EHR-Direct”), and AI that asks questions to provide advice (“Questions-Direct”) directly. We calculated descriptive statistics to identify participants who were extremely likely (score 6) to stop the medication and used logistic regression to identify demographic predictors of being likely (scores 4-6) as opposed to unlikely (scores 1-3) to stop a medication. Results: Older adults (n=1245) reported being extremely likely to stop a medication based on a PCP’s recommendation (n=748, 60.1% [aspirin] to n=858, 68.9% [ranitidine]) compared to a pharmacist (n=227, 18.2% [simvastatin] to n=361, 29% [ranitidine]). They were infrequently extremely likely to stop a medication when recommended by AI (EHR-PCP: n=182, 14.6% [aspirin] to n=289, 23.2% [ranitidine]; EHR-Direct: n=118, 9.5% [simvastatin] to n=212, 17% [ranitidine]; Questions-Direct: n=121, 9.7% [aspirin] to n=204, 16.4% [ranitidine]). In adjusted analyses, characteristics that increased the likelihood of following an AI recommendation included being Black or African American as compared to White (Questions-Direct: odds ratio [OR] 1.28, 95% CI 1.06-1.54 to EHR-PCP: OR 1.42, 95% CI 1.17-1.73), having higher self-reported health (EHR-PCP: OR 1.09, 95% CI 1.01-1.18 to EHR-Direct: OR 1.13 95%, CI 1.05-1.23), having higher confidence in using an EHR (Questions-Direct: OR 1.36, 95% CI 1.16-1.58 to EHR-PCP: OR 1.55, 95% CI 1.33-1.80), and having higher confidence using apps (EHR-Direct: OR 1.38, 95% CI 1.18-1.62 to EHR-PCP: OR 1.49, 95% CI 1.27-1.74). Older adults with higher health literacy were less likely to stop a medication when recommended by AI (EHR-PCP: OR 0.81, 95% CI 0.75-0.88 to EHR-Direct: OR 0.85, 95% CI 0.78-0.92). Conclusions: Older adults have reservations about following an AI recommendation to stop a medication. However, individuals who are Black or African American, have higher self-reported health, or have higher confidence in using an EHR or apps may be receptive to AI-based medication recommendations. %R 10.2196/60794 %U https://www.jmir.org/2024/1/e60794 %U https://doi.org/10.2196/60794 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e52597 %T Large Language Models and Empathy: Systematic Review %A Sorin,Vera %A Brin,Dana %A Barash,Yiftach %A Konen,Eli %A Charney,Alexander %A Nadkarni,Girish %A Klang,Eyal %+ Department of Radiology, Mayo Clinic, 200 First Street SW, Rochester, MN, 55905, United States, 1 5072842511, verasrn@gmail.com %K empathy %K LLMs %K AI %K ChatGPT %K review methods %K review methodology %K systematic review %K scoping %K synthesis %K foundation models %K text-based %K human interaction %K emotional intelligence %K objective metrics %K human assessment %K emotions %K healthcare %K cognitive %K PRISMA %D 2024 %7 11.12.2024 %9 Review %J J Med Internet Res %G English %X Background: Empathy, a fundamental aspect of human interaction, is characterized as the ability to experience another being’s emotions within oneself. In health care, empathy is a fundamental for health care professionals and patients’ interaction. It is a unique quality to humans that large language models (LLMs) are believed to lack. Objective: We aimed to review the literature on the capacity of LLMs in demonstrating empathy. Methods: We conducted a literature search on MEDLINE, Google Scholar, PsyArXiv, medRxiv, and arXiv between December 2022 and February 2024. We included English-language full-length publications that evaluated empathy in LLMs’ outputs. We excluded papers evaluating other topics related to emotional intelligence that were not specifically empathy. The included studies’ results, including the LLMs used, performance in empathy tasks, and limitations of the models, along with studies’ metadata were summarized. Results: A total of 12 studies published in 2023 met the inclusion criteria. ChatGPT-3.5 (OpenAI) was evaluated in all studies, with 6 studies comparing it with other LLMs such GPT-4, LLaMA (Meta), and fine-tuned chatbots. Seven studies focused on empathy within a medical context. The studies reported LLMs to exhibit elements of empathy, including emotions recognition and emotional support in diverse contexts. Evaluation metric included automatic metrics such as Recall-Oriented Understudy for Gisting Evaluation and Bilingual Evaluation Understudy, and human subjective evaluation. Some studies compared performance on empathy with humans, while others compared between different models. In some cases, LLMs were observed to outperform humans in empathy-related tasks. For example, ChatGPT-3.5 was evaluated for its responses to patients’ questions from social media, where ChatGPT’s responses were preferred over those of humans in 78.6% of cases. Other studies used subjective readers’ assigned scores. One study reported a mean empathy score of 1.84-1.9 (scale 0-2) for their fine-tuned LLM, while a different study evaluating ChatGPT-based chatbots reported a mean human rating of 3.43 out of 4 for empathetic responses. Other evaluations were based on the level of the emotional awareness scale, which was reported to be higher for ChatGPT-3.5 than for humans. Another study evaluated ChatGPT and GPT-4 on soft-skills questions in the United States Medical Licensing Examination, where GPT-4 answered 90% of questions correctly. Limitations were noted, including repetitive use of empathic phrases, difficulty following initial instructions, overly lengthy responses, sensitivity to prompts, and overall subjective evaluation metrics influenced by the evaluator’s background. Conclusions: LLMs exhibit elements of cognitive empathy, recognizing emotions and providing emotionally supportive responses in various contexts. Since social skills are an integral part of intelligence, these advancements bring LLMs closer to human-like interactions and expand their potential use in applications requiring emotional intelligence. However, there remains room for improvement in both the performance of these models and the evaluation strategies used for assessing soft skills. %M 39661968 %R 10.2196/52597 %U https://www.jmir.org/2024/1/e52597 %U https://doi.org/10.2196/52597 %U http://www.ncbi.nlm.nih.gov/pubmed/39661968 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e63892 %T Leveraging Large Language Models for Improved Understanding of Communications With Patients With Cancer in a Call Center Setting: Proof-of-Concept Study %A Cho,Seungbeom %A Lee,Mangyeong %A Yu,Jaewook %A Yoon,Junghee %A Choi,Jae-Boong %A Jung,Kyu-Hwan %A Cho,Juhee %+ Department of Medical Device Management and Research, Samsung Advanced Institute for Health Sciences and Technology, Sungkyunkwan University, 81 Irwon‐ro, Gangnam, Seoul, 06355, Republic of Korea, 82 02 3410 3632, kyuhwanjung@gmail.com %K large language model %K cancer %K supportive care %K LLMs %K patient communication %K natural language processing %K NLP %K self-management %K teleconsultation %K triage services %K telephone consultations %D 2024 %7 11.12.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Hospital call centers play a critical role in providing support and information to patients with cancer, making it crucial to effectively identify and understand patient intent during consultations. However, operational efficiency and standardization of telephone consultations, particularly when categorizing diverse patient inquiries, remain significant challenges. While traditional deep learning models like long short-term memory (LSTM) and bidirectional encoder representations from transformers (BERT) have been used to address these issues, they heavily depend on annotated datasets, which are labor-intensive and time-consuming to generate. Large language models (LLMs) like GPT-4, with their in-context learning capabilities, offer a promising alternative for classifying patient intent without requiring extensive retraining. Objective: This study evaluates the performance of GPT-4 in classifying the purpose of telephone consultations of patients with cancer. In addition, it compares the performance of GPT-4 to that of discriminative models, such as LSTM and BERT, with a particular focus on their ability to manage ambiguous and complex queries. Methods: We used a dataset of 430,355 sentences from telephone consultations with patients with cancer between 2016 and 2020. LSTM and BERT models were trained on 300,000 sentences using supervised learning, while GPT-4 was applied using zero-shot and few-shot approaches without explicit retraining. The accuracy of each model was compared using 1,000 randomly selected sentences from 2020 onward, with special attention paid to how each model handled ambiguous or uncertain queries. Results: GPT-4, which uses only a few examples (a few shots), attained a remarkable accuracy of 85.2%, considerably outperforming the LSTM and BERT models, which achieved accuracies of 73.7% and 71.3%, respectively. Notably, categories such as “Treatment,” “Rescheduling,” and “Symptoms” involve multiple contexts and exhibit significant complexity. GPT-4 demonstrated more than 15% superior performance in handling ambiguous queries in these categories. In addition, GPT-4 excelled in categories like “Records” and “Routine,” where contextual clues were clear, outperforming the discriminative models. These findings emphasize the potential of LLMs, particularly GPT-4, for interpreting complicated patient interactions during cancer-related telephone consultations. Conclusions: This study shows the potential of GPT-4 to significantly improve the classification of patient intent in cancer-related telephone oncological consultations. GPT-4’s ability to handle complex and ambiguous queries without extensive retraining provides a substantial advantage over discriminative models like LSTM and BERT. While GPT-4 demonstrates strong performance in various areas, further refinement of prompt design and category definitions is necessary to fully leverage its capabilities in practical health care applications. Future research will explore the integration of LLMs like GPT-4 into hybrid systems that combine human oversight with artificial intelligence–driven technologies. %M 39661975 %R 10.2196/63892 %U https://www.jmir.org/2024/1/e63892 %U https://doi.org/10.2196/63892 %U http://www.ncbi.nlm.nih.gov/pubmed/39661975 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e58623 %T Integrating GPT-Based AI into Virtual Patients to Facilitate Communication Training Among Medical First Responders: Usability Study of Mixed Reality Simulation %A Gutiérrez Maquilón,Rodrigo %A Uhl,Jakob %A Schrom-Feiertag,Helmut %A Tscheligi,Manfred %+ Center for Technology Experience, AIT - Austrian Institute of Technology, Giefinggasse 4, Vienna, 1210, Austria, 43 66478588121, rodrigo.gutierrez@ait.ac.at %K medical first responders %K verbal communication skills %K training %K virtual patient %K generative artificial intelligence %K GPT %K large language models %K prompt engineering %K mixed reality %D 2024 %7 11.12.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Training in social-verbal interactions is crucial for medical first responders (MFRs) to assess a patient’s condition and perform urgent treatment during emergency medical service administration. Integrating conversational agents (CAs) in virtual patients (VPs), that is, digital simulations, is a cost-effective alternative to resource-intensive human role-playing. There is moderate evidence that CAs improve communication skills more effectively when used with instructional interventions. However, more recent GPT-based artificial intelligence (AI) produces richer, more diverse, and more natural responses than previous CAs and has control of prosodic voice qualities like pitch and duration. These functionalities have the potential to better match the interaction expectations of MFRs regarding habitability. Objective: We aimed to study how the integration of GPT-based AI in a mixed reality (MR)–VP could support communication training of MFRs. Methods: We developed an MR simulation of a traffic accident with a VP. ChatGPT (OpenAI) was integrated into the VP and prompted with verified characteristics of accident victims. MFRs (N=24) were instructed on how to interact with the MR scenario. After assessing and treating the VP, the MFRs were administered the Mean Opinion Scale-Expanded, version 2, and the Subjective Assessment of Speech System Interfaces questionnaires to study their perception of the voice quality and the usability of the voice interactions, respectively. Open-ended questions were asked after completing the questionnaires. The observed and logged interactions with the VP, descriptive statistics of the questionnaires, and the output of the open-ended questions are reported. Results: The usability assessment of the VP resulted in moderate positive ratings, especially in habitability (median 4.25, IQR 4-4.81) and likeability (median 4.50, IQR 3.97-5.91). Interactions were negatively affected by the approximately 3-second latency of the responses. MFRs acknowledged the naturalness of determining the physiological states of the VP through verbal communication, for example, with questions such as “Where does it hurt?” However, the question-answer dynamic in the verbal exchange with the VP and the lack of the VP’s ability to start the verbal exchange were noticed. Noteworthy insights highlighted the potential of domain-knowledge prompt engineering to steer the actions of MFRs for effective training. Conclusions: Generative AI in VPs facilitates MFRs’ training but continues to rely on instructions for effective verbal interactions. Therefore, the capabilities of the GPT-VP and a training protocol need to be communicated to trainees. Future interactions should implement triggers based on keyword recognition, the VP pointing to the hurting area, conversational turn-taking techniques, and add the ability for the VP to start a verbal exchange. Furthermore, a local AI server, chunk processing, and lowering the audio resolution of the VP’s voice could ameliorate the delay in response and allay privacy concerns. Prompting could be used in future studies to create a virtual MFR capable of assisting trainees. %M 39661979 %R 10.2196/58623 %U https://formative.jmir.org/2024/1/e58623 %U https://doi.org/10.2196/58623 %U http://www.ncbi.nlm.nih.gov/pubmed/39661979 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60063 %T EyeGPT for Patient Inquiries and Medical Education: Development and Validation of an Ophthalmology Large Language Model %A Chen,Xiaolan %A Zhao,Ziwei %A Zhang,Weiyi %A Xu,Pusheng %A Wu,Yue %A Xu,Mingpu %A Gao,Le %A Li,Yinwen %A Shang,Xianwen %A Shi,Danli %A He,Mingguang %+ School of Optometry, The Hong Kong Polytechnic University, 11 Yuk Choi Road, Hung Hom, KLN, Hong Kong, 999077, China, 852 27664825, danli.shi@polyu.edu.hk %K large language model %K generative pretrained transformer %K generative artificial intelligence %K ophthalmology %K retrieval-augmented generation %K medical assistant %K EyeGPT %K generative AI %D 2024 %7 11.12.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models (LLMs) have the potential to enhance clinical flow and improve medical education, but they encounter challenges related to specialized knowledge in ophthalmology. Objective: This study aims to enhance ophthalmic knowledge by refining a general LLM into an ophthalmology-specialized assistant for patient inquiries and medical education. Methods: We transformed Llama2 into an ophthalmology-specialized LLM, termed EyeGPT, through the following 3 strategies: prompt engineering for role-playing, fine-tuning with publicly available data sets filtered for eye-specific terminology (83,919 samples), and retrieval-augmented generation leveraging a medical database and 14 ophthalmology textbooks. The efficacy of various EyeGPT variants was evaluated by 4 board-certified ophthalmologists through comprehensive use of 120 diverse category questions in both simple and complex question-answering scenarios. The performance of the best EyeGPT model was then compared with that of the unassisted human physician group and the EyeGPT+human group. We proposed 4 metrics for assessment: accuracy, understandability, trustworthiness, and empathy. The proportion of hallucinations was also reported. Results: The best fine-tuned model significantly outperformed the original Llama2 model at providing informed advice (mean 9.30, SD 4.42 vs mean 13.79, SD 5.70; P<.001) and mitigating hallucinations (97/120, 80.8% vs 53/120, 44.2%, P<.001). Incorporating information retrieval from reliable sources, particularly ophthalmology textbooks, further improved the model's response compared with solely the best fine-tuned model (mean 13.08, SD 5.43 vs mean 15.14, SD 4.64; P=.001) and reduced hallucinations (71/120, 59.2% vs 57/120, 47.4%, P=.02). Subgroup analysis revealed that EyeGPT showed robustness across common diseases, with consistent performance across different users and domains. Among the variants, the model integrating fine-tuning and book retrieval ranked highest, closely followed by the combination of fine-tuning and the manual database, standalone fine-tuning, and pure role-playing methods. EyeGPT demonstrated competitive capabilities in understandability and empathy when compared with human ophthalmologists. With the assistance of EyeGPT, the performance of the ophthalmologist was notably enhanced. Conclusions: We pioneered and introduced EyeGPT by refining a general domain LLM and conducted a comprehensive comparison and evaluation of different strategies to develop an ophthalmology-specific assistant. Our results highlight EyeGPT’s potential to assist ophthalmologists and patients in medical settings. %M 39661433 %R 10.2196/60063 %U https://www.jmir.org/2024/1/e60063 %U https://doi.org/10.2196/60063 %U http://www.ncbi.nlm.nih.gov/pubmed/39661433 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51435 %T ChatGPT May Improve Access to Language-Concordant Care for Patients With Non–English Language Preferences %A Dzuali,Fiatsogbe %A Seiger,Kira %A Novoa,Roberto %A Aleshin,Maria %A Teng,Joyce %A Lester,Jenna %A Daneshjou,Roxana %K ChatGPT %K artificial intelligence %K language %K translation %K health care disparity %K natural language model %K survey %K patient education %K preference %K human language %K language-concordant care %D 2024 %7 10.12.2024 %9 %J JMIR Med Educ %G English %X This study evaluated the accuracy of ChatGPT in translating English patient education materials into Spanish, Mandarin, and Russian. While ChatGPT shows promise for translating Spanish and Russian medical information, Mandarin translations require further refinement, highlighting the need for careful review of AI-generated translations before clinical use. %R 10.2196/51435 %U https://mededu.jmir.org/2024/1/e51435 %U https://doi.org/10.2196/51435 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e57667 %T Use of ChatGPT to Explore Gender and Geographic Disparities in Scientific Peer Review %A Sebo,Paul %+ University of Geneva, Rue Michel-Servet 1, Geneva, 1211, Switzerland, 41 223794390, paul.seboe@unige.ch %K Africa %K artificial intelligence %K discrimination %K peer review %K sentiment analysis %K ChatGPT %K disparity %K gender %K geographic %K global south %K inequality %K woman %K assessment %K researcher %K communication %K consultation %K gender bias %D 2024 %7 9.12.2024 %9 Short Paper %J J Med Internet Res %G English %X Background: In the realm of scientific research, peer review serves as a cornerstone for ensuring the quality and integrity of scholarly papers. Recent trends in promoting transparency and accountability has led some journals to publish peer-review reports alongside papers. Objective: ChatGPT-4 (OpenAI) was used to quantitatively assess sentiment and politeness in peer-review reports from high-impact medical journals. The objective was to explore gender and geographical disparities to enhance inclusivity within the peer-review process. Methods: All 9 general medical journals with an impact factor >2 that publish peer-review reports were identified. A total of 12 research papers per journal were randomly selected, all published in 2023. The names of the first and last authors along with the first author’s country of affiliation were collected, and the gender of both the first and last authors was determined. For each review, ChatGPT-4 was asked to evaluate the “sentiment score,” ranging from –100 (negative) to 0 (neutral) to +100 (positive), and the “politeness score,” ranging from –100 (rude) to 0 (neutral) to +100 (polite). The measurements were repeated 5 times and the minimum and maximum values were removed. The mean sentiment and politeness scores for each review were computed and then summarized using the median and interquartile range. Statistical analyses included Wilcoxon rank-sum tests, Kruskal-Wallis rank tests, and negative binomial regressions. Results: Analysis of 291 peer-review reports corresponding to 108 papers unveiled notable regional disparities. Papers from the Middle East, Latin America, or Africa exhibited lower sentiment and politeness scores compared to those from North America, Europe, or Pacific and Asia (sentiment scores: 27 vs 60 and 62 respectively; politeness scores: 43.5 vs 67 and 65 respectively, adjusted P=.02). No significant differences based on authors’ gender were observed (all P>.05). Conclusions: Notable regional disparities were found, with papers from the Middle East, Latin America, and Africa demonstrating significantly lower scores, while no discernible differences were observed based on authors’ gender. The absence of gender-based differences suggests that gender biases may not manifest as prominently as other forms of bias within the context of peer review. The study underscores the need for targeted interventions to address regional disparities in peer review and advocates for ongoing efforts to promote equity and inclusivity in scholarly communication. %M 39652394 %R 10.2196/57667 %U https://www.jmir.org/2024/1/e57667 %U https://doi.org/10.2196/57667 %U http://www.ncbi.nlm.nih.gov/pubmed/39652394 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e67409 %T The Triage and Diagnostic Accuracy of Frontier Large Language Models: Updated Comparison to Physician Performance %A Sorich,Michael Joseph %A Mangoni,Arduino Aleksander %A Bacchi,Stephen %A Menz,Bradley Douglas %A Hopkins,Ashley Mark %+ College of Medicine and Public Health, Flinders University, GPO Box 2100, Adelaide, 5001, Australia, 61 82013217, michael.sorich@flinders.edu.au %K generative artificial intelligence %K large language models %K triage %K diagnosis %K accuracy %K physician %K ChatGPT %K diagnostic %K primary care %K physicians %K prediction %K medical care %K internet %K LLMs %K AI %D 2024 %7 6.12.2024 %9 Research Letter %J J Med Internet Res %G English %X %M 39642373 %R 10.2196/67409 %U https://www.jmir.org/2024/1/e67409 %U https://doi.org/10.2196/67409 %U http://www.ncbi.nlm.nih.gov/pubmed/39642373 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e57451 %T Performance of GPT-3.5 and GPT-4 on the Korean Pharmacist Licensing Examination: Comparison Study %A Jin,Hye Kyung %A Kim,EunYoung %K GPT-3.5 %K GPT-4 %K Korean %K Korean Pharmacist Licensing Examination %K KPLE %D 2024 %7 4.12.2024 %9 %J JMIR Med Educ %G English %X Background: ChatGPT, a recently developed artificial intelligence chatbot and a notable large language model, has demonstrated improved performance on medical field examinations. However, there is currently little research on its efficacy in languages other than English or in pharmacy-related examinations. Objective: This study aimed to evaluate the performance of GPT models on the Korean Pharmacist Licensing Examination (KPLE). Methods: We evaluated the percentage of correct answers provided by 2 different versions of ChatGPT (GPT-3.5 and GPT-4) for all multiple-choice single-answer KPLE questions, excluding image-based questions. In total, 320, 317, and 323 questions from the 2021, 2022, and 2023 KPLEs, respectively, were included in the final analysis, which consisted of 4 units: Biopharmacy, Industrial Pharmacy, Clinical and Practical Pharmacy, and Medical Health Legislation. Results: The 3-year average percentage of correct answers was 86.5% (830/960) for GPT-4 and 60.7% (583/960) for GPT-3.5. GPT model accuracy was highest in Biopharmacy (GPT-3.5 77/96, 80.2% in 2022; GPT-4 87/90, 96.7% in 2021) and lowest in Medical Health Legislation (GPT-3.5 8/20, 40% in 2022; GPT-4 12/20, 60% in 2022). Additionally, when comparing the performance of artificial intelligence with that of human participants, pharmacy students outperformed GPT-3.5 but not GPT-4. Conclusions: In the last 3 years, GPT models have performed very close to or exceeded the passing threshold for the KPLE. This study demonstrates the potential of large language models in the pharmacy domain; however, extensive research is needed to evaluate their reliability and ensure their secure application in pharmacy contexts due to several inherent challenges. Addressing these limitations could make GPT models more effective auxiliary tools for pharmacy education. %R 10.2196/57451 %U https://mededu.jmir.org/2024/1/e57451 %U https://doi.org/10.2196/57451 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e63188 %T Comparing the Accuracy of Two Generated Large Language Models in Identifying Health-Related Rumors or Misconceptions and the Applicability in Health Science Popularization: Proof-of-Concept Study %A Luo,Yuan %A Miao,Yiqun %A Zhao,Yuhan %A Li,Jiawei %A Chen,Yuling %A Yue,Yuexue %A Wu,Ying %K rumor %K misconception %K health science popularization %K health education %K large language model %K LLM %K applicability %K accuracy %K effectiveness %K health related %K education %K health science %K proof of concept %D 2024 %7 2.12.2024 %9 %J JMIR Form Res %G English %X Background: Health-related rumors and misconceptions are spreading at an alarming rate, fueled by the rapid development of the internet and the exponential growth of social media platforms. This phenomenon has become a pressing global concern, as the dissemination of false information can have severe consequences, including widespread panic, social instability, and even public health crises. Objective: The aim of the study is to compare the accuracy of rumor identification and the effectiveness of health science popularization between 2 generated large language models in Chinese (GPT-4 by OpenAI and Enhanced Representation through Knowledge Integration Bot [ERNIE Bot] 4.0 by Baidu). Methods: In total, 20 health rumors and misconceptions, along with 10 health truths, were randomly inputted into GPT-4 and ERNIE Bot 4.0. We prompted them to determine whether the statements were rumors or misconceptions and provide explanations for their judgment. Further, we asked them to generate a health science popularization essay. We evaluated the outcomes in terms of accuracy, effectiveness, readability, and applicability. Accuracy was assessed by the rate of correctly identifying health-related rumors, misconceptions, and truths. Effectiveness was determined by the accuracy of the generated explanation, which was assessed collaboratively by 2 research team members with a PhD in nursing. Readability was calculated by the readability formula of Chinese health education materials. Applicability was evaluated by the Chinese Suitability Assessment of Materials. Results: GPT-4 and ERNIE Bot 4.0 correctly identified all health rumors and misconceptions (100% accuracy rate). For truths, the accuracy rate was 70% (7/10) and 100% (10/10), respectively. Both mostly provided widely recognized viewpoints without obvious errors. The average readability score for the health essays was 2.92 (SD 0.85) for GPT-4 and 3.02 (SD 0.84) for ERNIE Bot 4.0 (P=.65). For applicability, except for the content and cultural appropriateness category, significant differences were observed in the total score and scores in other dimensions between them (P<.05). Conclusions: ERNIE Bot 4.0 demonstrated similar accuracy to GPT-4 in identifying Chinese rumors. Both provided widely accepted views, despite some inaccuracies. These insights enhance understanding and correct misunderstandings. For health essays, educators can learn from readable language styles of GLLMs. Finally, ERNIE Bot 4.0 aligns with Chinese expression habits, making it a good choice for a better Chinese reading experience. %R 10.2196/63188 %U https://formative.jmir.org/2024/1/e63188 %U https://doi.org/10.2196/63188 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 7 %N %P e54919 %T Efficacy of ChatGPT in Educating Patients and Clinicians About Skin Toxicities Associated With Cancer Treatment %A Chang,Annie %A Young,Jade %A Para,Andrew %A Lamb,Angela %A Gulati,Nicholas %K artificial intelligence %K ChatGPT %K oncodermatology %K cancer therapy %K language learning model %D 2024 %7 20.11.2024 %9 %J JMIR Dermatol %G English %X This study investigates the application of ChatGPT, an artificial intelligence tool, in providing information on skin toxicities associated with cancer treatments, highlighting that while ChatGPT can serve as a valuable resource for clinicians, its use for patient education requires careful consideration due to the complex nature of the information provided. %R 10.2196/54919 %U https://derma.jmir.org/2024/1/e54919 %U https://doi.org/10.2196/54919 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e58329 %T Evaluation Framework of Large Language Models in Medical Documentation: Development and Usability Study %A Seo,Junhyuk %A Choi,Dasol %A Kim,Taerim %A Cha,Won Chul %A Kim,Minha %A Yoo,Haanju %A Oh,Namkee %A Yi,YongJin %A Lee,Kye Hwa %A Choi,Edward %+ Department of Digital Health, Samsung Advanced Institute of Health Sciences and Technology (SAIHST), Sungkyunkwan University, 115, Irwon-ro, Gangnam-gu, Seoul, 06355, Republic of Korea, 82 010 7114 2342, taerim.j.kim@gmail.com %K large language models %K health care documentation %K clinical evaluation %K emergency department %K artificial intelligence %K medical record accuracy %D 2024 %7 20.11.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The advancement of large language models (LLMs) offers significant opportunities for health care, particularly in the generation of medical documentation. However, challenges related to ensuring the accuracy and reliability of LLM outputs, coupled with the absence of established quality standards, have raised concerns about their clinical application. Objective: This study aimed to develop and validate an evaluation framework for assessing the accuracy and clinical applicability of LLM-generated emergency department (ED) records, aiming to enhance artificial intelligence integration in health care documentation. Methods: We organized the Healthcare Prompt-a-thon, a competitive event designed to explore the capabilities of LLMs in generating accurate medical records. The event involved 52 participants who generated 33 initial ED records using HyperCLOVA X, a Korean-specialized LLM. We applied a dual evaluation approach. First, clinical evaluation: 4 medical professionals evaluated the records using a 5-point Likert scale across 5 criteria—appropriateness, accuracy, structure/format, conciseness, and clinical validity. Second, quantitative evaluation: We developed a framework to categorize and count errors in the LLM outputs, identifying 7 key error types. Statistical methods, including Pearson correlation and intraclass correlation coefficients (ICC), were used to assess consistency and agreement among evaluators. Results: The clinical evaluation demonstrated strong interrater reliability, with ICC values ranging from 0.653 to 0.887 (P<.001), and a test-retest reliability Pearson correlation coefficient of 0.776 (P<.001). Quantitative analysis revealed that invalid generation errors were the most common, constituting 35.38% of total errors, while structural malformation errors had the most significant negative impact on the clinical evaluation score (Pearson r=–0.654; P<.001). A strong negative correlation was found between the number of quantitative errors and clinical evaluation scores (Pearson r=–0.633; P<.001), indicating that higher error rates corresponded to lower clinical acceptability. Conclusions: Our research provides robust support for the reliability and clinical acceptability of the proposed evaluation framework. It underscores the framework’s potential to mitigate clinical burdens and foster the responsible integration of artificial intelligence technologies in health care, suggesting a promising direction for future research and practical applications in the field. %M 39566044 %R 10.2196/58329 %U https://www.jmir.org/2024/1/e58329 %U https://doi.org/10.2196/58329 %U http://www.ncbi.nlm.nih.gov/pubmed/39566044 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51433 %T Leveraging Open-Source Large Language Models for Data Augmentation in Hospital Staff Surveys: Mixed Methods Study %A Ehrett,Carl %A Hegde,Sudeep %A Andre,Kwame %A Liu,Dixizi %A Wilson,Timothy %K data augmentation %K large language models %K medical education %K natural language processing %K data security %K ethics %K AI %K artificial intelligence %K data privacy %K medical staff %D 2024 %7 19.11.2024 %9 %J JMIR Med Educ %G English %X Background: Generative large language models (LLMs) have the potential to revolutionize medical education by generating tailored learning materials, enhancing teaching efficiency, and improving learner engagement. However, the application of LLMs in health care settings, particularly for augmenting small datasets in text classification tasks, remains underexplored, particularly for cost- and privacy-conscious applications that do not permit the use of third-party services such as OpenAI’s ChatGPT. Objective: This study aims to explore the use of open-source LLMs, such as Large Language Model Meta AI (LLaMA) and Alpaca models, for data augmentation in a specific text classification task related to hospital staff surveys. Methods: The surveys were designed to elicit narratives of everyday adaptation by frontline radiology staff during the initial phase of the COVID-19 pandemic. A 2-step process of data augmentation and text classification was conducted. The study generated synthetic data similar to the survey reports using 4 generative LLMs for data augmentation. A different set of 3 classifier LLMs was then used to classify the augmented text for thematic categories. The study evaluated performance on the classification task. Results: The overall best-performing combination of LLMs, temperature, classifier, and number of synthetic data cases is via augmentation with LLaMA 7B at temperature 0.7 with 100 augments, using Robustly Optimized BERT Pretraining Approach (RoBERTa) for the classification task, achieving an average area under the receiver operating characteristic (AUC) curve of 0.87 (SD 0.02; ie, 1 SD). The results demonstrate that open-source LLMs can enhance text classifiers’ performance for small datasets in health care contexts, providing promising pathways for improving medical education processes and patient care practices. Conclusions: The study demonstrates the value of data augmentation with open-source LLMs, highlights the importance of privacy and ethical considerations when using LLMs, and suggests future directions for research in this field. %R 10.2196/51433 %U https://mededu.jmir.org/2024/1/e51433 %U https://doi.org/10.2196/51433 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e63445 %T Using Large Language Models to Abstract Complex Social Determinants of Health From Original and Deidentified Medical Notes: Development and Validation Study %A Ralevski,Alexandra %A Taiyab,Nadaa %A Nossal,Michael %A Mico,Lindsay %A Piekos,Samantha %A Hadlock,Jennifer %+ Institute for Systems Biology, 401 Terry Ave N, Seattle, WA, 98121, United States, 1 732 1359, jhadlock@isbscience.org %K housing instability %K housing insecurity %K housing %K machine learning %K artificial intelligence %K AI %K large language model %K LLM %K natural language processing %K NLP %K electronic health record %K EHR %K electronic medical record %K EMR %K social determinants of health %K exposome %K pregnancy %K obstetric %K deidentification %D 2024 %7 19.11.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Social determinants of health (SDoH) such as housing insecurity are known to be intricately linked to patients’ health status. More efficient methods for abstracting structured data on SDoH can help accelerate the inclusion of exposome variables in biomedical research and support health care systems in identifying patients who could benefit from proactive outreach. Large language models (LLMs) developed from Generative Pre-trained Transformers (GPTs) have shown potential for performing complex abstraction tasks on unstructured clinical notes. Objective: Here, we assess the performance of GPTs on identifying temporal aspects of housing insecurity and compare results between both original and deidentified notes. Methods: We compared the ability of GPT-3.5 and GPT-4 to identify instances of both current and past housing instability, as well as general housing status, from 25,217 notes from 795 pregnant women. Results were compared with manual abstraction, a named entity recognition model, and regular expressions. Results: Compared with GPT-3.5 and the named entity recognition model, GPT-4 had the highest performance and had a much higher recall (0.924) than human abstractors (0.702) in identifying patients experiencing current or past housing instability, although precision was lower (0.850) compared with human abstractors (0.971). GPT-4’s precision improved slightly (0.936 original, 0.939 deidentified) on deidentified versions of the same notes, while recall dropped (0.781 original, 0.704 deidentified). Conclusions: This work demonstrates that while manual abstraction is likely to yield slightly more accurate results overall, LLMs can provide a scalable, cost-effective solution with the advantage of greater recall. This could support semiautomated abstraction, but given the potential risk for harm, human review would be essential before using results for any patient engagement or care decisions. Furthermore, recall was lower when notes were deidentified prior to LLM abstraction. %M 39561354 %R 10.2196/63445 %U https://www.jmir.org/2024/1/e63445 %U https://doi.org/10.2196/63445 %U http://www.ncbi.nlm.nih.gov/pubmed/39561354 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e59439 %T Mitigating Cognitive Biases in Clinical Decision-Making Through Multi-Agent Conversations Using Large Language Models: Simulation Study %A Ke,Yuhe %A Yang,Rui %A Lie,Sui An %A Lim,Taylor Xin Yi %A Ning,Yilin %A Li,Irene %A Abdullah,Hairil Rizal %A Ting,Daniel Shu Wei %A Liu,Nan %+ Centre for Quantitative Medicine, Duke-NUS Medical School, 8 College Road, Singapore, 169857, Singapore, 65 66016503, liu.nan@duke-nus.edu.sg %K clinical decision-making %K cognitive bias %K generative artificial intelligence %K large language model %K multi-agent %D 2024 %7 19.11.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Cognitive biases in clinical decision-making significantly contribute to errors in diagnosis and suboptimal patient outcomes. Addressing these biases presents a formidable challenge in the medical field. Objective: This study aimed to explore the role of large language models (LLMs) in mitigating these biases through the use of the multi-agent framework. We simulate the clinical decision-making processes through multi-agent conversation and evaluate its efficacy in improving diagnostic accuracy compared with humans. Methods: A total of 16 published and unpublished case reports where cognitive biases have resulted in misdiagnoses were identified from the literature. In the multi-agent framework, we leveraged GPT-4 (OpenAI) to facilitate interactions among different simulated agents to replicate clinical team dynamics. Each agent was assigned a distinct role: (1) making the final diagnosis after considering the discussions, (2) acting as a devil’s advocate to correct confirmation and anchoring biases, (3) serving as a field expert in the required medical subspecialty, (4) facilitating discussions to mitigate premature closure bias, and (5) recording and summarizing findings. We tested varying combinations of these agents within the framework to determine which configuration yielded the highest rate of correct final diagnoses. Each scenario was repeated 5 times for consistency. The accuracy of the initial diagnoses and the final differential diagnoses were evaluated, and comparisons with human-generated answers were made using the Fisher exact test. Results: A total of 240 responses were evaluated (3 different multi-agent frameworks). The initial diagnosis had an accuracy of 0% (0/80). However, following multi-agent discussions, the accuracy for the top 2 differential diagnoses increased to 76% (61/80) for the best-performing multi-agent framework (Framework 4-C). This was significantly higher compared with the accuracy achieved by human evaluators (odds ratio 3.49; P=.002). Conclusions: The multi-agent framework demonstrated an ability to re-evaluate and correct misconceptions, even in scenarios with misleading initial investigations. In addition, the LLM-driven, multi-agent conversation framework shows promise in enhancing diagnostic accuracy in diagnostically challenging medical scenarios. %M 39561363 %R 10.2196/59439 %U https://www.jmir.org/2024/1/e59439 %U https://doi.org/10.2196/59439 %U http://www.ncbi.nlm.nih.gov/pubmed/39561363 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e54297 %T Using ChatGPT in Nursing: Scoping Review of Current Opinions %A Zhou,You %A Li,Si-Jia %A Tang,Xing-Yi %A He,Yi-Chen %A Ma,Hao-Ming %A Wang,Ao-Qi %A Pei,Run-Yuan %A Piao,Mei-Hua %K ChatGPT %K large language model %K nursing %K artificial intelligence %K scoping review %K generative AI %K nursing education %D 2024 %7 19.11.2024 %9 %J JMIR Med Educ %G English %X Background: Since the release of ChatGPT in November 2022, this emerging technology has garnered a lot of attention in various fields, and nursing is no exception. However, to date, no study has comprehensively summarized the status and opinions of using ChatGPT across different nursing fields. Objective: We aim to synthesize the status and opinions of using ChatGPT according to different nursing fields, as well as assess ChatGPT’s strengths, weaknesses, and the potential impacts it may cause. Methods: This scoping review was conducted following the framework of Arksey and O’Malley and guided by the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). A comprehensive literature research was conducted in 4 web-based databases (PubMed, Embase, Web of Science, and CINHAL) to identify studies reporting the opinions of using ChatGPT in nursing fields from 2022 to September 3, 2023. The references of the included studies were screened manually to further identify relevant studies. Two authors conducted studies screening, eligibility assessments, and data extraction independently. Results: A total of 30 studies were included. The United States (7 studies), Canada (5 studies), and China (4 studies) were countries with the most publications. In terms of fields of concern, studies mainly focused on “ChatGPT and nursing education” (20 studies), “ChatGPT and nursing practice” (10 studies), and “ChatGPT and nursing research, writing, and examination” (6 studies). Six studies addressed the use of ChatGPT in multiple nursing fields. Conclusions: As an emerging artificial intelligence technology, ChatGPT has great potential to revolutionize nursing education, nursing practice, and nursing research. However, researchers, institutions, and administrations still need to critically examine its accuracy, safety, and privacy, as well as academic misconduct and potential ethical issues that it may lead to before applying ChatGPT to practice. %R 10.2196/54297 %U https://mededu.jmir.org/2024/1/e54297 %U https://doi.org/10.2196/54297 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e66453 %T The Vast Potential of ChatGPT in Pediatric Surgery %A Tang,Ran %A Qi,Shi-qin %+ Department of Pediatric Surgery, Anhui Provincal Children's Hospital, 39 Wangjiang Street, Hefei, 230051, China, 86 13637080508, qishiqin@163.com %K ChatGPT %K pediatric %K surgery %K artificial intelligence %K AI %K diagnosis %K surgeon %D 2024 %7 18.11.2024 %9 Letter to the Editor %J J Med Internet Res %G English %X %R 10.2196/66453 %U https://www.jmir.org/2024/1/e66453 %U https://doi.org/10.2196/66453 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e59607 %T Examining the Role of Large Language Models in Orthopedics: Systematic Review %A Zhang,Cheng %A Liu,Shanshan %A Zhou,Xingyu %A Zhou,Siyu %A Tian,Yinglun %A Wang,Shenglin %A Xu,Nanfang %A Li,Weishi %+ Department of Orthopaedics, Peking University Third Hospital, 49 North Garden Road, Beijing, 100191, China, 86 01082267360, puh3liweishi@163.com %K large language model %K LLM %K orthopedics %K generative pretrained transformer %K GPT %K ChatGPT %K digital health %K clinical practice %K artificial intelligence %K AI %K generative AI %K Bard %D 2024 %7 15.11.2024 %9 Review %J J Med Internet Res %G English %X Background: Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent. Objective: The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges. Methods: PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of “large language model,” “generative artificial intelligence,” “ChatGPT,” and “orthopaedics,” were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment. Results: A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs’ performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4’s accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections. Conclusions: LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision. %M 39546795 %R 10.2196/59607 %U https://www.jmir.org/2024/1/e59607 %U https://doi.org/10.2196/59607 %U http://www.ncbi.nlm.nih.gov/pubmed/39546795 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56762 %T Evaluating AI Competence in Specialized Medicine: Comparative Analysis of ChatGPT and Neurologists in a Neurology Specialist Examination in Spain %A Ros-Arlanzón,Pablo %A Perez-Sempere,Angel %K artificial intelligence %K ChatGPT %K clinical decision-making %K medical education %K medical knowledge assessment %K OpenAI %D 2024 %7 14.11.2024 %9 %J JMIR Med Educ %G English %X Background: With the rapid advancement of artificial intelligence (AI) in various fields, evaluating its application in specialized medical contexts becomes crucial. ChatGPT, a large language model developed by OpenAI, has shown potential in diverse applications, including medicine. Objective: This study aims to compare the performance of ChatGPT with that of attending neurologists in a real neurology specialist examination conducted in the Valencian Community, Spain, assessing the AI’s capabilities and limitations in medical knowledge. Methods: We conducted a comparative analysis using the 2022 neurology specialist examination results from 120 neurologists and responses generated by ChatGPT versions 3.5 and 4. The examination consisted of 80 multiple-choice questions, with a focus on clinical neurology and health legislation. Questions were classified according to Bloom’s Taxonomy. Statistical analysis of performance, including the κ coefficient for response consistency, was performed. Results: Human participants exhibited a median score of 5.91 (IQR: 4.93-6.76), with 32 neurologists failing to pass. ChatGPT-3.5 ranked 116th out of 122, answering 54.5% of questions correctly (score 3.94). ChatGPT-4 showed marked improvement, ranking 17th with 81.8% of correct answers (score 7.57), surpassing several human specialists. No significant variations were observed in the performance on lower-order questions versus higher-order questions. Additionally, ChatGPT-4 demonstrated increased interrater reliability, as reflected by a higher κ coefficient of 0.73, compared to ChatGPT-3.5’s coefficient of 0.69. Conclusions: This study underscores the evolving capabilities of AI in medical knowledge assessment, particularly in specialized fields. ChatGPT-4’s performance, outperforming the median score of human participants in a rigorous neurology examination, represents a significant milestone in AI development, suggesting its potential as an effective tool in specialized medical education and assessment. %R 10.2196/56762 %U https://mededu.jmir.org/2024/1/e56762 %U https://doi.org/10.2196/56762 %0 Journal Article %@ 2371-4379 %I JMIR Publications %V 9 %N %P e58680 %T Lightening the Load: Generative AI to Mitigate the Burden of the New Era of Obesity Medical Therapy %A Stevens,Elizabeth R %A Elmaleh-Sachs,Arielle %A Lofton,Holly %A Mann,Devin M %K obesity %K artificial intelligence %K AI %K clinical management %K GLP-1 %K glucagon-like peptide 1 %K medical therapy %K antiobesity %K diabetes %K medication %K agonists %K glucose-dependent insulinotropic polypeptide %K treatment %K clinician %K health care delivery system %K incretin mimetic %D 2024 %7 14.11.2024 %9 %J JMIR Diabetes %G English %X Highly effective antiobesity and diabetes medications such as glucagon-like peptide 1 (GLP-1) agonists and glucose-dependent insulinotropic polypeptide/GLP-1 (dual) receptor agonists (RAs) have ushered in a new era of treatment of these highly prevalent, morbid conditions that have increased across the globe. However, the rapidly escalating use of GLP-1/dual RA medications is poised to overwhelm an already overburdened health care provider workforce and health care delivery system, stifling its potentially dramatic benefits. Relying on existing systems and resources to address the oncoming rise in GLP-1/dual RA use will be insufficient. Generative artificial intelligence (GenAI) has the potential to offset the clinical and administrative demands associated with the management of patients on these medication types. Early adoption of GenAI to facilitate the management of these GLP-1/dual RAs has the potential to improve health outcomes while decreasing its concomitant workload. Research and development efforts are urgently needed to develop GenAI obesity medication management tools, as well as to ensure their accessibility and use by encouraging their integration into health care delivery systems. %R 10.2196/58680 %U https://diabetes.jmir.org/2024/1/e58680 %U https://doi.org/10.2196/58680 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e64226 %T Economics and Equity of Large Language Models: Health Care Perspective %A Nagarajan,Radha %A Kondo,Midori %A Salas,Franz %A Sezgin,Emre %A Yao,Yuan %A Klotzman,Vanessa %A Godambe,Sandip A %A Khan,Naqi %A Limon,Alfonso %A Stephenson,Graham %A Taraman,Sharief %A Walton,Nephi %A Ehwerhemuepha,Louis %A Pandit,Jay %A Pandita,Deepti %A Weiss,Michael %A Golden,Charles %A Gold,Adam %A Henderson,John %A Shippy,Angela %A Celi,Leo Anthony %A Hogan,William R %A Oermann,Eric K %A Sanger,Terence %A Martel,Steven %+ Children's Hospital of Orange County, 1201 W. La Veta Ave, Orange, CA, 92868, United States, 1 714 997 3000, Radha.Nagarajan@choc.org %K large language model %K LLM %K health care %K economics %K equity %K cloud service providers %K cloud %K health outcome %K implementation %K democratization %D 2024 %7 14.11.2024 %9 Viewpoint %J J Med Internet Res %G English %X Large language models (LLMs) continue to exhibit noteworthy capabilities across a spectrum of areas, including emerging proficiencies across the health care continuum. Successful LLM implementation and adoption depend on digital readiness, modern infrastructure, a trained workforce, privacy, and an ethical regulatory landscape. These factors can vary significantly across health care ecosystems, dictating the choice of a particular LLM implementation pathway. This perspective discusses 3 LLM implementation pathways—training from scratch pathway (TSP), fine-tuned pathway (FTP), and out-of-the-box pathway (OBP)—as potential onboarding points for health systems while facilitating equitable adoption. The choice of a particular pathway is governed by needs as well as affordability. Therefore, the risks, benefits, and economics of these pathways across 4 major cloud service providers (Amazon, Microsoft, Google, and Oracle) are presented. While cost comparisons, such as on-demand and spot pricing across the cloud service providers for the 3 pathways, are presented for completeness, the usefulness of managed services and cloud enterprise tools is elucidated. Managed services can complement the traditional workforce and expertise, while enterprise tools, such as federated learning, can overcome sample size challenges when implementing LLMs using health care data. Of the 3 pathways, TSP is expected to be the most resource-intensive regarding infrastructure and workforce while providing maximum customization, enhanced transparency, and performance. Because TSP trains the LLM using enterprise health care data, it is expected to harness the digital signatures of the population served by the health care system with the potential to impact outcomes. The use of pretrained models in FTP is a limitation. It may impact its performance because the training data used in the pretrained model may have hidden bias and may not necessarily be health care–related. However, FTP provides a balance between customization, cost, and performance. While OBP can be rapidly deployed, it provides minimal customization and transparency without guaranteeing long-term availability. OBP may also present challenges in interfacing seamlessly with downstream applications in health care settings with variations in pricing and use over time. Lack of customization in OBP can significantly limit its ability to impact outcomes. Finally, potential applications of LLMs in health care, including conversational artificial intelligence, chatbots, summarization, and machine translation, are highlighted. While the 3 implementation pathways discussed in this perspective have the potential to facilitate equitable adoption and democratization of LLMs, transitions between them may be necessary as the needs of health systems evolve. Understanding the economics and trade-offs of these onboarding pathways can guide their strategic adoption and demonstrate value while impacting health care outcomes favorably. %M 39541580 %R 10.2196/64226 %U https://www.jmir.org/2024/1/e64226 %U https://doi.org/10.2196/64226 %U http://www.ncbi.nlm.nih.gov/pubmed/39541580 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60226 %T Performance of ChatGPT in Ophthalmic Registration and Clinical Diagnosis: Cross-Sectional Study %A Ming,Shuai %A Yao,Xi %A Guo,Xiaohong %A Guo,Qingge %A Xie,Kunpeng %A Chen,Dandan %A Lei,Bo %+ Department of Ophthalmology, Henan Eye Institute, Henan Eye Hospital, Henan Provincial People's Hospital, No.7 Weiwu Road, Zhengzhou, China, 86 037167120925, bolei99@126.com %K artificial intelligence %K chatbot %K ChatGPT %K ophthalmic registration %K clinical diagnosis %K AI %K cross-sectional study %K eye disease %K eye disorder %K ophthalmology %K health care %K outpatient registration %K clinical %K decision-making %K generative AI %K vision impairment %D 2024 %7 14.11.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Artificial intelligence (AI) chatbots such as ChatGPT are expected to impact vision health care significantly. Their potential to optimize the consultation process and diagnostic capabilities across range of ophthalmic subspecialties have yet to be fully explored. Objective: This study aims to investigate the performance of AI chatbots in recommending ophthalmic outpatient registration and diagnosing eye diseases within clinical case profiles. Methods: This cross-sectional study used clinical cases from Chinese Standardized Resident Training–Ophthalmology (2nd Edition). For each case, 2 profiles were created: patient with history (Hx) and patient with history and examination (Hx+Ex). These profiles served as independent queries for GPT-3.5 and GPT-4.0 (accessed from March 5 to 18, 2024). Similarly, 3 ophthalmic residents were posed the same profiles in a questionnaire format. The accuracy of recommending ophthalmic subspecialty registration was primarily evaluated using Hx profiles. The accuracy of the top-ranked diagnosis and the accuracy of the diagnosis within the top 3 suggestions (do-not-miss diagnosis) were assessed using Hx+Ex profiles. The gold standard for judgment was the published, official diagnosis. Characteristics of incorrect diagnoses by ChatGPT were also analyzed. Results: A total of 208 clinical profiles from 12 ophthalmic subspecialties were analyzed (104 Hx and 104 Hx+Ex profiles). For Hx profiles, GPT-3.5, GPT-4.0, and residents showed comparable accuracy in registration suggestions (66/104, 63.5%; 81/104, 77.9%; and 72/104, 69.2%, respectively; P=.07), with ocular trauma, retinal diseases, and strabismus and amblyopia achieving the top 3 accuracies. For Hx+Ex profiles, both GPT-4.0 and residents demonstrated higher diagnostic accuracy than GPT-3.5 (62/104, 59.6% and 63/104, 60.6% vs 41/104, 39.4%; P=.003 and P=.001, respectively). Accuracy for do-not-miss diagnoses also improved (79/104, 76% and 68/104, 65.4% vs 51/104, 49%; P<.001 and P=.02, respectively). The highest diagnostic accuracies were observed in glaucoma; lens diseases; and eyelid, lacrimal, and orbital diseases. GPT-4.0 recorded fewer incorrect top-3 diagnoses (25/42, 60% vs 53/63, 84%; P=.005) and more partially correct diagnoses (21/42, 50% vs 7/63 11%; P<.001) than GPT-3.5, while GPT-3.5 had more completely incorrect (27/63, 43% vs 7/42, 17%; P=.005) and less precise diagnoses (22/63, 35% vs 5/42, 12%; P=.009). Conclusions: GPT-3.5 and GPT-4.0 showed intermediate performance in recommending ophthalmic subspecialties for registration. While GPT-3.5 underperformed, GPT-4.0 approached and numerically surpassed residents in differential diagnosis. AI chatbots show promise in facilitating ophthalmic patient registration. However, their integration into diagnostic decision-making requires more validation. %M 39541581 %R 10.2196/60226 %U https://www.jmir.org/2024/1/e60226 %U https://doi.org/10.2196/60226 %U http://www.ncbi.nlm.nih.gov/pubmed/39541581 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e58041 %T Enhancement of the Performance of Large Language Models in Diabetes Education through Retrieval-Augmented Generation: Comparative Study %A Wang,Dingqiao %A Liang,Jiangbo %A Ye,Jinguo %A Li,Jingni %A Li,Jingpeng %A Zhang,Qikai %A Hu,Qiuling %A Pan,Caineng %A Wang,Dongliang %A Liu,Zhong %A Shi,Wen %A Shi,Danli %A Li,Fei %A Qu,Bo %A Zheng,Yingfeng %+ State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, 07 Jinsui Road, GuangZhou, 510060, China, 86 139 2228 6455, zhyfeng@mail.sysu.edu.cn %K large language models %K LLMs %K retrieval-augmented generation %K RAG %K GPT-4.0 %K Claude-2 %K Google Bard %K diabetes education %D 2024 %7 8.11.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models (LLMs) demonstrated advanced performance in processing clinical information. However, commercially available LLMs lack specialized medical knowledge and remain susceptible to generating inaccurate information. Given the need for self-management in diabetes, patients commonly seek information online. We introduce the Retrieval-augmented Information System for Enhancement (RISE) framework and evaluate its performance in enhancing LLMs to provide accurate responses to diabetes-related inquiries. Objective: This study aimed to evaluate the potential of the RISE framework, an information retrieval and augmentation tool, to improve the LLM’s performance to accurately and safely respond to diabetes-related inquiries. Methods: The RISE, an innovative retrieval augmentation framework, comprises 4 steps: rewriting query, information retrieval, summarization, and execution. Using a set of 43 common diabetes-related questions, we evaluated 3 base LLMs (GPT-4, Anthropic Claude 2, Google Bard) and their RISE-enhanced versions respectively. Assessments were conducted by clinicians for accuracy and comprehensiveness and by patients for understandability. Results: The integration of RISE significantly improved the accuracy and comprehensiveness of responses from all 3 base LLMs. On average, the percentage of accurate responses increased by 12% (15/129) with RISE. Specifically, the rates of accurate responses increased by 7% (3/43) for GPT-4, 19% (8/43) for Claude 2, and 9% (4/43) for Google Bard. The framework also enhanced response comprehensiveness, with mean scores improving by 0.44 (SD 0.10). Understandability was also enhanced by 0.19 (SD 0.13) on average. Data collection was conducted from September 30, 2023 to February 5, 2024. Conclusions: The RISE significantly improves LLMs’ performance in responding to diabetes-related inquiries, enhancing accuracy, comprehensiveness, and understandability. These improvements have crucial implications for RISE’s future role in patient education and chronic illness self-management, which contributes to relieving medical resource pressures and raising public awareness of medical knowledge. %M 39046096 %R 10.2196/58041 %U https://www.jmir.org/2024/1/e58041 %U https://doi.org/10.2196/58041 %U http://www.ncbi.nlm.nih.gov/pubmed/39046096 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e22769 %T Applications and Concerns of ChatGPT and Other Conversational Large Language Models in Health Care: Systematic Review %A Wang,Leyao %A Wan,Zhiyu %A Ni,Congning %A Song,Qingyuan %A Li,Yang %A Clayton,Ellen %A Malin,Bradley %A Yin,Zhijun %+ Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave Ste 1475, Nashville, TN, 37203, United States, 1 6159363690, zhijun.yin@vumc.org %K large language model %K ChatGPT %K artificial intelligence %K natural language processing %K health care %K summarization %K medical knowledge inquiry %K reliability %K bias %K privacy %D 2024 %7 7.11.2024 %9 Review %J J Med Internet Res %G English %X Background: The launch of ChatGPT (OpenAI) in November 2022 attracted public attention and academic interest to large language models (LLMs), facilitating the emergence of many other innovative LLMs. These LLMs have been applied in various fields, including health care. Numerous studies have since been conducted regarding how to use state-of-the-art LLMs in health-related scenarios. Objective: This review aims to summarize applications of and concerns regarding conversational LLMs in health care and provide an agenda for future research in this field. Methods: We used PubMed, ACM, and the IEEE digital libraries as primary sources for this review. We followed the guidance of PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) to screen and select peer-reviewed research articles that (1) were related to health care applications and conversational LLMs and (2) were published before September 1, 2023, the date when we started paper collection. We investigated these papers and classified them according to their applications and concerns. Results: Our search initially identified 820 papers according to targeted keywords, out of which 65 (7.9%) papers met our criteria and were included in the review. The most popular conversational LLM was ChatGPT (60/65, 92% of papers), followed by Bard (Google LLC; 1/65, 2% of papers), LLaMA (Meta; 1/65, 2% of papers), and other LLMs (6/65, 9% papers). These papers were classified into four categories of applications: (1) summarization, (2) medical knowledge inquiry, (3) prediction (eg, diagnosis, treatment recommendation, and drug synergy), and (4) administration (eg, documentation and information collection), and four categories of concerns: (1) reliability (eg, training data quality, accuracy, interpretability, and consistency in responses), (2) bias, (3) privacy, and (4) public acceptability. There were 49 (75%) papers using LLMs for either summarization or medical knowledge inquiry, or both, and there are 58 (89%) papers expressing concerns about either reliability or bias, or both. We found that conversational LLMs exhibited promising results in summarization and providing general medical knowledge to patients with a relatively high accuracy. However, conversational LLMs such as ChatGPT are not always able to provide reliable answers to complex health-related tasks (eg, diagnosis) that require specialized domain expertise. While bias or privacy issues are often noted as concerns, no experiments in our reviewed papers thoughtfully examined how conversational LLMs lead to these issues in health care research. Conclusions: Future studies should focus on improving the reliability of LLM applications in complex health-related tasks, as well as investigating the mechanisms of how LLM applications bring bias and privacy issues. Considering the vast accessibility of LLMs, legal, social, and technical efforts are all needed to address concerns about LLMs to promote, improve, and regularize the application of LLMs in health care. %M 39509695 %R 10.2196/22769 %U https://www.jmir.org/2024/1/e22769 %U https://doi.org/10.2196/22769 %U http://www.ncbi.nlm.nih.gov/pubmed/39509695 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e63430 %T ChatGPT-4 Omni Performance in USMLE Disciplines and Clinical Skills: Comparative Analysis %A Bicknell,Brenton T %A Butler,Danner %A Whalen,Sydney %A Ricks,James %A Dixon,Cory J %A Clark,Abigail B %A Spaedy,Olivia %A Skelton,Adam %A Edupuganti,Neel %A Dzubinski,Lance %A Tate,Hudson %A Dyess,Garrett %A Lindeman,Brenessa %A Lehmann,Lisa Soleymani %K large language model %K ChatGPT %K medical education %K USMLE %K AI in medical education %K medical student resources %K educational technology %K artificial intelligence in medicine %K clinical skills %K LLM %K medical licensing examination %K medical students %K United States Medical Licensing Examination %K ChatGPT 4 Omni %K ChatGPT 4 %K ChatGPT 3.5 %D 2024 %7 6.11.2024 %9 %J JMIR Med Educ %G English %X Background: Recent studies, including those by the National Board of Medical Examiners, have highlighted the remarkable capabilities of recent large language models (LLMs) such as ChatGPT in passing the United States Medical Licensing Examination (USMLE). However, there is a gap in detailed analysis of LLM performance in specific medical content areas, thus limiting an assessment of their potential utility in medical education. Objective: This study aimed to assess and compare the accuracy of successive ChatGPT versions (GPT-3.5, GPT-4, and GPT-4 Omni) in USMLE disciplines, clinical clerkships, and the clinical skills of diagnostics and management. Methods: This study used 750 clinical vignette-based multiple-choice questions to characterize the performance of successive ChatGPT versions (ChatGPT 3.5 [GPT-3.5], ChatGPT 4 [GPT-4], and ChatGPT 4 Omni [GPT-4o]) across USMLE disciplines, clinical clerkships, and in clinical skills (diagnostics and management). Accuracy was assessed using a standardized protocol, with statistical analyses conducted to compare the models’ performances. Results: GPT-4o achieved the highest accuracy across 750 multiple-choice questions at 90.4%, outperforming GPT-4 and GPT-3.5, which scored 81.1% and 60.0%, respectively. GPT-4o’s highest performances were in social sciences (95.5%), behavioral and neuroscience (94.2%), and pharmacology (93.2%). In clinical skills, GPT-4o’s diagnostic accuracy was 92.7% and management accuracy was 88.8%, significantly higher than its predecessors. Notably, both GPT-4o and GPT-4 significantly outperformed the medical student average accuracy of 59.3% (95% CI 58.3‐60.3). Conclusions: GPT-4o’s performance in USMLE disciplines, clinical clerkships, and clinical skills indicates substantial improvements over its predecessors, suggesting significant potential for the use of this technology as an educational aid for medical students. These findings underscore the need for careful consideration when integrating LLMs into medical education, emphasizing the importance of structured curricula to guide their appropriate use and the need for ongoing critical analyses to ensure their reliability and effectiveness. %R 10.2196/63430 %U https://mededu.jmir.org/2024/1/e63430 %U https://doi.org/10.2196/63430 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56532 %T The Accuracy and Capability of Artificial Intelligence Solutions in Health Care Examinations and Certificates: Systematic Review and Meta-Analysis %A Waldock,William J %A Zhang,Joe %A Guni,Ahmad %A Nabeel,Ahmad %A Darzi,Ara %A Ashrafian,Hutan %+ Institute of Global Health Innovation, Imperial College London, 10th Floor, Queen Elizabeth Queen Mother Building, Praed Street, London, United Kingdom, 44 07799871597, h.ashrafian@imperial.ac.uk %K large language model %K LLM %K artificial intelligence %K AI %K health care exam %K narrative medical response %K health care examination %K clinical commissioning %K health services %K safety %D 2024 %7 5.11.2024 %9 Review %J J Med Internet Res %G English %X Background: Large language models (LLMs) have dominated public interest due to their apparent capability to accurately replicate learned knowledge in narrative text. However, there is a lack of clarity about the accuracy and capability standards of LLMs in health care examinations. Objective: We conducted a systematic review of LLM accuracy, as tested under health care examination conditions, as compared to known human performance standards. Methods: We quantified the accuracy of LLMs in responding to health care examination questions and evaluated the consistency and quality of study reporting. The search included all papers up until September 10, 2023, with all LLMs published in English journals that report clear LLM accuracy standards. The exclusion criteria were as follows: the assessment was not a health care exam, there was no LLM, there was no evaluation of comparable success accuracy, and the literature was not original research.The literature search included the following Medical Subject Headings (MeSH) terms used in all possible combinations: “artificial intelligence,” “ChatGPT,” “GPT,” “LLM,” “large language model,” “machine learning,” “neural network,” “Generative Pre-trained Transformer,” “Generative Transformer,” “Generative Language Model,” “Generative Model,” “medical exam,” “healthcare exam,” and “clinical exam.” Sensitivity, accuracy, and precision data were extracted, including relevant CIs. Results: The search identified 1673 relevant citations. After removing duplicate results, 1268 (75.8%) papers were screened for titles and abstracts, and 32 (2.5%) studies were included for full-text review. Our meta-analysis suggested that LLMs are able to perform with an overall medical examination accuracy of 0.61 (CI 0.58-0.64) and a United States Medical Licensing Examination (USMLE) accuracy of 0.51 (CI 0.46-0.56), while Chat Generative Pretrained Transformer (ChatGPT) can perform with an overall medical examination accuracy of 0.64 (CI 0.6-0.67). Conclusions: LLMs offer promise to remediate health care demand and staffing challenges by providing accurate and efficient context-specific information to critical decision makers. For policy and deployment decisions about LLMs to advance health care, we proposed a new framework called RUBRICC (Regulatory, Usability, Bias, Reliability [Evidence and Safety], Interoperability, Cost, and Codesign–Patient and Public Involvement and Engagement [PPIE]). This presents a valuable opportunity to direct the clinical commissioning of new LLM capabilities into health services, while respecting patient safety considerations. Trial Registration: OSF Registries osf.io/xqzkw; https://osf.io/xqzkw %M 39499913 %R 10.2196/56532 %U https://www.jmir.org/2024/1/e56532 %U https://doi.org/10.2196/56532 %U http://www.ncbi.nlm.nih.gov/pubmed/39499913 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51446 %T The Potential of Artificial Intelligence Tools for Reducing Uncertainty in Medicine and Directions for Medical Education %A Alli,Sauliha Rabia %A Hossain,Soaad Qahhār %A Das,Sunit %A Upshur,Ross %K artificial intelligence %K machine learning %K uncertainty %K clinical decision-making %K medical education %K generative AI %K generative artificial intelligence %D 2024 %7 4.11.2024 %9 %J JMIR Med Educ %G English %X In the field of medicine, uncertainty is inherent. Physicians are asked to make decisions on a daily basis without complete certainty, whether it is in understanding the patient’s problem, performing the physical examination, interpreting the findings of diagnostic tests, or proposing a management plan. The reasons for this uncertainty are widespread, including the lack of knowledge about the patient, individual physician limitations, and the limited predictive power of objective diagnostic tools. This uncertainty poses significant problems in providing competent patient care. Research efforts and teaching are attempts to reduce uncertainty that have now become inherent to medicine. Despite this, uncertainty is rampant. Artificial intelligence (AI) tools, which are being rapidly developed and integrated into practice, may change the way we navigate uncertainty. In their strongest forms, AI tools may have the ability to improve data collection on diseases, patient beliefs, values, and preferences, thereby allowing more time for physician-patient communication. By using methods not previously considered, these tools hold the potential to reduce the uncertainty in medicine, such as those arising due to the lack of clinical information and provider skill and bias. Despite this possibility, there has been considerable resistance to the implementation of AI tools in medical practice. In this viewpoint article, we discuss the impact of AI on medical uncertainty and discuss practical approaches to teaching the use of AI tools in medical schools and residency training programs, including AI ethics, real-world skills, and technological aptitude. %R 10.2196/51446 %U https://mededu.jmir.org/2024/1/e51446 %U https://doi.org/10.2196/51446 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60291 %T Accuracy of Prospective Assessments of 4 Large Language Model Chatbot Responses to Patient Questions About Emergency Care: Experimental Comparative Study %A Yau,Jonathan Yi-Shin %A Saadat,Soheil %A Hsu,Edmund %A Murphy,Linda Suk-Ling %A Roh,Jennifer S %A Suchard,Jeffrey %A Tapia,Antonio %A Wiechmann,Warren %A Langdorf,Mark I %+ Department of Emergency Medicine, University of California - Irvine, 101 the City Drive, Route 128-01, Orange, CA, 92868, United States, 1 7147452663, milangdo@hs.uci.edu %K artificial intelligence %K AI %K chatbots %K generative AI %K natural language processing %K consumer health information %K patient education %K literacy %K emergency care information %K chatbot %K misinformation %K health care %K medical consultation %D 2024 %7 4.11.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Recent surveys indicate that 48% of consumers actively use generative artificial intelligence (AI) for health-related inquiries. Despite widespread adoption and the potential to improve health care access, scant research examines the performance of AI chatbot responses regarding emergency care advice. Objective: We assessed the quality of AI chatbot responses to common emergency care questions. We sought to determine qualitative differences in responses from 4 free-access AI chatbots, for 10 different serious and benign emergency conditions. Methods: We created 10 emergency care questions that we fed into the free-access versions of ChatGPT 3.5 (OpenAI), Google Bard, Bing AI Chat (Microsoft), and Claude AI (Anthropic) on November 26, 2023. Each response was graded by 5 board-certified emergency medicine (EM) faculty for 8 domains of percentage accuracy, presence of dangerous information, factual accuracy, clarity, completeness, understandability, source reliability, and source relevancy. We determined the correct, complete response to the 10 questions from reputable and scholarly emergency medical references. These were compiled by an EM resident physician. For the readability of the chatbot responses, we used the Flesch-Kincaid Grade Level of each response from readability statistics embedded in Microsoft Word. Differences between chatbots were determined by the chi-square test. Results: Each of the 4 chatbots’ responses to the 10 clinical questions were scored across 8 domains by 5 EM faculty, for 400 assessments for each chatbot. Together, the 4 chatbots had the best performance in clarity and understandability (both 85%), intermediate performance in accuracy and completeness (both 50%), and poor performance (10%) for source relevance and reliability (mostly unreported). Chatbots contained dangerous information in 5% to 35% of responses, with no statistical difference between chatbots on this metric (P=.24). ChatGPT, Google Bard, and Claud AI had similar performances across 6 out of 8 domains. Only Bing AI performed better with more identified or relevant sources (40%; the others had 0%-10%). Flesch-Kincaid Reading level was 7.7-8.9 grade for all chatbots, except ChatGPT at 10.8, which were all too advanced for average emergency patients. Responses included both dangerous (eg, starting cardiopulmonary resuscitation with no pulse check) and generally inappropriate advice (eg, loosening the collar to improve breathing without evidence of airway compromise). Conclusions: AI chatbots, though ubiquitous, have significant deficiencies in EM patient advice, despite relatively consistent performance. Information for when to seek urgent or emergent care is frequently incomplete and inaccurate, and patients may be unaware of misinformation. Sources are not generally provided. Patients who use AI to guide health care decisions assume potential risks. AI chatbots for health should be subject to further research, refinement, and regulation. We strongly recommend proper medical consultation to prevent potential adverse outcomes. %R 10.2196/60291 %U https://www.jmir.org/2024/1/e60291 %U https://doi.org/10.2196/60291 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e51095 %T Assessing the Role of the Generative Pretrained Transformer (GPT) in Alzheimer’s Disease Management: Comparative Study of Neurologist- and Artificial Intelligence–Generated Responses %A Zeng,Jiaqi %A Zou,Xiaoyi %A Li,Shirong %A Tang,Yao %A Teng,Sisi %A Li,Huanhuan %A Wang,Changyu %A Wu,Yuxuan %A Zhang,Luyao %A Zhong,Yunheng %A Liu,Jialin %A Liu,Siru %+ Department of Medical Informatics, West China Medical School, No 37 Guoxue Road, Chengdu, 610041, China, 86 28 85422306, Dljl8@163.com %K Alzheimer's disease %K artificial intelligence %K AI %K large language model %K LLM %K Generative Pretrained Transformer %K GPT %K ChatGPT %K patient information %D 2024 %7 31.10.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Alzheimer’s disease (AD) is a progressive neurodegenerative disorder posing challenges to patients, caregivers, and society. Accessible and accurate information is crucial for effective AD management. Objective: This study aimed to evaluate the accuracy, comprehensibility, clarity, and usefulness of the Generative Pretrained Transformer’s (GPT) answers concerning the management and caregiving of patients with AD. Methods: In total, 14 questions related to the prevention, treatment, and care of AD were identified and posed to GPT-3.5 and GPT-4 in Chinese and English, respectively, and 4 respondent neurologists were asked to answer them. We generated 8 sets of responses (total 112) and randomly coded them in answer sheets. Next, 5 evaluator neurologists and 5 family members of patients were asked to rate the 112 responses using separate 5-point Likert scales. We evaluated the quality of the responses using a set of 8 questions rated on a 5-point Likert scale. To gauge comprehensibility and participant satisfaction, we included 3 questions dedicated to each aspect within the same set of 8 questions. Results: As of April 10, 2023, the 5 evaluator neurologists and 5 family members of patients with AD rated the 112 responses: GPT-3.5: n=28, 25%, responses; GPT-4: n=28, 25%, responses; respondent neurologists: 56 (50%) responses. The top 5 (4.5%) responses rated by evaluator neurologists had 4 (80%) GPT (GPT-3.5+GPT-4) responses and 1 (20%) respondent neurologist’s response. For the top 5 (4.5%) responses rated by patients’ family members, all but the third response were GPT responses. Based on the evaluation by neurologists, the neurologist-generated responses achieved a mean score of 3.9 (SD 0.7), while the GPT-generated responses scored significantly higher (mean 4.4, SD 0.6; P<.001). Language and model analyses revealed no significant differences in response quality between the GPT-3.5 and GPT-4 models (GPT-3.5: mean 4.3, SD 0.7; GPT-4: mean 4.4, SD 0.5; P=.51). However, English responses outperformed Chinese responses in terms of comprehensibility (Chinese responses: mean 4.1, SD 0.7; English responses: mean 4.6, SD 0.5; P=.005) and participant satisfaction (Chinese responses: mean 4.2, SD 0.8; English responses: mean 4.5, SD 0.5; P=.04). According to the evaluator neurologists’ review, Chinese responses had a mean score of 4.4 (SD 0.6), whereas English responses had a mean score of 4.5 (SD 0.5; P=.002). As for the family members of patients with AD, no significant differences were observed between GPT and neurologists, GPT-3.5 and GPT-4, or Chinese and English responses. Conclusions: GPT can provide patient education materials on AD for patients, their families and caregivers, nurses, and neurologists. This capability can contribute to the effective health care management of patients with AD, leading to enhanced patient outcomes. %M 39481104 %R 10.2196/51095 %U https://www.jmir.org/2024/1/e51095 %U https://doi.org/10.2196/51095 %U http://www.ncbi.nlm.nih.gov/pubmed/39481104 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e57124 %T Leveraging Artificial Intelligence and Data Science for Integration of Social Determinants of Health in Emergency Medicine: Scoping Review %A Abbott,Ethan E %A Apakama,Donald %A Richardson,Lynne D %A Chan,Lili %A Nadkarni,Girish N %K data science %K social determinants of health %K natural language processing %K artificial intelligence %K NLP %K machine learning %K review methods %K review methodology %K scoping review %K emergency medicine %K PRISMA %D 2024 %7 30.10.2024 %9 %J JMIR Med Inform %G English %X Background: Social determinants of health (SDOH) are critical drivers of health disparities and patient outcomes. However, accessing and collecting patient-level SDOH data can be operationally challenging in the emergency department (ED) clinical setting, requiring innovative approaches. Objective: This scoping review examines the potential of AI and data science for modeling, extraction, and incorporation of SDOH data specifically within EDs, further identifying areas for advancement and investigation. Methods: We conducted a standardized search for studies published between 2015 and 2022, across Medline (Ovid), Embase (Ovid), CINAHL, Web of Science, and ERIC databases. We focused on identifying studies using AI or data science related to SDOH within emergency care contexts or conditions. Two specialized reviewers in emergency medicine (EM) and clinical informatics independently assessed each article, resolving discrepancies through iterative reviews and discussion. We then extracted data covering study details, methodologies, patient demographics, care settings, and principal outcomes. Results: Of the 1047 studies screened, 26 met the inclusion criteria. Notably, 9 out of 26 (35%) studies were solely concentrated on ED patients. Conditions studied spanned broad EM complaints and included sepsis, acute myocardial infarction, and asthma. The majority of studies (n=16) explored multiple SDOH domains, with homelessness/housing insecurity and neighborhood/built environment predominating. Machine learning (ML) techniques were used in 23 of 26 studies, with natural language processing (NLP) being the most commonly used approach (n=11). Rule-based NLP (n=5), deep learning (n=2), and pattern matching (n=4) were the most commonly used NLP techniques. NLP models in the reviewed studies displayed significant predictive performance with outcomes, with F1-scores ranging between 0.40 and 0.75 and specificities nearing 95.9%. Conclusions: Although in its infancy, the convergence of AI and data science techniques, especially ML and NLP, with SDOH in EM offers transformative possibilities for better usage and integration of social data into clinical care and research. With a significant focus on the ED and notable NLP model performance, there is an imperative to standardize SDOH data collection, refine algorithms for diverse patient groups, and champion interdisciplinary synergies. These efforts aim to harness SDOH data optimally, enhancing patient care and mitigating health disparities. Our research underscores the vital need for continued investigation in this domain. %R 10.2196/57124 %U https://medinform.jmir.org/2024/1/e57124 %U https://doi.org/10.2196/57124 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e60939 %T Ensuring Accuracy and Equity in Vaccination Information From ChatGPT and CDC: Mixed-Methods Cross-Language Evaluation %A Joshi,Saubhagya %A Ha,Eunbin %A Amaya,Andee %A Mendoza,Melissa %A Rivera,Yonaira %A Singh,Vivek K %+ School of Communication & Information, Rutgers University, 4 Huntington Street, New Brunswick, NJ, 08901, United States, 1 848 932 7588, v.singh@rutgers.edu %K vaccination %K health equity %K multilingualism %K language equity %K health literacy %K online health information %K conversational agents %K artificial intelligence %K large language models %K health information %K public health %D 2024 %7 30.10.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: In the digital age, large language models (LLMs) like ChatGPT have emerged as important sources of health care information. Their interactive capabilities offer promise for enhancing health access, particularly for groups facing traditional barriers such as insurance and language constraints. Despite their growing public health use, with millions of medical queries processed weekly, the quality of LLM-provided information remains inconsistent. Previous studies have predominantly assessed ChatGPT’s English responses, overlooking the needs of non–English speakers in the United States. This study addresses this gap by evaluating the quality and linguistic parity of vaccination information from ChatGPT and the Centers for Disease Control and Prevention (CDC), emphasizing health equity. Objective: This study aims to assess the quality and language equity of vaccination information provided by ChatGPT and the CDC in English and Spanish. It highlights the critical need for cross-language evaluation to ensure equitable health information access for all linguistic groups. Methods: We conducted a comparative analysis of ChatGPT’s and CDC’s responses to frequently asked vaccination-related questions in both languages. The evaluation encompassed quantitative and qualitative assessments of accuracy, readability, and understandability. Accuracy was gauged by the perceived level of misinformation; readability, by the Flesch-Kincaid grade level and readability score; and understandability, by items from the National Institutes of Health’s Patient Education Materials Assessment Tool (PEMAT) instrument. Results: The study found that both ChatGPT and CDC provided mostly accurate and understandable (eg, scores over 95 out of 100) responses. However, Flesch-Kincaid grade levels often exceeded the American Medical Association’s recommended levels, particularly in English (eg, average grade level in English for ChatGPT=12.84, Spanish=7.93, recommended=6). CDC responses outperformed ChatGPT in readability across both languages. Notably, some Spanish responses appeared to be direct translations from English, leading to unnatural phrasing. The findings underscore the potential and challenges of using ChatGPT for health care access. Conclusions: ChatGPT holds potential as a health information resource but requires improvements in readability and linguistic equity to be truly effective for diverse populations. Crucially, the default user experience with ChatGPT, typically encountered by those without advanced language and prompting skills, can significantly shape health perceptions. This is vital from a public health standpoint, as the majority of users will interact with LLMs in their most accessible form. Ensuring that default responses are accurate, understandable, and equitable is imperative for fostering informed health decisions across diverse communities. %M 39476380 %R 10.2196/60939 %U https://formative.jmir.org/2024/1/e60939 %U https://doi.org/10.2196/60939 %U http://www.ncbi.nlm.nih.gov/pubmed/39476380 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e59501 %T Role of Synchronous, Moderated, and Anonymous Peer Support Chats on Reducing Momentary Loneliness in Older Adults: Retrospective Observational Study %A Dana,Zara %A Nagra,Harpreet %A Kilby,Kimberly %+ Supportiv, 2222 Harold Way, Berkeley, CA, 94704, United States, 1 800 845 0015, harpreet@supportiv.com %K digital peer support %K social loneliness %K chat-based interactions %K older adults %D 2024 %7 25.10.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Older adults have a high rate of loneliness, which contributes to increased psychosocial risk, medical morbidity, and mortality. Digital emotional support interventions provide a convenient and rapid avenue for additional support. Digital peer support interventions for emotional struggles contrast the usual provider-based clinical care models because they offer more accessible, direct support for empowerment, highlighting the users’ autonomy, competence, and relatedness. Objective: This study aims to examine a novel anonymous and synchronous peer-to-peer digital chat service facilitated by trained human moderators. The experience of a cohort of 699 adults aged ≥65 years was analyzed to determine (1) if participation, alone, led to measurable aggregate change in momentary loneliness and optimism and (2) the impact of peers on momentary loneliness and optimism. Methods: Participants were each prompted with a single question: “What’s your struggle?” Using a proprietary artificial intelligence model, the free-text response automatched the respondent based on their self-expressed emotional struggle to peers and a chat moderator. Exchanged messages were analyzed to quantitatively measure the change in momentary loneliness and optimism using a third-party, public, natural language processing model (GPT-4 [OpenAI]). The sentiment change analysis was initially performed at the individual level and then averaged across all users with similar emotion types to produce a statistically significant (P<.05) collective trend per emotion. To evaluate the peer impact on momentary loneliness and optimism, we performed propensity matching to align the moderator+single user and moderator+small group chat cohorts and then compare the emotion trends between the matched cohorts. Results: Loneliness and optimism trends significantly improved after 8 (P=.02) to 9 minutes (P=.03) into the chat. We observed a significant improvement in the momentary loneliness and optimism trends between the moderator+small group compared to the moderator+single user chat cohort after 19 (P=.049) and 21 minutes (P=.04) for optimism and loneliness, respectively. Conclusions: Chat-based peer support may be a viable intervention to help address momentary loneliness in older adults and present an alternative to traditional care. The promising results support the need for further study to expand the evidence for such cost-effective options. %M 39453688 %R 10.2196/59501 %U https://formative.jmir.org/2024/1/e59501 %U https://doi.org/10.2196/59501 %U http://www.ncbi.nlm.nih.gov/pubmed/39453688 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e58418 %T Aligning Large Language Models for Enhancing Psychiatric Interviews Through Symptom Delineation and Summarization: Pilot Study %A So,Jae-hee %A Chang,Joonhwan %A Kim,Eunji %A Na,Junho %A Choi,JiYeon %A Sohn,Jy-yong %A Kim,Byung-Hoon %A Chu,Sang Hui %+ Department of Applied Statistics, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul, 03722, Republic of Korea, 82 2 2123 2472, jysohn1108@gmail.com %K large language model %K psychiatric interview %K interview summarization %K symptom delineation %D 2024 %7 24.10.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Recent advancements in large language models (LLMs) have accelerated their use across various domains. Psychiatric interviews, which are goal-oriented and structured, represent a significantly underexplored area where LLMs can provide substantial value. In this study, we explore the application of LLMs to enhance psychiatric interviews by analyzing counseling data from North Korean defectors who have experienced traumatic events and mental health issues. Objective: This study aims to investigate whether LLMs can (1) delineate parts of the conversation that suggest psychiatric symptoms and identify those symptoms, and (2) summarize stressors and symptoms based on the interview dialogue transcript. Methods: Given the interview transcripts, we align the LLMs to perform 3 tasks: (1) extracting stressors from the transcripts, (2) delineating symptoms and their indicative sections, and (3) summarizing the patients based on the extracted stressors and symptoms. These 3 tasks address the 2 objectives, where delineating symptoms is based on the output from the second task, and generating the summary of the interview incorporates the outputs from all 3 tasks. In this context, the transcript data were labeled by mental health experts for the training and evaluation of the LLMs. Results: First, we present the performance of LLMs in estimating (1) the transcript sections related to psychiatric symptoms and (2) the names of the corresponding symptoms. In the zero-shot inference setting using the GPT-4 Turbo model, 73 out of 102 transcript segments demonstrated a recall mid-token distance d<20 for estimating the sections associated with the symptoms. For evaluating the names of the corresponding symptoms, the fine-tuning method demonstrates a performance advantage over the zero-shot inference setting of the GPT-4 Turbo model. On average, the fine-tuning method achieves an accuracy of 0.82, a precision of 0.83, a recall of 0.82, and an F1-score of 0.82. Second, the transcripts are used to generate summaries for each interviewee using LLMs. This generative task was evaluated using metrics such as Generative Evaluation (G-Eval) and Bidirectional Encoder Representations from Transformers Score (BERTScore). The summaries generated by the GPT-4 Turbo model, utilizing both symptom and stressor information, achieve high average G-Eval scores: coherence of 4.66, consistency of 4.73, fluency of 2.16, and relevance of 4.67. Furthermore, it is noted that the use of retrieval-augmented generation did not lead to a significant improvement in performance. Conclusions: LLMs, using either (1) appropriate prompting techniques or (2) fine-tuning methods with data labeled by mental health experts, achieved an accuracy of over 0.8 for the symptom delineation task when measured across all segments in the transcript. Additionally, they attained a G-Eval score of over 4.6 for coherence in the summarization task. This research contributes to the emerging field of applying LLMs in psychiatric interviews and demonstrates their potential effectiveness in assisting mental health practitioners. %M 39447159 %R 10.2196/58418 %U https://formative.jmir.org/2024/1/e58418 %U https://doi.org/10.2196/58418 %U http://www.ncbi.nlm.nih.gov/pubmed/39447159 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54242 %T Gender Bias in AI's Perception of Cardiovascular Risk %A Achtari,Margaux %A Salihu,Adil %A Muller,Olivier %A Abbé,Emmanuel %A Clair,Carole %A Schwarz,Joëlle %A Fournier,Stephane %+ Department of Cardiology, Lausanne University Hospital and University of Lausanne, 21 Rue Du Bugnon, Lausanne, CH-1011, Switzerland, 41 21 314 00 12, stephane.fournier@chuv.ch %K artificial intelligence %K gender equity %K coronary artery disease %K AI %K cardiovascular %K risk %K CAD %K artery %K coronary %K chatbot: health care %K men: women %K gender bias %K gender %D 2024 %7 22.10.2024 %9 Research Letter %J J Med Internet Res %G English %X The study investigated gender bias in GPT-4’s assessment of coronary artery disease risk by presenting identical clinical vignettes of men and women with and without psychiatric comorbidities. Results suggest that psychiatric conditions may influence GPT-4’s coronary artery disease risk assessment among men and women. %M 39437384 %R 10.2196/54242 %U https://www.jmir.org/2024/1/e54242 %U https://doi.org/10.2196/54242 %U http://www.ncbi.nlm.nih.gov/pubmed/39437384 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e60164 %T Health Care Language Models and Their Fine-Tuning for Information Extraction: Scoping Review %A Nunes,Miguel %A Bone,Joao %A Ferreira,Joao C %A Elvas,Luis B %+ Department of Logistics, Molde, University College, Britvegen 2, Noruega, Molde, 6410, Norway, 47 969152334, luis.m.elvas@himolde.no %K language model %K information extraction %K healthcare %K PRISMA-ScR %K scoping literature review %K transformers %K natural language processing %K European Portuguese %D 2024 %7 21.10.2024 %9 Review %J JMIR Med Inform %G English %X Background: In response to the intricate language, specialized terminology outside everyday life, and the frequent presence of abbreviations and acronyms inherent in health care text data, domain adaptation techniques have emerged as crucial to transformer-based models. This refinement in the knowledge of the language models (LMs) allows for a better understanding of the medical textual data, which results in an improvement in medical downstream tasks, such as information extraction (IE). We have identified a gap in the literature regarding health care LMs. Therefore, this study presents a scoping literature review investigating domain adaptation methods for transformers in health care, differentiating between English and non-English languages, focusing on Portuguese. Most specifically, we investigated the development of health care LMs, with the aim of comparing Portuguese with other more developed languages to guide the path of a non–English-language with fewer resources. Objective: This study aimed to research health care IE models, regardless of language, to understand the efficacy of transformers and what are the medical entities most commonly extracted. Methods: This scoping review was conducted using the PRISMA-ScR (Preferred Reporting Items for Systematic reviews and Meta-Analyses extension for Scoping Reviews) methodology on Scopus and Web of Science Core Collection databases. Only studies that mentioned the creation of health care LMs or health care IE models were included, while large language models (LLMs) were excluded. The latest were not included since we wanted to research LMs and not LLMs, which are architecturally different and have distinct purposes. Results: Our search query retrieved 137 studies, 60 of which met the inclusion criteria, and none of them were systematic literature reviews. English and Chinese are the languages with the most health care LMs developed. These languages already have disease-specific LMs, while others only have general–health care LMs. European Portuguese does not have any public health care LM and should take examples from other languages to develop, first, general-health care LMs and then, in an advanced phase, disease-specific LMs. Regarding IE models, transformers were the most commonly used method, and named entity recognition was the most popular topic, with only a few studies mentioning Assertion Status or addressing medical lexical problems. The most extracted entities were diagnosis, posology, and symptoms. Conclusions: The findings indicate that domain adaptation is beneficial, achieving better results in downstream tasks. Our analysis allowed us to understand that the use of transformers is more developed for the English and Chinese languages. European Portuguese lacks relevant studies and should draw examples from other non-English languages to develop these models and drive progress in AI. Health care professionals could benefit from highlighting medically relevant information and optimizing the reading of the textual data, or this information could be used to create patient medical timelines, allowing for profiling. %M 39432345 %R 10.2196/60164 %U https://medinform.jmir.org/2024/1/e60164 %U https://doi.org/10.2196/60164 %U http://www.ncbi.nlm.nih.gov/pubmed/39432345 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e57400 %T Large Language Models for Mental Health Applications: Systematic Review %A Guo,Zhijun %A Lai,Alvina %A Thygesen,Johan H %A Farrington,Joseph %A Keen,Thomas %A Li,Kezhi %+ Institute of Health Informatics University College, London, 222 Euston Road, London, NW1 2DA, United Kingdom, 44 7859 995590, ken.li@ucl.ac.uk %K large language models %K mental health %K digital health care %K ChatGPT %K Bidirectional Encoder Representations from Transformers %K BERT %D 2024 %7 18.10.2024 %9 Review %J JMIR Ment Health %G English %X Background: Large language models (LLMs) are advanced artificial neural networks trained on extensive datasets to accurately understand and generate natural language. While they have received much attention and demonstrated potential in digital health, their application in mental health, particularly in clinical settings, has generated considerable debate. Objective: This systematic review aims to critically assess the use of LLMs in mental health, specifically focusing on their applicability and efficacy in early screening, digital interventions, and clinical settings. By systematically collating and assessing the evidence from current studies, our work analyzes models, methodologies, data sources, and outcomes, thereby highlighting the potential of LLMs in mental health, the challenges they present, and the prospects for their clinical use. Methods: Adhering to the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines, this review searched 5 open-access databases: MEDLINE (accessed by PubMed), IEEE Xplore, Scopus, JMIR, and ACM Digital Library. Keywords used were (mental health OR mental illness OR mental disorder OR psychiatry) AND (large language models). This study included articles published between January 1, 2017, and April 30, 2024, and excluded articles published in languages other than English. Results: In total, 40 articles were evaluated, including 15 (38%) articles on mental health conditions and suicidal ideation detection through text analysis, 7 (18%) on the use of LLMs as mental health conversational agents, and 18 (45%) on other applications and evaluations of LLMs in mental health. LLMs show good effectiveness in detecting mental health issues and providing accessible, destigmatized eHealth services. However, assessments also indicate that the current risks associated with clinical use might surpass their benefits. These risks include inconsistencies in generated text; the production of hallucinations; and the absence of a comprehensive, benchmarked ethical framework. Conclusions: This systematic review examines the clinical applications of LLMs in mental health, highlighting their potential and inherent risks. The study identifies several issues: the lack of multilingual datasets annotated by experts, concerns regarding the accuracy and reliability of generated content, challenges in interpretability due to the “black box” nature of LLMs, and ongoing ethical dilemmas. These ethical concerns include the absence of a clear, benchmarked ethical framework; data privacy issues; and the potential for overreliance on LLMs by both physicians and patients, which could compromise traditional medical practices. As a result, LLMs should not be considered substitutes for professional mental health services. However, the rapid development of LLMs underscores their potential as valuable clinical aids, emphasizing the need for continued research and development in this area. Trial Registration: PROSPERO CRD42024508617; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=508617 %M 39423368 %R 10.2196/57400 %U https://mental.jmir.org/2024/1/e57400 %U https://doi.org/10.2196/57400 %U http://www.ncbi.nlm.nih.gov/pubmed/39423368 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e62963 %T Describing the Framework for AI Tool Assessment in Mental Health and Applying It to a Generative AI Obsessive-Compulsive Disorder Platform: Tutorial %A Golden,Ashleigh %A Aboujaoude,Elias %+ Department of Psychiatry and Behavioral Sciences, Stanford University School of Medicine, 401 Quarry Rd, Stanford, CA, 94304, United States, 1 650 498 9111, eaboujaoude@stanford.edu %K artificial intelligence %K ChatGPT %K generative artificial intelligence %K generative AI %K large language model %K chatbots %K machine learning %K digital health %K telemedicine %K psychotherapy %K obsessive-compulsive disorder %D 2024 %7 18.10.2024 %9 Tutorial %J JMIR Form Res %G English %X As artificial intelligence (AI) technologies occupy a bigger role in psychiatric and psychological care and become the object of increased research attention, industry investment, and public scrutiny, tools for evaluating their clinical, ethical, and user-centricity standards have become essential. In this paper, we first review the history of rating systems used to evaluate AI mental health interventions. We then describe the recently introduced Framework for AI Tool Assessment in Mental Health (FAITA-Mental Health), whose scoring system allows users to grade AI mental health platforms on key domains, including credibility, user experience, crisis management, user agency, health equity, and transparency. Finally, we demonstrate the use of FAITA-Mental Health scale by systematically applying it to OCD Coach, a generative AI tool readily available on the ChatGPT store and designed to help manage the symptoms of obsessive-compulsive disorder. The results offer insights into the utility and limitations of FAITA-Mental Health when applied to “real-world” generative AI platforms in the mental health space, suggesting that the framework effectively identifies key strengths and gaps in AI-driven mental health tools, particularly in areas such as credibility, user experience, and acute crisis management. The results also highlight the need for stringent standards to guide AI integration into mental health care in a manner that is not only effective but also safe and protective of the users’ rights and welfare. %M 39423001 %R 10.2196/62963 %U https://formative.jmir.org/2024/1/e62963 %U https://doi.org/10.2196/62963 %U http://www.ncbi.nlm.nih.gov/pubmed/39423001 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e58011 %T An Ethical Perspective on the Democratization of Mental Health With Generative AI %A Elyoseph,Zohar %A Gur,Tamar %A Haber,Yuval %A Simon,Tomer %A Angert,Tal %A Navon,Yuval %A Tal,Amir %A Asman,Oren %K ethics %K generative artificial intelligence %K generative AI %K mental health %K ChatGPT %K large language model %K LLM %K digital mental health %K machine learning %K AI %K technology %K accessibility %K knowledge %K GenAI %D 2024 %7 17.10.2024 %9 %J JMIR Ment Health %G English %X Knowledge has become more open and accessible to a large audience with the “democratization of information” facilitated by technology. This paper provides a sociohistorical perspective for the theme issue “Responsible Design, Integration, and Use of Generative AI in Mental Health.” It evaluates ethical considerations in using generative artificial intelligence (GenAI) for the democratization of mental health knowledge and practice. It explores the historical context of democratizing information, transitioning from restricted access to widespread availability due to the internet, open-source movements, and most recently, GenAI technologies such as large language models. The paper highlights why GenAI technologies represent a new phase in the democratization movement, offering unparalleled access to highly advanced technology as well as information. In the realm of mental health, this requires delicate and nuanced ethical deliberation. Including GenAI in mental health may allow, among other things, improved accessibility to mental health care, personalized responses, and conceptual flexibility, and could facilitate a flattening of traditional hierarchies between health care providers and patients. At the same time, it also entails significant risks and challenges that must be carefully addressed. To navigate these complexities, the paper proposes a strategic questionnaire for assessing artificial intelligence–based mental health applications. This tool evaluates both the benefits and the risks, emphasizing the need for a balanced and ethical approach to GenAI integration in mental health. The paper calls for a cautious yet positive approach to GenAI in mental health, advocating for the active engagement of mental health professionals in guiding GenAI development. It emphasizes the importance of ensuring that GenAI advancements are not only technologically sound but also ethically grounded and patient-centered. %R 10.2196/58011 %U https://mental.jmir.org/2024/1/e58011 %U https://doi.org/10.2196/58011 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60695 %T Performance of Retrieval-Augmented Large Language Models to Recommend Head and Neck Cancer Clinical Trials %A Hung,Tony K W %A Kuperman,Gilad J %A Sherman,Eric J %A Ho,Alan L %A Weng,Chunhua %A Pfister,David G %A Mao,Jun J %+ Memorial Sloan Kettering Cancer Center, 530 E 74th St, New York, NY, 10021, United States, 1 646 608 4127, hungt@mskcc.org %K large language model %K LLM %K ChatGPT %K GPT-4 %K artificial intelligence %K AI %K clinical trials %K decision support %K LookUpTrials %K cancer care delivery %K head and neck oncology %K head and neck cancer %K retrieval augmented generation %D 2024 %7 15.10.2024 %9 Research Letter %J J Med Internet Res %G English %X %M 39405514 %R 10.2196/60695 %U https://www.jmir.org/2024/1/e60695 %U https://doi.org/10.2196/60695 %U http://www.ncbi.nlm.nih.gov/pubmed/39405514 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e53505 %T The Dual Nature of AI in Information Dissemination: Ethical Considerations %A Germani,Federico %A Spitale,Giovanni %A Biller-Andorno,Nikola %+ Institute of Biomedical Ethics and History of Medicine, University of Zurich, Switzerland, Winterthurerstrasse 30, Zurich, 8006, Switzerland, 41 44 634 40 81, biller-andorno@ibme.uzh.ch %K AI %K bioethics %K infodemic management %K disinformation %K artificial intelligence %K ethics %K ethical %K infodemic %K infodemics %K public health %K misinformation %K information dissemination %K information literacy %D 2024 %7 15.10.2024 %9 Viewpoint %J JMIR AI %G English %X Infodemics pose significant dangers to public health and to the societal fabric, as the spread of misinformation can have far-reaching consequences. While artificial intelligence (AI) systems have the potential to craft compelling and valuable information campaigns with positive repercussions for public health and democracy, concerns have arisen regarding the potential use of AI systems to generate convincing disinformation. The consequences of this dual nature of AI, capable of both illuminating and obscuring the information landscape, are complex and multifaceted. We contend that the rapid integration of AI into society demands a comprehensive understanding of its ethical implications and the development of strategies to harness its potential for the greater good while mitigating harm. Thus, in this paper we explore the ethical dimensions of AI’s role in information dissemination and impact on public health, arguing that potential strategies to deal with AI and disinformation encompass generating regulated and transparent data sets used to train AI models, regulating content outputs, and promoting information literacy. %M 39405099 %R 10.2196/53505 %U https://ai.jmir.org/2024/1/e53505 %U https://doi.org/10.2196/53505 %U http://www.ncbi.nlm.nih.gov/pubmed/39405099 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e52974 %T Behavioral Nudging With Generative AI for Content Development in SMS Health Care Interventions: Case Study %A Harrison,Rachel M %A Lapteva,Ekaterina %A Bibin,Anton %+ GenAI Lab, Ophiuchus LLC, 1111B S Governors Ave, STE 7359, Dover, DE, 19904, United States, 1 302 526 0926, rae@ophiuchus.ai %K generative artificial intelligence %K generative AI %K prompt engineering %K large language models %K GPT %K content design %K brief message interventions %K mHealth %K behavior change techniques %K medication adherence %K type 2 diabetes %D 2024 %7 15.10.2024 %9 Original Paper %J JMIR AI %G English %X Background: Brief message interventions have demonstrated immense promise in health care, yet the development of these messages has suffered from a dearth of transparency and a scarcity of publicly accessible data sets. Moreover, the researcher-driven content creation process has raised resource allocation issues, necessitating a more efficient and transparent approach to content development. Objective: This research sets out to address the challenges of content development for SMS interventions by showcasing the use of generative artificial intelligence (AI) as a tool for content creation, transparently explaining the prompt design and content generation process, and providing the largest publicly available data set of brief messages and source code for future replication of our process. Methods: Leveraging the pretrained large language model GPT-3.5 (OpenAI), we generate a collection of messages in the context of medication adherence for individuals with type 2 diabetes using evidence-derived behavior change techniques identified in a prior systematic review. We create an attributed prompt designed to adhere to content (readability and tone) and SMS (character count and encoder type) standards while encouraging message variability to reflect differences in behavior change techniques. Results: We deliver the most extensive repository of brief messages for a singular health care intervention and the first library of messages crafted with generative AI. In total, our method yields a data set comprising 1150 messages, with 89.91% (n=1034) meeting character length requirements and 80.7% (n=928) meeting readability requirements. Furthermore, our analysis reveals that all messages exhibit diversity comparable to an existing publicly available data set created under the same theoretical framework for a similar setting. Conclusions: This research provides a novel approach to content creation for health care interventions using state-of-the-art generative AI tools. Future research is needed to assess the generated content for ethical, safety, and research standards, as well as to determine whether the intervention is successful in improving the target behaviors. %M 39405108 %R 10.2196/52974 %U https://ai.jmir.org/2024/1/e52974 %U https://doi.org/10.2196/52974 %U http://www.ncbi.nlm.nih.gov/pubmed/39405108 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e57157 %T Navigating Nephrology's Decline Through a GPT-4 Analysis of Internal Medicine Specialties in the United States: Qualitative Study %A Miao,Jing %A Thongprayoon,Charat %A Garcia Valencia,Oscar %A Craici,Iasmina M %A Cheungpasitporn,Wisit %K artificial intelligence %K ChatGPT %K nephrology fellowship training %K fellowship matching %K medical education %K AI %K nephrology %K fellowship %K United States %K factor %K chatbots %K intellectual %K complexity %K work-life balance %K procedural involvement %K opportunity %K career demand %K financial compensation %D 2024 %7 10.10.2024 %9 %J JMIR Med Educ %G English %X Background: The 2024 Nephrology fellowship match data show the declining interest in nephrology in the United States, with an 11% drop in candidates and a mere 66% (321/488) of positions filled. Objective: The study aims to discern the factors influencing this trend using ChatGPT, a leading chatbot model, for insights into the comparative appeal of nephrology versus other internal medicine specialties. Methods: Using the GPT-4 model, the study compared nephrology with 13 other internal medicine specialties, evaluating each on 7 criteria including intellectual complexity, work-life balance, procedural involvement, research opportunities, patient relationships, career demand, and financial compensation. Each criterion was assigned scores from 1 to 10, with the cumulative score determining the ranking. The approach included counteracting potential bias by instructing GPT-4 to favor other specialties over nephrology in reverse scenarios. Results: GPT-4 ranked nephrology only above sleep medicine. While nephrology scored higher than hospice and palliative medicine, it fell short in key criteria such as work-life balance, patient relationships, and career demand. When examining the percentage of filled positions in the 2024 appointment year match, nephrology’s filled rate was 66%, only higher than the 45% (155/348) filled rate of geriatric medicine. Nephrology’s score decreased by 4%‐14% in 5 criteria including intellectual challenge and complexity, procedural involvement, career opportunity and demand, research and academic opportunities, and financial compensation. Conclusions: ChatGPT does not favor nephrology over most internal medicine specialties, highlighting its diminishing appeal as a career choice. This trend raises significant concerns, especially considering the overall physician shortage, and prompts a reevaluation of factors affecting specialty choice among medical residents. %R 10.2196/57157 %U https://mededu.jmir.org/2024/1/e57157 %U https://doi.org/10.2196/57157 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 13 %N %P e58195 %T A Novel Cognitive Behavioral Therapy–Based Generative AI Tool (Socrates 2.0) to Facilitate Socratic Dialogue: Protocol for a Mixed Methods Feasibility Study %A Held,Philip %A Pridgen,Sarah A %A Chen,Yaozhong %A Akhtar,Zuhaib %A Amin,Darpan %A Pohorence,Sean %+ Department of Psychiatry and Behavioral Sciences, Rush University Medical Center, 1645 W. Jackson Blvd, Suite 602, Chicago, IL, 60612, United States, 1 3129421423, philip_held@rush.edu %K generative artificial intelligence %K mental health %K feasibility %K cognitive restructuring %K Socratic dialogue %K mobile phone %D 2024 %7 10.10.2024 %9 Protocol %J JMIR Res Protoc %G English %X Background: Digital mental health tools, designed to augment traditional mental health treatments, are becoming increasingly important due to a wide range of barriers to accessing mental health care, including a growing shortage of clinicians. Most existing tools use rule-based algorithms, often leading to interactions that feel unnatural compared with human therapists. Large language models (LLMs) offer a solution for the development of more natural, engaging digital tools. In this paper, we detail the development of Socrates 2.0, which was designed to engage users in Socratic dialogue surrounding unrealistic or unhelpful beliefs, a core technique in cognitive behavioral therapies. The multiagent LLM-based tool features an artificial intelligence (AI) therapist, Socrates, which receives automated feedback from an AI supervisor and an AI rater. The combination of multiple agents appeared to help address common LLM issues such as looping, and it improved the overall dialogue experience. Initial user feedback from individuals with lived experiences of mental health problems as well as cognitive behavioral therapists has been positive. Moreover, tests in approximately 500 scenarios showed that Socrates 2.0 engaged in harmful responses in under 1% of cases, with the AI supervisor promptly correcting the dialogue each time. However, formal feasibility studies with potential end users are needed. Objective: This mixed methods study examines the feasibility of Socrates 2.0. Methods: On the basis of the initial data, we devised a formal feasibility study of Socrates 2.0 to gather qualitative and quantitative data about users’ and clinicians’ experience of interacting with the tool. Using a mixed method approach, the goal is to gather feasibility and acceptability data from 100 users and 50 clinicians to inform the eventual implementation of generative AI tools, such as Socrates 2.0, in mental health treatment. We designed this study to better understand how users and clinicians interact with the tool, including the frequency, length, and time of interactions, users’ satisfaction with the tool overall, quality of each dialogue and individual responses, as well as ways in which the tool should be improved before it is used in efficacy trials. Descriptive and inferential analyses will be performed on data from validated usability measures. Thematic analysis will be performed on the qualitative data. Results: Recruitment will begin in February 2024 and is expected to conclude by February 2025. As of September 25, 2024, overall, 55 participants have been recruited. Conclusions: The development of Socrates 2.0 and the outlined feasibility study are important first steps in applying generative AI to mental health treatment delivery and lay the foundation for formal feasibility studies. International Registered Report Identifier (IRRID): DERR1-10.2196/58195 %M 39388255 %R 10.2196/58195 %U https://www.researchprotocols.org/2024/1/e58195 %U https://doi.org/10.2196/58195 %U http://www.ncbi.nlm.nih.gov/pubmed/39388255 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56128 %T Assessment of ChatGPT-4 in Family Medicine Board Examinations Using Advanced AI Learning and Analytical Methods: Observational Study %A Goodings,Anthony James %A Kajitani,Sten %A Chhor,Allison %A Albakri,Ahmad %A Pastrak,Mila %A Kodancha,Megha %A Ives,Rowan %A Lee,Yoo Bin %A Kajitani,Kari %K ChatGPT-4 %K Family Medicine Board Examination %K artificial intelligence in medical education %K AI performance assessment %K prompt engineering %K ChatGPT %K artificial intelligence %K AI %K medical education %K assessment %K observational %K analytical method %K data analysis %K examination %D 2024 %7 8.10.2024 %9 %J JMIR Med Educ %G English %X Background: This research explores the capabilities of ChatGPT-4 in passing the American Board of Family Medicine (ABFM) Certification Examination. Addressing a gap in existing literature, where earlier artificial intelligence (AI) models showed limitations in medical board examinations, this study evaluates the enhanced features and potential of ChatGPT-4, especially in document analysis and information synthesis. Objective: The primary goal is to assess whether ChatGPT-4, when provided with extensive preparation resources and when using sophisticated data analysis, can achieve a score equal to or above the passing threshold for the Family Medicine Board Examinations. Methods: In this study, ChatGPT-4 was embedded in a specialized subenvironment, “AI Family Medicine Board Exam Taker,” designed to closely mimic the conditions of the ABFM Certification Examination. This subenvironment enabled the AI to access and analyze a range of relevant study materials, including a primary medical textbook and supplementary web-based resources. The AI was presented with a series of ABFM-type examination questions, reflecting the breadth and complexity typical of the examination. Emphasis was placed on assessing the AI’s ability to interpret and respond to these questions accurately, leveraging its advanced data processing and analysis capabilities within this controlled subenvironment. Results: In our study, ChatGPT-4’s performance was quantitatively assessed on 300 practice ABFM examination questions. The AI achieved a correct response rate of 88.67% (95% CI 85.08%-92.25%) for the Custom Robot version and 87.33% (95% CI 83.57%-91.10%) for the Regular version. Statistical analysis, including the McNemar test (P=.45), indicated no significant difference in accuracy between the 2 versions. In addition, the chi-square test for error-type distribution (P=.32) revealed no significant variation in the pattern of errors across versions. These results highlight ChatGPT-4’s capacity for high-level performance and consistency in responding to complex medical examination questions under controlled conditions. Conclusions: The study demonstrates that ChatGPT-4, particularly when equipped with specialized preparation and when operating in a tailored subenvironment, shows promising potential in handling the intricacies of medical board examinations. While its performance is comparable with the expected standards for passing the ABFM Certification Examination, further enhancements in AI technology and tailored training methods could push these capabilities to new heights. This exploration opens avenues for integrating AI tools such as ChatGPT-4 in medical education and assessment, emphasizing the importance of continuous advancement and specialized training in medical applications of AI. %R 10.2196/56128 %U https://mededu.jmir.org/2024/1/e56128 %U https://doi.org/10.2196/56128 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e62924 %T Use of SNOMED CT in Large Language Models: Scoping Review %A Chang,Eunsuk %A Sung,Sumi %+ Department of Nursing Science, Research Institute of Nursing Science, Chungbuk National University, 1 Chungdae-ro, Seowon-gu, Cheongju, 28644, Republic of Korea, 82 43 249 1731, sumisung@cbnu.ac.kr %K SNOMED CT %K ontology %K knowledge graph %K large language models %K natural language processing %K language models %D 2024 %7 7.10.2024 %9 Review %J JMIR Med Inform %G English %X Background: Large language models (LLMs) have substantially advanced natural language processing (NLP) capabilities but often struggle with knowledge-driven tasks in specialized domains such as biomedicine. Integrating biomedical knowledge sources such as SNOMED CT into LLMs may enhance their performance on biomedical tasks. However, the methodologies and effectiveness of incorporating SNOMED CT into LLMs have not been systematically reviewed. Objective: This scoping review aims to examine how SNOMED CT is integrated into LLMs, focusing on (1) the types and components of LLMs being integrated with SNOMED CT, (2) which contents of SNOMED CT are being integrated, and (3) whether this integration improves LLM performance on NLP tasks. Methods: Following the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) guidelines, we searched ACM Digital Library, ACL Anthology, IEEE Xplore, PubMed, and Embase for relevant studies published from 2018 to 2023. Studies were included if they incorporated SNOMED CT into LLM pipelines for natural language understanding or generation tasks. Data on LLM types, SNOMED CT integration methods, end tasks, and performance metrics were extracted and synthesized. Results: The review included 37 studies. Bidirectional Encoder Representations from Transformers and its biomedical variants were the most commonly used LLMs. Three main approaches for integrating SNOMED CT were identified: (1) incorporating SNOMED CT into LLM inputs (28/37, 76%), primarily using concept descriptions to expand training corpora; (2) integrating SNOMED CT into additional fusion modules (5/37, 14%); and (3) using SNOMED CT as an external knowledge retriever during inference (5/37, 14%). The most frequent end task was medical concept normalization (15/37, 41%), followed by entity extraction or typing and classification. While most studies (17/19, 89%) reported performance improvements after SNOMED CT integration, only a small fraction (19/37, 51%) provided direct comparisons. The reported gains varied widely across different metrics and tasks, ranging from 0.87% to 131.66%. However, some studies showed either no improvement or a decline in certain performance metrics. Conclusions: This review demonstrates diverse approaches for integrating SNOMED CT into LLMs, with a focus on using concept descriptions to enhance biomedical language understanding and generation. While the results suggest potential benefits of SNOMED CT integration, the lack of standardized evaluation methods and comprehensive performance reporting hinders definitive conclusions about its effectiveness. Future research should prioritize consistent reporting of performance comparisons and explore more sophisticated methods for incorporating SNOMED CT’s relational structure into LLMs. In addition, the biomedical NLP community should develop standardized evaluation frameworks to better assess the impact of ontology integration on LLM performance. %M 39374057 %R 10.2196/62924 %U https://medinform.jmir.org/2024/1/e62924 %U https://doi.org/10.2196/62924 %U http://www.ncbi.nlm.nih.gov/pubmed/39374057 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e57673 %T The Utility and Implications of Ambient Scribes in Primary Care %A Seth,Puneet %A Carretas,Romina %A Rudzicz,Frank %+ Department of Family Medicine, McMaster University, 100 Main Street West, Hamilton, ON, L8P 1H6, Canada, 1 416 671 5114, sethp1@mcmaster.ca %K artificial intelligence %K AI %K large language model %K LLM %K digital scribe %K ambient scribe %K organizational efficiency %K electronic health record %K documentation burden %K administrative burden %D 2024 %7 4.10.2024 %9 Viewpoint %J JMIR AI %G English %X Ambient scribe technology, utilizing large language models, represents an opportunity for addressing several current pain points in the delivery of primary care. We explore the evolution of ambient scribes and their current use in primary care. We discuss the suitability of primary care for ambient scribe integration, considering the varied nature of patient presentations and the emphasis on comprehensive care. We also propose the stages of maturation in the use of ambient scribes in primary care and their impact on care delivery. Finally, we call for focused research on safety, bias, patient impact, and privacy in ambient scribe technology, emphasizing the need for early training and education of health care providers in artificial intelligence and digital health tools. %M 39365655 %R 10.2196/57673 %U https://ai.jmir.org/2024/1/e57673 %U https://doi.org/10.2196/57673 %U http://www.ncbi.nlm.nih.gov/pubmed/39365655 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e51635 %T Engine of Innovation in Hospital Pharmacy: Applications and Reflections of ChatGPT %A Li,Xingang %A Guo,Heng %A Li,Dandan %A Zheng,Yingming %+ Department of Pharmacy, Beijing Friendship Hospital, Capital Medical University, No 95, Yongan Road, Xicheng District, Beijing, 100050, China, 86 1081608511, lxg198320022003@163.com %K ChatGPT %K hospital pharmacy %K natural language processing %K drug information %K drug therapy %K drug interaction %K scientific research %K innovation %K pharmacy %K quality %K safety %K pharmaceutical care %K tool %K medical care quality %D 2024 %7 4.10.2024 %9 Viewpoint %J J Med Internet Res %G English %X Hospital pharmacy plays an important role in ensuring medical care quality and safety, especially in the area of drug information retrieval, therapy guidance, and drug-drug interaction management. ChatGPT is a powerful artificial intelligence language model that can generate natural-language texts. Here, we explored the applications and reflections of ChatGPT in hospital pharmacy, where it may enhance the quality and efficiency of pharmaceutical care. We also explored ChatGPT’s prospects in hospital pharmacy and discussed its working principle, diverse applications, and practical cases in daily operations and scientific research. Meanwhile, the challenges and limitations of ChatGPT, such as data privacy, ethical issues, bias and discrimination, and human oversight, are discussed. ChatGPT is a promising tool for hospital pharmacy, but it requires careful evaluation and validation before it can be integrated into clinical practice. Some suggestions for future research and development of ChatGPT in hospital pharmacy are provided. %M 39365643 %R 10.2196/51635 %U https://www.jmir.org/2024/1/e51635 %U https://doi.org/10.2196/51635 %U http://www.ncbi.nlm.nih.gov/pubmed/39365643 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52746 %T Performance of ChatGPT on Nursing Licensure Examinations in the United States and China: Cross-Sectional Study %A Wu,Zelin %A Gan,Wenyi %A Xue,Zhaowen %A Ni,Zhengxin %A Zheng,Xiaofei %A Zhang,Yiyi %K artificial intelligence %K ChatGPT %K nursing licensure examination %K nursing %K LLMs %K large language models %K nursing education %K AI %K nursing student %K large language model %K licensing %K observation %K observational study %K China %K USA %K United States of America %K auxiliary tool %K accuracy rate %K theoretical %D 2024 %7 3.10.2024 %9 %J JMIR Med Educ %G English %X Background: The creation of large language models (LLMs) such as ChatGPT is an important step in the development of artificial intelligence, which shows great potential in medical education due to its powerful language understanding and generative capabilities. The purpose of this study was to quantitatively evaluate and comprehensively analyze ChatGPT’s performance in handling questions for the National Nursing Licensure Examination (NNLE) in China and the United States, including the National Council Licensure Examination for Registered Nurses (NCLEX-RN) and the NNLE. Objective: This study aims to examine how well LLMs respond to the NCLEX-RN and the NNLE multiple-choice questions (MCQs) in various language inputs. To evaluate whether LLMs can be used as multilingual learning assistance for nursing, and to assess whether they possess a repository of professional knowledge applicable to clinical nursing practice. Methods: First, we compiled 150 NCLEX-RN Practical MCQs, 240 NNLE Theoretical MCQs, and 240 NNLE Practical MCQs. Then, the translation function of ChatGPT 3.5 was used to translate NCLEX-RN questions from English to Chinese and NNLE questions from Chinese to English. Finally, the original version and the translated version of the MCQs were inputted into ChatGPT 4.0, ChatGPT 3.5, and Google Bard. Different LLMs were compared according to the accuracy rate, and the differences between different language inputs were compared. Results: The accuracy rates of ChatGPT 4.0 for NCLEX-RN practical questions and Chinese-translated NCLEX-RN practical questions were 88.7% (133/150) and 79.3% (119/150), respectively. Despite the statistical significance of the difference (P=.03), the correct rate was generally satisfactory. Around 71.9% (169/235) of NNLE Theoretical MCQs and 69.1% (161/233) of NNLE Practical MCQs were correctly answered by ChatGPT 4.0. The accuracy of ChatGPT 4.0 in processing NNLE Theoretical MCQs and NNLE Practical MCQs translated into English was 71.5% (168/235; P=.92) and 67.8% (158/233; P=.77), respectively, and there was no statistically significant difference between the results of text input in different languages. ChatGPT 3.5 (NCLEX-RN P=.003, NNLE Theoretical P<.001, NNLE Practical P=.12) and Google Bard (NCLEX-RN P<.001, NNLE Theoretical P<.001, NNLE Practical P<.001) had lower accuracy rates for nursing-related MCQs than ChatGPT 4.0 in English input. English accuracy was higher when compared with ChatGPT 3.5’s Chinese input, and the difference was statistically significant (NCLEX-RN P=.02, NNLE Practical P=.02). Whether submitted in Chinese or English, the MCQs from the NCLEX-RN and NNLE demonstrated that ChatGPT 4.0 had the highest number of unique correct responses and the lowest number of unique incorrect responses among the 3 LLMs. Conclusions: This study, focusing on 618 nursing MCQs including NCLEX-RN and NNLE exams, found that ChatGPT 4.0 outperformed ChatGPT 3.5 and Google Bard in accuracy. It excelled in processing English and Chinese inputs, underscoring its potential as a valuable tool in nursing education and clinical decision-making. %R 10.2196/52746 %U https://mededu.jmir.org/2024/1/e52746 %U https://doi.org/10.2196/52746 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60601 %T Ascle—A Python Natural Language Processing Toolkit for Medical Text Generation: Development and Evaluation Study %A Yang,Rui %A Zeng,Qingcheng %A You,Keen %A Qiao,Yujie %A Huang,Lucas %A Hsieh,Chia-Chun %A Rosand,Benjamin %A Goldwasser,Jeremy %A Dave,Amisha %A Keenan,Tiarnan %A Ke,Yuhe %A Hong,Chuan %A Liu,Nan %A Chew,Emily %A Radev,Dragomir %A Lu,Zhiyong %A Xu,Hua %A Chen,Qingyu %A Li,Irene %+ Information Technology Center, University of Tokyo, 6-2-3 Kashiwanoha, Kashiwa, 277-8582, Japan, 81 09014707813, ireneli@ds.itc.u-tokyo.ac.jp %K natural language processing %K machine learning %K deep learning %K generative artificial intelligence %K large language models %K retrieval-augmented generation %K healthcare %D 2024 %7 3.10.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Medical texts present significant domain-specific challenges, and manually curating these texts is a time-consuming and labor-intensive process. To address this, natural language processing (NLP) algorithms have been developed to automate text processing. In the biomedical field, various toolkits for text processing exist, which have greatly improved the efficiency of handling unstructured text. However, these existing toolkits tend to emphasize different perspectives, and none of them offer generation capabilities, leaving a significant gap in the current offerings. Objective: This study aims to describe the development and preliminary evaluation of Ascle. Ascle is tailored for biomedical researchers and clinical staff with an easy-to-use, all-in-one solution that requires minimal programming expertise. For the first time, Ascle provides 4 advanced and challenging generative functions: question-answering, text summarization, text simplification, and machine translation. In addition, Ascle integrates 12 essential NLP functions, along with query and search capabilities for clinical databases. Methods: We fine-tuned 32 domain-specific language models and evaluated them thoroughly on 27 established benchmarks. In addition, for the question-answering task, we developed a retrieval-augmented generation (RAG) framework for large language models that incorporated a medical knowledge graph with ranking techniques to enhance the reliability of generated answers. Additionally, we conducted a physician validation to assess the quality of generated content beyond automated metrics. Results: The fine-tuned models and RAG framework consistently enhanced text generation tasks. For example, the fine-tuned models improved the machine translation task by 20.27 in terms of BLEU score. In the question-answering task, the RAG framework raised the ROUGE-L score by 18% over the vanilla models. Physician validation of generated answers showed high scores for readability (4.95/5) and relevancy (4.43/5), with a lower score for accuracy (3.90/5) and completeness (3.31/5). Conclusions: This study introduces the development and evaluation of Ascle, a user-friendly NLP toolkit designed for medical text generation. All code is publicly available through the Ascle GitHub repository. All fine-tuned language models can be accessed through Hugging Face. %M 39361955 %R 10.2196/60601 %U https://www.jmir.org/2024/1/e60601 %U https://doi.org/10.2196/60601 %U http://www.ncbi.nlm.nih.gov/pubmed/39361955 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e63010 %T Comparative Study to Evaluate the Accuracy of Differential Diagnosis Lists Generated by Gemini Advanced, Gemini, and Bard for a Case Report Series Analysis: Cross-Sectional Study %A Hirosawa,Takanobu %A Harada,Yukinori %A Tokumasu,Kazuki %A Ito,Takahiro %A Suzuki,Tomoharu %A Shimizu,Taro %+ Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, 880 Kitakobayashi, Mibu-cho, Shimotsuga, 321-0293, Japan, 81 282861111, hirosawa@dokkyomed.ac.jp %K artificial intelligence %K clinical decision support %K diagnostic excellence %K generative artificial intelligence %K large language models %K natural language processing %D 2024 %7 2.10.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: Generative artificial intelligence (GAI) systems by Google have recently been updated from Bard to Gemini and Gemini Advanced as of December 2023. Gemini is a basic, free-to-use model after a user’s login, while Gemini Advanced operates on a more advanced model requiring a fee-based subscription. These systems have the potential to enhance medical diagnostics. However, the impact of these updates on comprehensive diagnostic accuracy remains unknown. Objective: This study aimed to compare the accuracy of the differential diagnosis lists generated by Gemini Advanced, Gemini, and Bard across comprehensive medical fields using case report series. Methods: We identified a case report series with relevant final diagnoses published in the American Journal Case Reports from January 2022 to March 2023. After excluding nondiagnostic cases and patients aged 10 years and younger, we included the remaining case reports. After refining the case parts as case descriptions, we input the same case descriptions into Gemini Advanced, Gemini, and Bard to generate the top 10 differential diagnosis lists. In total, 2 expert physicians independently evaluated whether the final diagnosis was included in the lists and its ranking. Any discrepancies were resolved by another expert physician. Bonferroni correction was applied to adjust the P values for the number of comparisons among 3 GAI systems, setting the corrected significance level at P value <.02. Results: In total, 392 case reports were included. The inclusion rates of the final diagnosis within the top 10 differential diagnosis lists were 73% (286/392) for Gemini Advanced, 76.5% (300/392) for Gemini, and 68.6% (269/392) for Bard. The top diagnoses matched the final diagnoses in 31.6% (124/392) for Gemini Advanced, 42.6% (167/392) for Gemini, and 31.4% (123/392) for Bard. Gemini demonstrated higher diagnostic accuracy than Bard both within the top 10 differential diagnosis lists (P=.02) and as the top diagnosis (P=.001). In addition, Gemini Advanced achieved significantly lower accuracy than Gemini in identifying the most probable diagnosis (P=.002). Conclusions: The results of this study suggest that Gemini outperformed Bard in diagnostic accuracy following the model update. However, Gemini Advanced requires further refinement to optimize its performance for future artificial intelligence–enhanced diagnostics. These findings should be interpreted cautiously and considered primarily for research purposes, as these GAI systems have not been adjusted for medical diagnostics nor approved for clinical use. %M 39357052 %R 10.2196/63010 %U https://medinform.jmir.org/2024/1/e63010 %U https://doi.org/10.2196/63010 %U http://www.ncbi.nlm.nih.gov/pubmed/39357052 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e51383 %T Optimizing ChatGPT’s Interpretation and Reporting of Delirium Assessment Outcomes: Exploratory Study %A Choi,Yong K %A Lin,Shih-Yin %A Fick,Donna Marie %A Shulman,Richard W %A Lee,Sangil %A Shrestha,Priyanka %A Santoso,Kate %+ Department of Health Information Management, School of Health and Rehabilitation Sciences, University of Pittsburgh, 6051B Forbes Tower, Pittsburgh, PA, 15260, United States, 1 412 624 6442, yong.choi@pitt.edu %K generative artificial intelligence %K generative AI %K large language models %K ChatGPT %K delirium detection %K Sour Seven Questionnaire %K prompt engineering %K clinical vignettes %K medical education %K caregiver education %D 2024 %7 1.10.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Generative artificial intelligence (AI) and large language models, such as OpenAI’s ChatGPT, have shown promising potential in supporting medical education and clinical decision-making, given their vast knowledge base and natural language processing capabilities. As a general purpose AI system, ChatGPT can complete a wide range of tasks, including differential diagnosis without additional training. However, the specific application of ChatGPT in learning and applying a series of specialized, context-specific tasks mimicking the workflow of a human assessor, such as administering a standardized assessment questionnaire, followed by inputting assessment results in a standardized form, and interpretating assessment results strictly following credible, published scoring criteria, have not been thoroughly studied. Objective: This exploratory study aims to evaluate and optimize ChatGPT’s capabilities in administering and interpreting the Sour Seven Questionnaire, an informant-based delirium assessment tool. Specifically, the objectives were to train ChatGPT-3.5 and ChatGPT-4 to understand and correctly apply the Sour Seven Questionnaire to clinical vignettes using prompt engineering, assess the performance of these AI models in identifying and scoring delirium symptoms against scores from human experts, and refine and enhance the models’ interpretation and reporting accuracy through iterative prompt optimization. Methods: We used prompt engineering to train ChatGPT-3.5 and ChatGPT-4 models on the Sour Seven Questionnaire, a tool for assessing delirium through caregiver input. Prompt engineering is a methodology used to enhance the AI’s processing of inputs by meticulously structuring the prompts to improve accuracy and consistency in outputs. In this study, prompt engineering involved creating specific, structured commands that guided the AI models in understanding and applying the assessment tool’s criteria accurately to clinical vignettes. This approach also included designing prompts to explicitly instruct the AI on how to format its responses, ensuring they were consistent with clinical documentation standards. Results: Both ChatGPT models demonstrated promising proficiency in applying the Sour Seven Questionnaire to the vignettes, despite initial inconsistencies and errors. Performance notably improved through iterative prompt engineering, enhancing the models’ capacity to detect delirium symptoms and assign scores. Prompt optimizations included adjusting the scoring methodology to accept only definitive “Yes” or “No” responses, revising the evaluation prompt to mandate responses in a tabular format, and guiding the models to adhere to the 2 recommended actions specified in the Sour Seven Questionnaire. Conclusions: Our findings provide preliminary evidence supporting the potential utility of AI models such as ChatGPT in administering standardized clinical assessment tools. The results highlight the significance of context-specific training and prompt engineering in harnessing the full potential of these AI models for health care applications. Despite the encouraging results, broader generalizability and further validation in real-world settings warrant additional research. %M 39353189 %R 10.2196/51383 %U https://formative.jmir.org/2024/1/e51383 %U https://doi.org/10.2196/51383 %U http://www.ncbi.nlm.nih.gov/pubmed/39353189 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e58831 %T “Doctor ChatGPT, Can You Help Me?” The Patient’s Perspective: Cross-Sectional Study %A Armbruster,Jonas %A Bussmann,Florian %A Rothhaas,Catharina %A Titze,Nadine %A Grützner,Paul Alfred %A Freischmidt,Holger %+ Department of Trauma and Orthopedic Surgery, BG Klinik Ludwigshafen, Ludwig-Guttmann-Strasse 13, Ludwigshafen am Rhein, 67071, Germany, 49 6216810, Holger.Freischmidt@bgu-ludwigshafen.de %K artificial intelligence %K AI %K large language models %K LLM %K ChatGPT %K patient education %K patient information %K patient perceptions %K chatbot %K chatbots %K empathy %D 2024 %7 1.10.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Artificial intelligence and the language models derived from it, such as ChatGPT, offer immense possibilities, particularly in the field of medicine. It is already evident that ChatGPT can provide adequate and, in some cases, expert-level responses to health-related queries and advice for patients. However, it is currently unknown how patients perceive these capabilities, whether they can derive benefit from them, and whether potential risks, such as harmful suggestions, are detected by patients. Objective: This study aims to clarify whether patients can get useful and safe health care advice from an artificial intelligence chatbot assistant. Methods: This cross-sectional study was conducted using 100 publicly available health-related questions from 5 medical specialties (trauma, general surgery, otolaryngology, pediatrics, and internal medicine) from a web-based platform for patients. Responses generated by ChatGPT-4.0 and by an expert panel (EP) of experienced physicians from the aforementioned web-based platform were packed into 10 sets consisting of 10 questions each. The blinded evaluation was carried out by patients regarding empathy and usefulness (assessed through the question: “Would this answer have helped you?”) on a scale from 1 to 5. As a control, evaluation was also performed by 3 physicians in each respective medical specialty, who were additionally asked about the potential harm of the response and its correctness. Results: In total, 200 sets of questions were submitted by 64 patients (mean 45.7, SD 15.9 years; 29/64, 45.3% male), resulting in 2000 evaluated answers of ChatGPT and the EP each. ChatGPT scored higher in terms of empathy (4.18 vs 2.7; P<.001) and usefulness (4.04 vs 2.98; P<.001). Subanalysis revealed a small bias in terms of levels of empathy given by women in comparison with men (4.46 vs 4.14; P=.049). Ratings of ChatGPT were high regardless of the participant’s age. The same highly significant results were observed in the evaluation of the respective specialist physicians. ChatGPT outperformed significantly in correctness (4.51 vs 3.55; P<.001). Specialists rated the usefulness (3.93 vs 4.59) and correctness (4.62 vs 3.84) significantly lower in potentially harmful responses from ChatGPT (P<.001). This was not the case among patients. Conclusions: The results indicate that ChatGPT is capable of supporting patients in health-related queries better than physicians, at least in terms of written advice through a web-based platform. In this study, ChatGPT’s responses had a lower percentage of potentially harmful advice than the web-based EP. However, it is crucial to note that this finding is based on a specific study design and may not generalize to all health care settings. Alarmingly, patients are not able to independently recognize these potential dangers. %M 39352738 %R 10.2196/58831 %U https://www.jmir.org/2024/1/e58831 %U https://doi.org/10.2196/58831 %U http://www.ncbi.nlm.nih.gov/pubmed/39352738 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e64143 %T Practical Aspects of Using Large Language Models to Screen Abstracts for Cardiovascular Drug Development: Cross-Sectional Study %A Ronquillo,Jay G %A Ye,Jamie %A Gorman,Donal %A Lemeshow,Adina R %A Watt,Stephen J %K biomedical informatics %K drug development %K cardiology %K cardio %K LLM %K biomedical %K drug %K cross-sectional study %K biomarker %K cardiovascular %K screening optimization %K GPT %K large language model %K AI %K artificial intelligence %D 2024 %7 30.9.2024 %9 %J JMIR Med Inform %G English %X Cardiovascular drug development requires synthesizing relevant literature about indications, mechanisms, biomarkers, and outcomes. This short study investigates the performance, cost, and prompt engineering trade-offs of 3 large language models accelerating the literature screening process for cardiovascular drug development applications. %R 10.2196/64143 %U https://medinform.jmir.org/2024/1/e64143 %U https://doi.org/10.2196/64143 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e55648 %T Accuracy of a Commercial Large Language Model (ChatGPT) to Perform Disaster Triage of Simulated Patients Using the Simple Triage and Rapid Treatment (START) Protocol: Gage Repeatability and Reproducibility Study %A Franc,Jeffrey Micheal %A Hertelendy,Attila Julius %A Cheng,Lenard %A Hata,Ryan %A Verde,Manuela %+ Department of Emergency Medicine, University of Alberta, 736c University Terrace, 8203-112 Street NW, Edmonton, AB, T6R2Z6, Canada, 1 7807006730, jeffrey.franc@ualberta.ca %K disaster medicine %K large language models %K triage %K disaster %K emergency %K disasters %K emergencies %K LLM %K LLMs %K GPT %K ChatGPT %K language model %K language models %K NLP %K natural language processing %K artificial intelligence %K repeatability %K reproducibility %K accuracy %K accurate %K reproducible %K repeatable %D 2024 %7 30.9.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The release of ChatGPT (OpenAI) in November 2022 drastically reduced the barrier to using artificial intelligence by allowing a simple web-based text interface to a large language model (LLM). One use case where ChatGPT could be useful is in triaging patients at the site of a disaster using the Simple Triage and Rapid Treatment (START) protocol. However, LLMs experience several common errors including hallucinations (also called confabulations) and prompt dependency. Objective: This study addresses the research problem: “Can ChatGPT adequately triage simulated disaster patients using the START protocol?” by measuring three outcomes: repeatability, reproducibility, and accuracy. Methods: Nine prompts were developed by 5 disaster medicine physicians. A Python script queried ChatGPT Version 4 for each prompt combined with 391 validated simulated patient vignettes. Ten repetitions of each combination were performed for a total of 35,190 simulated triages. A reference standard START triage code for each simulated case was assigned by 2 disaster medicine specialists (JMF and MV), with a third specialist (LC) added if the first two did not agree. Results were evaluated using a gage repeatability and reproducibility study (gage R and R). Repeatability was defined as variation due to repeated use of the same prompt. Reproducibility was defined as variation due to the use of different prompts on the same patient vignette. Accuracy was defined as agreement with the reference standard. Results: Although 35,102 (99.7%) queries returned a valid START score, there was considerable variability. Repeatability (use of the same prompt repeatedly) was 14% of the overall variation. Reproducibility (use of different prompts) was 4.1% of the overall variation. The accuracy of ChatGPT for START was 63.9% with a 32.9% overtriage rate and a 3.1% undertriage rate. Accuracy varied by prompt with a maximum of 71.8% and a minimum of 46.7%. Conclusions: This study indicates that ChatGPT version 4 is insufficient to triage simulated disaster patients via the START protocol. It demonstrated suboptimal repeatability and reproducibility. The overall accuracy of triage was only 63.9%. Health care professionals are advised to exercise caution while using commercial LLMs for vital medical determinations, given that these tools may commonly produce inaccurate data, colloquially referred to as hallucinations or confabulations. Artificial intelligence–guided tools should undergo rigorous statistical evaluation—using methods such as gage R and R—before implementation into clinical settings. %M 39348189 %R 10.2196/55648 %U https://www.jmir.org/2024/1/e55648 %U https://doi.org/10.2196/55648 %U http://www.ncbi.nlm.nih.gov/pubmed/39348189 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52346 %T Artificial Intelligence in Dental Education: Opportunities and Challenges of Large Language Models and Multimodal Foundation Models %A Claman,Daniel %A Sezgin,Emre %K artificial intelligence %K large language models %K dental education %K GPT %K ChatGPT %K periodontal health %K AI %K LLM %K LLMs %K chatbot %K natural language %K generative pretrained transformer %K innovation %K technology %K large language model %D 2024 %7 27.9.2024 %9 %J JMIR Med Educ %G English %X Instructional and clinical technologies have been transforming dental education. With the emergence of artificial intelligence (AI), the opportunities of using AI in education has increased. With the recent advancement of generative AI, large language models (LLMs) and foundation models gained attention with their capabilities in natural language understanding and generation as well as combining multiple types of data, such as text, images, and audio. A common example has been ChatGPT, which is based on a powerful LLM—the GPT model. This paper discusses the potential benefits and challenges of incorporating LLMs in dental education, focusing on periodontal charting with a use case to outline capabilities of LLMs. LLMs can provide personalized feedback, generate case scenarios, and create educational content to contribute to the quality of dental education. However, challenges, limitations, and risks exist, including bias and inaccuracy in the content created, privacy and security concerns, and the risk of overreliance. With guidance and oversight, and by effectively and ethically integrating LLMs, dental education can incorporate engaging and personalized learning experiences for students toward readiness for real-life clinical practice. %R 10.2196/52346 %U https://mededu.jmir.org/2024/1/e52346 %U https://doi.org/10.2196/52346 %0 Journal Article %@ 2564-1891 %I JMIR Publications %V 4 %N %P e60678 %T Evaluating the Influence of Role-Playing Prompts on ChatGPT’s Misinformation Detection Accuracy: Quantitative Study %A Haupt,Michael Robert %A Yang,Luning %A Purnat,Tina %A Mackey,Tim %+ Global Health Program, Department of Anthropology, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA, 92093, United States, 1 858 534 4145, tkmackey@ucsd.edu %K large language models %K ChatGPT %K artificial intelligence %K AI %K experiment %K prompt engineering %K role-playing %K social identity %K misinformation detection %K COVID-19 %D 2024 %7 26.9.2024 %9 Original Paper %J JMIR Infodemiology %G English %X Background: During the COVID-19 pandemic, the rapid spread of misinformation on social media created significant public health challenges. Large language models (LLMs), pretrained on extensive textual data, have shown potential in detecting misinformation, but their performance can be influenced by factors such as prompt engineering (ie, modifying LLM requests to assess changes in output). One form of prompt engineering is role-playing, where, upon request, OpenAI’s ChatGPT imitates specific social roles or identities. This research examines how ChatGPT’s accuracy in detecting COVID-19–related misinformation is affected when it is assigned social identities in the request prompt. Understanding how LLMs respond to different identity cues can inform messaging campaigns, ensuring effective use in public health communications. Objective: This study investigates the impact of role-playing prompts on ChatGPT’s accuracy in detecting misinformation. This study also assesses differences in performance when misinformation is explicitly stated versus implied, based on contextual knowledge, and examines the reasoning given by ChatGPT for classification decisions. Methods: Overall, 36 real-world tweets about COVID-19 collected in September 2021 were categorized into misinformation, sentiment (opinions aligned vs unaligned with public health guidelines), corrections, and neutral reporting. ChatGPT was tested with prompts incorporating different combinations of multiple social identities (ie, political beliefs, education levels, locality, religiosity, and personality traits), resulting in 51,840 runs. Two control conditions were used to compare results: prompts with no identities and those including only political identity. Results: The findings reveal that including social identities in prompts reduces average detection accuracy, with a notable drop from 68.1% (SD 41.2%; no identities) to 29.3% (SD 31.6%; all identities included). Prompts with only political identity resulted in the lowest accuracy (19.2%, SD 29.2%). ChatGPT was also able to distinguish between sentiments expressing opinions not aligned with public health guidelines from misinformation making declarative statements. There were no consistent differences in performance between explicit and implicit misinformation requiring contextual knowledge. While the findings show that the inclusion of identities decreased detection accuracy, it remains uncertain whether ChatGPT adopts views aligned with social identities: when assigned a conservative identity, ChatGPT identified misinformation with nearly the same accuracy as it did when assigned a liberal identity. While political identity was mentioned most frequently in ChatGPT’s explanations for its classification decisions, the rationales for classifications were inconsistent across study conditions, and contradictory explanations were provided in some instances. Conclusions: These results indicate that ChatGPT’s ability to classify misinformation is negatively impacted when role-playing social identities, highlighting the complexity of integrating human biases and perspectives in LLMs. This points to the need for human oversight in the use of LLMs for misinformation detection. Further research is needed to understand how LLMs weigh social identities in prompt-based tasks and explore their application in different cultural contexts. %M 39326035 %R 10.2196/60678 %U https://infodemiology.jmir.org/2024/1/e60678 %U https://doi.org/10.2196/60678 %U http://www.ncbi.nlm.nih.gov/pubmed/39326035 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e57362 %T The Most Effective Interventions for Classification Model Development to Predict Chat Outcomes Based on the Conversation Content in Online Suicide Prevention Chats: Machine Learning Approach %A Salmi,Salim %A Mérelle,Saskia %A Gilissen,Renske %A van der Mei,Rob %A Bhulai,Sandjai %+ Research Department, 113 Suicide Prevention, Paasheuvelweg 25, Amsterdam, 1105 BP, Netherlands, 31 640673474, s.salmi@113.nl %K suicide %K suicidality %K suicide prevention %K helpline %K suicide helpline %K classification %K interpretable AI %K explainable AI %K conversations %K BERT %K bidirectional encoder representations from transformers %K machine learning %K artificial intelligence %K large language models %K LLM %K natural language processing %D 2024 %7 26.9.2024 %9 Original Paper %J JMIR Ment Health %G English %X Background: For the provision of optimal care in a suicide prevention helpline, it is important to know what contributes to positive or negative effects on help seekers. Helplines can often be contacted through text-based chat services, which produce large amounts of text data for use in large-scale analysis. Objective: We trained a machine learning classification model to predict chat outcomes based on the content of the chat conversations in suicide helplines and identified the counsellor utterances that had the most impact on its outputs. Methods: From August 2021 until January 2023, help seekers (N=6903) scored themselves on factors known to be associated with suicidality (eg, hopelessness, feeling entrapped, will to live) before and after a chat conversation with the suicide prevention helpline in the Netherlands (113 Suicide Prevention). Machine learning text analysis was used to predict help seeker scores on these factors. Using 2 approaches for interpreting machine learning models, we identified text messages from helpers in a chat that contributed the most to the prediction of the model. Results: According to the machine learning model, helpers’ positive affirmations and expressing involvement contributed to improved scores of the help seekers. Use of macros and ending the chat prematurely due to the help seeker being in an unsafe situation had negative effects on help seekers. Conclusions: This study reveals insights for improving helpline chats, emphasizing the value of an evocative style with questions, positive affirmations, and practical advice. It also underscores the potential of machine learning in helpline chat analysis. %M 39326039 %R 10.2196/57362 %U https://mental.jmir.org/2024/1/e57362 %U https://doi.org/10.2196/57362 %U http://www.ncbi.nlm.nih.gov/pubmed/39326039 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e58741 %T Using Natural Language Processing (GPT-4) for Computed Tomography Image Analysis of Cerebral Hemorrhages in Radiology: Retrospective Analysis %A Zhang,Daiwen %A Ma,Zixuan %A Gong,Ru %A Lian,Liangliang %A Li,Yanzhuo %A He,Zhenghui %A Han,Yuhan %A Hui,Jiyuan %A Huang,Jialin %A Jiang,Jiyao %A Weng,Weiji %A Feng,Junfeng %+ Brain Injury Centre, Ren Ji Hospital, Shanghai Jiao Tong University School of Medicine, 160 Pujian Road, Pudong New District, Shanghai, 200127, China, 86 136 1186 0825, fengjfmail@163.com %K GPT-4 %K natural language processing %K NLP %K artificial intelligence %K AI %K cerebral hemorrhage %K computed tomography %K CT %D 2024 %7 26.9.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Cerebral hemorrhage is a critical medical condition that necessitates a rapid and precise diagnosis for timely medical intervention, including emergency operation. Computed tomography (CT) is essential for identifying cerebral hemorrhage, but its effectiveness is limited by the availability of experienced radiologists, especially in resource-constrained regions or when shorthanded during holidays or at night. Despite advancements in artificial intelligence–driven diagnostic tools, most require technical expertise. This poses a challenge for widespread adoption in radiological imaging. The introduction of advanced natural language processing (NLP) models such as GPT-4, which can annotate and analyze images without extensive algorithmic training, offers a potential solution. Objective: This study investigates GPT-4’s capability to identify and annotate cerebral hemorrhages in cranial CT scans. It represents a novel application of NLP models in radiological imaging. Methods: In this retrospective analysis, we collected 208 CT scans with 6 types of cerebral hemorrhages at Ren Ji Hospital, Shanghai Jiao Tong University School of Medicine, between January and September 2023. All CT images were mixed together and sequentially numbered, so each CT image had its own corresponding number. A random sequence from 1 to 208 was generated, and all CT images were inputted into GPT-4 for analysis in the order of the random sequence. The outputs were subsequently examined using Photoshop and evaluated by experienced radiologists on a 4-point scale to assess identification completeness, accuracy, and success. Results: The overall identification completeness percentage for the 6 types of cerebral hemorrhages was 72.6% (SD 18.6%). Specifically, GPT-4 achieved higher identification completeness in epidural and intraparenchymal hemorrhages (89.0%, SD 19.1% and 86.9%, SD 17.7%, respectively), yet its identification completeness percentage in chronic subdural hemorrhages was very low (37.3%, SD 37.5%). The misidentification percentages for complex hemorrhages (54.0%, SD 28.0%), epidural hemorrhages (50.2%, SD 22.7%), and subarachnoid hemorrhages (50.5%, SD 29.2%) were relatively high, whereas they were relatively low for acute subdural hemorrhages (32.6%, SD 26.3%), chronic subdural hemorrhages (40.3%, SD 27.2%), and intraparenchymal hemorrhages (26.2%, SD 23.8%). The identification completeness percentages in both massive and minor bleeding showed no significant difference (P=.06). However, the misidentification percentage in recognizing massive bleeding was significantly lower than that for minor bleeding (P=.04). The identification completeness percentages and misidentification percentages for cerebral hemorrhages at different locations showed no significant differences (all P>.05). Lastly, radiologists showed relative acceptance regarding identification completeness (3.60, SD 0.54), accuracy (3.30, SD 0.65), and success (3.38, SD 0.64). Conclusions: GPT-4, a standout among NLP models, exhibits both promising capabilities and certain limitations in the realm of radiological imaging, particularly when it comes to identifying cerebral hemorrhages in CT scans. This opens up new directions and insights for the future development of NLP models in radiology. Trial Registration: ClinicalTrials.gov NCT06230419; https://clinicaltrials.gov/study/NCT06230419 %M 39326037 %R 10.2196/58741 %U https://www.jmir.org/2024/1/e58741 %U https://doi.org/10.2196/58741 %U http://www.ncbi.nlm.nih.gov/pubmed/39326037 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e53778 %T Generation of Backward-Looking Complex Reflections for a Motivational Interviewing–Based Smoking Cessation Chatbot Using GPT-4: Algorithm Development and Validation %A Kumar,Ash Tanuj %A Wang,Cindy %A Dong,Alec %A Rose,Jonathan %K motivational interviewing %K smoking cessation %K therapy %K automated therapy %K natural language processing %K large language models %K GPT-4 %K chatbot %K dialogue agent %K reflections %K reflection generation %K smoking %K cessation %K ChatGPT %K smokers %K smoker %K effectiveness %K messages %D 2024 %7 26.9.2024 %9 %J JMIR Ment Health %G English %X Background: Motivational interviewing (MI) is a therapeutic technique that has been successful in helping smokers reduce smoking but has limited accessibility due to the high cost and low availability of clinicians. To address this, the MIBot project has sought to develop a chatbot that emulates an MI session with a client with the specific goal of moving an ambivalent smoker toward the direction of quitting. One key element of an MI conversation is reflective listening, where a therapist expresses their understanding of what the client has said by uttering a reflection that encourages the client to continue their thought process. Complex reflections link the client’s responses to relevant ideas and facts to enhance this contemplation. Backward-looking complex reflections (BLCRs) link the client’s most recent response to a relevant selection of the client’s previous statements. Our current chatbot can generate complex reflections—but not BLCRs—using large language models (LLMs) such as GPT-2, which allows the generation of unique, human-like messages customized to client responses. Recent advancements in these models, such as the introduction of GPT-4, provide a novel way to generate complex text by feeding the models instructions and conversational history directly, making this a promising approach to generate BLCRs. Objective: This study aims to develop a method to generate BLCRs for an MI-based smoking cessation chatbot and to measure the method’s effectiveness. Methods: LLMs such as GPT-4 can be stimulated to produce specific types of responses to their inputs by “asking” them with an English-based description of the desired output. These descriptions are called prompts, and the goal of writing a description that causes an LLM to generate the required output is termed prompt engineering. We evolved an instruction to prompt GPT-4 to generate a BLCR, given the portions of the transcript of the conversation up to the point where the reflection was needed. The approach was tested on 50 previously collected MIBot transcripts of conversations with smokers and was used to generate a total of 150 reflections. The quality of the reflections was rated on a 4-point scale by 3 independent raters to determine whether they met specific criteria for acceptability. Results: Of the 150 generated reflections, 132 (88%) met the level of acceptability. The remaining 18 (12%) had one or more flaws that made them inappropriate as BLCRs. The 3 raters had pairwise agreement on 80% to 88% of these scores. Conclusions: The method presented to generate BLCRs is good enough to be used as one source of reflections in an MI-style conversation but would need an automatic checker to eliminate the unacceptable ones. This work illustrates the power of the new LLMs to generate therapeutic client-specific responses under the command of a language-based specification. %R 10.2196/53778 %U https://mental.jmir.org/2024/1/e53778 %U https://doi.org/10.2196/53778 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e62679 %T Empathy Toward Artificial Intelligence Versus Human Experiences and the Role of Transparency in Mental Health and Social Support Chatbot Design: Comparative Study %A Shen,Jocelyn %A DiPaola,Daniella %A Ali,Safinah %A Sap,Maarten %A Park,Hae Won %A Breazeal,Cynthia %+ MIT Media Lab, 75 Amherst Street, Cambridge, MA, 02139, United States, 1 3109802254, joceshen@mit.edu %K empathy %K large language models %K ethics %K transparency %K crowdsourcing %K human-computer interaction %D 2024 %7 25.9.2024 %9 Original Paper %J JMIR Ment Health %G English %X Background: Empathy is a driving force in our connection to others, our mental well-being, and resilience to challenges. With the rise of generative artificial intelligence (AI) systems, mental health chatbots, and AI social support companions, it is important to understand how empathy unfolds toward stories from human versus AI narrators and how transparency plays a role in user emotions. Objective: We aim to understand how empathy shifts across human-written versus AI-written stories, and how these findings inform ethical implications and human-centered design of using mental health chatbots as objects of empathy. Methods: We conducted crowd-sourced studies with 985 participants who each wrote a personal story and then rated empathy toward 2 retrieved stories, where one was written by a language model, and another was written by a human. Our studies varied disclosing whether a story was written by a human or an AI system to see how transparent author information affects empathy toward the narrator. We conducted mixed methods analyses: through statistical tests, we compared user’s self-reported state empathy toward the stories across different conditions. In addition, we qualitatively coded open-ended feedback about reactions to the stories to understand how and why transparency affects empathy toward human versus AI storytellers. Results: We found that participants significantly empathized with human-written over AI-written stories in almost all conditions, regardless of whether they are aware (t196=7.07, P<.001, Cohen d=0.60) or not aware (t298=3.46, P<.001, Cohen d=0.24) that an AI system wrote the story. We also found that participants reported greater willingness to empathize with AI-written stories when there was transparency about the story author (t494=–5.49, P<.001, Cohen d=0.36). Conclusions: Our work sheds light on how empathy toward AI or human narrators is tied to the way the text is presented, thus informing ethical considerations of empathetic artificial social support or mental health chatbots. %M 39321450 %R 10.2196/62679 %U https://mental.jmir.org/2024/1/e62679 %U https://doi.org/10.2196/62679 %U http://www.ncbi.nlm.nih.gov/pubmed/39321450 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e59505 %T Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook %A AlSaad,Rawan %A Abd-alrazaq,Alaa %A Boughorbel,Sabri %A Ahmed,Arfan %A Renault,Max-Antoine %A Damseh,Rafat %A Sheikh,Javaid %+ Weill Cornell Medicine-Qatar, Education City, Street 2700, Doha, Qatar, 974 44928830, rta4003@qatar-med.cornell.edu %K artificial intelligence %K large language models %K multimodal large language models %K multimodality %K multimodal generative artificial intelligence %K multimodal generative AI %K generative artificial intelligence %K generative AI %K health care %D 2024 %7 25.9.2024 %9 Viewpoint %J J Med Internet Res %G English %X In the complex and multidimensional field of medicine, multimodal data are prevalent and crucial for informed clinical decisions. Multimodal data span a broad spectrum of data types, including medical images (eg, MRI and CT scans), time-series data (eg, sensor data from wearable devices and electronic health records), audio recordings (eg, heart and respiratory sounds and patient interviews), text (eg, clinical notes and research articles), videos (eg, surgical procedures), and omics data (eg, genomics and proteomics). While advancements in large language models (LLMs) have enabled new applications for knowledge retrieval and processing in the medical field, most LLMs remain limited to processing unimodal data, typically text-based content, and often overlook the importance of integrating the diverse data modalities encountered in clinical practice. This paper aims to present a detailed, practical, and solution-oriented perspective on the use of multimodal LLMs (M-LLMs) in the medical field. Our investigation spanned M-LLM foundational principles, current and potential applications, technical and ethical challenges, and future research directions. By connecting these elements, we aimed to provide a comprehensive framework that links diverse aspects of M-LLMs, offering a unified vision for their future in health care. This approach aims to guide both future research and practical implementations of M-LLMs in health care, positioning them as a paradigm shift toward integrated, multimodal data–driven medical practice. We anticipate that this work will spark further discussion and inspire the development of innovative approaches in the next generation of medical M-LLM systems. %M 39321458 %R 10.2196/59505 %U https://www.jmir.org/2024/1/e59505 %U https://doi.org/10.2196/59505 %U http://www.ncbi.nlm.nih.gov/pubmed/39321458 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 13 %N %P e60361 %T Neural Conversational Agent for Weight Loss Counseling: Protocol for an Implementation and Feasibility Study %A Kotov,Alexander %A Idalski Carcone,April %A Towner,Elizabeth %+ Department of Computer Science, College of Engineering, Wayne State University, Suite 14001.6, 5057 Woodward Ave, Detroit, MI, 48202, United States, 1 3135779307, kotov@wayne.edu %K conversational agents %K artificial intelligence %K behavior change %K weight loss %K obesity %K motivational interviewing %K web-based application %K deep learning %K transformers %K large language models %K feasibility study %D 2024 %7 20.9.2024 %9 Protocol %J JMIR Res Protoc %G English %X Background: Obesity is a common, serious and costly chronic disease. Current clinical practice guidelines recommend that providers augment the longitudinal care of people living with obesity with consistent support for the development of self-efficacy and motivation to modify their lifestyle behaviors. Lifestyle behavior change aligns with the goals of motivational interviewing (MI), a client-centered yet directive counseling modality. However, training health care providers to be proficient in MI is expensive and time-consuming, resulting in a lack of trained counselors and limiting the widespread adoption of MI in clinical practice. Artificial intelligence (AI) counselors accessible via the internet can help circumvent these barriers. Objective: The primary objective is to explore the feasibility of conducting unscripted MI-consistent counseling using Neural Agent for Obesity Motivational Interviewing (NAOMI), a large language model (LLM)–based web app for weight loss counseling. The secondary objectives are to test the acceptability and usability of NAOMI’s counseling and examine its ability to shift motivational precursors in a sample of patients with overweight and obesity recruited from primary care clinics. Methods: NAOMI will be developed based on recent advances in deep learning in four stages. In stages 1 and 2, NAOMI will be implemented using an open-source foundation LLM and (1) few-shot learning based on a prompt with task-specific instructions and (2) domain adaptation strategy based on fine-tuning LLM using a large corpus of general psychotherapy and MI treatment transcripts. In stages 3 and 4, we will refine the best of these 2 approaches. Each NAOMI version will be evaluated using a mixed methods approach in which 10 adults (18-65 years) meeting the criteria for overweight or obesity (25.0≥BMI≤39.9) interact with NAOMI and provide feedback. NAOMI’s fidelity to the MI framework will be assessed using the Motivational Interviewing Treatment Integrity scale. Participants’ general perceptions of AI conversational agents and NAOMI specifically will be assessed via Pre- and Post-Interaction Questionnaires. Motivational precursors, such as participants’ confidence, importance, and readiness for changing lifestyle behaviors (eg, diet and activity), will be measured before and after the interaction, and 1 week later. A qualitative analysis of changes in the measures of perceptions of AI agents and counselors and motivational precursors will be performed. Participants will rate NAOMI’s usability and empathic skills post interaction via questionnaire-based assessments along with providing feedback about their experience with NAOMI via a qualitative interview. Results: NAOMI (version 1.0) has been developed. Participant recruitment will commence in September 2024. Data collection activities are expected to conclude in May 2025. Conclusions: If proven effective, LLM-based counseling agents can become a cost-effective approach for addressing the obesity epidemic at a public health level. They can also have a broad, transformative impact on the delivery of MI and other psychotherapeutic treatment modalities extending their reach and broadening access. International Registered Report Identifier (IRRID): PRR1-10.2196/60361 %M 39303273 %R 10.2196/60361 %U https://www.researchprotocols.org/2024/1/e60361 %U https://doi.org/10.2196/60361 %U http://www.ncbi.nlm.nih.gov/pubmed/39303273 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54617 %T Using Large Language Models to Detect Depression From User-Generated Diary Text Data as a Novel Approach in Digital Mental Health Screening: Instrument Validation Study %A Shin,Daun %A Kim,Hyoseung %A Lee,Seunghwan %A Cho,Younhee %A Jung,Whanbo %+ Department of Psychiatry, Anam Hospital, Korea University, 73 Goryeodae-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea, 82 1093649735, rune1018@gmail.com %K depression %K screening %K artificial intelligence %K digital health technology %K text data %D 2024 %7 18.9.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Depressive disorders have substantial global implications, leading to various social consequences, including decreased occupational productivity and a high disability burden. Early detection and intervention for clinically significant depression have gained attention; however, the existing depression screening tools, such as the Center for Epidemiologic Studies Depression Scale, have limitations in objectivity and accuracy. Therefore, researchers are identifying objective indicators of depression, including image analysis, blood biomarkers, and ecological momentary assessments (EMAs). Among EMAs, user-generated text data, particularly from diary writing, have emerged as a clinically significant and analyzable source for detecting or diagnosing depression, leveraging advancements in large language models such as ChatGPT. Objective: We aimed to detect depression based on user-generated diary text through an emotional diary writing app using a large language model (LLM). We aimed to validate the value of the semistructured diary text data as an EMA data source. Methods: Participants were assessed for depression using the Patient Health Questionnaire and suicide risk was evaluated using the Beck Scale for Suicide Ideation before starting and after completing the 2-week diary writing period. The text data from the daily diaries were also used in the analysis. The performance of leading LLMs, such as ChatGPT with GPT-3.5 and GPT-4, was assessed with and without GPT-3.5 fine-tuning on the training data set. The model performance comparison involved the use of chain-of-thought and zero-shot prompting to analyze the text structure and content. Results: We used 428 diaries from 91 participants; GPT-3.5 fine-tuning demonstrated superior performance in depression detection, achieving an accuracy of 0.902 and a specificity of 0.955. However, the balanced accuracy was the highest (0.844) for GPT-3.5 without fine-tuning and prompt techniques; it displayed a recall of 0.929. Conclusions: Both GPT-3.5 and GPT-4.0 demonstrated relatively reasonable performance in recognizing the risk of depression based on diaries. Our findings highlight the potential clinical usefulness of user-generated text data for detecting depression. In addition to measurable indicators, such as step count and physical activity, future research should increasingly emphasize qualitative digital expression. %M 39292502 %R 10.2196/54617 %U https://www.jmir.org/2024/1/e54617 %U https://doi.org/10.2196/54617 %U http://www.ncbi.nlm.nih.gov/pubmed/39292502 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e65527 %T Considerations and Challenges in the Application of Large Language Models for Patient Complaint Resolution %A Wei,Bin %A Hu,Xin %A Wu,XiaoRong %+ The 1st Affiliated Hospital, Jiangxi Medical College, Nanchang University, No. 17 Yongwai Zheng Street, Donghu District, Nanchang, 330000, China, 86 13617093259, wxr98021@126.com %K ChatGPT %K large language model %K LLM %K artificial intelligence %K AI %K patient complaint %K empathy %K efficiency %K patient satisfaction %K resource allocation %D 2024 %7 17.9.2024 %9 Letter to the Editor %J J Med Internet Res %G English %X %M 39288405 %R 10.2196/65527 %U https://www.jmir.org/2024/1/e65527 %U https://doi.org/10.2196/65527 %U http://www.ncbi.nlm.nih.gov/pubmed/39288405 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56859 %T Performance of ChatGPT in the In-Training Examination for Anesthesiology and Pain Medicine Residents in South Korea: Observational Study %A Yoon,Soo-Hyuk %A Oh,Seok Kyeong %A Lim,Byung Gun %A Lee,Ho-Jin %+ Department of Anesthesiology and Pain Medicine, Seoul National University Hospital, Seoul National University College of Medicine, Daehak-ro 101, Jongno-gu, Seoul, 03080, Republic of Korea, 82 220720039, hjpainfree@snu.ac.kr %K AI tools %K problem solving %K anesthesiology %K artificial intelligence %K pain medicine %K ChatGPT %K health care %K medical education %K South Korea %D 2024 %7 16.9.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT has been tested in health care, including the US Medical Licensing Examination and specialty exams, showing near-passing results. Its performance in the field of anesthesiology has been assessed using English board examination questions; however, its effectiveness in Korea remains unexplored. Objective: This study investigated the problem-solving performance of ChatGPT in the fields of anesthesiology and pain medicine in the Korean language context, highlighted advancements in artificial intelligence (AI), and explored its potential applications in medical education. Methods: We investigated the performance (number of correct answers/number of questions) of GPT-4, GPT-3.5, and CLOVA X in the fields of anesthesiology and pain medicine, using in-training examinations that have been administered to Korean anesthesiology residents over the past 5 years, with an annual composition of 100 questions. Questions containing images, diagrams, or photographs were excluded from the analysis. Furthermore, to assess the performance differences of the GPT across different languages, we conducted a comparative analysis of the GPT-4’s problem-solving proficiency using both the original Korean texts and their English translations. Results: A total of 398 questions were analyzed. GPT-4 (67.8%) demonstrated a significantly better overall performance than GPT-3.5 (37.2%) and CLOVA-X (36.7%). However, GPT-3.5 and CLOVA X did not show significant differences in their overall performance. Additionally, the GPT-4 showed superior performance on questions translated into English, indicating a language processing discrepancy (English: 75.4% vs Korean: 67.8%; difference 7.5%; 95% CI 3.1%-11.9%; P=.001). Conclusions: This study underscores the potential of AI tools, such as ChatGPT, in medical education and practice but emphasizes the need for cautious application and further refinement, especially in non-English medical contexts. The findings suggest that although AI advancements are promising, they require careful evaluation and development to ensure acceptable performance across diverse linguistic and professional settings. %M 39284182 %R 10.2196/56859 %U https://mededu.jmir.org/2024/1/e56859 %U https://doi.org/10.2196/56859 %U http://www.ncbi.nlm.nih.gov/pubmed/39284182 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e56797 %T ChatGPT Use Among Pediatric Health Care Providers: Cross-Sectional Survey Study %A Kisvarday,Susannah %A Yan,Adam %A Yarahuan,Julia %A Kats,Daniel J %A Ray,Mondira %A Kim,Eugene %A Hong,Peter %A Spector,Jacob %A Bickel,Jonathan %A Parsons,Chase %A Rabbani,Naveed %A Hron,Jonathan D %+ Division of General Pediatrics, Boston Children's Hospital, 300 Longwood Avenue, Boston, MA, 02115, United States, 1 5704283137, susannah.kisvarday@childrens.harvard.edu %K ChatGPT %K machine learning %K surveys and questionnaires %K medical informatics applications %K OpenAI %K large language model %K LLM %K machine learning %K pediatric %K chatbot %K artificial intelligence %K AI %K digital tools %D 2024 %7 12.9.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: The public launch of OpenAI’s ChatGPT platform generated immediate interest in the use of large language models (LLMs). Health care institutions are now grappling with establishing policies and guidelines for the use of these technologies, yet little is known about how health care providers view LLMs in medical settings. Moreover, there are no studies assessing how pediatric providers are adopting these readily accessible tools. Objective: The aim of this study was to determine how pediatric providers are currently using LLMs in their work as well as their interest in using a Health Insurance Portability and Accountability Act (HIPAA)–compliant version of ChatGPT in the future. Methods: A survey instrument consisting of structured and unstructured questions was iteratively developed by a team of informaticians from various pediatric specialties. The survey was sent via Research Electronic Data Capture (REDCap) to all Boston Children’s Hospital pediatric providers. Participation was voluntary and uncompensated, and all survey responses were anonymous.  Results: Surveys were completed by 390 pediatric providers. Approximately 50% (197/390) of respondents had used an LLM; of these, almost 75% (142/197) were already using an LLM for nonclinical work and 27% (52/195) for clinical work. Providers detailed the various ways they are currently using an LLM in their clinical and nonclinical work. Only 29% (n=105) of 362 respondents indicated that ChatGPT should be used for patient care in its present state; however, 73.8% (273/368) reported they would use a HIPAA-compliant version of ChatGPT if one were available. Providers’ proposed future uses of LLMs in health care are described. Conclusions: Despite significant concerns and barriers to LLM use in health care, pediatric providers are already using LLMs at work. This study will give policy makers needed information about how providers are using LLMs clinically. %M 39265163 %R 10.2196/56797 %U https://formative.jmir.org/2024/1/e56797 %U https://doi.org/10.2196/56797 %U http://www.ncbi.nlm.nih.gov/pubmed/39265163 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60501 %T Prompt Engineering Paradigms for Medical Applications: Scoping Review %A Zaghir,Jamil %A Naguib,Marco %A Bjelogrlic,Mina %A Névéol,Aurélie %A Tannier,Xavier %A Lovis,Christian %+ Department of Radiology and Medical Informatics, University of Geneva, Chemin des Mines, 9, Geneva, 1202, Switzerland, 41 022 379 08 18, Jamil.Zaghir@unige.ch %K prompt engineering %K prompt design %K prompt learning %K prompt tuning %K large language models %K LLMs %K scoping review %K clinical natural language processing %K natural language processing %K NLP %K medical texts %K medical application %K medical applications %K clinical practice %K privacy %K medicine %K computer science %K medical informatics %D 2024 %7 10.9.2024 %9 Review %J J Med Internet Res %G English %X Background: Prompt engineering, focusing on crafting effective prompts to large language models (LLMs), has garnered attention for its capabilities at harnessing the potential of LLMs. This is even more crucial in the medical domain due to its specialized terminology and language technicity. Clinical natural language processing applications must navigate complex language and ensure privacy compliance. Prompt engineering offers a novel approach by designing tailored prompts to guide models in exploiting clinically relevant information from complex medical texts. Despite its promise, the efficacy of prompt engineering in the medical domain remains to be fully explored. Objective: The aim of the study is to review research efforts and technical approaches in prompt engineering for medical applications as well as provide an overview of opportunities and challenges for clinical practice. Methods: Databases indexing the fields of medicine, computer science, and medical informatics were queried in order to identify relevant published papers. Since prompt engineering is an emerging field, preprint databases were also considered. Multiple data were extracted, such as the prompt paradigm, the involved LLMs, the languages of the study, the domain of the topic, the baselines, and several learning, design, and architecture strategies specific to prompt engineering. We include studies that apply prompt engineering–based methods to the medical domain, published between 2022 and 2024, and covering multiple prompt paradigms such as prompt learning (PL), prompt tuning (PT), and prompt design (PD). Results: We included 114 recent prompt engineering studies. Among the 3 prompt paradigms, we have observed that PD is the most prevalent (78 papers). In 12 papers, PD, PL, and PT terms were used interchangeably. While ChatGPT is the most commonly used LLM, we have identified 7 studies using this LLM on a sensitive clinical data set. Chain-of-thought, present in 17 studies, emerges as the most frequent PD technique. While PL and PT papers typically provide a baseline for evaluating prompt-based approaches, 61% (48/78) of the PD studies do not report any nonprompt-related baseline. Finally, we individually examine each of the key prompt engineering–specific information reported across papers and find that many studies neglect to explicitly mention them, posing a challenge for advancing prompt engineering research. Conclusions: In addition to reporting on trends and the scientific landscape of prompt engineering, we provide reporting guidelines for future studies to help advance research in the medical field. We also disclose tables and figures summarizing medical prompt engineering papers available and hope that future contributions will leverage these existing works to better advance the field. %M 39255030 %R 10.2196/60501 %U https://www.jmir.org/2024/1/e60501 %U https://doi.org/10.2196/60501 %U http://www.ncbi.nlm.nih.gov/pubmed/39255030 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54985 %T The Diagnostic Ability of GPT-3.5 and GPT-4.0 in Surgery: Comparative Analysis %A Liu,Jiayu %A Liang,Xiuting %A Fang,Dandong %A Zheng,Jiqi %A Yin,Chengliang %A Xie,Hui %A Li,Yanteng %A Sun,Xiaochun %A Tong,Yue %A Che,Hebin %A Hu,Ping %A Yang,Fan %A Wang,Bingxian %A Chen,Yuanyuan %A Cheng,Gang %A Zhang,Jianning %+ Department of Neurosurgery, The First Medical Centre, Chinese PLA General Hospital, No. 28 Fuxing Road, Haidian District, Beijing, 100853, China, 86 01066938439, jnzhang2018@163.com %K ChatGPT %K accuracy rates %K artificial intelligence %K diagnosis %K surgeon %D 2024 %7 10.9.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: ChatGPT (OpenAI) has shown great potential in clinical diagnosis and could become an excellent auxiliary tool in clinical practice. This study investigates and evaluates ChatGPT in diagnostic capabilities by comparing the performance of GPT-3.5 and GPT-4.0 across model iterations. Objective: This study aims to evaluate the precise diagnostic ability of GPT-3.5 and GPT-4.0 for colon cancer and its potential as an auxiliary diagnostic tool for surgeons and compare the diagnostic accuracy rates between GTP-3.5 and GPT-4.0. We precisely assess the accuracy of primary and secondary diagnoses and analyze the causes of misdiagnoses in GPT-3.5 and GPT-4.0 according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings. Methods: We retrieved 316 case reports for intestinal cancer from the Chinese Medical Association Publishing House database, of which 286 cases were deemed valid after data cleansing. The cases were translated from Mandarin to English and then input into GPT-3.5 and GPT-4.0 using a simple, direct prompt to elicit primary and secondary diagnoses. We conducted a comparative study to evaluate the diagnostic accuracy of GPT-4.0 and GPT-3.5. Three senior surgeons from the General Surgery Department, specializing in Colorectal Surgery, assessed the diagnostic information at the Chinese PLA (People’s Liberation Army) General Hospital. The accuracy of primary and secondary diagnoses was scored based on predefined criteria. Additionally, we analyzed and compared the causes of misdiagnoses in both models according to 7 categories: patient histories, symptoms, physical signs, laboratory examinations, imaging examinations, pathological examinations, and intraoperative findings. Results: Out of 286 cases, GPT-4.0 and GPT-3.5 both demonstrated high diagnostic accuracy for primary diagnoses, but the accuracy rates of GPT-4.0 were significantly higher than GPT-3.5 (mean 0.972, SD 0.137 vs mean 0.855, SD 0.335; t285=5.753; P<.001). For secondary diagnoses, the accuracy rates of GPT-4.0 were also significantly higher than GPT-3.5 (mean 0.908, SD 0.159 vs mean 0.617, SD 0.349; t285=–7.727; P<.001). GPT-3.5 showed limitations in processing patient history, symptom presentation, laboratory tests, and imaging data. While GPT-4.0 improved upon GPT-3.5, it still has limitations in identifying symptoms and laboratory test data. For both primary and secondary diagnoses, there was no significant difference in accuracy related to age, gender, or system group between GPT-4.0 and GPT-3.5. Conclusions: This study demonstrates that ChatGPT, particularly GPT-4.0, possesses significant diagnostic potential, with GPT-4.0 exhibiting higher accuracy than GPT-3.5. However, GPT-4.0 still has limitations, particularly in recognizing patient symptoms and laboratory data, indicating a need for more research in real-world clinical settings to enhance its diagnostic capabilities. %M 39255016 %R 10.2196/54985 %U https://www.jmir.org/2024/1/e54985 %U https://doi.org/10.2196/54985 %U http://www.ncbi.nlm.nih.gov/pubmed/39255016 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56121 %T Quality and Accountability of ChatGPT in Health Care in Low- and Middle-Income Countries: Simulated Patient Study %A Si,Yafei %A Yang,Yuyi %A Wang,Xi %A Zu,Jiaqi %A Chen,Xi %A Fan,Xiaojing %A An,Ruopeng %A Gong,Sen %+ School of Public Policy and Administration, Xi’an Jiaotong University, 28 West Xianning Road, Xi'an, 710049, China, 86 15891725861, emirada@163.com %K ChatGPT %K generative AI %K simulated patient %K health care %K quality and safety %K low- and middle-income countries %K quality %K LMIC %K patient study %K effectiveness %K reliability %K medication prescription %K prescription %K noncommunicable diseases %K AI integration %K AI %K artificial intelligence %D 2024 %7 9.9.2024 %9 Research Letter %J J Med Internet Res %G English %X Using simulated patients to mimic 9 established noncommunicable and infectious diseases, we assessed ChatGPT’s performance in treatment recommendations for common diseases in low- and middle-income countries. ChatGPT had a high level of accuracy in both correct diagnoses (20/27, 74%) and medication prescriptions (22/27, 82%) but a concerning level of unnecessary or harmful medications (23/27, 85%) even with correct diagnoses. ChatGPT performed better in managing noncommunicable diseases than infectious ones. These results highlight the need for cautious AI integration in health care systems to ensure quality and safety. %M 39250188 %R 10.2196/56121 %U https://www.jmir.org/2024/1/e56121 %U https://doi.org/10.2196/56121 %U http://www.ncbi.nlm.nih.gov/pubmed/39250188 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e58478 %T Practical Applications of Large Language Models for Health Care Professionals and Scientists %A Reis,Florian %A Lenz,Christian %A Gossen,Manfred %A Volk,Hans-Dieter %A Drzeniek,Norman Michael %K artificial intelligence %K healthcare %K chatGPT %K large language model %K prompting %K LLM %K applications %K AI %K scientists %K physicians %K health care %D 2024 %7 5.9.2024 %9 %J JMIR Med Inform %G English %X With the popularization of large language models (LLMs), strategies for their effective and safe usage in health care and research have become increasingly pertinent. Despite the growing interest and eagerness among health care professionals and scientists to exploit the potential of LLMs, initial attempts may yield suboptimal results due to a lack of user experience, thus complicating the integration of artificial intelligence (AI) tools into workplace routine. Focusing on scientists and health care professionals with limited LLM experience, this viewpoint article highlights and discusses 6 easy-to-implement use cases of practical relevance. These encompass customizing translations, refining text and extracting information, generating comprehensive overviews and specialized insights, compiling ideas into cohesive narratives, crafting personalized educational materials, and facilitating intellectual sparring. Additionally, we discuss general prompting strategies and precautions for the implementation of AI tools in biomedicine. Despite various hurdles and challenges, the integration of LLMs into daily routines of physicians and researchers promises heightened workplace productivity and efficiency. %R 10.2196/58478 %U https://medinform.jmir.org/2024/1/e58478 %U https://doi.org/10.2196/58478 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56022 %T An Advanced Machine Learning Model for a Web-Based Artificial Intelligence–Based Clinical Decision Support System Application: Model Development and Validation Study %A Lin,Tai-Han %A Chung,Hsing-Yi %A Jian,Ming-Jr %A Chang,Chih-Kai %A Perng,Cherng-Lih %A Liao,Guo-Shiou %A Yu,Jyh-Cherng %A Dai,Ming-Shen %A Yu,Cheng-Ping %A Shang,Hung-Sheng %+ Division of Clinical Pathology, Department of Pathology, Tri-Service General Hospital, National Defense Medical Center, No. 161, Sec. 6, Minquan E. Road, Neihu District, Taipei, 11490, Taiwan, 886 920713130, iamkeith001@gmail.com %K breast cancer recurrence %K artificial intelligence–based clinical decision support system %K machine learning %K personalized treatment planning %K ChatGPT %K predictive model accuracy %D 2024 %7 4.9.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Breast cancer is a leading global health concern, necessitating advancements in recurrence prediction and management. The development of an artificial intelligence (AI)–based clinical decision support system (AI-CDSS) using ChatGPT addresses this need with the aim of enhancing both prediction accuracy and user accessibility. Objective: This study aims to develop and validate an advanced machine learning model for a web-based AI-CDSS application, leveraging the question-and-answer guidance capabilities of ChatGPT to enhance data preprocessing and model development, thereby improving the prediction of breast cancer recurrence. Methods: This study focused on developing an advanced machine learning model by leveraging data from the Tri-Service General Hospital breast cancer registry of 3577 patients (2004-2016). As a tertiary medical center, it accepts referrals from four branches—3 branches in the northern region and 1 branch on an offshore island in our country—that manage chronic diseases but refer complex surgical cases, including breast cancer, to the main center, enriching our study population’s diversity. Model training used patient data from 2004 to 2012, with subsequent validation using data from 2013 to 2016, ensuring comprehensive assessment and robustness of our predictive models. ChatGPT is integral to preprocessing and model development, aiding in hormone receptor categorization, age binning, and one-hot encoding. Techniques such as the synthetic minority oversampling technique address the imbalance of data sets. Various algorithms, including light gradient-boosting machine, gradient boosting, and extreme gradient boosting, were used, and their performance was evaluated using metrics such as the area under the curve, accuracy, sensitivity, and F1-score. Results: The light gradient-boosting machine model demonstrated superior performance, with an area under the curve of 0.80, followed closely by the gradient boosting and extreme gradient boosting models. The web interface of the AI-CDSS tool was effectively tested in clinical decision-making scenarios, proving its use in personalized treatment planning and patient involvement. Conclusions: The AI-CDSS tool, enhanced by ChatGPT, marks a significant advancement in breast cancer recurrence prediction, offering a more individualized and accessible approach for clinicians and patients. Although promising, further validation in diverse clinical settings is recommended to confirm its efficacy and expand its use. %M 39231422 %R 10.2196/56022 %U https://www.jmir.org/2024/1/e56022 %U https://doi.org/10.2196/56022 %U http://www.ncbi.nlm.nih.gov/pubmed/39231422 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e59258 %T Evaluating the Capabilities of Generative AI Tools in Understanding Medical Papers: Qualitative Study %A Akyon,Seyma Handan %A Akyon,Fatih Cagatay %A Camyar,Ahmet Sefa %A Hızlı,Fatih %A Sari,Talha %A Hızlı,Şamil %+ Golpazari Family Health Center, Istiklal Mahallesi Fevzi Cakmak Caddesi No:23 Golpazari, Bilecik, 11700, Turkey, 90 5052568096, drseymahandan@gmail.com %K large language models %K LLM %K LLMs %K ChatGPT %K artificial intelligence %K AI %K natural language processing %K medicine %K health care %K GPT %K machine learning %K language model %K language models %K generative %K research paper %K research papers %K scientific research %K answer %K answers %K response %K responses %K comprehension %K STROBE %K Strengthening the Reporting of Observational Studies in Epidemiology %D 2024 %7 4.9.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: Reading medical papers is a challenging and time-consuming task for doctors, especially when the papers are long and complex. A tool that can help doctors efficiently process and understand medical papers is needed. Objective: This study aims to critically assess and compare the comprehension capabilities of large language models (LLMs) in accurately and efficiently understanding medical research papers using the STROBE (Strengthening the Reporting of Observational Studies in Epidemiology) checklist, which provides a standardized framework for evaluating key elements of observational study. Methods: The study is a methodological type of research. The study aims to evaluate the understanding capabilities of new generative artificial intelligence tools in medical papers. A novel benchmark pipeline processed 50 medical research papers from PubMed, comparing the answers of 6 LLMs (GPT-3.5-Turbo, GPT-4-0613, GPT-4-1106, PaLM 2, Claude v1, and Gemini Pro) to the benchmark established by expert medical professors. Fifteen questions, derived from the STROBE checklist, assessed LLMs’ understanding of different sections of a research paper. Results: LLMs exhibited varying performance, with GPT-3.5-Turbo achieving the highest percentage of correct answers (n=3916, 66.9%), followed by GPT-4-1106 (n=3837, 65.6%), PaLM 2 (n=3632, 62.1%), Claude v1 (n=2887, 58.3%), Gemini Pro (n=2878, 49.2%), and GPT-4-0613 (n=2580, 44.1%). Statistical analysis revealed statistically significant differences between LLMs (P<.001), with older models showing inconsistent performance compared to newer versions. LLMs showcased distinct performances for each question across different parts of a scholarly paper—with certain models like PaLM 2 and GPT-3.5 showing remarkable versatility and depth in understanding. Conclusions: This study is the first to evaluate the performance of different LLMs in understanding medical papers using the retrieval augmented generation method. The findings highlight the potential of LLMs to enhance medical research by improving efficiency and facilitating evidence-based decision-making. Further research is needed to address limitations such as the influence of question formats, potential biases, and the rapid evolution of LLM models. %M 39230947 %R 10.2196/59258 %U https://medinform.jmir.org/2024/1/e59258 %U https://doi.org/10.2196/59258 %U http://www.ncbi.nlm.nih.gov/pubmed/39230947 %0 Journal Article %@ 2564-1891 %I JMIR Publications %V 4 %N %P e59641 %T Large Language Models Can Enable Inductive Thematic Analysis of a Social Media Corpus in a Single Prompt: Human Validation Study %A Deiner,Michael S %A Honcharov,Vlad %A Li,Jiawei %A Mackey,Tim K %A Porco,Travis C %A Sarkar,Urmimala %+ Departments of Ophthalmology, Epidemiology and Biostatistics, Global Health Sciences, and Francis I Proctor Foundation, University of California San Francisco, 490 Illinois St, 2nd Floor, San Francisco, CA, 94158, United States, 1 415 476 4101, travis.porco@ucsf.edu %K generative large language model %K generative pretrained transformer %K GPT %K Claude %K Twitter %K X formerly known as Twitter %K social media %K inductive content analysis %K COVID-19 %K vaccine hesitancy %K infodemiology %D 2024 %7 29.8.2024 %9 Original Paper %J JMIR Infodemiology %G English %X Background: Manually analyzing public health–related content from social media provides valuable insights into the beliefs, attitudes, and behaviors of individuals, shedding light on trends and patterns that can inform public understanding, policy decisions, targeted interventions, and communication strategies. Unfortunately, the time and effort needed from well-trained human subject matter experts makes extensive manual social media listening unfeasible. Generative large language models (LLMs) can potentially summarize and interpret large amounts of text, but it is unclear to what extent LLMs can glean subtle health-related meanings in large sets of social media posts and reasonably report health-related themes. Objective: We aimed to assess the feasibility of using LLMs for topic model selection or inductive thematic analysis of large contents of social media posts by attempting to answer the following question: Can LLMs conduct topic model selection and inductive thematic analysis as effectively as humans did in a prior manual study, or at least reasonably, as judged by subject matter experts? Methods: We asked the same research question and used the same set of social media content for both the LLM selection of relevant topics and the LLM analysis of themes as was conducted manually in a published study about vaccine rhetoric. We used the results from that study as background for this LLM experiment by comparing the results from the prior manual human analyses with the analyses from 3 LLMs: GPT4-32K, Claude-instant-100K, and Claude-2-100K. We also assessed if multiple LLMs had equivalent ability and assessed the consistency of repeated analysis from each LLM. Results: The LLMs generally gave high rankings to the topics chosen previously by humans as most relevant. We reject a null hypothesis (P<.001, overall comparison) and conclude that these LLMs are more likely to include the human-rated top 5 content areas in their top rankings than would occur by chance. Regarding theme identification, LLMs identified several themes similar to those identified by humans, with very low hallucination rates. Variability occurred between LLMs and between test runs of an individual LLM. Despite not consistently matching the human-generated themes, subject matter experts found themes generated by the LLMs were still reasonable and relevant. Conclusions: LLMs can effectively and efficiently process large social media–based health-related data sets. LLMs can extract themes from such data that human subject matter experts deem reasonable. However, we were unable to show that the LLMs we tested can replicate the depth of analysis from human subject matter experts by consistently extracting the same themes from the same data. There is vast potential, once better validated, for automated LLM-based real-time social listening for common and rare health conditions, informing public health understanding of the public’s interests and concerns and determining the public’s ideas to address them. %M 39207842 %R 10.2196/59641 %U https://infodemiology.jmir.org/2024/1/e59641 %U https://doi.org/10.2196/59641 %U http://www.ncbi.nlm.nih.gov/pubmed/39207842 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e57896 %T Current Status of ChatGPT Use in Medical Education: Potentials, Challenges, and Strategies %A Xu,Tianhui %A Weng,Huiting %A Liu,Fang %A Yang,Li %A Luo,Yuanyuan %A Ding,Ziwei %A Wang,Qin %+ Clinical Nursing Teaching and Research Section, The Second Xiangya Hospital of Central South University, 139 Middle Renmin Road, Changsha, 410011, China, 86 18774806226, wangqin3421@csu.edu.cn %K chat generative pretrained transformer %K ChatGPT %K artificial intelligence %K medical education %K natural language processing %K clinical practice %D 2024 %7 28.8.2024 %9 Viewpoint %J J Med Internet Res %G English %X ChatGPT, a generative pretrained transformer, has garnered global attention and sparked discussions since its introduction on November 30, 2022. However, it has generated controversy within the realms of medical education and scientific research. This paper examines the potential applications, limitations, and strategies for using ChatGPT. ChatGPT offers personalized learning support to medical students through its robust natural language generation capabilities, enabling it to furnish answers. Moreover, it has demonstrated significant use in simulating clinical scenarios, facilitating teaching and learning processes, and revitalizing medical education. Nonetheless, numerous challenges accompany these advancements. In the context of education, it is of paramount importance to prevent excessive reliance on ChatGPT and combat academic plagiarism. Likewise, in the field of medicine, it is vital to guarantee the timeliness, accuracy, and reliability of content generated by ChatGPT. Concurrently, ethical challenges and concerns regarding information security arise. In light of these challenges, this paper proposes targeted strategies for addressing them. First, the risk of overreliance on ChatGPT and academic plagiarism must be mitigated through ideological education, fostering comprehensive competencies, and implementing diverse evaluation criteria. The integration of contemporary pedagogical methodologies in conjunction with the use of ChatGPT serves to enhance the overall quality of medical education. To enhance the professionalism and reliability of the generated content, it is recommended to implement measures to optimize ChatGPT’s training data professionally and enhance the transparency of the generation process. This ensures that the generated content is aligned with the most recent standards of medical practice. Moreover, the enhancement of value alignment and the establishment of pertinent legislation or codes of practice address ethical concerns, including those pertaining to algorithmic discrimination, the allocation of medical responsibility, privacy, and security. In conclusion, while ChatGPT presents significant potential in medical education, it also encounters various challenges. Through comprehensive research and the implementation of suitable strategies, it is anticipated that ChatGPT’s positive impact on medical education will be harnessed, laying the groundwork for advancing the discipline and fostering the development of high-caliber medical professionals. %M 39196640 %R 10.2196/57896 %U https://www.jmir.org/2024/1/e57896 %U https://doi.org/10.2196/57896 %U http://www.ncbi.nlm.nih.gov/pubmed/39196640 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54616 %T AI-Driven Diagnostic Assistance in Medical Inquiry: Reinforcement Learning Algorithm Development and Validation %A Zou,Xuan %A He,Weijie %A Huang,Yu %A Ouyang,Yi %A Zhang,Zhen %A Wu,Yu %A Wu,Yongsheng %A Feng,Lili %A Wu,Sheng %A Yang,Mengqi %A Chen,Xuyan %A Zheng,Yefeng %A Jiang,Rui %A Chen,Ting %+ Department of Computer Science and Technology, Tsinghua University, Room 3-609, Future Internet Technology Research Center, Tsinghua University, Beijing, 100084, China, 86 010 62797101, tingchen@tsinghua.edu.cn %K inquiry and diagnosis %K electronic health record %K reinforcement learning %K natural language processing %K artificial intelligence %D 2024 %7 23.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: For medical diagnosis, clinicians typically begin with a patient’s chief concerns, followed by questions about symptoms and medical history, physical examinations, and requests for necessary auxiliary examinations to gather comprehensive medical information. This complex medical investigation process has yet to be modeled by existing artificial intelligence (AI) methodologies. Objective: The aim of this study was to develop an AI-driven medical inquiry assistant for clinical diagnosis that provides inquiry recommendations by simulating clinicians’ medical investigating logic via reinforcement learning. Methods: We compiled multicenter, deidentified outpatient electronic health records from 76 hospitals in Shenzhen, China, spanning the period from July to November 2021. These records consisted of both unstructured textual information and structured laboratory test results. We first performed feature extraction and standardization using natural language processing techniques and then used a reinforcement learning actor-critic framework to explore the rational and effective inquiry logic. To align the inquiry process with actual clinical practice, we segmented the inquiry into 4 stages: inquiring about symptoms and medical history, conducting physical examinations, requesting auxiliary examinations, and terminating the inquiry with a diagnosis. External validation was conducted to validate the inquiry logic of the AI model. Results: This study focused on 2 retrospective inquiry-and-diagnosis tasks in the emergency and pediatrics departments. The emergency departments provided records of 339,020 consultations including mainly children (median age 5.2, IQR 2.6-26.1 years) with various types of upper respiratory tract infections (250,638/339,020, 73.93%). The pediatrics department provided records of 561,659 consultations, mainly of children (median age 3.8, IQR 2.0-5.7 years) with various types of upper respiratory tract infections (498,408/561,659, 88.73%). When conducting its own inquiries in both scenarios, the AI model demonstrated high diagnostic performance, with areas under the receiver operating characteristic curve of 0.955 (95% CI 0.953-0.956) and 0.943 (95% CI 0.941-0.944), respectively. When the AI model was used in a simulated collaboration with physicians, it notably reduced the average number of physicians’ inquiries to 46% (6.037/13.26; 95% CI 6.009-6.064) and 43% (6.245/14.364; 95% CI 6.225-6.269) while achieving areas under the receiver operating characteristic curve of 0.972 (95% CI 0.970-0.973) and 0.968 (95% CI 0.967-0.969) in the scenarios. External validation revealed a normalized Kendall τ distance of 0.323 (95% CI 0.301-0.346), indicating the inquiry consistency of the AI model with physicians. Conclusions: This retrospective analysis of predominantly respiratory pediatric presentations in emergency and pediatrics departments demonstrated that an AI-driven diagnostic assistant had high diagnostic performance both in stand-alone use and in simulated collaboration with clinicians. Its investigation process was found to be consistent with the clinicians’ medical investigation logic. These findings highlight the diagnostic assistant’s promise in assisting the decision-making processes of health care professionals. %M 39178403 %R 10.2196/54616 %U https://www.jmir.org/2024/1/e54616 %U https://doi.org/10.2196/54616 %U http://www.ncbi.nlm.nih.gov/pubmed/39178403 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50545 %T Integration of ChatGPT Into a Course for Medical Students: Explorative Study on Teaching Scenarios, Students’ Perception, and Applications %A Thomae,Anita V %A Witt,Claudia M %A Barth,Jürgen %K medical education %K ChatGPT %K artificial intelligence %K information for patients %K critical appraisal %K evaluation %K blended learning %K AI %K digital skills %K teaching %D 2024 %7 22.8.2024 %9 %J JMIR Med Educ %G English %X Background: Text-generating artificial intelligence (AI) such as ChatGPT offers many opportunities and challenges in medical education. Acquiring practical skills necessary for using AI in a clinical context is crucial, especially for medical education. Objective: This explorative study aimed to investigate the feasibility of integrating ChatGPT into teaching units and to evaluate the course and the importance of AI-related competencies for medical students. Since a possible application of ChatGPT in the medical field could be the generation of information for patients, we further investigated how such information is perceived by students in terms of persuasiveness and quality. Methods: ChatGPT was integrated into 3 different teaching units of a blended learning course for medical students. Using a mixed methods approach, quantitative and qualitative data were collected. As baseline data, we assessed students’ characteristics, including their openness to digital innovation. The students evaluated the integration of ChatGPT into the course and shared their thoughts regarding the future of text-generating AI in medical education. The course was evaluated based on the Kirkpatrick Model, with satisfaction, learning progress, and applicable knowledge considered as key assessment levels. In ChatGPT-integrating teaching units, students evaluated videos featuring information for patients regarding their persuasiveness on treatment expectations in a self-experience experiment and critically reviewed information for patients written using ChatGPT 3.5 based on different prompts. Results: A total of 52 medical students participated in the study. The comprehensive evaluation of the course revealed elevated levels of satisfaction, learning progress, and applicability specifically in relation to the ChatGPT-integrating teaching units. Furthermore, all evaluation levels demonstrated an association with each other. Higher openness to digital innovation was associated with higher satisfaction and, to a lesser extent, with higher applicability. AI-related competencies in other courses of the medical curriculum were perceived as highly important by medical students. Qualitative analysis highlighted potential use cases of ChatGPT in teaching and learning. In ChatGPT-integrating teaching units, students rated information for patients generated using a basic ChatGPT prompt as “moderate” in terms of comprehensibility, patient safety, and the correct application of communication rules taught during the course. The students’ ratings were considerably improved using an extended prompt. The same text, however, showed the smallest increase in treatment expectations when compared with information provided by humans (patient, clinician, and expert) via videos. Conclusions: This study offers valuable insights into integrating the development of AI competencies into a blended learning course. Integration of ChatGPT enhanced learning experiences for medical students. %R 10.2196/50545 %U https://mededu.jmir.org/2024/1/e50545 %U https://doi.org/10.2196/50545 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56500 %T Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study %A Li,Kevin Danis %A Fernandez,Adrian M %A Schwartz,Rachel %A Rios,Natalie %A Carlisle,Marvin Nathaniel %A Amend,Gregory M %A Patel,Hiren V %A Breyer,Benjamin N %+ Department of Urology, University of California San Francisco, 400 Parnassus Ave, San Francisco, CA, United States, 1 415 353 2200, kevin.d.li@ucsf.edu %K artificial intelligence %K ChatGPT %K large language models %K qualitative analysis %K content analysis %K buried penis %K qualitative interviews %K qualitative description %K urology %D 2024 %7 21.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models including GPT-4 (OpenAI) have opened new avenues in health care and qualitative research. Traditional qualitative methods are time-consuming and require expertise to capture nuance. Although large language models have demonstrated enhanced contextual understanding and inferencing compared with traditional natural language processing, their performance in qualitative analysis versus that of humans remains unexplored. Objective: We evaluated the effectiveness of GPT-4 versus human researchers in qualitative analysis of interviews with patients with adult-acquired buried penis (AABP). Methods: Qualitative data were obtained from semistructured interviews with 20 patients with AABP. Human analysis involved a structured 3-stage process—initial observations, line-by-line coding, and consensus discussions to refine themes. In contrast, artificial intelligence (AI) analysis with GPT-4 underwent two phases: (1) a naïve phase, where GPT-4 outputs were independently evaluated by a blinded reviewer to identify themes and subthemes and (2) a comparison phase, where AI-generated themes were compared with human-identified themes to assess agreement. We used a general qualitative description approach. Results: The study population (N=20) comprised predominantly White (17/20, 85%), married (12/20, 60%), heterosexual (19/20, 95%) men, with a mean age of 58.8 years and BMI of 41.1 kg/m2. Human qualitative analysis identified “urinary issues” in 95% (19/20) and GPT-4 in 75% (15/20) of interviews, with the subtheme “spray or stream” noted in 60% (12/20) and 35% (7/20), respectively. “Sexual issues” were prominent (19/20, 95% humans vs 16/20, 80% GPT-4), although humans identified a wider range of subthemes, including “pain with sex or masturbation” (7/20, 35%) and “difficulty with sex or masturbation” (4/20, 20%). Both analyses similarly highlighted “mental health issues” (11/20, 55%, both), although humans coded “depression” more frequently (10/20, 50% humans vs 4/20, 20% GPT-4). Humans frequently cited “issues using public restrooms” (12/20, 60%) as impacting social life, whereas GPT-4 emphasized “struggles with romantic relationships” (9/20, 45%). “Hygiene issues” were consistently recognized (14/20, 70% humans vs 13/20, 65% GPT-4). Humans uniquely identified “contributing factors” as a theme in all interviews. There was moderate agreement between human and GPT-4 coding (κ=0.401). Reliability assessments of GPT-4’s analyses showed consistent coding for themes including “body image struggles,” “chronic pain” (10/10, 100%), and “depression” (9/10, 90%). Other themes like “motivation for surgery” and “weight challenges” were reliably coded (8/10, 80%), while less frequent themes were variably identified across multiple iterations. Conclusions: Large language models including GPT-4 can effectively identify key themes in analyzing qualitative health care data, showing moderate agreement with human analysis. While human analysis provided a richer diversity of subthemes, the consistency of AI suggests its use as a complementary tool in qualitative research. With AI rapidly advancing, future studies should iterate analyses and circumvent token limitations by segmenting data, furthering the breadth and depth of large language model–driven qualitative analyses. %M 39167785 %R 10.2196/56500 %U https://www.jmir.org/2024/1/e56500 %U https://doi.org/10.2196/56500 %U http://www.ncbi.nlm.nih.gov/pubmed/39167785 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e57037 %T Integrating ChatGPT in Orthopedic Education for Medical Undergraduates: Randomized Controlled Trial %A Gan,Wenyi %A Ouyang,Jianfeng %A Li,Hua %A Xue,Zhaowen %A Zhang,Yiming %A Dong,Qiu %A Huang,Jiadong %A Zheng,Xiaofei %A Zhang,Yiyi %+ The First Clinical Medical College of Jinan University, The First Affiliated Hospital of Jinan University, No. 613, Huangpu Avenue West, Tianhe District, Guangzhou, 510630, China, 86 130 76855735, yiyizjun@126.com %K ChatGPT %K medical education %K orthopedics %K artificial intelligence %K large language model %K natural language processing %K randomized controlled trial %K learning aid %D 2024 %7 20.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: ChatGPT is a natural language processing model developed by OpenAI, which can be iteratively updated and optimized to accommodate the changing and complex requirements of human verbal communication. Objective: The study aimed to evaluate ChatGPT’s accuracy in answering orthopedics-related multiple-choice questions (MCQs) and assess its short-term effects as a learning aid through a randomized controlled trial. In addition, long-term effects on student performance in other subjects were measured using final examination results. Methods: We first evaluated ChatGPT’s accuracy in answering MCQs pertaining to orthopedics across various question formats. Then, 129 undergraduate medical students participated in a randomized controlled study in which the ChatGPT group used ChatGPT as a learning tool, while the control group was prohibited from using artificial intelligence software to support learning. Following a 2-week intervention, the 2 groups’ understanding of orthopedics was assessed by an orthopedics test, and variations in the 2 groups’ performance in other disciplines were noted through a follow-up at the end of the semester. Results: ChatGPT-4.0 answered 1051 orthopedics-related MCQs with a 70.60% (742/1051) accuracy rate, including 71.8% (237/330) accuracy for A1 MCQs, 73.7% (330/448) accuracy for A2 MCQs, 70.2% (92/131) accuracy for A3/4 MCQs, and 58.5% (83/142) accuracy for case analysis MCQs. As of April 7, 2023, a total of 129 individuals participated in the experiment. However, 19 individuals withdrew from the experiment at various phases; thus, as of July 1, 2023, a total of 110 individuals accomplished the trial and completed all follow-up work. After we intervened in the learning style of the students in the short term, the ChatGPT group answered more questions correctly than the control group (ChatGPT group: mean 141.20, SD 26.68; control group: mean 130.80, SD 25.56; P=.04) in the orthopedics test, particularly on A1 (ChatGPT group: mean 46.57, SD 8.52; control group: mean 42.18, SD 9.43; P=.01), A2 (ChatGPT group: mean 60.59, SD 10.58; control group: mean 56.66, SD 9.91; P=.047), and A3/4 MCQs (ChatGPT group: mean 19.57, SD 5.48; control group: mean 16.46, SD 4.58; P=.002). At the end of the semester, we found that the ChatGPT group performed better on final examinations in surgery (ChatGPT group: mean 76.54, SD 9.79; control group: mean 72.54, SD 8.11; P=.02) and obstetrics and gynecology (ChatGPT group: mean 75.98, SD 8.94; control group: mean 72.54, SD 8.66; P=.04) than the control group. Conclusions: ChatGPT answers orthopedics-related MCQs accurately, and students using it excel in both short-term and long-term assessments. Our findings strongly support ChatGPT’s integration into medical education, enhancing contemporary instructional methods. Trial Registration: Chinese Clinical Trial Registry Chictr2300071774; https://www.chictr.org.cn/hvshowproject.html ?id=225740&v=1.0 %M 39163598 %R 10.2196/57037 %U https://www.jmir.org/2024/1/e57037 %U https://doi.org/10.2196/57037 %U http://www.ncbi.nlm.nih.gov/pubmed/39163598 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e56537 %T Evaluating Literature Reviews Conducted by Humans Versus ChatGPT: Comparative Study %A Mostafapour,Mehrnaz %A Fortier,Jacqueline H %A Pacheco,Karen %A Murray,Heather %A Garber,Gary %+ Canadian Medical Protective Association, 875 Carling Ave, Ottawa, ON, K1S 5P1, Canada, 1 800 267 6522, research@cmpa.org %K OpenAIs %K chatGPT %K AI vs. human %K literature search %K Chat GPT performance evaluation %K large language models %K artificial intelligence %K AI %K algorithm %K algorithms %K predictive model %K predictive models %K literature review %K literature reviews %D 2024 %7 19.8.2024 %9 Original Paper %J JMIR AI %G English %X Background: With the rapid evolution of artificial intelligence (AI), particularly large language models (LLMs) such as ChatGPT-4 (OpenAI), there is an increasing interest in their potential to assist in scholarly tasks, including conducting literature reviews. However, the efficacy of AI-generated reviews compared with traditional human-led approaches remains underexplored. Objective: This study aims to compare the quality of literature reviews conducted by the ChatGPT-4 model with those conducted by human researchers, focusing on the relational dynamics between physicians and patients. Methods: We included 2 literature reviews in the study on the same topic, namely, exploring factors affecting relational dynamics between physicians and patients in medicolegal contexts. One review used GPT-4, last updated in September 2021, and the other was conducted by human researchers. The human review involved a comprehensive literature search using medical subject headings and keywords in Ovid MEDLINE, followed by a thematic analysis of the literature to synthesize information from selected articles. The AI-generated review used a new prompt engineering approach, using iterative and sequential prompts to generate results. Comparative analysis was based on qualitative measures such as accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency. Results: GPT-4 produced an extensive list of relational factors rapidly. The AI model demonstrated an impressive breadth of knowledge but exhibited limitations in in-depth and contextual understanding, occasionally producing irrelevant or incorrect information. In comparison, human researchers provided a more nuanced and contextually relevant review. The comparative analysis assessed the reviews based on criteria including accuracy, response time, consistency, breadth and depth of knowledge, contextual understanding, and transparency. While GPT-4 showed advantages in response time and breadth of knowledge, human-led reviews excelled in accuracy, depth of knowledge, and contextual understanding. Conclusions: The study suggests that GPT-4, with structured prompt engineering, can be a valuable tool for conducting preliminary literature reviews by providing a broad overview of topics quickly. However, its limitations necessitate careful expert evaluation and refinement, making it an assistant rather than a substitute for human expertise in comprehensive literature reviews. Moreover, this research highlights the potential and limitations of using AI tools like GPT-4 in academic research, particularly in the fields of health services and medical research. It underscores the necessity of combining AI’s rapid information retrieval capabilities with human expertise for more accurate and contextually rich scholarly outputs. %M 39159446 %R 10.2196/56537 %U https://ai.jmir.org/2024/1/e56537 %U https://doi.org/10.2196/56537 %U http://www.ncbi.nlm.nih.gov/pubmed/39159446 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e56243 %T Extraction of Substance Use Information From Clinical Notes: Generative Pretrained Transformer–Based Investigation %A Shah-Mohammadi,Fatemeh %A Finkelstein,Joseph %+ Department of Biomedical Informatics, School of Medicine, The University of Utah, 421 Wakara Way, Ste 140, Salt Lake City, UT, 84108, United States, 1 801 581 4080, fatemeh.shah-mohammadi@utah.edu %K substance use %K natural language processing %K GPT %K prompt engineering %K zero-shot learning %K few-shot learning %D 2024 %7 19.8.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: Understanding the multifaceted nature of health outcomes requires a comprehensive examination of the social, economic, and environmental determinants that shape individual well-being. Among these determinants, behavioral factors play a crucial role, particularly the consumption patterns of psychoactive substances, which have important implications on public health. The Global Burden of Disease Study shows a growing impact in disability-adjusted life years due to substance use. The successful identification of patients’ substance use information equips clinical care teams to address substance-related issues more effectively, enabling targeted support and ultimately improving patient outcomes. Objective: Traditional natural language processing methods face limitations in accurately parsing diverse clinical language associated with substance use. Large language models offer promise in overcoming these challenges by adapting to diverse language patterns. This study investigates the application of the generative pretrained transformer (GPT) model in specific GPT-3.5 for extracting tobacco, alcohol, and substance use information from patient discharge summaries in zero-shot and few-shot learning settings. This study contributes to the evolving landscape of health care informatics by showcasing the potential of advanced language models in extracting nuanced information critical for enhancing patient care. Methods: The main data source for analysis in this paper is Medical Information Mart for Intensive Care III data set. Among all notes in this data set, we focused on discharge summaries. Prompt engineering was undertaken, involving an iterative exploration of diverse prompts. Leveraging carefully curated examples and refined prompts, we investigate the model’s proficiency through zero-shot as well as few-shot prompting strategies. Results: Results show GPT’s varying effectiveness in identifying mentions of tobacco, alcohol, and substance use across learning scenarios. Zero-shot learning showed high accuracy in identifying substance use, whereas few-shot learning reduced accuracy but improved in identifying substance use status, enhancing recall and F1-score at the expense of lower precision. Conclusions: Excellence of zero-shot learning in precisely extracting text span mentioning substance use demonstrates its effectiveness in situations in which comprehensive recall is important. Conversely, few-shot learning offers advantages when accurately determining the status of substance use is the primary focus, even if it involves a trade-off in precision. The results contribute to enhancement of early detection and intervention strategies, tailor treatment plans with greater precision, and ultimately, contribute to a holistic understanding of patient health profiles. By integrating these artificial intelligence–driven methods into electronic health record systems, clinicians can gain immediate, comprehensive insights into substance use that results in shaping interventions that are not only timely but also more personalized and effective. %M 39037700 %R 10.2196/56243 %U https://medinform.jmir.org/2024/1/e56243 %U https://doi.org/10.2196/56243 %U http://www.ncbi.nlm.nih.gov/pubmed/39037700 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e52758 %T Human-Comparable Sensitivity of Large Language Models in Identifying Eligible Studies Through Title and Abstract Screening: 3-Layer Strategy Using GPT-3.5 and GPT-4 for Systematic Reviews %A Matsui,Kentaro %A Utsumi,Tomohiro %A Aoki,Yumi %A Maruki,Taku %A Takeshima,Masahiro %A Takaesu,Yoshikazu %+ Department of Neuropsychiatry, Graduate School of Medicine, University of the Ryukyus, 207 Uehara, Nishihara, Okinawa, 903-0215, Japan, 81 98 895 3331, takaesuy@med.u-ryukyu.ac.jp %K systematic review %K screening %K GPT-3.5 %K GPT-4 %K language model %K information science %K library science %K artificial intelligence %K prompt engineering %K meta-analysis %D 2024 %7 16.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The screening process for systematic reviews is resource-intensive. Although previous machine learning solutions have reported reductions in workload, they risked excluding relevant papers. Objective: We evaluated the performance of a 3-layer screening method using GPT-3.5 and GPT-4 to streamline the title and abstract-screening process for systematic reviews. Our goal is to develop a screening method that maximizes sensitivity for identifying relevant records. Methods: We conducted screenings on 2 of our previous systematic reviews related to the treatment of bipolar disorder, with 1381 records from the first review and 3146 from the second. Screenings were conducted using GPT-3.5 (gpt-3.5-turbo-0125) and GPT-4 (gpt-4-0125-preview) across three layers: (1) research design, (2) target patients, and (3) interventions and controls. The 3-layer screening was conducted using prompts tailored to each study. During this process, information extraction according to each study’s inclusion criteria and optimization for screening were carried out using a GPT-4–based flow without manual adjustments. Records were evaluated at each layer, and those meeting the inclusion criteria at all layers were subsequently judged as included. Results: On each layer, both GPT-3.5 and GPT-4 were able to process about 110 records per minute, and the total time required for screening the first and second studies was approximately 1 hour and 2 hours, respectively. In the first study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.900/0.709 and 0.806/0.996, respectively. Both screenings by GPT-3.5 and GPT-4 judged all 6 records used for the meta-analysis as included. In the second study, the sensitivities/specificities of the GPT-3.5 and GPT-4 were 0.958/0.116 and 0.875/0.855, respectively. The sensitivities for the relevant records align with those of human evaluators: 0.867-1.000 for the first study and 0.776-0.979 for the second study. Both screenings by GPT-3.5 and GPT-4 judged all 9 records used for the meta-analysis as included. After accounting for justifiably excluded records by GPT-4, the sensitivities/specificities of the GPT-4 screening were 0.962/0.996 in the first study and 0.943/0.855 in the second study. Further investigation indicated that the cases incorrectly excluded by GPT-3.5 were due to a lack of domain knowledge, while the cases incorrectly excluded by GPT-4 were due to misinterpretations of the inclusion criteria. Conclusions: Our 3-layer screening method with GPT-4 demonstrated acceptable level of sensitivity and specificity that supports its practical application in systematic review screenings. Future research should aim to generalize this approach and explore its effectiveness in diverse settings, both medical and nonmedical, to fully establish its use and operational feasibility. %M 39151163 %R 10.2196/52758 %U https://www.jmir.org/2024/1/e52758 %U https://doi.org/10.2196/52758 %U http://www.ncbi.nlm.nih.gov/pubmed/39151163 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e59213 %T A Language Model–Powered Simulated Patient With Automated Feedback for History Taking: Prospective Study %A Holderried,Friederike %A Stegemann-Philipps,Christian %A Herrmann-Werner,Anne %A Festl-Wietek,Teresa %A Holderried,Martin %A Eickhoff,Carsten %A Mahling,Moritz %+ Tübingen Institute for Medical Education (TIME), Medical Faculty, University of Tübingen, Elfriede-Aulhorn-Strasse 10, Tübingen, 72076, Germany, 49 707129 ext 73688, friederike.holderried@med.uni-tuebingen.de %K virtual patients communication %K communication skills %K technology enhanced education %K TEL %K medical education %K ChatGPT %K GPT: LLM %K LLMs %K NLP %K natural language processing %K machine learning %K artificial intelligence %K language model %K language models %K communication %K relationship %K relationships %K chatbot %K chatbots %K conversational agent %K conversational agents %K history %K histories %K simulated %K student %K students %K interaction %K interactions %D 2024 %7 16.8.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Although history taking is fundamental for diagnosing medical conditions, teaching and providing feedback on the skill can be challenging due to resource constraints. Virtual simulated patients and web-based chatbots have thus emerged as educational tools, with recent advancements in artificial intelligence (AI) such as large language models (LLMs) enhancing their realism and potential to provide feedback. Objective: In our study, we aimed to evaluate the effectiveness of a Generative Pretrained Transformer (GPT) 4 model to provide structured feedback on medical students’ performance in history taking with a simulated patient. Methods: We conducted a prospective study involving medical students performing history taking with a GPT-powered chatbot. To that end, we designed a chatbot to simulate patients’ responses and provide immediate feedback on the comprehensiveness of the students’ history taking. Students’ interactions with the chatbot were analyzed, and feedback from the chatbot was compared with feedback from a human rater. We measured interrater reliability and performed a descriptive analysis to assess the quality of feedback. Results: Most of the study’s participants were in their third year of medical school. A total of 1894 question-answer pairs from 106 conversations were included in our analysis. GPT-4’s role-play and responses were medically plausible in more than 99% of cases. Interrater reliability between GPT-4 and the human rater showed “almost perfect” agreement (Cohen κ=0.832). Less agreement (κ<0.6) detected for 8 out of 45 feedback categories highlighted topics about which the model’s assessments were overly specific or diverged from human judgement. Conclusions: The GPT model was effective in providing structured feedback on history-taking dialogs provided by medical students. Although we unraveled some limitations regarding the specificity of feedback for certain feedback categories, the overall high agreement with human raters suggests that LLMs can be a valuable tool for medical education. Our findings, thus, advocate the careful integration of AI-driven feedback mechanisms in medical training and highlight important aspects when LLMs are used in that context. %M 39150749 %R 10.2196/59213 %U https://mededu.jmir.org/2024/1/e59213 %U https://doi.org/10.2196/59213 %U http://www.ncbi.nlm.nih.gov/pubmed/39150749 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e52401 %T ChatGPT and Google Assistant as a Source of Patient Education for Patients With Amblyopia: Content Analysis %A Wu,Gloria %A Lee,David A %A Zhao,Weichen %A Wong,Adrial %A Jhangiani,Rohan %A Kurniawan,Sri %+ University of California, San Francisco School of Medicine, 533 Parnassus Ave, San Francisco, CA, 94143, United States, 1 408 621 9074, gwu2550@gmail.com %K ChatGPT %K Google Assistant %K amblyopia %K health literacy %K American Association for Pediatric Ophthalmology and Strabismus %K pediatric %K ophthalmology %K patient education %K education %K ophthalmologist %K Google %K monitoring %D 2024 %7 15.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: We queried ChatGPT (OpenAI) and Google Assistant about amblyopia and compared their answers with the keywords found on the American Association for Pediatric Ophthalmology and Strabismus (AAPOS) website, specifically the section on amblyopia. Out of the 26 keywords chosen from the website, ChatGPT included 11 (42%) in its responses, while Google included 8 (31%). Objective: Our study investigated the adherence of ChatGPT-3.5 and Google Assistant to the guidelines of the AAPOS for patient education on amblyopia. Methods: ChatGPT-3.5 was used. The four questions taken from the AAPOS website, specifically its glossary section for amblyopia, are as follows: (1) What is amblyopia? (2) What causes amblyopia? (3) How is amblyopia treated? (4) What happens if amblyopia is untreated? Approved and selected by ophthalmologists (GW and DL), the keywords from AAPOS were words or phrases that deemed significant for the education of patients with amblyopia. The “Flesch-Kincaid Grade Level” formula, approved by the US Department of Education, was used to evaluate the reading comprehension level for the responses from ChatGPT, Google Assistant, and AAPOS. Results: In their responses, ChatGPT did not mention the term “ophthalmologist,” whereas Google Assistant and AAPOS both mentioned the term once and twice, respectively. ChatGPT did, however, use the term “eye doctors” once. According to the Flesch-Kincaid test, the average reading level of AAPOS was 11.4 (SD 2.1; the lowest level) while that of Google was 13.1 (SD 4.8; the highest required reading level), also showing the greatest variation in grade level in its responses. ChatGPT’s answers, on average, scored 12.4 (SD 1.1) grade level. They were all similar in terms of difficulty level in reading. For the keywords, out of the 4 responses, ChatGPT used 42% (11/26) of the keywords, whereas Google Assistant used 31% (8/26). Conclusions: ChatGPT trains on texts and phrases and generates new sentences, while Google Assistant automatically copies website links. As ophthalmologists, we should consider including “see an ophthalmologist” on our websites and journals. While ChatGPT is here to stay, we, as physicians, need to monitor its answers. %M 39146013 %R 10.2196/52401 %U https://www.jmir.org/2024/1/e52401 %U https://doi.org/10.2196/52401 %U http://www.ncbi.nlm.nih.gov/pubmed/39146013 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e55939 %T Evaluating the Efficacy of ChatGPT as a Patient Education Tool in Prostate Cancer: Multimetric Assessment %A Gibson,Damien %A Jackson,Stuart %A Shanmugasundaram,Ramesh %A Seth,Ishith %A Siu,Adrian %A Ahmadi,Nariman %A Kam,Jonathan %A Mehan,Nicholas %A Thanigasalam,Ruban %A Jeffery,Nicola %A Patel,Manish I %A Leslie,Scott %+ Department of Urology, Saint George Hospital, Gray St, Kogarah, 2217, Australia, 61 (02) 9113 1111, Damien.p.gibson@gmail.com %K prostate cancer %K patient education %K large language model %K ChatGPT %K AI language model %K multimetric assessment %K artificial intelligence %K AI %K AI chatbots %K health care professional %K health care professionals %K men %K man %K prostate %K cancer %K decision-making %K prostate specific %K antigen screening %K medical information %K natural language processing %K NLP %D 2024 %7 14.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Artificial intelligence (AI) chatbots, such as ChatGPT, have made significant progress. These chatbots, particularly popular among health care professionals and patients, are transforming patient education and disease experience with personalized information. Accurate, timely patient education is crucial for informed decision-making, especially regarding prostate-specific antigen screening and treatment options. However, the accuracy and reliability of AI chatbots’ medical information must be rigorously evaluated. Studies testing ChatGPT’s knowledge of prostate cancer are emerging, but there is a need for ongoing evaluation to ensure the quality and safety of information provided to patients. Objective: This study aims to evaluate the quality, accuracy, and readability of ChatGPT-4’s responses to common prostate cancer questions posed by patients. Methods: Overall, 8 questions were formulated with an inductive approach based on information topics in peer-reviewed literature and Google Trends data. Adapted versions of the Patient Education Materials Assessment Tool for AI (PEMAT-AI), Global Quality Score, and DISCERN-AI tools were used by 4 independent reviewers to assess the quality of the AI responses. The 8 AI outputs were judged by 7 expert urologists, using an assessment framework developed to assess accuracy, safety, appropriateness, actionability, and effectiveness. The AI responses’ readability was assessed using established algorithms (Flesch Reading Ease score, Gunning Fog Index, Flesch-Kincaid Grade Level, The Coleman-Liau Index, and Simple Measure of Gobbledygook [SMOG] Index). A brief tool (Reference Assessment AI [REF-AI]) was developed to analyze the references provided by AI outputs, assessing for reference hallucination, relevance, and quality of references. Results: The PEMAT-AI understandability score was very good (mean 79.44%, SD 10.44%), the DISCERN-AI rating was scored as “good” quality (mean 13.88, SD 0.93), and the Global Quality Score was high (mean 4.46/5, SD 0.50). Natural Language Assessment Tool for AI had pooled mean accuracy of 3.96 (SD 0.91), safety of 4.32 (SD 0.86), appropriateness of 4.45 (SD 0.81), actionability of 4.05 (SD 1.15), and effectiveness of 4.09 (SD 0.98). The readability algorithm consensus was “difficult to read” (Flesch Reading Ease score mean 45.97, SD 8.69; Gunning Fog Index mean 14.55, SD 4.79), averaging an 11th-grade reading level, equivalent to 15- to 17-year-olds (Flesch-Kincaid Grade Level mean 12.12, SD 4.34; The Coleman-Liau Index mean 12.75, SD 1.98; SMOG Index mean 11.06, SD 3.20). REF-AI identified 2 reference hallucinations, while the majority (28/30, 93%) of references appropriately supplemented the text. Most references (26/30, 86%) were from reputable government organizations, while a handful were direct citations from scientific literature. Conclusions: Our analysis found that ChatGPT-4 provides generally good responses to common prostate cancer queries, making it a potentially valuable tool for patient education in prostate cancer care. Objective quality assessment tools indicated that the natural language processing outputs were generally reliable and appropriate, but there is room for improvement. %M 39141904 %R 10.2196/55939 %U https://www.jmir.org/2024/1/e55939 %U https://doi.org/10.2196/55939 %U http://www.ncbi.nlm.nih.gov/pubmed/39141904 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 7 %N %P e55204 %T Readability of Information Generated by ChatGPT for Hidradenitis Suppurativa %A Gawey,Lauren %A Dagenet,Caitlyn B %A Tran,Khiem A %A Park,Sarah %A Hsiao,Jennifer L %A Shi,Vivian %+ Department of Dermatology, University of Arkansas for Medical Sciences, 4301 W Markham St, #576, Little Rock, AR, 72205, United States, 1 8148022747, vivian.shi.publications@gmail.com %K hidradenitis suppurativa %K ChatGPT %K Chat-GPT %K chatbot %K chatbots %K chat-bot %K chat-bots %K machine learning %K ML %K artificial intelligence %K AI %K algorithm %K algorithms %K predictive model %K predictive models %K predictive analytics %K predictive system %K practical model %K practical models %K deep learning %K patient resources %K readability %D 2024 %7 14.8.2024 %9 Research Letter %J JMIR Dermatol %G English %X %M 39141908 %R 10.2196/55204 %U https://derma.jmir.org/2024/1/e55204 %U https://doi.org/10.2196/55204 %U http://www.ncbi.nlm.nih.gov/pubmed/39141908 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e55138 %T Characterizing the Adoption and Experiences of Users of Artificial Intelligence–Generated Health Information in the United States: Cross-Sectional Questionnaire Study %A Ayo-Ajibola,Oluwatobiloba %A Davis,Ryan J %A Lin,Matthew E %A Riddell,Jeffrey %A Kravitz,Richard L %+ Division of General Medicine, University of California Davis, 4150 V Street, PSSB Suite 2400, Sacramento, CA, 95817, United States, 1 916 734 7005, rlkravitz@ucdavis.edu %K artificial intelligence %K ChatGPT %K health information %K patient information-seeking %K online health information %K health literacy %K ResearchMatch %K users %K diagnosis %K decision-making %K cross-sectional %K survey %K surveys %K adoption %K utilization %K AI %K less-educated %K poor health %K worse health %K experience %K experiences %K user %K users %K non user %K non users %K AI-generated %K health information %K implication %K implications %K medical practice %K medical practices %K public health %K descriptive statistics %K t test %K t tests %K chi-square test %K chi-square tests %K health-seeking behavior %K health-seeking behaviors %K patient-provider %K interaction %K interactions %K patient %K patients %D 2024 %7 14.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: OpenAI’s ChatGPT is a source of advanced online health information (OHI) that may be integrated into individuals’ health information-seeking routines. However, concerns have been raised about its factual accuracy and impact on health outcomes. To forecast implications for medical practice and public health, more information is needed on who uses the tool, how often, and for what. Objective: This study aims to characterize the reasons for and types of ChatGPT OHI use and describe the users most likely to engage with the platform. Methods: In this cross-sectional survey, patients received invitations to participate via the ResearchMatch platform, a nonprofit affiliate of the National Institutes of Health. A web-based survey measured demographic characteristics, use of ChatGPT and other sources of OHI, experience characterization, and resultant health behaviors. Descriptive statistics were used to summarize the data. Both 2-tailed t tests and Pearson chi-square tests were used to compare users of ChatGPT OHI to nonusers. Results: Of 2406 respondents, 21.5% (n=517) respondents reported using ChatGPT for OHI. ChatGPT users were younger than nonusers (32.8 vs 39.1 years, P<.001) with lower advanced degree attainment (BA or higher; 49.9% vs 67%, P<.001) and greater use of transient health care (ED and urgent care; P<.001). ChatGPT users were more avid consumers of general non-ChatGPT OHI (percentage of weekly or greater OHI seeking frequency in past 6 months, 28.2% vs 22.8%, P<.001). Around 39.3% (n=206) respondents endorsed using the platform for OHI 2-3 times weekly or more, and most sought the tool to determine if a consultation was required (47.4%, n=245) or to explore alternative treatment (46.2%, n=239). Use characterization was favorable as many believed ChatGPT to be just as or more useful than other OHIs (87.7%, n=429) and their doctor (81%, n=407). About one-third of respondents requested a referral (35.6%, n=184) or changed medications (31%, n=160) based on the information received from ChatGPT. As many users reported skepticism regarding the ChatGPT output (67.9%, n=336), most turned to their physicians (67.5%, n=349). Conclusions: This study underscores the significant role of AI-generated OHI in shaping health-seeking behaviors and the potential evolution of patient-provider interactions. Given the proclivity of these users to enact health behavior changes based on AI-generated content, there is an opportunity for physicians to guide ChatGPT OHI users on an informed and examined use of the technology. %M 39141910 %R 10.2196/55138 %U https://www.jmir.org/2024/1/e55138 %U https://doi.org/10.2196/55138 %U http://www.ncbi.nlm.nih.gov/pubmed/39141910 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e58653 %T A Chatbot (Juno) Prototype to Deploy a Behavioral Activation Intervention to Pregnant Women: Qualitative Evaluation Using a Multiple Case Study %A Mancinelli,Elisa %A Magnolini,Simone %A Gabrielli,Silvia %A Salcuni,Silvia %+ Department of Developmental and Socialization Psychology, University of Padova, Via Venezia 8, Padova, 35131, Italy, 39 3342799698, elisa.mancinelli@phd.unipd.it %K chatbot prototype %K co-design %K pregnancy %K prevention %K behavioral activation %K multiple case study %D 2024 %7 14.8.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Despite the increasing focus on perinatal care, preventive digital interventions are still scarce. Furthermore, the literature suggests that the design and development of these interventions are mainly conducted through a top-down approach that limitedly accounts for direct end user perspectives. Objective: Building from a previous co-design study, this study aimed to qualitatively evaluate pregnant women’s experiences with a chatbot (Juno) prototype designed to deploy a preventive behavioral activation intervention. Methods: Using a multiple–case study design, the research aims to uncover similarities and differences in participants’ perceptions of the chatbot while also exploring women’s desires for improvement and technological advancements in chatbot-based interventions in perinatal mental health. Five pregnant women interacted weekly with the chatbot, operationalized in Telegram, following a 6-week intervention. Self-report questionnaires were administered at baseline and postintervention time points. About 10-14 days after concluding interactions with Juno, women participated in a semistructured interview focused on (1) their personal experience with Juno, (2) user experience and user engagement, and (3) their opinions on future technological advancements. Interview transcripts, comprising 15 questions, were qualitatively evaluated and compared. Finally, a text-mining analysis of transcripts was performed. Results: Similarities and differences have emerged regarding women’s experiences with Juno, appreciating its esthetic but highlighting technical issues and desiring clearer guidance. They found the content useful and pertinent to pregnancy but differed on when they deemed it most helpful. Women expressed interest in receiving increasingly personalized responses and in future integration with existing health care systems for better support. Accordingly, they generally viewed Juno as an effective momentary support but emphasized the need for human interaction in mental health care, particularly if increasingly personalized. Further concerns included overreliance on chatbots when seeking psychological support and the importance of clearly educating users on the chatbot’s limitations. Conclusions: Overall, the results highlighted both the positive aspects and the shortcomings of the chatbot-based intervention, providing insight into its refinement and future developments. However, women stressed the need to balance technological support with human interactions, particularly when the intervention involves beyond preventive mental health context, to favor a greater and more reliable monitoring. %M 39140593 %R 10.2196/58653 %U https://formative.jmir.org/2024/1/e58653 %U https://doi.org/10.2196/58653 %U http://www.ncbi.nlm.nih.gov/pubmed/39140593 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52784 %T Influence of Model Evolution and System Roles on ChatGPT’s Performance in Chinese Medical Licensing Exams: Comparative Study %A Ming,Shuai %A Guo,Qingge %A Cheng,Wenjun %A Lei,Bo %K ChatGPT %K Chinese National Medical Licensing Examination %K large language models %K medical education %K system role %K LLM %K LLMs %K language model %K language models %K artificial intelligence %K chatbot %K chatbots %K conversational agent %K conversational agents %K exam %K exams %K examination %K examinations %K OpenAI %K answer %K answers %K response %K responses %K accuracy %K performance %K China %K Chinese %D 2024 %7 13.8.2024 %9 %J JMIR Med Educ %G English %X Background: With the increasing application of large language models like ChatGPT in various industries, its potential in the medical domain, especially in standardized examinations, has become a focal point of research. Objective: The aim of this study is to assess the clinical performance of ChatGPT, focusing on its accuracy and reliability in the Chinese National Medical Licensing Examination (CNMLE). Methods: The CNMLE 2022 question set, consisting of 500 single-answer multiple choices questions, were reclassified into 15 medical subspecialties. Each question was tested 8 to 12 times in Chinese on the OpenAI platform from April 24 to May 15, 2023. Three key factors were considered: the version of GPT-3.5 and 4.0, the prompt’s designation of system roles tailored to medical subspecialties, and repetition for coherence. A passing accuracy threshold was established as 60%. The χ2 tests and κ values were employed to evaluate the model’s accuracy and consistency. Results: GPT-4.0 achieved a passing accuracy of 72.7%, which was significantly higher than that of GPT-3.5 (54%; P<.001). The variability rate of repeated responses from GPT-4.0 was lower than that of GPT-3.5 (9% vs 19.5%; P<.001). However, both models showed relatively good response coherence, with κ values of 0.778 and 0.610, respectively. System roles numerically increased accuracy for both GPT-4.0 (0.3%‐3.7%) and GPT-3.5 (1.3%‐4.5%), and reduced variability by 1.7% and 1.8%, respectively (P>.05). In subgroup analysis, ChatGPT achieved comparable accuracy among different question types (P>.05). GPT-4.0 surpassed the accuracy threshold in 14 of 15 subspecialties, while GPT-3.5 did so in 7 of 15 on the first response. Conclusions: GPT-4.0 passed the CNMLE and outperformed GPT-3.5 in key areas such as accuracy, consistency, and medical subspecialty expertise. Adding a system role insignificantly enhanced the model’s reliability and answer coherence. GPT-4.0 showed promising potential in medical education and clinical practice, meriting further study. %R 10.2196/52784 %U https://mededu.jmir.org/2024/1/e52784 %U https://doi.org/10.2196/52784 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e54371 %T Evaluation of Generative Language Models in Personalizing Medical Information: Instrument Validation Study %A Spina,Aidin %A Andalib,Saman %A Flores,Daniel %A Vermani,Rishi %A Halaseh,Faris F %A Nelson,Ariana M %+ School of Medicine, University of California, Irvine, 1001 Health Sciences Road, Irvine, CA, 92617, United States, 1 949 290 8347, acspina@hs.uci.edu %K generative language model %K GLM %K artificial intelligence %K AI %K low health literacy %K LHL %K readability %K GLMs %K language model %K language models %K health literacy %K understandable %K understandability %K knowledge translation %K comprehension %K generative %K NLP %K natural language processing %K reading level %K reading levels %K education %K medical text %K medical texts %K medical information %K health information %D 2024 %7 13.8.2024 %9 Original Paper %J JMIR AI %G English %X Background: Although uncertainties exist regarding implementation, artificial intelligence–driven generative language models (GLMs) have enormous potential in medicine. Deployment of GLMs could improve patient comprehension of clinical texts and improve low health literacy. Objective: The goal of this study is to evaluate the potential of ChatGPT-3.5 and GPT-4 to tailor the complexity of medical information to patient-specific input education level, which is crucial if it is to serve as a tool in addressing low health literacy. Methods: Input templates related to 2 prevalent chronic diseases—type II diabetes and hypertension—were designed. Each clinical vignette was adjusted for hypothetical patient education levels to evaluate output personalization. To assess the success of a GLM (GPT-3.5 and GPT-4) in tailoring output writing, the readability of pre- and posttransformation outputs were quantified using the Flesch reading ease score (FKRE) and the Flesch-Kincaid grade level (FKGL). Results: Responses (n=80) were generated using GPT-3.5 and GPT-4 across 2 clinical vignettes. For GPT-3.5, FKRE means were 57.75 (SD 4.75), 51.28 (SD 5.14), 32.28 (SD 4.52), and 28.31 (SD 5.22) for 6th grade, 8th grade, high school, and bachelor’s, respectively; FKGL mean scores were 9.08 (SD 0.90), 10.27 (SD 1.06), 13.4 (SD 0.80), and 13.74 (SD 1.18). GPT-3.5 only aligned with the prespecified education levels at the bachelor’s degree. Conversely, GPT-4’s FKRE mean scores were 74.54 (SD 2.6), 71.25 (SD 4.96), 47.61 (SD 6.13), and 13.71 (SD 5.77), with FKGL mean scores of 6.3 (SD 0.73), 6.7 (SD 1.11), 11.09 (SD 1.26), and 17.03 (SD 1.11) for the same respective education levels. GPT-4 met the target readability for all groups except the 6th-grade FKRE average. Both GLMs produced outputs with statistically significant differences (P<.001; 8th grade P<.001; high school P<.001; bachelors P=.003; FKGL: 6th grade P=.001; 8th grade P<.001; high school P<.001; bachelors P<.001) between mean FKRE and FKGL across input education levels. Conclusions: GLMs can change the structure and readability of medical text outputs according to input-specified education. However, GLMs categorize input education designation into 3 broad tiers of output readability: easy (6th and 8th grade), medium (high school), and difficult (bachelor’s degree). This is the first result to suggest that there are broader boundaries in the success of GLMs in output text simplification. Future research must establish how GLMs can reliably personalize medical texts to prespecified education levels to enable a broader impact on health care literacy. %M 39137416 %R 10.2196/54371 %U https://ai.jmir.org/2024/1/e54371 %U https://doi.org/10.2196/54371 %U http://www.ncbi.nlm.nih.gov/pubmed/39137416 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51757 %T Understanding Health Care Students’ Perceptions, Beliefs, and Attitudes Toward AI-Powered Language Models: Cross-Sectional Study %A Cherrez-Ojeda,Ivan %A Gallardo-Bastidas,Juan C %A Robles-Velasco,Karla %A Osorio,María F %A Velez Leon,Eleonor Maria %A Leon Velastegui,Manuel %A Pauletto,Patrícia %A Aguilar-Díaz,F C %A Squassi,Aldo %A González Eras,Susana Patricia %A Cordero Carrasco,Erita %A Chavez Gonzalez,Karol Leonor %A Calderon,Juan C %A Bousquet,Jean %A Bedbrook,Anna %A Faytong-Haro,Marco %+ Universidad Espiritu Santo, Km. 2.5 via Samborondon, Samborondon, 0901952, Ecuador, 593 999981769, ivancherrez@gmail.com %K artificial intelligence %K ChatGPT %K education %K health care %K students %D 2024 %7 13.8.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT was not intended for use in health care, but it has potential benefits that depend on end-user understanding and acceptability, which is where health care students become crucial. There is still a limited amount of research in this area. Objective: The primary aim of our study was to assess the frequency of ChatGPT use, the perceived level of knowledge, the perceived risks associated with its use, and the ethical issues, as well as attitudes toward the use of ChatGPT in the context of education in the field of health. In addition, we aimed to examine whether there were differences across groups based on demographic variables. The second part of the study aimed to assess the association between the frequency of use, the level of perceived knowledge, the level of risk perception, and the level of perception of ethics as predictive factors for participants’ attitudes toward the use of ChatGPT. Methods: A cross-sectional survey was conducted from May to June 2023 encompassing students of medicine, nursing, dentistry, nutrition, and laboratory science across the Americas. The study used descriptive analysis, chi-square tests, and ANOVA to assess statistical significance across different categories. The study used several ordinal logistic regression models to analyze the impact of predictive factors (frequency of use, perception of knowledge, perception of risk, and ethics perception scores) on attitude as the dependent variable. The models were adjusted for gender, institution type, major, and country. Stata was used to conduct all the analyses. Results: Of 2661 health care students, 42.99% (n=1144) were unaware of ChatGPT. The median score of knowledge was “minimal” (median 2.00, IQR 1.00-3.00). Most respondents (median 2.61, IQR 2.11-3.11) regarded ChatGPT as neither ethical nor unethical. Most participants (median 3.89, IQR 3.44-4.34) “somewhat agreed” that ChatGPT (1) benefits health care settings, (2) provides trustworthy data, (3) is a helpful tool for clinical and educational medical information access, and (4) makes the work easier. In total, 70% (7/10) of people used it for homework. As the perceived knowledge of ChatGPT increased, there was a stronger tendency with regard to having a favorable attitude toward ChatGPT. Higher ethical consideration perception ratings increased the likelihood of considering ChatGPT as a source of trustworthy health care information (odds ratio [OR] 1.620, 95% CI 1.498-1.752), beneficial in medical issues (OR 1.495, 95% CI 1.452-1.539), and useful for medical literature (OR 1.494, 95% CI 1.426-1.564; P<.001 for all results). Conclusions: Over 40% of American health care students (1144/2661, 42.99%) were unaware of ChatGPT despite its extensive use in the health field. Our data revealed the positive attitudes toward ChatGPT and the desire to learn more about it. Medical educators must explore how chatbots may be included in undergraduate health care education programs. %M 39137029 %R 10.2196/51757 %U https://mededu.jmir.org/2024/1/e51757 %U https://doi.org/10.2196/51757 %U http://www.ncbi.nlm.nih.gov/pubmed/39137029 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e59133 %T Educational Utility of Clinical Vignettes Generated in Japanese by ChatGPT-4: Mixed Methods Study %A Takahashi,Hiromizu %A Shikino,Kiyoshi %A Kondo,Takeshi %A Komori,Akira %A Yamada,Yuji %A Saita,Mizue %A Naito,Toshio %+ Department of General Medicine, Juntendo University Faculty of Medicine, Bunkyo, 3-1-3 Hongo, Tokyo, 113-0033, Japan, 81 3 3813 3111, hrtakaha@juntendo.ac.jp %K generative AI %K ChatGPT-4 %K medical case generation %K medical education %K clinical vignettes %K AI %K artificial intelligence %K Japanese %K Japan %D 2024 %7 13.8.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Evaluating the accuracy and educational utility of artificial intelligence–generated medical cases, especially those produced by large language models such as ChatGPT-4 (developed by OpenAI), is crucial yet underexplored. Objective: This study aimed to assess the educational utility of ChatGPT-4–generated clinical vignettes and their applicability in educational settings. Methods: Using a convergent mixed methods design, a web-based survey was conducted from January 8 to 28, 2024, to evaluate 18 medical cases generated by ChatGPT-4 in Japanese. In the survey, 6 main question items were used to evaluate the quality of the generated clinical vignettes and their educational utility, which are information quality, information accuracy, educational usefulness, clinical match, terminology accuracy (TA), and diagnosis difficulty. Feedback was solicited from physicians specializing in general internal medicine or general medicine and experienced in medical education. Chi-square and Mann-Whitney U tests were performed to identify differences among cases, and linear regression was used to examine trends associated with physicians’ experience. Thematic analysis of qualitative feedback was performed to identify areas for improvement and confirm the educational utility of the cases. Results: Of the 73 invited participants, 71 (97%) responded. The respondents, primarily male (64/71, 90%), spanned a broad range of practice years (from 1976 to 2017) and represented diverse hospital sizes throughout Japan. The majority deemed the information quality (mean 0.77, 95% CI 0.75-0.79) and information accuracy (mean 0.68, 95% CI 0.65-0.71) to be satisfactory, with these responses being based on binary data. The average scores assigned were 3.55 (95% CI 3.49-3.60) for educational usefulness, 3.70 (95% CI 3.65-3.75) for clinical match, 3.49 (95% CI 3.44-3.55) for TA, and 2.34 (95% CI 2.28-2.40) for diagnosis difficulty, based on a 5-point Likert scale. Statistical analysis showed significant variability in content quality and relevance across the cases (P<.001 after Bonferroni correction). Participants suggested improvements in generating physical findings, using natural language, and enhancing medical TA. The thematic analysis highlighted the need for clearer documentation, clinical information consistency, content relevance, and patient-centered case presentations. Conclusions: ChatGPT-4–generated medical cases written in Japanese possess considerable potential as resources in medical education, with recognized adequacy in quality and accuracy. Nevertheless, there is a notable need for enhancements in the precision and realism of case details. This study emphasizes ChatGPT-4’s value as an adjunctive educational tool in the medical field, requiring expert oversight for optimal application. %M 39137031 %R 10.2196/59133 %U https://mededu.jmir.org/2024/1/e59133 %U https://doi.org/10.2196/59133 %U http://www.ncbi.nlm.nih.gov/pubmed/39137031 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56413 %T Performance of Large Language Models in Patient Complaint Resolution: Web-Based Cross-Sectional Survey %A Yong,Lorraine Pei Xian %A Tung,Joshua Yi Min %A Lee,Zi Yao %A Kuan,Win Sen %A Chua,Mui Teng %+ Emergency Medicine Department, National University Hospital, National University Health System, 5 Lower Kent Ridge Road, Singapore, 119074, Singapore, 65 67725000, lorraineyong@nus.edu.sg %K ChatGPT %K large language models %K artificial intelligence %K patient complaint %K health care complaint %K empathy %K efficiency %K patient satisfaction %K resource allocation %D 2024 %7 9.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Patient complaints are a perennial challenge faced by health care institutions globally, requiring extensive time and effort from health care workers. Despite these efforts, patient dissatisfaction remains high. Recent studies on the use of large language models (LLMs) such as the GPT models developed by OpenAI in the health care sector have shown great promise, with the ability to provide more detailed and empathetic responses as compared to physicians. LLMs could potentially be used in responding to patient complaints to improve patient satisfaction and complaint response time. Objective: This study aims to evaluate the performance of LLMs in addressing patient complaints received by a tertiary health care institution, with the goal of enhancing patient satisfaction. Methods: Anonymized patient complaint emails and associated responses from the patient relations department were obtained. ChatGPT-4.0 (OpenAI, Inc) was provided with the same complaint email and tasked to generate a response. The complaints and the respective responses were uploaded onto a web-based questionnaire. Respondents were asked to rate both responses on a 10-point Likert scale for 4 items: appropriateness, completeness, empathy, and satisfaction. Participants were also asked to choose a preferred response at the end of each scenario. Results: There was a total of 188 respondents, of which 115 (61.2%) were health care workers. A majority of the respondents, including both health care and non–health care workers, preferred replies from ChatGPT (n=164, 87.2% to n=183, 97.3%). GPT-4.0 responses were rated higher in all 4 assessed items with all median scores of 8 (IQR 7-9) compared to human responses (appropriateness 5, IQR 3-7; empathy 4, IQR 3-6; quality 5, IQR 3-6; satisfaction 5, IQR 3-6; P<.001) and had higher average word counts as compared to human responses (238 vs 76 words). Regression analyses showed that a higher word count was a statistically significant predictor of higher score in all 4 items, with every 1-word increment resulting in an increase in scores of between 0.015 and 0.019 (all P<.001). However, on subgroup analysis by authorship, this only held true for responses written by patient relations department staff and not those generated by ChatGPT which received consistently high scores irrespective of response length. Conclusions: This study provides significant evidence supporting the effectiveness of LLMs in resolution of patient complaints. ChatGPT demonstrated superiority in terms of response appropriateness, empathy, quality, and overall satisfaction when compared against actual human responses to patient complaints. Future research can be done to measure the degree of improvement that artificial intelligence generated responses can bring in terms of time savings, cost-effectiveness, patient satisfaction, and stress reduction for the health care system. %M 39121468 %R 10.2196/56413 %U https://www.jmir.org/2024/1/e56413 %U https://doi.org/10.2196/56413 %U http://www.ncbi.nlm.nih.gov/pubmed/39121468 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e46800 %T Assessing ChatGPT’s Capability for Multiple Choice Questions Using RaschOnline: Observational Study %A Chow,Julie Chi %A Cheng,Teng Yun %A Chien,Tsair-Wei %A Chou,Willy %+ Department of Physical Medicine and Rehabilitation, Chi Mei Medical Center, No. 901, Chung Hwa Road, Yung Kung District, Tainan, 710, Taiwan, 886 937399106, smilewilly@mail.chimei.org.tw %K RaschOnline %K ChatGPT %K multiple choice questions %K differential item functioning %K Wright map %K KIDMAP %K website tool %K evaluation tool %K tool %K application %K artificial intelligence %K scoring %K testing %K college %K students %D 2024 %7 8.8.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: ChatGPT (OpenAI), a state-of-the-art large language model, has exhibited remarkable performance in various specialized applications. Despite the growing popularity and efficacy of artificial intelligence, there is a scarcity of studies that assess ChatGPT’s competence in addressing multiple-choice questions (MCQs) using KIDMAP of Rasch analysis—a website tool used to evaluate ChatGPT’s performance in MCQ answering. Objective: This study aims to (1) showcase the utility of the website (Rasch analysis, specifically RaschOnline), and (2) determine the grade achieved by ChatGPT when compared to a normal sample. Methods: The capability of ChatGPT was evaluated using 10 items from the English tests conducted for Taiwan college entrance examinations in 2023. Under a Rasch model, 300 simulated students with normal distributions were simulated to compete with ChatGPT’s responses. RaschOnline was used to generate 5 visual presentations, including item difficulties, differential item functioning, item characteristic curve, Wright map, and KIDMAP, to address the research objectives. Results: The findings revealed the following: (1) the difficulty of the 10 items increased in a monotonous pattern from easier to harder, represented by logits (–2.43, –1.78, –1.48, –0.64, –0.1, 0.33, 0.59, 1.34, 1.7, and 2.47); (2) evidence of differential item functioning was observed between gender groups for item 5 (P=.04); (3) item 5 displayed a good fit to the Rasch model (P=.61); (4) all items demonstrated a satisfactory fit to the Rasch model, indicated by Infit mean square errors below the threshold of 1.5; (5) no significant difference was found in the measures obtained between gender groups (P=.83); (6) a significant difference was observed among ability grades (P<.001); and (7) ChatGPT’s capability was graded as A, surpassing grades B to E. Conclusions: By using RaschOnline, this study provides evidence that ChatGPT possesses the ability to achieve a grade A when compared to a normal sample. It exhibits excellent proficiency in answering MCQs from the English tests conducted in 2023 for the Taiwan college entrance examinations. %M 39115919 %R 10.2196/46800 %U https://formative.jmir.org/2024/1/e46800 %U https://doi.org/10.2196/46800 %U http://www.ncbi.nlm.nih.gov/pubmed/39115919 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51157 %T Assessing ChatGPT’s Competency in Addressing Interdisciplinary Inquiries on Chatbot Uses in Sports Rehabilitation: Simulation Study %A McBee,Joseph C %A Han,Daniel Y %A Liu,Li %A Ma,Leah %A Adjeroh,Donald A %A Xu,Dong %A Hu,Gangqing %+ Department of Microbiology, Immunology, & Cell Biology, West Virginia University, 64 Medical Center Drive, Morgantown, WV, 26506-9177, United States, 1 304 581 1692, gh00001@mix.wvu.edu %K ChatGPT %K chatbots %K multirole-playing %K interdisciplinary inquiry %K medical education %K sports medicine %D 2024 %7 7.8.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT showcases exceptional conversational capabilities and extensive cross-disciplinary knowledge. In addition, it can perform multiple roles in a single chat session. This unique multirole-playing feature positions ChatGPT as a promising tool for exploring interdisciplinary subjects. Objective: The aim of this study was to evaluate ChatGPT’s competency in addressing interdisciplinary inquiries based on a case study exploring the opportunities and challenges of chatbot uses in sports rehabilitation. Methods: We developed a model termed PanelGPT to assess ChatGPT’s competency in addressing interdisciplinary topics through simulated panel discussions. Taking chatbot uses in sports rehabilitation as an example of an interdisciplinary topic, we prompted ChatGPT through PanelGPT to role-play a physiotherapist, psychologist, nutritionist, artificial intelligence expert, and athlete in a simulated panel discussion. During the simulation, we posed questions to the panel while ChatGPT acted as both the panelists for responses and the moderator for steering the discussion. We performed the simulation using ChatGPT-4 and evaluated the responses by referring to the literature and our human expertise. Results: By tackling questions related to chatbot uses in sports rehabilitation with respect to patient education, physiotherapy, physiology, nutrition, and ethical considerations, responses from the ChatGPT-simulated panel discussion reasonably pointed to various benefits such as 24/7 support, personalized advice, automated tracking, and reminders. ChatGPT also correctly emphasized the importance of patient education, and identified challenges such as limited interaction modes, inaccuracies in emotion-related advice, assurance of data privacy and security, transparency in data handling, and fairness in model training. It also stressed that chatbots are to assist as a copilot, not to replace human health care professionals in the rehabilitation process. Conclusions: ChatGPT exhibits strong competency in addressing interdisciplinary inquiry by simulating multiple experts from complementary backgrounds, with significant implications in assisting medical education. %M 39042885 %R 10.2196/51157 %U https://mededu.jmir.org/2024/1/e51157 %U https://doi.org/10.2196/51157 %U http://www.ncbi.nlm.nih.gov/pubmed/39042885 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 13 %N %P e52973 %T Engagement With Conversational Agent–Enabled Interventions in Cardiometabolic Disease Management: Protocol for a Systematic Review %A Kashyap,Nick %A Sebastian,Ann Tresa %A Lynch,Chris %A Jansons,Paul %A Maddison,Ralph %A Dingler,Tilman %A Oldenburg,Brian %+ Baker Department of Cardiovascular Research, Translation and Implementation, La Trobe University, Plenty Road and Kingsbury Dr, Bundoora, Melbourne, 3086, Australia, 61 422023197, Nick.Kashyap@baker.edu.au %K cardiometabolic disease %K cardiovascular disease %K diabetes %K chronic disease %K chatbot %K acceptability %K technology acceptance model %K design %K natural language processing %K adult %K heart failure %K digital health intervention %K Australia %K systematic review %K meta-analysis %K digital health %K conversational agent–enabled %K health informatics %K management %D 2024 %7 7.8.2024 %9 Protocol %J JMIR Res Protoc %G English %X Background: Cardiometabolic diseases (CMDs) are a group of interrelated conditions, including heart failure and diabetes, that increase the risk of cardiovascular and metabolic complications. The rising number of Australians with CMDs has necessitated new strategies for those managing these conditions, such as digital health interventions. The effectiveness of digital health interventions in supporting people with CMDs is dependent on the extent to which users engage with the tools. Augmenting digital health interventions with conversational agents, technologies that interact with people using natural language, may enhance engagement because of their human-like attributes. To date, no systematic review has compiled evidence on how design features influence the engagement of conversational agent–enabled interventions supporting people with CMDs. This review seeks to address this gap, thereby guiding developers in creating more engaging and effective tools for CMD management. Objective: The aim of this systematic review is to synthesize evidence pertaining to conversational agent–enabled intervention design features and their impacts on the engagement of people managing CMD. Methods: The review is conducted in accordance with the Cochrane Handbook for Systematic Reviews of Interventions and reported in accordance with PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Searches will be conducted in the Ovid (Medline), Web of Science, and Scopus databases, which will be run again prior to manuscript submission. Inclusion criteria will consist of primary research studies reporting on conversational agent–enabled interventions, including measures of engagement, in adults with CMD. Data extraction will seek to capture the perspectives of people with CMD on the use of conversational agent–enabled interventions. Joanna Briggs Institute critical appraisal tools will be used to evaluate the overall quality of evidence collected. Results: This review was initiated in May 2023 and was registered with the International Prospective Register of Systematic Reviews (PROSPERO) in June 2023, prior to title and abstract screening. Full-text screening of articles was completed in July 2023 and data extraction began August 2023. Final searches were conducted in April 2024 prior to finalizing the review and the manuscript was submitted for peer review in July 2024. Conclusions: This review will synthesize diverse observations pertaining to conversational agent–enabled intervention design features and their impacts on engagement among people with CMDs. These observations can be used to guide the development of more engaging conversational agent–enabled interventions, thereby increasing the likelihood of regular intervention use and improved CMD health outcomes. Additionally, this review will identify gaps in the literature in terms of how engagement is reported, thereby highlighting areas for future exploration and supporting researchers in advancing the understanding of conversational agent–enabled interventions. Trial Registration: PROSPERO CRD42023431579; https://tinyurl.com/55cxkm26 International Registered Report Identifier (IRRID): DERR1-10.2196/52973 %M 39110504 %R 10.2196/52973 %U https://www.researchprotocols.org/2024/1/e52973 %U https://doi.org/10.2196/52973 %U http://www.ncbi.nlm.nih.gov/pubmed/39110504 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e59273 %T Claude 3 Opus and ChatGPT With GPT-4 in Dermoscopic Image Analysis for Melanoma Diagnosis: Comparative Performance Analysis %A Liu,Xu %A Duan,Chaoli %A Kim,Min-kyu %A Zhang,Lu %A Jee,Eunjin %A Maharjan,Beenu %A Huang,Yuwei %A Du,Dan %A Jiang,Xian %+ Department of Dermatology, West China Hospital, Sichuan University, No. 37, Guoxue Xiang, Wuhou District, Chengdu, 610041, China, 86 02885423315, jiangxian@scu.edu.cn %K artificial intelligence %K AI %K large language model %K LLM %K Claude %K ChatGPT %K dermatologist %D 2024 %7 6.8.2024 %9 Research Letter %J JMIR Med Inform %G English %X Background: Recent advancements in artificial intelligence (AI) and large language models (LLMs) have shown potential in medical fields, including dermatology. With the introduction of image analysis capabilities in LLMs, their application in dermatological diagnostics has garnered significant interest. These capabilities are enabled by the integration of computer vision techniques into the underlying architecture of LLMs. Objective: This study aimed to compare the diagnostic performance of Claude 3 Opus and ChatGPT with GPT-4 in analyzing dermoscopic images for melanoma detection, providing insights into their strengths and limitations. Methods: We randomly selected 100 histopathology-confirmed dermoscopic images (50 malignant, 50 benign) from the International Skin Imaging Collaboration (ISIC) archive using a computer-generated randomization process. The ISIC archive was chosen due to its comprehensive and well-annotated collection of dermoscopic images, ensuring a diverse and representative sample. Images were included if they were dermoscopic images of melanocytic lesions with histopathologically confirmed diagnoses. Each model was given the same prompt, instructing it to provide the top 3 differential diagnoses for each image, ranked by likelihood. Primary diagnosis accuracy, accuracy of the top 3 differential diagnoses, and malignancy discrimination ability were assessed. The McNemar test was chosen to compare the diagnostic performance of the 2 models, as it is suitable for analyzing paired nominal data. Results: In the primary diagnosis, Claude 3 Opus achieved 54.9% sensitivity (95% CI 44.08%-65.37%), 57.14% specificity (95% CI 46.31%-67.46%), and 56% accuracy (95% CI 46.22%-65.42%), while ChatGPT demonstrated 56.86% sensitivity (95% CI 45.99%-67.21%), 38.78% specificity (95% CI 28.77%-49.59%), and 48% accuracy (95% CI 38.37%-57.75%). The McNemar test showed no significant difference between the 2 models (P=.17). For the top 3 differential diagnoses, Claude 3 Opus and ChatGPT included the correct diagnosis in 76% (95% CI 66.33%-83.77%) and 78% (95% CI 68.46%-85.45%) of cases, respectively. The McNemar test showed no significant difference (P=.56). In malignancy discrimination, Claude 3 Opus outperformed ChatGPT with 47.06% sensitivity, 81.63% specificity, and 64% accuracy, compared to 45.1%, 42.86%, and 44%, respectively. The McNemar test showed a significant difference (P<.001). Claude 3 Opus had an odds ratio of 3.951 (95% CI 1.685-9.263) in discriminating malignancy, while ChatGPT-4 had an odds ratio of 0.616 (95% CI 0.297-1.278). Conclusions: Our study highlights the potential of LLMs in assisting dermatologists but also reveals their limitations. Both models made errors in diagnosing melanoma and benign lesions. These findings underscore the need for developing robust, transparent, and clinically validated AI models through collaborative efforts between AI researchers, dermatologists, and other health care professionals. While AI can provide valuable insights, it cannot yet replace the expertise of trained clinicians. %M 39106482 %R 10.2196/59273 %U https://medinform.jmir.org/2024/1/e59273 %U https://doi.org/10.2196/59273 %U http://www.ncbi.nlm.nih.gov/pubmed/39106482 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e53134 %T Effectiveness and User Experience of a Smoking Cessation Chatbot: Mixed Methods Study Comparing Motivational Interviewing and Confrontational Counseling %A He,Linwei %A Basar,Erkan %A Krahmer,Emiel %A Wiers,Reinout %A Antheunis,Marjolijn %+ Department of Communication and Cognition, Tilburg School of Humanities and Digital Sciences, Tilburg University, Dante Building, D407, Warandelaan 2, Tilburg, 5037AB, Netherlands, 31 644911989, l.he_1@tilburguniversity.edu %K chatbot %K smoking cessation %K counseling %K motivational interviewing %K confrontational counseling %K user experience %K engagement %D 2024 %7 6.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Cigarette smoking poses a major public health risk. Chatbots may serve as an accessible and useful tool to promote cessation due to their high accessibility and potential in facilitating long-term personalized interactions. To increase effectiveness and acceptability, there remains a need to identify and evaluate counseling strategies for these chatbots, an aspect that has not been comprehensively addressed in previous research. Objective: This study aims to identify effective counseling strategies for such chatbots to support smoking cessation. In addition, we sought to gain insights into smokers’ expectations of and experiences with the chatbot. Methods: This mixed methods study incorporated a web-based experiment and semistructured interviews. Smokers (N=229) interacted with either a motivational interviewing (MI)–style (n=112, 48.9%) or a confrontational counseling–style (n=117, 51.1%) chatbot. Both cessation-related (ie, intention to quit and self-efficacy) and user experience–related outcomes (ie, engagement, therapeutic alliance, perceived empathy, and interaction satisfaction) were assessed. Semistructured interviews were conducted with 16 participants, 8 (50%) from each condition, and data were analyzed using thematic analysis. Results: Results from a multivariate ANOVA showed that participants had a significantly higher overall rating for the MI (vs confrontational counseling) chatbot. Follow-up discriminant analysis revealed that the better perception of the MI chatbot was mostly explained by the user experience–related outcomes, with cessation-related outcomes playing a lesser role. Exploratory analyses indicated that smokers in both conditions reported increased intention to quit and self-efficacy after the chatbot interaction. Interview findings illustrated several constructs (eg, affective attitude and engagement) explaining people’s previous expectations and timely and retrospective experience with the chatbot. Conclusions: The results confirmed that chatbots are a promising tool in motivating smoking cessation and the use of MI can improve user experience. We did not find extra support for MI to motivate cessation and have discussed possible reasons. Smokers expressed both relational and instrumental needs in the quitting process. Implications for future research and practice are discussed. %M 39106097 %R 10.2196/53134 %U https://www.jmir.org/2024/1/e53134 %U https://doi.org/10.2196/53134 %U http://www.ncbi.nlm.nih.gov/pubmed/39106097 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e55577 %T Benchmarking Large Language Models for Cervical Spondylosis %A Zhang,Boyan %A Du,Yueqi %A Duan,Wanru %A Chen,Zan %+ Xuanwu Hospital, Capital Medical University, 45 Changchun Street, Beijing, 100000, China, 86 13911712120, chenzan66@163.com %K cervical spondylosis %K large language model %K LLM %K patient %K ChatGPT %D 2024 %7 5.8.2024 %9 Research Letter %J JMIR Form Res %G English %X Cervical spondylosis is the most common degenerative spinal disorder in modern societies. Patients require a great deal of medical knowledge, and large language models (LLMs) offer patients a novel and convenient tool for accessing medical advice. In this study, we collected the most frequently asked questions by patients with cervical spondylosis in clinical work and internet consultations. The accuracy of the answers provided by LLMs was evaluated and graded by 3 experienced spinal surgeons. Comparative analysis of responses showed that all LLMs could provide satisfactory results, and that among them, GPT-4 had the highest accuracy rate. Variation across each section in all LLMs revealed their ability boundaries and the development direction of artificial intelligence. %M 39102674 %R 10.2196/55577 %U https://formative.jmir.org/2024/1/e55577 %U https://doi.org/10.2196/55577 %U http://www.ncbi.nlm.nih.gov/pubmed/39102674 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e57224 %T Predictors of Health Care Practitioners’ Intention to Use AI-Enabled Clinical Decision Support Systems: Meta-Analysis Based on the Unified Theory of Acceptance and Use of Technology %A Dingel,Julius %A Kleine,Anne-Kathrin %A Cecil,Julia %A Sigl,Anna Leonie %A Lermer,Eva %A Gaube,Susanne %+ Human-AI-Interaction Group, Center for Leadership and People Management, Ludwig Maximilian University of Munich, Geschwister-Scholl-Platz 1, Munich, 80539, Germany, 49 8921809775, anne-kathrin.kleine@psy.lmu.de %K Unified Theory of Acceptance and Use of Technology %K UTAUT %K artificial intelligence–enabled clinical decision support systems %K AI-CDSSs %K meta-analysis %K health care practitioners %D 2024 %7 5.8.2024 %9 Review %J J Med Internet Res %G English %X Background: Artificial intelligence–enabled clinical decision support systems (AI-CDSSs) offer potential for improving health care outcomes, but their adoption among health care practitioners remains limited. Objective: This meta-analysis identified predictors influencing health care practitioners’ intention to use AI-CDSSs based on the Unified Theory of Acceptance and Use of Technology (UTAUT). Additional predictors were examined based on existing empirical evidence. Methods: The literature search using electronic databases, forward searches, conference programs, and personal correspondence yielded 7731 results, of which 17 (0.22%) studies met the inclusion criteria. Random-effects meta-analysis, relative weight analyses, and meta-analytic moderation and mediation analyses were used to examine the relationships between relevant predictor variables and the intention to use AI-CDSSs. Results: The meta-analysis results supported the application of the UTAUT to the context of the intention to use AI-CDSSs. The results showed that performance expectancy (r=0.66), effort expectancy (r=0.55), social influence (r=0.66), and facilitating conditions (r=0.66) were positively associated with the intention to use AI-CDSSs, in line with the predictions of the UTAUT. The meta-analysis further identified positive attitude (r=0.63), trust (r=0.73), anxiety (r=–0.41), perceived risk (r=–0.21), and innovativeness (r=0.54) as additional relevant predictors. Trust emerged as the most influential predictor overall. The results of the moderation analyses show that the relationship between social influence and use intention becomes weaker with increasing age. In addition, the relationship between effort expectancy and use intention was stronger for diagnostic AI-CDSSs than for devices that combined diagnostic and treatment recommendations. Finally, the relationship between facilitating conditions and use intention was mediated through performance and effort expectancy. Conclusions: This meta-analysis contributes to the understanding of the predictors of intention to use AI-CDSSs based on an extended UTAUT model. More research is needed to substantiate the identified relationships and explain the observed variations in effect sizes by identifying relevant moderating factors. The research findings bear important implications for the design and implementation of training programs for health care practitioners to ease the adoption of AI-CDSSs into their practice. %M 39102675 %R 10.2196/57224 %U https://www.jmir.org/2024/1/e57224 %U https://doi.org/10.2196/57224 %U http://www.ncbi.nlm.nih.gov/pubmed/39102675 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60336 %T Patient-Representing Population's Perceptions of GPT-Generated Versus Standard Emergency Department Discharge Instructions: Randomized Blind Survey Assessment %A Huang,Thomas %A Safranek,Conrad %A Socrates,Vimig %A Chartash,David %A Wright,Donald %A Dilip,Monisha %A Sangal,Rohit B %A Taylor,Richard Andrew %+ Department of Emergency Medicine, Yale School of Medicine, 333 Cedar Street, New Haven, CT, 06510, United States, 1 2034324771, richard.taylor@yale.edu %K machine learning %K artificial intelligence %K large language models %K natural language processing %K ChatGPT %K discharge instructions %K emergency medicine %K emergency department %K discharge instructions %K surveys and questionaries %D 2024 %7 2.8.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Discharge instructions are a key form of documentation and patient communication in the time of transition from the emergency department (ED) to home. Discharge instructions are time-consuming and often underprioritized, especially in the ED, leading to discharge delays and possibly impersonal patient instructions. Generative artificial intelligence and large language models (LLMs) offer promising methods of creating high-quality and personalized discharge instructions; however, there exists a gap in understanding patient perspectives of LLM-generated discharge instructions. Objective: We aimed to assess the use of LLMs such as ChatGPT in synthesizing accurate and patient-accessible discharge instructions in the ED. Methods: We synthesized 5 unique, fictional ED encounters to emulate real ED encounters that included a diverse set of clinician history, physical notes, and nursing notes. These were passed to GPT-4 in Azure OpenAI Service (Microsoft) to generate LLM-generated discharge instructions. Standard discharge instructions were also generated for each of the 5 unique ED encounters. All GPT-generated and standard discharge instructions were then formatted into standardized after-visit summary documents. These after-visit summaries containing either GPT-generated or standard discharge instructions were randomly and blindly administered to Amazon MTurk respondents representing patient populations through Amazon MTurk Survey Distribution. Discharge instructions were assessed based on metrics of interpretability of significance, understandability, and satisfaction. Results: Our findings revealed that survey respondents’ perspectives regarding GPT-generated and standard discharge instructions were significantly (P=.01) more favorable toward GPT-generated return precautions, and all other sections were considered noninferior to standard discharge instructions. Of the 156 survey respondents, GPT-generated discharge instructions were assigned favorable ratings, “agree” and “strongly agree,” more frequently along the metric of interpretability of significance in discharge instruction subsections regarding diagnosis, procedures, treatment, post-ED medications or any changes to medications, and return precautions. Survey respondents found GPT-generated instructions to be more understandable when rating procedures, treatment, post-ED medications or medication changes, post-ED follow-up, and return precautions. Satisfaction with GPT-generated discharge instruction subsections was the most favorable in procedures, treatment, post-ED medications or medication changes, and return precautions. Wilcoxon rank-sum test of Likert responses revealed significant differences (P=.01) in the interpretability of significant return precautions in GPT-generated discharge instructions compared to standard discharge instructions but not for other evaluation metrics and discharge instruction subsections. Conclusions: This study demonstrates the potential for LLMs such as ChatGPT to act as a method of augmenting current documentation workflows in the ED to reduce the documentation burden of physicians. The ability of LLMs to provide tailored instructions for patients by improving readability and making instructions more applicable to patients could improve upon the methods of communication that currently exist. %M 39094112 %R 10.2196/60336 %U https://www.jmir.org/2024/1/e60336 %U https://doi.org/10.2196/60336 %U http://www.ncbi.nlm.nih.gov/pubmed/39094112 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e58129 %T Large Language Models Versus Expert Clinicians in Crisis Prediction Among Telemental Health Patients: Comparative Study %A Lee,Christine %A Mohebbi,Matthew %A O'Callaghan,Erin %A Winsberg,Mirène %+ Brightside Health, 2261 Market Street, STE 10222, San Francisco, CA, 94114, United States, 1 415 279 2042, mimi.winsberg@brightside.com %K mental health %K telehealth %K PHQ-9 %K Patient Health Questionnaire-9 %K suicidal ideation %K AI %K LLM %K OpenAI %K GPT-4 %K generative pretrained transformer 4 %K tele-mental health %K large language model %K clinician %K clinicians %K artificial intelligence %K patient information %K suicide %K suicidal %K mental disorder %K suicide attempt %K psychologist %K psychologists %K psychiatrist %K psychiatrists %K psychiatry %K clinical setting %K self-reported %K treatment %K medication %K digital mental health %K machine learning %K language model %K suicide %K crisis %K telemental health %K tele health %K e-health %K digital health %D 2024 %7 2.8.2024 %9 Original Paper %J JMIR Ment Health %G English %X Background: Due to recent advances in artificial intelligence, large language models (LLMs) have emerged as a powerful tool for a variety of language-related tasks, including sentiment analysis, and summarization of provider-patient interactions. However, there is limited research on these models in the area of crisis prediction. Objective: This study aimed to evaluate the performance of LLMs, specifically OpenAI’s generative pretrained transformer 4 (GPT-4), in predicting current and future mental health crisis episodes using patient-provided information at intake among users of a national telemental health platform. Methods: Deidentified patient-provided data were pulled from specific intake questions of the Brightside telehealth platform, including the chief complaint, for 140 patients who indicated suicidal ideation (SI), and another 120 patients who later indicated SI with a plan during the course of treatment. Similar data were pulled for 200 randomly selected patients, treated during the same time period, who never endorsed SI. In total, 6 senior Brightside clinicians (3 psychologists and 3 psychiatrists) were shown patients’ self-reported chief complaint and self-reported suicide attempt history but were blinded to the future course of treatment and other reported symptoms, including SI. They were asked a simple yes or no question regarding their prediction of endorsement of SI with plan, along with their confidence level about the prediction. GPT-4 was provided with similar information and asked to answer the same questions, enabling us to directly compare the performance of artificial intelligence and clinicians. Results: Overall, the clinicians’ average precision (0.7) was higher than that of GPT-4 (0.6) in identifying the SI with plan at intake (n=140) versus no SI (n=200) when using the chief complaint alone, while sensitivity was higher for the GPT-4 (0.62) than the clinicians’ average (0.53). The addition of suicide attempt history increased the clinicians’ average sensitivity (0.59) and precision (0.77) while increasing the GPT-4 sensitivity (0.59) but decreasing the GPT-4 precision (0.54). Performance decreased comparatively when predicting future SI with plan (n=120) versus no SI (n=200) with a chief complaint only for the clinicians (average sensitivity=0.4; average precision=0.59) and the GPT-4 (sensitivity=0.46; precision=0.48). The addition of suicide attempt history increased performance comparatively for the clinicians (average sensitivity=0.46; average precision=0.69) and the GPT-4 (sensitivity=0.74; precision=0.48). Conclusions: GPT-4, with a simple prompt design, produced results on some metrics that approached those of a trained clinician. Additional work must be done before such a model can be piloted in a clinical setting. The model should undergo safety checks for bias, given evidence that LLMs can perpetuate the biases of the underlying data on which they are trained. We believe that LLMs hold promise for augmenting the identification of higher-risk patients at intake and potentially delivering more timely care to patients. %M 38876484 %R 10.2196/58129 %U https://mental.jmir.org/2024/1/e58129 %U https://doi.org/10.2196/58129 %U http://www.ncbi.nlm.nih.gov/pubmed/38876484 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60083 %T Ethical Considerations and Fundamental Principles of Large Language Models in Medical Education: Viewpoint %A Zhui,Li %A Fenghe,Li %A Xuehu,Wang %A Qining,Fu %A Wei,Ren %+ Department of Vascular Surgery, The First Affiliated Hospital of Chongqing Medical University, No. 1 of Youyi Road, Yuzhong District, Chongqing, 400016, China, 86 13658339771, renwei_2301@yeah.net %K medical education %K artificial intelligence %K large language models %K medical ethics %K AI %K LLMs %K ethics %K academic integrity %K privacy and data risks %K data security %K data protection %K intellectual property rights %K educational research %D 2024 %7 1.8.2024 %9 Viewpoint %J J Med Internet Res %G English %X This viewpoint article first explores the ethical challenges associated with the future application of large language models (LLMs) in the context of medical education. These challenges include not only ethical concerns related to the development of LLMs, such as artificial intelligence (AI) hallucinations, information bias, privacy and data risks, and deficiencies in terms of transparency and interpretability but also issues concerning the application of LLMs, including deficiencies in emotional intelligence, educational inequities, problems with academic integrity, and questions of responsibility and copyright ownership. This paper then analyzes existing AI-related legal and ethical frameworks and highlights their limitations with regard to the application of LLMs in the context of medical education. To ensure that LLMs are integrated in a responsible and safe manner, the authors recommend the development of a unified ethical framework that is specifically tailored for LLMs in this field. This framework should be based on 8 fundamental principles: quality control and supervision mechanisms; privacy and data protection; transparency and interpretability; fairness and equal treatment; academic integrity and moral norms; accountability and traceability; protection and respect for intellectual property; and the promotion of educational research and innovation. The authors further discuss specific measures that can be taken to implement these principles, thereby laying a solid foundation for the development of a comprehensive and actionable ethical framework. Such a unified ethical framework based on these 8 fundamental principles can provide clear guidance and support for the application of LLMs in the context of medical education. This approach can help establish a balance between technological advancement and ethical safeguards, thereby ensuring that medical education can progress without compromising the principles of fairness, justice, or patient safety and establishing a more equitable, safer, and more efficient environment for medical education. %M 38971715 %R 10.2196/60083 %U https://www.jmir.org/2024/1/e60083 %U https://doi.org/10.2196/60083 %U http://www.ncbi.nlm.nih.gov/pubmed/38971715 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e54345 %T Reference Hallucination Score for Medical Artificial Intelligence Chatbots: Development and Usability Study %A Aljamaan,Fadi %A Temsah,Mohamad-Hani %A Altamimi,Ibraheem %A Al-Eyadhy,Ayman %A Jamal,Amr %A Alhasan,Khalid %A Mesallam,Tamer A %A Farahat,Mohamed %A Malki,Khalid H %+ Department of Otolaryngology, College of Medicine, Research Chair of Voice, Swallowing, and Communication Disorders, King Saud University, 12629 Abdulaziz Rd, Al Malaz, Riyadh, P.BOX 2925 Zip 11461, Saudi Arabia, 966 114876100, kalmalki@ksu.edu.sa %K artificial intelligence (AI) chatbots %K reference hallucination %K bibliographic verification %K ChatGPT %K Perplexity %K SciSpace %K Elicit %K Bing %D 2024 %7 31.7.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: Artificial intelligence (AI) chatbots have recently gained use in medical practice by health care practitioners. Interestingly, the output of these AI chatbots was found to have varying degrees of hallucination in content and references. Such hallucinations generate doubts about their output and their implementation. Objective: The aim of our study was to propose a reference hallucination score (RHS) to evaluate the authenticity of AI chatbots’ citations. Methods: Six AI chatbots were challenged with the same 10 medical prompts, requesting 10 references per prompt. The RHS is composed of 6 bibliographic items and the reference’s relevance to prompts’ keywords. RHS was calculated for each reference, prompt, and type of prompt (basic vs complex). The average RHS was calculated for each AI chatbot and compared across the different types of prompts and AI chatbots. Results: Bard failed to generate any references. ChatGPT 3.5 and Bing generated the highest RHS (score=11), while Elicit and SciSpace generated the lowest RHS (score=1), and Perplexity generated a middle RHS (score=7). The highest degree of hallucination was observed for reference relevancy to the prompt keywords (308/500, 61.6%), while the lowest was for reference titles (169/500, 33.8%). ChatGPT and Bing had comparable RHS (β coefficient=–0.069; P=.32), while Perplexity had significantly lower RHS than ChatGPT (β coefficient=–0.345; P<.001). AI chatbots generally had significantly higher RHS when prompted with scenarios or complex format prompts (β coefficient=0.486; P<.001). Conclusions: The variation in RHS underscores the necessity for a robust reference evaluation tool to improve the authenticity of AI chatbots. Further, the variations highlight the importance of verifying their output and citations. Elicit and SciSpace had negligible hallucination, while ChatGPT and Bing had critical hallucination levels. The proposed AI chatbots’ RHS could contribute to ongoing efforts to enhance AI’s general reliability in medical research. %M 39083799 %R 10.2196/54345 %U https://medinform.jmir.org/2024/1/e54345 %U https://doi.org/10.2196/54345 %U http://www.ncbi.nlm.nih.gov/pubmed/39083799 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e54633 %T A Reliable and Accessible Caregiving Language Model (CaLM) to Support Tools for Caregivers: Development and Evaluation Study %A Parmanto,Bambang %A Aryoyudanta,Bayu %A Soekinto,Timothius Wilbert %A Setiawan,I Made Agus %A Wang,Yuhan %A Hu,Haomin %A Saptono,Andi %A Choi,Yong Kyung %+ Department of Health Information Management, University of Pittsburgh, 6052 Forbes Tower, Pittsburgh, PA, 15260, United States, 1 412 383 6649, parmanto@pitt.edu %K large language model %K caregiving %K caregiver %K informal care %K carer %K GPT %K language model %K LLM %K elderly %K aging %K ChatGPT %K machine learning %K natural language processing %K NLP %D 2024 %7 31.7.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: In the United States, 1 in 5 adults currently serves as a family caregiver for an individual with a serious illness or disability. Unlike professional caregivers, family caregivers often assume this role without formal preparation or training. Thus, there is an urgent need to enhance the capacity of family caregivers to provide quality care. Leveraging technology as an educational tool or an adjunct to care is a promising approach that has the potential to enhance the learning and caregiving capabilities of family caregivers. Large language models (LLMs) can potentially be used as a foundation technology for supporting caregivers. An LLM can be categorized as a foundation model (FM), which is a large-scale model trained on a broad data set that can be adapted to a range of different domain tasks. Despite their potential, FMs have the critical weakness of “hallucination,” where the models generate information that can be misleading or inaccurate. Information reliability is essential when language models are deployed as front-line help tools for caregivers. Objective: This study aimed to (1) develop a reliable caregiving language model (CaLM) by using FMs and a caregiving knowledge base, (2) develop an accessible CaLM using a small FM that requires fewer computing resources, and (3) evaluate the model’s performance compared with a large FM. Methods: We developed a CaLM using the retrieval augmented generation (RAG) framework combined with FM fine-tuning for improving the quality of FM answers by grounding the model on a caregiving knowledge base. The key components of the CaLM are the caregiving knowledge base, a fine-tuned FM, and a retriever module. We used 2 small FMs as candidates for the foundation of the CaLM (LLaMA [large language model Meta AI] 2 and Falcon with 7 billion parameters) and adopted a large FM (GPT-3.5 with an estimated 175 billion parameters) as a benchmark. We developed the caregiving knowledge base by gathering various types of documents from the internet. We focused on caregivers of individuals with Alzheimer disease and related dementias. We evaluated the models’ performances using the benchmark metrics commonly used in evaluating language models and their reliability for providing accurate references with their answers. Results: The RAG framework improved the performance of all FMs used in this study across all measures. As expected, the large FM performed better than the small FMs across all metrics. Interestingly, the small fine-tuned FMs with RAG performed significantly better than GPT 3.5 across all metrics. The fine-tuned LLaMA 2 with a small FM performed better than GPT 3.5 (even with RAG) in returning references with the answers. Conclusions: The study shows that a reliable and accessible CaLM can be developed using small FMs with a knowledge base specific to the caregiving domain. %M 39083337 %R 10.2196/54633 %U https://formative.jmir.org/2024/1/e54633 %U https://doi.org/10.2196/54633 %U http://www.ncbi.nlm.nih.gov/pubmed/39083337 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e52500 %T Can Large Language Models Replace Therapists? Evaluating Performance at Simple Cognitive Behavioral Therapy Tasks %A Hodson,Nathan %A Williamson,Simon %+ Warwick Medical School, University of Warwick, Warwick Medical School, Gibbett Hill Road, Coventry, CV4 7AL, United Kingdom, 44 02476574880, nathan.hodson@warwick.ac.uk %K mental health %K psychotherapy %K digital therapy %K CBT %K ChatGPT %K cognitive behavioral therapy %K cognitive behavioural therapy %K LLM %K LLMs %K language model %K language models %K NLP %K natural language processing %K artificial intelligence %K performance %K chatbot %K chatbots %K conversational agent %K conversational agents %D 2024 %7 30.7.2024 %9 Research Letter %J JMIR AI %G English %X The advent of large language models (LLMs) such as ChatGPT has potential implications for psychological therapies such as cognitive behavioral therapy (CBT). We systematically investigated whether LLMs could recognize an unhelpful thought, examine its validity, and reframe it to a more helpful one. LLMs currently have the potential to offer reasonable suggestions for the identification and reframing of unhelpful thoughts but should not be relied on to lead CBT delivery. %M 39078696 %R 10.2196/52500 %U https://ai.jmir.org/2024/1/e52500 %U https://doi.org/10.2196/52500 %U http://www.ncbi.nlm.nih.gov/pubmed/39078696 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e55933 %T Impact of Large Language Models on Medical Education and Teaching Adaptations %A Zhui,Li %A Yhap,Nina %A Liping,Liu %A Zhengjie,Wang %A Zhonghao,Xiong %A Xiaoshu,Yuan %A Hong,Cui %A Xuexiu,Liu %A Wei,Ren %K large language models %K medical education %K opportunities %K challenges %K critical thinking %K educator %D 2024 %7 25.7.2024 %9 %J JMIR Med Inform %G English %X This viewpoint article explores the transformative role of large language models (LLMs) in the field of medical education, highlighting their potential to enhance teaching quality, promote personalized learning paths, strengthen clinical skills training, optimize teaching assessment processes, boost the efficiency of medical research, and support continuing medical education. However, the use of LLMs entails certain challenges, such as questions regarding the accuracy of information, the risk of overreliance on technology, a lack of emotional recognition capabilities, and concerns related to ethics, privacy, and data security. This article emphasizes that to maximize the potential of LLMs and overcome these challenges, educators must exhibit leadership in medical education, adjust their teaching strategies flexibly, cultivate students’ critical thinking, and emphasize the importance of practical experience, thus ensuring that students can use LLMs correctly and effectively. By adopting such a comprehensive and balanced approach, educators can train health care professionals who are proficient in the use of advanced technologies and who exhibit solid professional ethics and practical skills, thus laying a strong foundation for these professionals to overcome future challenges in the health care sector. %R 10.2196/55933 %U https://medinform.jmir.org/2024/1/e55933 %U https://doi.org/10.2196/55933 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e56342 %T Assessing the Ability of a Large Language Model to Score Free-Text Medical Student Clinical Notes: Quantitative Study %A Burke,Harry B %A Hoang,Albert %A Lopreiato,Joseph O %A King,Heidi %A Hemmer,Paul %A Montgomery,Michael %A Gagarin,Viktoria %K medical education %K generative artificial intelligence %K natural language processing %K ChatGPT %K generative pretrained transformer %K standardized patients %K clinical notes %K free-text notes %K history and physical examination %K large language model %K LLM %K medical student %K medical students %K clinical information %K artificial intelligence %K AI %K patients %K patient %K medicine %D 2024 %7 25.7.2024 %9 %J JMIR Med Educ %G English %X Background: Teaching medical students the skills required to acquire, interpret, apply, and communicate clinical information is an integral part of medical education. A crucial aspect of this process involves providing students with feedback regarding the quality of their free-text clinical notes. Objective: The goal of this study was to assess the ability of ChatGPT 3.5, a large language model, to score medical students’ free-text history and physical notes. Methods: This is a single-institution, retrospective study. Standardized patients learned a prespecified clinical case and, acting as the patient, interacted with medical students. Each student wrote a free-text history and physical note of their interaction. The students’ notes were scored independently by the standardized patients and ChatGPT using a prespecified scoring rubric that consisted of 85 case elements. The measure of accuracy was percent correct. Results: The study population consisted of 168 first-year medical students. There was a total of 14,280 scores. The ChatGPT incorrect scoring rate was 1.0%, and the standardized patient incorrect scoring rate was 7.2%. The ChatGPT error rate was 86%, lower than the standardized patient error rate. The ChatGPT mean incorrect scoring rate of 12 (SD 11) was significantly lower than the standardized patient mean incorrect scoring rate of 85 (SD 74; P=.002). Conclusions: ChatGPT demonstrated a significantly lower error rate compared to standardized patients. This is the first study to assess the ability of a generative pretrained transformer (GPT) program to score medical students’ standardized patient-based free-text clinical notes. It is expected that, in the near future, large language models will provide real-time feedback to practicing physicians regarding their free-text notes. GPT artificial intelligence programs represent an important advance in medical education and medical practice. %R 10.2196/56342 %U https://mededu.jmir.org/2024/1/e56342 %U https://doi.org/10.2196/56342 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60807 %T Performance of ChatGPT Across Different Versions in Medical Licensing Examinations Worldwide: Systematic Review and Meta-Analysis %A Liu,Mingxin %A Okuhara,Tsuyoshi %A Chang,XinYi %A Shirabe,Ritsuko %A Nishiie,Yuriko %A Okada,Hiroko %A Kiuchi,Takahiro %+ Department of Health Communication, Graduate School of Medicine, The University of Tokyo, 7-3-1 Hongo, Bunkyo, Tokyo, 113-8655, Japan, 81 03 5800 6549, liumingxin98@g.ecc.u-tokyo.ac.jp %K large language model, ChatGPT, medical licensing examination, medical education %K LLMs %K NLP %K natural language processing %K artificial intelligence %K language models %K review methods %K systematic %K meta-analysis %D 2024 %7 25.7.2024 %9 Review %J J Med Internet Res %G English %X Background: Over the past 2 years, researchers have used various medical licensing examinations to test whether ChatGPT (OpenAI) possesses accurate medical knowledge. The performance of each version of ChatGPT on the medical licensing examination in multiple environments showed remarkable differences. At this stage, there is still a lack of a comprehensive understanding of the variability in ChatGPT’s performance on different medical licensing examinations. Objective: In this study, we reviewed all studies on ChatGPT performance in medical licensing examinations up to March 2024. This review aims to contribute to the evolving discourse on artificial intelligence (AI) in medical education by providing a comprehensive analysis of the performance of ChatGPT in various environments. The insights gained from this systematic review will guide educators, policymakers, and technical experts to effectively and judiciously use AI in medical education. Methods: We searched the literature published between January 1, 2022, and March 29, 2024, by searching query strings in Web of Science, PubMed, and Scopus. Two authors screened the literature according to the inclusion and exclusion criteria, extracted data, and independently assessed the quality of the literature concerning Quality Assessment of Diagnostic Accuracy Studies-2. We conducted both qualitative and quantitative analyses. Results: A total of 45 studies on the performance of different versions of ChatGPT in medical licensing examinations were included in this study. GPT-4 achieved an overall accuracy rate of 81% (95% CI 78-84; P<.01), significantly surpassing the 58% (95% CI 53-63; P<.01) accuracy rate of GPT-3.5. GPT-4 passed the medical examinations in 26 of 29 cases, outperforming the average scores of medical students in 13 of 17 cases. Translating the examination questions into English improved GPT-3.5’s performance but did not affect GPT-4. GPT-3.5 showed no difference in performance between examinations from English-speaking and non–English-speaking countries (P=.72), but GPT-4 performed better on examinations from English-speaking countries significantly (P=.02). Any type of prompt could significantly improve GPT-3.5’s (P=.03) and GPT-4’s (P<.01) performance. GPT-3.5 performed better on short-text questions than on long-text questions. The difficulty of the questions affected the performance of GPT-3.5 and GPT-4. In image-based multiple-choice questions (MCQs), ChatGPT’s accuracy rate ranges from 13.1% to 100%. ChatGPT performed significantly worse on open-ended questions than on MCQs. Conclusions: GPT-4 demonstrates considerable potential for future use in medical education. However, due to its insufficient accuracy, inconsistent performance, and the challenges posed by differing medical policies and knowledge across countries, GPT-4 is not yet suitable for use in medical education. Trial Registration: PROSPERO CRD42024506687; https://www.crd.york.ac.uk/prospero/display_record.php?RecordID=506687 %M 39052324 %R 10.2196/60807 %U https://www.jmir.org/2024/1/e60807 %U https://doi.org/10.2196/60807 %U http://www.ncbi.nlm.nih.gov/pubmed/39052324 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e59050 %T ChatGPT for Automated Qualitative Research: Content Analysis %A Bijker,Rimke %A Merkouris,Stephanie S %A Dowling,Nicki A %A Rodda,Simone N %+ Department of Psychology and Neuroscience, Auckland University of Technology, 90 Akoranga Drive, Auckland, 0627, New Zealand, 64 9921 9999 ext 29079, simone.rodda@aut.ac.nz %K ChatGPT %K natural language processing %K qualitative content analysis %K Theoretical Domains Framework %D 2024 %7 25.7.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Data analysis approaches such as qualitative content analysis are notoriously time and labor intensive because of the time to detect, assess, and code a large amount of data. Tools such as ChatGPT may have tremendous potential in automating at least some of the analysis. Objective: The aim of this study was to explore the utility of ChatGPT in conducting qualitative content analysis through the analysis of forum posts from people sharing their experiences on reducing their sugar consumption. Methods: Inductive and deductive content analysis were performed on 537 forum posts to detect mechanisms of behavior change. Thorough prompt engineering provided appropriate instructions for ChatGPT to execute data analysis tasks. Data identification involved extracting change mechanisms from a subset of forum posts. The precision of the extracted data was assessed through comparison with human coding. On the basis of the identified change mechanisms, coding schemes were developed with ChatGPT using data-driven (inductive) and theory-driven (deductive) content analysis approaches. The deductive approach was informed by the Theoretical Domains Framework using both an unconstrained coding scheme and a structured coding matrix. In total, 10 coding schemes were created from a subset of data and then applied to the full data set in 10 new conversations, resulting in 100 conversations each for inductive and unconstrained deductive analysis. A total of 10 further conversations coded the full data set into the structured coding matrix. Intercoder agreement was evaluated across and within coding schemes. ChatGPT output was also evaluated by the researchers to assess whether it reflected prompt instructions. Results: The precision of detecting change mechanisms in the data subset ranged from 66% to 88%. Overall κ scores for intercoder agreement ranged from 0.72 to 0.82 across inductive coding schemes and from 0.58 to 0.73 across unconstrained coding schemes and structured coding matrix. Coding into the best-performing coding scheme resulted in category-specific κ scores ranging from 0.67 to 0.95 for the inductive approach and from 0.13 to 0.87 for the deductive approaches. ChatGPT largely followed prompt instructions in producing a description of each coding scheme, although the wording for the inductively developed coding schemes was lengthier than specified. Conclusions: ChatGPT appears fairly reliable in assisting with qualitative analysis. ChatGPT performed better in developing an inductive coding scheme that emerged from the data than adapting an existing framework into an unconstrained coding scheme or coding directly into a structured matrix. The potential for ChatGPT to act as a second coder also appears promising, with almost perfect agreement in at least 1 coding scheme. The findings suggest that ChatGPT could prove useful as a tool to assist in each phase of qualitative content analysis, but multiple iterations are required to determine the reliability of each stage of analysis. %M 39052327 %R 10.2196/59050 %U https://www.jmir.org/2024/1/e59050 %U https://doi.org/10.2196/59050 %U http://www.ncbi.nlm.nih.gov/pubmed/39052327 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e57721 %T Comparison of the Quality of Discharge Letters Written by Large Language Models and Junior Clinicians: Single-Blinded Study %A Tung,Joshua Yi Min %A Gill,Sunil Ravinder %A Sng,Gerald Gui Ren %A Lim,Daniel Yan Zheng %A Ke,Yuhe %A Tan,Ting Fang %A Jin,Liyuan %A Elangovan,Kabilan %A Ong,Jasmine Chiat Ling %A Abdullah,Hairil Rizal %A Ting,Daniel Shu Wei %A Chong,Tsung Wen %+ Department of Urology, Singapore General Hospital, 16 College Road, Block 4 Level 1, Singapore, 169854, Singapore, 65 62223322, joshua.tung@gmail.com %K artificial intelligence %K AI %K discharge summaries %K continuity of care %K large language model %K LLM %K junior clinician %K letter writing %K single-blinded %K ChatGPT %K urology %K primary care %K fictional electronic record %K consultation note %K referral letter %K simulated environment %D 2024 %7 24.7.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Discharge letters are a critical component in the continuity of care between specialists and primary care providers. However, these letters are time-consuming to write, underprioritized in comparison to direct clinical care, and are often tasked to junior doctors. Prior studies assessing the quality of discharge summaries written for inpatient hospital admissions show inadequacies in many domains. Large language models such as GPT have the ability to summarize large volumes of unstructured free text such as electronic medical records and have the potential to automate such tasks, providing time savings and consistency in quality. Objective: The aim of this study was to assess the performance of GPT-4 in generating discharge letters written from urology specialist outpatient clinics to primary care providers and to compare their quality against letters written by junior clinicians. Methods: Fictional electronic records were written by physicians simulating 5 common urology outpatient cases with long-term follow-up. Records comprised simulated consultation notes, referral letters and replies, and relevant discharge summaries from inpatient admissions. GPT-4 was tasked to write discharge letters for these cases with a specified target audience of primary care providers who would be continuing the patient’s care. Prompts were written for safety, content, and style. Concurrently, junior clinicians were provided with the same case records and instructional prompts. GPT-4 output was assessed for instances of hallucination. A blinded panel of primary care physicians then evaluated the letters using a standardized questionnaire tool. Results: GPT-4 outperformed human counterparts in information provision (mean 4.32, SD 0.95 vs 3.70, SD 1.27; P=.03) and had no instances of hallucination. There were no statistically significant differences in the mean clarity (4.16, SD 0.95 vs 3.68, SD 1.24; P=.12), collegiality (4.36, SD 1.00 vs 3.84, SD 1.22; P=.05), conciseness (3.60, SD 1.12 vs 3.64, SD 1.27; P=.71), follow-up recommendations (4.16, SD 1.03 vs 3.72, SD 1.13; P=.08), and overall satisfaction (3.96, SD 1.14 vs 3.62, SD 1.34; P=.36) between the letters generated by GPT-4 and humans, respectively. Conclusions: Discharge letters written by GPT-4 had equivalent quality to those written by junior clinicians, without any hallucinations. This study provides a proof of concept that large language models can be useful and safe tools in clinical documentation. %M 39047282 %R 10.2196/57721 %U https://www.jmir.org/2024/1/e57721 %U https://doi.org/10.2196/57721 %U http://www.ncbi.nlm.nih.gov/pubmed/39047282 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 7 %N %P e58396 %T NVIDIA’s “Chat with RTX” Custom Large Language Model and Personalized AI Chatbot Augments the Value of Electronic Dermatology Reference Material %A Kamel Boulos,Maged N %A Dellavalle,Robert %+ School of Medicine, University of Lisbon, Av Prof Egas Moniz MB, Lisbon, 1649-028, Portugal, 351 920531573, mnkboulos@ieee.org %K AI chatbots %K artificial intelligence %K AI %K generative AI %K large language models %K dermatology %K education %K self-study %K NVIDIA RTX %K retrieval-augmented generation %K RAG %D 2024 %7 24.7.2024 %9 Editorial %J JMIR Dermatol %G English %X This paper demonstrates a new, promising method using generative artificial intelligence (AI) to augment the educational value of electronic textbooks and research papers (locally stored on user’s machine) and maximize their potential for self-study, in a way that goes beyond the standard electronic search and indexing that is already available in all of these textbooks and files. The presented method runs fully locally on the user’s machine, is generally affordable, and does not require high technical expertise to set up and customize with the user’s own content. %M 39047285 %R 10.2196/58396 %U https://derma.jmir.org/2024/1/e58396 %U https://doi.org/10.2196/58396 %U http://www.ncbi.nlm.nih.gov/pubmed/39047285 %0 Journal Article %@ 2292-9495 %I %V 11 %N %P e51086 %T AI Hesitancy and Acceptability—Perceptions of AI Chatbots for Chronic Health Management and Long COVID Support: Survey Study %A Wu,Philip Fei %A Summers,Charlotte %A Panesar,Arjun %A Kaura,Amit %A Zhang,Li %K AI hesitancy %K chatbot %K long COVID %K diabetes %K chronic disease management %K technology acceptance %K post–COVID-19 condition %K artificial intelligence %D 2024 %7 23.7.2024 %9 %J JMIR Hum Factors %G English %X Background: Artificial intelligence (AI) chatbots have the potential to assist individuals with chronic health conditions by providing tailored information, monitoring symptoms, and offering mental health support. Despite their potential benefits, research on public attitudes toward health care chatbots is still limited. To effectively support individuals with long-term health conditions like long COVID (or post–COVID-19 condition), it is crucial to understand their perspectives and preferences regarding the use of AI chatbots. Objective: This study has two main objectives: (1) provide insights into AI chatbot acceptance among people with chronic health conditions, particularly adults older than 55 years and (2) explore the perceptions of using AI chatbots for health self-management and long COVID support. Methods: A web-based survey study was conducted between January and March 2023, specifically targeting individuals with diabetes and other chronic conditions. This particular population was chosen due to their potential awareness and ability to self-manage their condition. The survey aimed to capture data at multiple intervals, taking into consideration the public launch of ChatGPT, which could have potentially impacted public opinions during the project timeline. The survey received 1310 clicks and garnered 900 responses, resulting in a total of 888 usable data points. Results: Although past experience with chatbots (P<.001, 95% CI .110-.302) and online information seeking (P<.001, 95% CI .039-.084) are strong indicators of respondents’ future adoption of health chatbots, they are in general skeptical or unsure about the use of AI chatbots for health care purposes. Less than one-third of the respondents (n=203, 30.1%) indicated that they were likely to use a health chatbot in the next 12 months if available. Most were uncertain about a chatbot’s capability to provide accurate medical advice. However, people seemed more receptive to using voice-based chatbots for mental well-being, health data collection, and analysis. Half of the respondents with long COVID showed interest in using emotionally intelligent chatbots. Conclusions: AI hesitancy is not uniform across all health domains and user groups. Despite persistent AI hesitancy, there are promising opportunities for chatbots to offer support for chronic conditions in areas of lifestyle enhancement and mental well-being, potentially through voice-based user interfaces. %R 10.2196/51086 %U https://humanfactors.jmir.org/2024/1/e51086 %U https://doi.org/10.2196/51086 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52818 %T Appraisal of ChatGPT’s Aptitude for Medical Education: Comparative Analysis With Third-Year Medical Students in a Pulmonology Examination %A Cherif,Hela %A Moussa,Chirine %A Missaoui,Abdel Mouhaymen %A Salouage,Issam %A Mokaddem,Salma %A Dhahri,Besma %+ Faculté de Médecine de Tunis, Université de Tunis El Manar, 15, Rue Djebel Lakhdhar – Bab Saadoun, Tunis, 1007, Tunisia, 216 50424534, hela.cherif@fmt.utm.tn %K medical education %K ChatGPT %K GPT %K artificial intelligence %K natural language processing %K NLP %K pulmonary medicine %K pulmonary %K lung %K lungs %K respiratory %K respiration %K pneumology %K comparative analysis %K large language models %K LLMs %K LLM %K language model %K generative AI %K generative artificial intelligence %K generative %K exams %K exam %K examinations %K examination %D 2024 %7 23.7.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The rapid evolution of ChatGPT has generated substantial interest and led to extensive discussions in both public and academic domains, particularly in the context of medical education. Objective: This study aimed to evaluate ChatGPT’s performance in a pulmonology examination through a comparative analysis with that of third-year medical students. Methods: In this cross-sectional study, we conducted a comparative analysis with 2 distinct groups. The first group comprised 244 third-year medical students who had previously taken our institution’s 2020 pulmonology examination, which was conducted in French. The second group involved ChatGPT-3.5 in 2 separate sets of conversations: without contextualization (V1) and with contextualization (V2). In both V1 and V2, ChatGPT received the same set of questions administered to the students. Results: V1 demonstrated exceptional proficiency in radiology, microbiology, and thoracic surgery, surpassing the majority of medical students in these domains. However, it faced challenges in pathology, pharmacology, and clinical pneumology. In contrast, V2 consistently delivered more accurate responses across various question categories, regardless of the specialization. ChatGPT exhibited suboptimal performance in multiple choice questions compared to medical students. V2 excelled in responding to structured open-ended questions. Both ChatGPT conversations, particularly V2, outperformed students in addressing questions of low and intermediate difficulty. Interestingly, students showcased enhanced proficiency when confronted with highly challenging questions. V1 fell short of passing the examination. Conversely, V2 successfully achieved examination success, outperforming 139 (62.1%) medical students. Conclusions: While ChatGPT has access to a comprehensive web-based data set, its performance closely mirrors that of an average medical student. Outcomes are influenced by question format, item complexity, and contextual nuances. The model faces challenges in medical contexts requiring information synthesis, advanced analytical aptitude, and clinical judgment, as well as in non-English language assessments and when confronted with data outside mainstream internet sources. %M 39042876 %R 10.2196/52818 %U https://mededu.jmir.org/2024/1/e52818 %U https://doi.org/10.2196/52818 %U http://www.ncbi.nlm.nih.gov/pubmed/39042876 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56930 %T Roles, Users, Benefits, and Limitations of Chatbots in Health Care: Rapid Review %A Laymouna,Moustafa %A Ma,Yuanchao %A Lessard,David %A Schuster,Tibor %A Engler,Kim %A Lebouché,Bertrand %+ Centre for Outcomes Research and Evaluation, Research Institute of the McGill University Health Centre, D02.4110 – Glen Site, 1001 Decarie Blvd, Montreal, QC, H4A 3J1, Canada, 1 514 843 2090, bertrand.lebouche@mcgill.ca %K chatbot %K conversational agent %K conversational assistant %K user-computer interface %K digital health %K mobile health %K electronic health %K telehealth %K artificial intelligence %K AI %K health information technology %D 2024 %7 23.7.2024 %9 Review %J J Med Internet Res %G English %X Background: Chatbots, or conversational agents, have emerged as significant tools in health care, driven by advancements in artificial intelligence and digital technology. These programs are designed to simulate human conversations, addressing various health care needs. However, no comprehensive synthesis of health care chatbots’ roles, users, benefits, and limitations is available to inform future research and application in the field. Objective: This review aims to describe health care chatbots’ characteristics, focusing on their diverse roles in the health care pathway, user groups, benefits, and limitations. Methods: A rapid review of published literature from 2017 to 2023 was performed with a search strategy developed in collaboration with a health sciences librarian and implemented in the MEDLINE and Embase databases. Primary research studies reporting on chatbot roles or benefits in health care were included. Two reviewers dual-screened the search results. Extracted data on chatbot roles, users, benefits, and limitations were subjected to content analysis. Results: The review categorized chatbot roles into 2 themes: delivery of remote health services, including patient support, care management, education, skills building, and health behavior promotion, and provision of administrative assistance to health care providers. User groups spanned across patients with chronic conditions as well as patients with cancer; individuals focused on lifestyle improvements; and various demographic groups such as women, families, and older adults. Professionals and students in health care also emerged as significant users, alongside groups seeking mental health support, behavioral change, and educational enhancement. The benefits of health care chatbots were also classified into 2 themes: improvement of health care quality and efficiency and cost-effectiveness in health care delivery. The identified limitations encompassed ethical challenges, medicolegal and safety concerns, technical difficulties, user experience issues, and societal and economic impacts. Conclusions: Health care chatbots offer a wide spectrum of applications, potentially impacting various aspects of health care. While they are promising tools for improving health care efficiency and quality, their integration into the health care system must be approached with consideration of their limitations to ensure optimal, safe, and equitable use. %M 39042446 %R 10.2196/56930 %U https://www.jmir.org/2024/1/e56930 %U https://doi.org/10.2196/56930 %U http://www.ncbi.nlm.nih.gov/pubmed/39042446 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e55927 %T Benchmarking State-of-the-Art Large Language Models for Migraine Patient Education: Performance Comparison of Responses to Common Queries %A Li,Linger %A Li,Pengfei %A Wang,Kun %A Zhang,Liang %A Ji,Hongwei %A Zhao,Hongqin %+ Department of Neurology, The Affiliated Hospital of Qingdao University, No. 59 Haier Road, Qingdao, 266035, China, 86 13864873935, zhaohongq@qdu.edu.cn %K migraine %K large language models %K patient education %K ChatGPT %K Google Bard %K language model %K patient education %K education %K headache %K accuracy %K OpenAI %K AI %K artificial intelligence %K AI-assisted %K holistic %K migraine management %K management %D 2024 %7 23.7.2024 %9 Research Letter %J J Med Internet Res %G English %X This study assessed the potential of large language models (OpenAI’s ChatGPT 3.5 and 4.0, Google Bard, Meta Llama2, and Anthropic Claude2) in addressing 30 common migraine-related queries, providing a foundation to advance artificial intelligence–assisted patient education and insights for a holistic approach to migraine management. %M 38828692 %R 10.2196/55927 %U https://www.jmir.org/2024/1/e55927 %U https://doi.org/10.2196/55927 %U http://www.ncbi.nlm.nih.gov/pubmed/38828692 %0 Journal Article %@ 2291-5222 %I JMIR Publications %V 12 %N %P e57318 %T Conversational Chatbot for Cigarette Smoking Cessation: Results From the 11-Step User-Centered Design Development Process and Randomized Controlled Trial %A Bricker,Jonathan B %A Sullivan,Brianna %A Mull,Kristin %A Santiago-Torres,Margarita %A Lavista Ferres,Juan M %+ Division of Public Health Sciences, Fred Hutch Cancer Center, 1100 Fairview Avenue N, Seattle, WA, 98109, United States, 1 206 667 5074, jbricker@fredhutch.org %K chatbot %K conversational agent %K conversational agents %K digital therapeutics %K smoking cessation %K development %K develop %K design %K smoking %K smoke %K smokers %K quit %K quitting %K cessation %K chatbots %K large language model %K LLM %K LLMs %K large language models %K addict %K addiction %K addictions %K mobile phone %D 2024 %7 23.7.2024 %9 Original Paper %J JMIR Mhealth Uhealth %G English %X Background: Conversational chatbots are an emerging digital intervention for smoking cessation. No studies have reported on the entire development process of a cessation chatbot. Objective: We aim to report results of the user-centered design development process and randomized controlled trial for a novel and comprehensive quit smoking conversational chatbot called QuitBot. Methods: The 4 years of formative research for developing QuitBot followed an 11-step process: (1) specifying a conceptual model; (2) conducting content analysis of existing interventions (63 hours of intervention transcripts); (3) assessing user needs; (4) developing the chat’s persona (“personality”); (5) prototyping content and persona; (6) developing full functionality; (7) programming the QuitBot; (8) conducting a diary study; (9) conducting a pilot randomized controlled trial (RCT); (10) reviewing results of the RCT; and (11) adding a free-form question and answer (QnA) function, based on user feedback from pilot RCT results. The process of adding a QnA function itself involved a three-step process: (1) generating QnA pairs, (2) fine-tuning large language models (LLMs) on QnA pairs, and (3) evaluating the LLM outputs. Results: We developed a quit smoking program spanning 42 days of 2- to 3-minute conversations covering topics ranging from motivations to quit, setting a quit date, choosing Food and Drug Administration–approved cessation medications, coping with triggers, and recovering from lapses and relapses. In a pilot RCT with 96% three-month outcome data retention, QuitBot demonstrated high user engagement and promising cessation rates compared to the National Cancer Institute’s SmokefreeTXT text messaging program, particularly among those who viewed all 42 days of program content: 30-day, complete-case, point prevalence abstinence rates at 3-month follow-up were 63% (39/62) for QuitBot versus 38.5% (45/117) for SmokefreeTXT (odds ratio 2.58, 95% CI 1.34-4.99; P=.005). However, Facebook Messenger intermittently blocked participants’ access to QuitBot, so we transitioned from Facebook Messenger to a stand-alone smartphone app as the communication channel. Participants’ frustration with QuitBot’s inability to answer their open-ended questions led to us develop a core conversational feature, enabling users to ask open-ended questions about quitting cigarette smoking and for the QuitBot to respond with accurate and professional answers. To support this functionality, we developed a library of 11,000 QnA pairs on topics associated with quitting cigarette smoking. Model testing results showed that Microsoft’s Azure-based QnA maker effectively handled questions that matched our library of 11,000 QnA pairs. A fine-tuned, contextualized GPT-3.5 (OpenAI) responds to questions that are not within our library of QnA pairs. Conclusions: The development process yielded the first LLM-based quit smoking program delivered as a conversational chatbot. Iterative testing led to significant enhancements, including improvements to the delivery channel. A pivotal addition was the inclusion of a core LLM–supported conversational feature allowing users to ask open-ended questions. Trial Registration: ClinicalTrials.gov NCT03585231; https://clinicaltrials.gov/study/NCT03585231 %M 38913882 %R 10.2196/57318 %U https://mhealth.jmir.org/2024/1/e57318 %U https://doi.org/10.2196/57318 %U http://www.ncbi.nlm.nih.gov/pubmed/38913882 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e58158 %T Evaluating and Enhancing Large Language Models’ Performance in Domain-Specific Medicine: Development and Usability Study With DocOA %A Chen,Xi %A Wang,Li %A You,MingKe %A Liu,WeiZhi %A Fu,Yu %A Xu,Jie %A Zhang,Shaoting %A Chen,Gang %A Li,Kang %A Li,Jian %+ Sports Medicine Center, West China Hospital, Sichuan University, No. 37, Guoxue Alley, Wuhou District, Chengdu, 610041, China, 86 18980601388, lijian_sportsmed@163.com %K large language model %K retrieval-augmented generation %K domain-specific benchmark framework %K osteoarthritis management %D 2024 %7 22.7.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. Objective: This study focused on evaluating and enhancing the clinical capabilities and explainability of LLMs in specific domains, using OA management as a case study. Methods: A domain-specific benchmark framework was developed to evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM designed for OA management integrating retrieval-augmented generation and instructional prompts, was developed. It can identify the clinical evidence upon which its answers are based through retrieval-augmented generation, thereby demonstrating the explainability of those answers. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. Results: Results showed that general LLMs such as GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. Conclusions: This study introduces a novel benchmark framework that assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs. %M 38833165 %R 10.2196/58158 %U https://www.jmir.org/2024/1/e58158 %U https://doi.org/10.2196/58158 %U http://www.ncbi.nlm.nih.gov/pubmed/38833165 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e55799 %T Evaluating Large Language Models for Automated Reporting and Data Systems Categorization: Cross-Sectional Study %A Wu,Qingxia %A Wu,Qingxia %A Li,Huali %A Wang,Yan %A Bai,Yan %A Wu,Yaping %A Yu,Xuan %A Li,Xiaodong %A Dong,Pei %A Xue,Jon %A Shen,Dinggang %A Wang,Meiyun %+ Department of Medical Imaging, Henan Provincial People’s Hospital & People’s Hospital of Zhengzhou University, No 7, Weiwu Road, Jinshui District, Zhengzhou, 450001, China, 86 037165580267, mywang@zzu.edu.cn %K Radiology Reporting and Data Systems %K LI-RADS %K Lung-RADS %K O-RADS %K large language model %K ChatGPT %K chatbot %K chatbots %K categorization %K recommendation %K recommendations %K accuracy %D 2024 %7 17.7.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: Large language models show promise for improving radiology workflows, but their performance on structured radiological tasks such as Reporting and Data Systems (RADS) categorization remains unexplored. Objective: This study aims to evaluate 3 large language model chatbots—Claude-2, GPT-3.5, and GPT-4—on assigning RADS categories to radiology reports and assess the impact of different prompting strategies. Methods: This cross-sectional study compared 3 chatbots using 30 radiology reports (10 per RADS criteria), using a 3-level prompting strategy: zero-shot, few-shot, and guideline PDF-informed prompts. The cases were grounded in Liver Imaging Reporting & Data System (LI-RADS) version 2018, Lung CT (computed tomography) Screening Reporting & Data System (Lung-RADS) version 2022, and Ovarian-Adnexal Reporting & Data System (O-RADS) magnetic resonance imaging, meticulously prepared by board-certified radiologists. Each report underwent 6 assessments. Two blinded reviewers assessed the chatbots’ response at patient-level RADS categorization and overall ratings. The agreement across repetitions was assessed using Fleiss κ. Results: Claude-2 achieved the highest accuracy in overall ratings with few-shot prompts and guideline PDFs (prompt-2), attaining 57% (17/30) average accuracy over 6 runs and 50% (15/30) accuracy with k-pass voting. Without prompt engineering, all chatbots performed poorly. The introduction of a structured exemplar prompt (prompt-1) increased the accuracy of overall ratings for all chatbots. Providing prompt-2 further improved Claude-2’s performance, an enhancement not replicated by GPT-4. The interrun agreement was substantial for Claude-2 (k=0.66 for overall rating and k=0.69 for RADS categorization), fair for GPT-4 (k=0.39 for both), and fair for GPT-3.5 (k=0.21 for overall rating and k=0.39 for RADS categorization). All chatbots showed significantly higher accuracy with LI-RADS version 2018 than with Lung-RADS version 2022 and O-RADS (P<.05); with prompt-2, Claude-2 achieved the highest overall rating accuracy of 75% (45/60) in LI-RADS version 2018. Conclusions: When equipped with structured prompts and guideline PDFs, Claude-2 demonstrated potential in assigning RADS categories to radiology cases according to established criteria such as LI-RADS version 2018. However, the current generation of chatbots lags in accurately categorizing cases based on more recent RADS criteria. %M 39018102 %R 10.2196/55799 %U https://medinform.jmir.org/2024/1/e55799 %U https://doi.org/10.2196/55799 %U http://www.ncbi.nlm.nih.gov/pubmed/39018102 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e51282 %T Assessing GPT-4’s Performance in Delivering Medical Advice: Comparative Analysis With Human Experts %A Jo,Eunbeen %A Song,Sanghoun %A Kim,Jong-Ho %A Lim,Subin %A Kim,Ju Hyeon %A Cha,Jung-Joon %A Kim,Young-Min %A Joo,Hyung Joon %K GPT-4 %K medical advice %K ChatGPT %K cardiology %K cardiologist %K heart %K advice %K recommendation %K recommendations %K linguistic %K linguistics %K artificial intelligence %K NLP %K natural language processing %K chatbot %K chatbots %K conversational agent %K conversational agents %K response %K responses %D 2024 %7 8.7.2024 %9 %J JMIR Med Educ %G English %X Background: Accurate medical advice is paramount in ensuring optimal patient care, and misinformation can lead to misguided decisions with potentially detrimental health outcomes. The emergence of large language models (LLMs) such as OpenAI’s GPT-4 has spurred interest in their potential health care applications, particularly in automated medical consultation. Yet, rigorous investigations comparing their performance to human experts remain sparse. Objective: This study aims to compare the medical accuracy of GPT-4 with human experts in providing medical advice using real-world user-generated queries, with a specific focus on cardiology. It also sought to analyze the performance of GPT-4 and human experts in specific question categories, including drug or medication information and preliminary diagnoses. Methods: We collected 251 pairs of cardiology-specific questions from general users and answers from human experts via an internet portal. GPT-4 was tasked with generating responses to the same questions. Three independent cardiologists (SL, JHK, and JJC) evaluated the answers provided by both human experts and GPT-4. Using a computer interface, each evaluator compared the pairs and determined which answer was superior, and they quantitatively measured the clarity and complexity of the questions as well as the accuracy and appropriateness of the responses, applying a 3-tiered grading scale (low, medium, and high). Furthermore, a linguistic analysis was conducted to compare the length and vocabulary diversity of the responses using word count and type-token ratio. Results: GPT-4 and human experts displayed comparable efficacy in medical accuracy (“GPT-4 is better” at 132/251, 52.6% vs “Human expert is better” at 119/251, 47.4%). In accuracy level categorization, humans had more high-accuracy responses than GPT-4 (50/237, 21.1% vs 30/238, 12.6%) but also a greater proportion of low-accuracy responses (11/237, 4.6% vs 1/238, 0.4%; P=.001). GPT-4 responses were generally longer and used a less diverse vocabulary than those of human experts, potentially enhancing their comprehensibility for general users (sentence count: mean 10.9, SD 4.2 vs mean 5.9, SD 3.7; P<.001; type-token ratio: mean 0.69, SD 0.07 vs mean 0.79, SD 0.09; P<.001). Nevertheless, human experts outperformed GPT-4 in specific question categories, notably those related to drug or medication information and preliminary diagnoses. These findings highlight the limitations of GPT-4 in providing advice based on clinical experience. Conclusions: GPT-4 has shown promising potential in automated medical consultation, with comparable medical accuracy to human experts. However, challenges remain particularly in the realm of nuanced clinical judgment. Future improvements in LLMs may require the integration of specific clinical reasoning pathways and regulatory oversight for safe use. Further research is needed to understand the full potential of LLMs across various medical specialties and conditions. %R 10.2196/51282 %U https://mededu.jmir.org/2024/1/e51282 %U https://doi.org/10.2196/51282 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e53308 %T The Ability of ChatGPT in Paraphrasing Texts and Reducing Plagiarism: A Descriptive Analysis %A Hassanipour,Soheil %A Nayak,Sandeep %A Bozorgi,Ali %A Keivanlou,Mohammad-Hossein %A Dave,Tirth %A Alotaibi,Abdulhadi %A Joukar,Farahnaz %A Mellatdoust,Parinaz %A Bakhshi,Arash %A Kuriyakose,Dona %A Polisetty,Lakshmi D %A Chimpiri,Mallika %A Amini-Salehi,Ehsan %K ChatGPT %K paraphrasing %K text generation %K prompts %K academic journals %K plagiarize %K plagiarism %K paraphrase %K wording %K LLM %K LLMs %K language model %K language models %K prompt %K generative %K artificial intelligence %K NLP %K natural language processing %K rephrase %K plagiarizing %K honesty %K integrity %K texts %K text %K textual %K generation %K large language model %K large language models %D 2024 %7 8.7.2024 %9 %J JMIR Med Educ %G English %X Background: The introduction of ChatGPT by OpenAI has garnered significant attention. Among its capabilities, paraphrasing stands out. Objective: This study aims to investigate the satisfactory levels of plagiarism in the paraphrased text produced by this chatbot. Methods: Three texts of varying lengths were presented to ChatGPT. ChatGPT was then instructed to paraphrase the provided texts using five different prompts. In the subsequent stage of the study, the texts were divided into separate paragraphs, and ChatGPT was requested to paraphrase each paragraph individually. Lastly, in the third stage, ChatGPT was asked to paraphrase the texts it had previously generated. Results: The average plagiarism rate in the texts generated by ChatGPT was 45% (SD 10%). ChatGPT exhibited a substantial reduction in plagiarism for the provided texts (mean difference −0.51, 95% CI −0.54 to −0.48; P<.001). Furthermore, when comparing the second attempt with the initial attempt, a significant decrease in the plagiarism rate was observed (mean difference −0.06, 95% CI −0.08 to −0.03; P<.001). The number of paragraphs in the texts demonstrated a noteworthy association with the percentage of plagiarism, with texts consisting of a single paragraph exhibiting the lowest plagiarism rate (P<.001). Conclusion: Although ChatGPT demonstrates a notable reduction of plagiarism within texts, the existing levels of plagiarism remain relatively high. This underscores a crucial caution for researchers when incorporating this chatbot into their work. %R 10.2196/53308 %U https://mededu.jmir.org/2024/1/e53308 %U https://doi.org/10.2196/53308 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56110 %T ChatGPT With GPT-4 Outperforms Emergency Department Physicians in Diagnostic Accuracy: Retrospective Analysis %A Hoppe,John Michael %A Auer,Matthias K %A Strüven,Anna %A Massberg,Steffen %A Stremmel,Christopher %+ Department of Medicine I, LMU University Hospital, Marchioninistr 15, Munich, 81377, Germany, 49 89 4400 712622, christopher.stremmel@med.uni-muenchen.de %K emergency department %K diagnosis %K accuracy %K artificial intelligence %K ChatGPT %K internal medicine %K AI %K natural language processing %K NLP %K emergency medicine triage %K triage %K physicians %K physician %K diagnostic accuracy %K OpenAI %D 2024 %7 8.7.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: OpenAI’s ChatGPT is a pioneering artificial intelligence (AI) in the field of natural language processing, and it holds significant potential in medicine for providing treatment advice. Additionally, recent studies have demonstrated promising results using ChatGPT for emergency medicine triage. However, its diagnostic accuracy in the emergency department (ED) has not yet been evaluated. Objective: This study compares the diagnostic accuracy of ChatGPT with GPT-3.5 and GPT-4 and primary treating resident physicians in an ED setting. Methods: Among 100 adults admitted to our ED in January 2023 with internal medicine issues, the diagnostic accuracy was assessed by comparing the diagnoses made by ED resident physicians and those made by ChatGPT with GPT-3.5 or GPT-4 against the final hospital discharge diagnosis, using a point system for grading accuracy. Results: The study enrolled 100 patients with a median age of 72 (IQR 58.5-82.0) years who were admitted to our internal medicine ED primarily for cardiovascular, endocrine, gastrointestinal, or infectious diseases. GPT-4 outperformed both GPT-3.5 (P<.001) and ED resident physicians (P=.01) in diagnostic accuracy for internal medicine emergencies. Furthermore, across various disease subgroups, GPT-4 consistently outperformed GPT-3.5 and resident physicians. It demonstrated significant superiority in cardiovascular (GPT-4 vs ED physicians: P=.03) and endocrine or gastrointestinal diseases (GPT-4 vs GPT-3.5: P=.01). However, in other categories, the differences were not statistically significant. Conclusions: In this study, which compared the diagnostic accuracy of GPT-3.5, GPT-4, and ED resident physicians against a discharge diagnosis gold standard, GPT-4 outperformed both the resident physicians and its predecessor, GPT-3.5. Despite the retrospective design of the study and its limited sample size, the results underscore the potential of AI as a supportive diagnostic tool in ED settings. %M 38976865 %R 10.2196/56110 %U https://www.jmir.org/2024/1/e56110 %U https://doi.org/10.2196/56110 %U http://www.ncbi.nlm.nih.gov/pubmed/38976865 %0 Journal Article %@ 2368-7959 %I %V 11 %N %P e56569 %T The Role of Humanization and Robustness of Large Language Models in Conversational Artificial Intelligence for Individuals With Depression: A Critical Analysis %A Ferrario,Andrea %A Sedlakova,Jana %A Trachsel,Manuel %K generative AI %K large language models %K large language model %K LLM %K LLMs %K machine learning %K ML %K natural language processing %K NLP %K deep learning %K depression %K mental health %K mental illness %K mental disease %K mental diseases %K mental illnesses %K artificial intelligence %K AI %K digital health %K digital technology %K digital intervention %K digital interventions %K ethics %D 2024 %7 2.7.2024 %9 %J JMIR Ment Health %G English %X Large language model (LLM)–powered services are gaining popularity in various applications due to their exceptional performance in many tasks, such as sentiment analysis and answering questions. Recently, research has been exploring their potential use in digital health contexts, particularly in the mental health domain. However, implementing LLM-enhanced conversational artificial intelligence (CAI) presents significant ethical, technical, and clinical challenges. In this viewpoint paper, we discuss 2 challenges that affect the use of LLM-enhanced CAI for individuals with mental health issues, focusing on the use case of patients with depression: the tendency to humanize LLM-enhanced CAI and their lack of contextualized robustness. Our approach is interdisciplinary, relying on considerations from philosophy, psychology, and computer science. We argue that the humanization of LLM-enhanced CAI hinges on the reflection of what it means to simulate “human-like” features with LLMs and what role these systems should play in interactions with humans. Further, ensuring the contextualization of the robustness of LLMs requires considering the specificities of language production in individuals with depression, as well as its evolution over time. Finally, we provide a series of recommendations to foster the responsible design and deployment of LLM-enhanced CAI for the therapeutic support of individuals with depression. %R 10.2196/56569 %U https://mental.jmir.org/2024/1/e56569 %U https://doi.org/10.2196/56569 %0 Journal Article %@ 2291-9694 %I %V 12 %N %P e57674 %T Data Set and Benchmark (MedGPTEval) to Evaluate Responses From Large Language Models in Medicine: Evaluation Development and Validation %A Xu,Jie %A Lu,Lu %A Peng,Xinwei %A Pang,Jiali %A Ding,Jinru %A Yang,Lingrui %A Song,Huan %A Li,Kang %A Sun,Xin %A Zhang,Shaoting %K ChatGPT %K LLM %K assessment %K data set %K benchmark %K medicine %D 2024 %7 28.6.2024 %9 %J JMIR Med Inform %G English %X Background: Large language models (LLMs) have achieved great progress in natural language processing tasks and demonstrated the potential for use in clinical applications. Despite their capabilities, LLMs in the medical domain are prone to generating hallucinations (not fully reliable responses). Hallucinations in LLMs’ responses create substantial risks, potentially threatening patients’ physical safety. Thus, to perceive and prevent this safety risk, it is essential to evaluate LLMs in the medical domain and build a systematic evaluation. Objective: We developed a comprehensive evaluation system, MedGPTEval, composed of criteria, medical data sets in Chinese, and publicly available benchmarks. Methods: First, a set of evaluation criteria was designed based on a comprehensive literature review. Second, existing candidate criteria were optimized by using a Delphi method with 5 experts in medicine and engineering. Third, 3 clinical experts designed medical data sets to interact with LLMs. Finally, benchmarking experiments were conducted on the data sets. The responses generated by chatbots based on LLMs were recorded for blind evaluations by 5 licensed medical experts. The evaluation criteria that were obtained covered medical professional capabilities, social comprehensive capabilities, contextual capabilities, and computational robustness, with 16 detailed indicators. The medical data sets include 27 medical dialogues and 7 case reports in Chinese. Three chatbots were evaluated: ChatGPT by OpenAI; ERNIE Bot by Baidu, Inc; and Doctor PuJiang (Dr PJ) by Shanghai Artificial Intelligence Laboratory. Results: Dr PJ outperformed ChatGPT and ERNIE Bot in the multiple-turn medical dialogues and case report scenarios. Dr PJ also outperformed ChatGPT in the semantic consistency rate and complete error rate category, indicating better robustness. However, Dr PJ had slightly lower scores in medical professional capabilities compared with ChatGPT in the multiple-turn dialogue scenario. Conclusions: MedGPTEval provides comprehensive criteria to evaluate chatbots by LLMs in the medical domain, open-source data sets, and benchmarks assessing 3 LLMs. Experimental results demonstrate that Dr PJ outperforms ChatGPT and ERNIE Bot in social and professional contexts. Therefore, such an assessment system can be easily adopted by researchers in this community to augment an open-source data set. %R 10.2196/57674 %U https://medinform.jmir.org/2024/1/e57674 %U https://doi.org/10.2196/57674 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54571 %T Assessing Generative Pretrained Transformers (GPT) in Clinical Decision-Making: Comparative Analysis of GPT-3.5 and GPT-4 %A Lahat,Adi %A Sharif,Kassem %A Zoabi,Narmin %A Shneor Patt,Yonatan %A Sharif,Yousra %A Fisher,Lior %A Shani,Uria %A Arow,Mohamad %A Levin,Roni %A Klang,Eyal %+ Department of Gastroenterology, Chaim Sheba Medical Center, Affiliated with Tel Aviv University, Tel Hashomer, Ramat Gan, 5262100, Israel, 972 5302060, zokadi@gmail.com %K ChatGPT %K chat-GPT %K chatbot %K chatbots %K chat-bot %K chat-bots %K natural language processing %K NLP %K artificial intelligence %K AI %K machine learning %K ML %K algorithm %K algorithms %K predictive model %K predictive models %K predictive analytics %K predictive system %K practical model %K practical models %K internal medicine %K ethics %K ethical %K ethical dilemma %K ethical dilemmas %K bioethics %K emergency medicine %K EM medicine %K ED physician %K emergency physician %K emergency doctor %D 2024 %7 27.6.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Artificial intelligence, particularly chatbot systems, is becoming an instrumental tool in health care, aiding clinical decision-making and patient engagement. Objective: This study aims to analyze the performance of ChatGPT-3.5 and ChatGPT-4 in addressing complex clinical and ethical dilemmas, and to illustrate their potential role in health care decision-making while comparing seniors’ and residents’ ratings, and specific question types. Methods: A total of 4 specialized physicians formulated 176 real-world clinical questions. A total of 8 senior physicians and residents assessed responses from GPT-3.5 and GPT-4 on a 1-5 scale across 5 categories: accuracy, relevance, clarity, utility, and comprehensiveness. Evaluations were conducted within internal medicine, emergency medicine, and ethics. Comparisons were made globally, between seniors and residents, and across classifications. Results: Both GPT models received high mean scores (4.4, SD 0.8 for GPT-4 and 4.1, SD 1.0 for GPT-3.5). GPT-4 outperformed GPT-3.5 across all rating dimensions, with seniors consistently rating responses higher than residents for both models. Specifically, seniors rated GPT-4 as more beneficial and complete (mean 4.6 vs 4.0 and 4.6 vs 4.1, respectively; P<.001), and GPT-3.5 similarly (mean 4.1 vs 3.7 and 3.9 vs 3.5, respectively; P<.001). Ethical queries received the highest ratings for both models, with mean scores reflecting consistency across accuracy and completeness criteria. Distinctions among question types were significant, particularly for the GPT-4 mean scores in completeness across emergency, internal, and ethical questions (4.2, SD 1.0; 4.3, SD 0.8; and 4.5, SD 0.7, respectively; P<.001), and for GPT-3.5’s accuracy, beneficial, and completeness dimensions. Conclusions: ChatGPT’s potential to assist physicians with medical issues is promising, with prospects to enhance diagnostics, treatments, and ethics. While integration into clinical workflows may be valuable, it must complement, not replace, human expertise. Continued research is essential to ensure safe and effective implementation in clinical environments. %M 38935937 %R 10.2196/54571 %U https://www.jmir.org/2024/1/e54571 %U https://doi.org/10.2196/54571 %U http://www.ncbi.nlm.nih.gov/pubmed/38935937 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e52001 %T Assessing the Reproducibility of the Structured Abstracts Generated by ChatGPT and Bard Compared to Human-Written Abstracts in the Field of Spine Surgery: Comparative Analysis %A Kim,Hong Jin %A Yang,Jae Hyuk %A Chang,Dong-Gune %A Lenke,Lawrence G %A Pizones,Javier %A Castelein,René %A Watanabe,Kota %A Trobisch,Per D %A Mundis Jr,Gregory M %A Suh,Seung Woo %A Suk,Se-Il %+ Department of Orthopedic Surgery, Inje University Sanggye Paik Hospital, College of Medicine, Inje University, 1342, Dongil-Ro, Nowon-Gu, Seoul, 01757, Republic of Korea, 82 2 950 1284, dgchangmd@gmail.com %K artificial intelligence %K AI %K ChatGPT %K Bard %K scientific abstract %K orthopedic surgery %K spine %K journal guidelines %K plagiarism %K ethics %K spine surgery %K surgery %K language model %K chatbot %K formatting guidelines %K abstract %D 2024 %7 26.6.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Due to recent advances in artificial intelligence (AI), language model applications can generate logical text output that is difficult to distinguish from human writing. ChatGPT (OpenAI) and Bard (subsequently rebranded as “Gemini”; Google AI) were developed using distinct approaches, but little has been studied about the difference in their capability to generate the abstract. The use of AI to write scientific abstracts in the field of spine surgery is the center of much debate and controversy. Objective: The objective of this study is to assess the reproducibility of the structured abstracts generated by ChatGPT and Bard compared to human-written abstracts in the field of spine surgery. Methods: In total, 60 abstracts dealing with spine sections were randomly selected from 7 reputable journals and used as ChatGPT and Bard input statements to generate abstracts based on supplied paper titles. A total of 174 abstracts, divided into human-written abstracts, ChatGPT-generated abstracts, and Bard-generated abstracts, were evaluated for compliance with the structured format of journal guidelines and consistency of content. The likelihood of plagiarism and AI output was assessed using the iThenticate and ZeroGPT programs, respectively. A total of 8 reviewers in the spinal field evaluated 30 randomly extracted abstracts to determine whether they were produced by AI or human authors. Results: The proportion of abstracts that met journal formatting guidelines was greater among ChatGPT abstracts (34/60, 56.6%) compared with those generated by Bard (6/54, 11.1%; P<.001). However, a higher proportion of Bard abstracts (49/54, 90.7%) had word counts that met journal guidelines compared with ChatGPT abstracts (30/60, 50%; P<.001). The similarity index was significantly lower among ChatGPT-generated abstracts (20.7%) compared with Bard-generated abstracts (32.1%; P<.001). The AI-detection program predicted that 21.7% (13/60) of the human group, 63.3% (38/60) of the ChatGPT group, and 87% (47/54) of the Bard group were possibly generated by AI, with an area under the curve value of 0.863 (P<.001). The mean detection rate by human reviewers was 53.8% (SD 11.2%), achieving a sensitivity of 56.3% and a specificity of 48.4%. A total of 56.3% (63/112) of the actual human-written abstracts and 55.9% (62/128) of AI-generated abstracts were recognized as human-written and AI-generated by human reviewers, respectively. Conclusions: Both ChatGPT and Bard can be used to help write abstracts, but most AI-generated abstracts are currently considered unethical due to high plagiarism and AI-detection rates. ChatGPT-generated abstracts appear to be superior to Bard-generated abstracts in meeting journal formatting guidelines. Because humans are unable to accurately distinguish abstracts written by humans from those produced by AI programs, it is crucial to exercise special caution and examine the ethical boundaries of using AI programs, including ChatGPT and Bard. %M 38924787 %R 10.2196/52001 %U https://www.jmir.org/2024/1/e52001 %U https://doi.org/10.2196/52001 %U http://www.ncbi.nlm.nih.gov/pubmed/38924787 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54607 %T Multimodal ChatGPT-4V for Electrocardiogram Interpretation: Promise and Limitations %A Zhu,Lingxuan %A Mou,Weiming %A Wu,Keren %A Lai,Yancheng %A Lin,Anqi %A Yang,Tao %A Zhang,Jian %A Luo,Peng %+ Department of Oncology, Zhujiang Hospital, Southern Medical University, 253 Industrial Avenue, Guangzhou, 510282, China, 86 020 61643888, luopeng@smu.edu.cn %K ChatGPT %K ECG %K electrocardiogram %K multimodal %K artificial intelligence %K AI %K large language model %K diagnostic %K quantitative analysis %K clinical %K clinicians %K ECG interpretation %K cardiovascular care %K cardiovascular %D 2024 %7 26.6.2024 %9 Research Letter %J J Med Internet Res %G English %X This study evaluated the capabilities of the newly released ChatGPT-4V, a large language model with visual recognition abilities, in interpreting electrocardiogram waveforms and answering related multiple-choice questions for assisting with cardiovascular care. %M 38764297 %R 10.2196/54607 %U https://www.jmir.org/2024/1/e54607 %U https://doi.org/10.2196/54607 %U http://www.ncbi.nlm.nih.gov/pubmed/38764297 %0 Journal Article %@ 2291-5222 %I JMIR Publications %V 12 %N %P e54945 %T A Chatbot-Delivered Stress Management Coaching for Students (MISHA App): Pilot Randomized Controlled Trial %A Ulrich,Sandra %A Lienhard,Natascha %A Künzli,Hansjörg %A Kowatsch,Tobias %+ School of Applied Psychology, Zurich University of Applied Sciences, Pfingstweidstrasse 96, Zurich, 8005, Switzerland, 41 58 934 ext 8451, sandra.ulrich@zhaw.ch %K conversational agent %K mobile health %K mHealth %K smartphone %K stress management %K lifestyle %K behavior change %K coaching %K mobile phone %D 2024 %7 26.6.2024 %9 Original Paper %J JMIR Mhealth Uhealth %G English %X Background: Globally, students face increasing mental health challenges, including elevated stress levels and declining well-being, leading to academic performance issues and mental health disorders. However, due to stigma and symptom underestimation, students rarely seek effective stress management solutions. Conversational agents in the health sector have shown promise in reducing stress, depression, and anxiety. Nevertheless, research on their effectiveness for students with stress remains limited. Objective: This study aims to develop a conversational agent–delivered stress management coaching intervention for students called MISHA and to evaluate its effectiveness, engagement, and acceptance. Methods: In an unblinded randomized controlled trial, Swiss students experiencing stress were recruited on the web. Using a 1:1 randomization ratio, participants (N=140) were allocated to either the intervention or waitlist control group. Treatment effectiveness on changes in the primary outcome, that is, perceived stress, and secondary outcomes, including depression, anxiety, psychosomatic symptoms, and active coping, were self-assessed and evaluated using ANOVA for repeated measure and general estimating equations. Results: The per-protocol analysis revealed evidence for improvement of stress, depression, and somatic symptoms with medium effect sizes (Cohen d=−0.36 to Cohen d=−0.60), while anxiety and active coping did not change (Cohen d=−0.29 and Cohen d=0.13). In the intention-to-treat analysis, similar results were found, indicating reduced stress (β estimate=−0.13, 95% CI −0.20 to −0.05; P<.001), depressive symptoms (β estimate=−0.23, 95% CI −0.38 to −0.08; P=.003), and psychosomatic symptoms (β estimate=−0.16, 95% CI −0.27 to −0.06; P=.003), while anxiety and active coping did not change. Overall, 60% (42/70) of the participants in the intervention group completed the coaching by completing the postintervention survey. They particularly appreciated the quality, quantity, credibility, and visual representation of information. While individual customization was rated the lowest, the target group fitting was perceived as high. Conclusions: Findings indicate that MISHA is feasible, acceptable, and effective in reducing perceived stress among students in Switzerland. Future research is needed with different populations, for example, in students with high stress levels or compared to active controls. Trial Registration: German Clinical Trials Register DRKS 00030004; https://drks.de/search/en/trial/DRKS00030004 %M 38922677 %R 10.2196/54945 %U https://mhealth.jmir.org/2024/1/e54945 %U https://doi.org/10.2196/54945 %U http://www.ncbi.nlm.nih.gov/pubmed/38922677 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56780 %T Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses %A Luo,Xufei %A Chen,Fengxian %A Zhu,Di %A Wang,Ling %A Wang,Zijun %A Liu,Hui %A Lyu,Meng %A Wang,Ye %A Wang,Qi %A Chen,Yaolong %+ Evidence-Based Medicine Center, School of Basic Medical Sciences, Lanzhou University, No 199 Donggang West Road, Chengguan District, Lanzhou, 730000, China, 86 13893104140, chevidence@lzu.edu.cn %K large language model %K ChatGPT %K systematic review %K chatbot %K meta-analysis %D 2024 %7 25.6.2024 %9 Viewpoint %J J Med Internet Res %G English %X Large language models (LLMs) such as ChatGPT have become widely applied in the field of medical research. In the process of conducting systematic reviews, similar tools can be used to expedite various steps, including defining clinical questions, performing the literature search, document screening, information extraction, and language refinement, thereby conserving resources and enhancing efficiency. However, when using LLMs, attention should be paid to transparent reporting, distinguishing between genuine and false content, and avoiding academic misconduct. In this viewpoint, we highlight the potential roles of LLMs in the creation of systematic reviews and meta-analyses, elucidating their advantages, limitations, and future research directions, aiming to provide insights and guidance for authors planning systematic reviews and meta-analyses. %M 38819655 %R 10.2196/56780 %U https://www.jmir.org/2024/1/e56780 %U https://doi.org/10.2196/56780 %U http://www.ncbi.nlm.nih.gov/pubmed/38819655 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e58758 %T Evaluation of ChatGPT-Generated Differential Diagnosis for Common Diseases With Atypical Presentation: Descriptive Research %A Shikino,Kiyoshi %A Shimizu,Taro %A Otsuka,Yuki %A Tago,Masaki %A Takahashi,Hiromizu %A Watari,Takashi %A Sasaki,Yosuke %A Iizuka,Gemmei %A Tamura,Hiroki %A Nakashima,Koichi %A Kunitomo,Kotaro %A Suzuki,Morika %A Aoyama,Sayaka %A Kosaka,Shintaro %A Kawahigashi,Teiko %A Matsumoto,Tomohiro %A Orihara,Fumina %A Morikawa,Toru %A Nishizawa,Toshinori %A Hoshina,Yoji %A Yamamoto,Yu %A Matsuo,Yuichiro %A Unoki,Yuto %A Kimura,Hirofumi %A Tokushima,Midori %A Watanuki,Satoshi %A Saito,Takuma %A Otsuka,Fumio %A Tokuda,Yasuharu %K atypical presentation %K ChatGPT %K common disease %K diagnostic accuracy %K diagnosis %K patient safety %D 2024 %7 21.6.2024 %9 %J JMIR Med Educ %G English %X Background: The persistence of diagnostic errors, despite advances in medical knowledge and diagnostics, highlights the importance of understanding atypical disease presentations and their contribution to mortality and morbidity. Artificial intelligence (AI), particularly generative pre-trained transformers like GPT-4, holds promise for improving diagnostic accuracy, but requires further exploration in handling atypical presentations. Objective: This study aimed to assess the diagnostic accuracy of ChatGPT in generating differential diagnoses for atypical presentations of common diseases, with a focus on the model’s reliance on patient history during the diagnostic process. Methods: We used 25 clinical vignettes from the Journal of Generalist Medicine characterizing atypical manifestations of common diseases. Two general medicine physicians categorized the cases based on atypicality. ChatGPT was then used to generate differential diagnoses based on the clinical information provided. The concordance between AI-generated and final diagnoses was measured, with a focus on the top-ranked disease (top 1) and the top 5 differential diagnoses (top 5). Results: ChatGPT’s diagnostic accuracy decreased with an increase in atypical presentation. For category 1 (C1) cases, the concordance rates were 17% (n=1) for the top 1 and 67% (n=4) for the top 5. Categories 3 (C3) and 4 (C4) showed a 0% concordance for top 1 and markedly lower rates for the top 5, indicating difficulties in handling highly atypical cases. The χ2 test revealed no significant difference in the top 1 differential diagnosis accuracy between less atypical (C1+C2) and more atypical (C3+C4) groups (χ²1=2.07; n=25; P=.13). However, a significant difference was found in the top 5 analyses, with less atypical cases showing higher accuracy (χ²1=4.01; n=25; P=.048). Conclusions: ChatGPT-4 demonstrates potential as an auxiliary tool for diagnosing typical and mildly atypical presentations of common diseases. However, its performance declines with greater atypicality. The study findings underscore the need for AI systems to encompass a broader range of linguistic capabilities, cultural understanding, and diverse clinical scenarios to improve diagnostic utility in real-world settings. %R 10.2196/58758 %U https://mededu.jmir.org/2024/1/e58758 %U https://doi.org/10.2196/58758 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56894 %T Parents’ Perceptions of Their Parenting Journeys and a Mobile App Intervention (Parentbot—A Digital Healthcare Assistant): Qualitative Process Evaluation %A Chua,Joelle Yan Xin %A Choolani,Mahesh %A Chee,Cornelia Yin Ing %A Yi,Huso %A Chan,Yiong Huak %A Lalor,Joan Gabrielle %A Chong,Yap Seng %A Shorey,Shefaly %+ Alice Lee Centre for Nursing Studies, Yong Loo Lin School of Medicine, National University of Singapore, Level 2, Clinical Research Centre, Block MD11 10 Medical Drive, Singapore, 117597, Singapore, 65 66011294, nurssh@nus.edu.sg %K perinatal %K parents %K mobile app %K chatbot %K qualitative study %K interviews %K experiences %K mobile phone %D 2024 %7 21.6.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Parents experience many challenges during the perinatal period. Mobile app–based interventions and chatbots show promise in delivering health care support for parents during the perinatal period. Objective: This descriptive qualitative process evaluation study aims to explore the perinatal experiences of parents in Singapore, as well as examine the user experiences of the mobile app–based intervention with an in-built chatbot titled Parentbot—a Digital Healthcare Assistant (PDA). Methods: A total of 20 heterosexual English-speaking parents were recruited via purposive sampling from a single tertiary hospital in Singapore. The parents (control group: 10/20, 50%; intervention group: 10/20, 50%) were also part of an ongoing randomized trial between November 2022 and August 2023 that aimed to evaluate the effectiveness of the PDA in improving parenting outcomes. Semistructured one-to-one interviews were conducted via Zoom from February to June 2023. All interviews were conducted in English, audio recorded, and transcribed verbatim. Data analysis was guided by the thematic analysis framework. The COREQ (Consolidated Criteria for Reporting Qualitative Research) checklist was used to guide the reporting of data. Results: Three themes with 10 subthemes describing parents’ perceptions of their parenting journeys and their experiences with the PDA were identified. The main themes were (1) new babies, new troubles, and new wonders; (2) support system for the parents; and (3) reshaping perinatal support for future parents. Conclusions: Overall, the PDA provided parents with informational, socioemotional, and psychological support and could be used to supplement the perinatal care provided for future parents. To optimize users’ experience with the PDA, the intervention could be equipped with a more sophisticated chatbot, equipped with more gamification features, and programmed to deliver personalized care to parents. Researchers and health care providers could also strive to promote more peer-to-peer interactions among users. The provision of continuous, holistic, and family-centered care by health care professionals could also be emphasized. Moreover, policy changes regarding maternity and paternity leaves, availability of infant care centers, and flexible work arrangements could be further explored to promote healthy work-family balance for parents. %M 38905628 %R 10.2196/56894 %U https://www.jmir.org/2024/1/e56894 %U https://doi.org/10.2196/56894 %U http://www.ncbi.nlm.nih.gov/pubmed/38905628 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e53203 %T The Machine Speaks: Conversational AI and the Importance of Effort to Relationships of Meaning %A Hartford,Anna %A Stein,Dan J %+ Neuroscience Institute, University of Cape Town, Groote Schuur Hospital, Observatory, Cape Town, 7935, South Africa, 27 214042174, annahartford@gmail.com %K artificial intelligence %K AI %K conversational AIs %K generative AI %K intimacy %K human-machine interaction %K interpersonal relationships %K effort %K psychotherapy %K conversation %D 2024 %7 18.6.2024 %9 Viewpoint %J JMIR Ment Health %G English %X The focus of debates about conversational artificial intelligence (CAI) has largely been on social and ethical concerns that arise when we speak to machines—what is gained and what is lost when we replace our human interlocutors, including our human therapists, with AI. In this viewpoint, we focus instead on a distinct and growing phenomenon: letting machines speak for us. What is at stake when we replace our own efforts at interpersonal engagement with CAI? The purpose of these technologies is, in part, to remove effort, but effort has enormous value, and in some cases, even intrinsic value. This is true in many realms, but especially in interpersonal relationships. To make an effort for someone, irrespective of what that effort amounts to, often conveys value and meaning in itself. We elaborate on the meaning, worth, and significance that may be lost when we relinquish effort in our interpersonal engagements as well as on the opportunities for self-understanding and growth that we may forsake. %M 38889401 %R 10.2196/53203 %U https://mental.jmir.org/2024/1/e53203 %U https://doi.org/10.2196/53203 %U http://www.ncbi.nlm.nih.gov/pubmed/38889401 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e50182 %T Developing a Chatbot to Support Individuals With Neurodevelopmental Disorders: Tutorial %A Singla,Ashwani %A Khanna,Ritvik %A Kaur,Manpreet %A Kelm,Karen %A Zaiane,Osmar %A Rosenfelt,Cory Scott %A Bui,Truong An %A Rezaei,Navid %A Nicholas,David %A Reformat,Marek Z %A Majnemer,Annette %A Ogourtsova,Tatiana %A Bolduc,Francois %+ Department of Pediatrics, University of Alberta, 11315 87th Avenue, Edmonton, AB, T6G 2E1, Canada, 1 780 492 9713, fbolduc@ualberta.ca %K chatbot %K user interface %K knowledge graph %K neurodevelopmental disability %K autism %K intellectual disability %K attention-deficit/hyperactivity disorder %D 2024 %7 18.6.2024 %9 Tutorial %J J Med Internet Res %G English %X Families of individuals with neurodevelopmental disabilities or differences (NDDs) often struggle to find reliable health information on the web. NDDs encompass various conditions affecting up to 14% of children in high-income countries, and most individuals present with complex phenotypes and related conditions. It is challenging for their families to develop literacy solely by searching information on the internet. While in-person coaching can enhance care, it is only available to a minority of those with NDDs. Chatbots, or computer programs that simulate conversation, have emerged in the commercial sector as useful tools for answering questions, but their use in health care remains limited. To address this challenge, the researchers developed a chatbot named CAMI (Coaching Assistant for Medical/Health Information) that can provide information about trusted resources covering core knowledge and services relevant to families of individuals with NDDs. The chatbot was developed, in collaboration with individuals with lived experience, to provide information about trusted resources covering core knowledge and services that may be of interest. The developers used the Django framework (Django Software Foundation) for the development and used a knowledge graph to depict the key entities in NDDs and their relationships to allow the chatbot to suggest web resources that may be related to the user queries. To identify NDD domain–specific entities from user input, a combination of standard sources (the Unified Medical Language System) and other entities were used which were identified by health professionals as well as collaborators. Although most entities were identified in the text, some were not captured in the system and therefore went undetected. Nonetheless, the chatbot was able to provide resources addressing most user queries related to NDDs. The researchers found that enriching the vocabulary with synonyms and lay language terms for specific subdomains enhanced entity detection. By using a data set of numerous individuals with NDDs, the researchers developed a knowledge graph that established meaningful connections between entities, allowing the chatbot to present related symptoms, diagnoses, and resources. To the researchers’ knowledge, CAMI is the first chatbot to provide resources related to NDDs. Our work highlighted the importance of engaging end users to supplement standard generic ontologies to named entities for language recognition. It also demonstrates that complex medical and health-related information can be integrated using knowledge graphs and leveraging existing large datasets. This has multiple implications: generalizability to other health domains as well as reducing the need for experts and optimizing their input while keeping health care professionals in the loop. The researchers' work also shows how health and computer science domains need to collaborate to achieve the granularity needed to make chatbots truly useful and impactful. %M 38888947 %R 10.2196/50182 %U https://www.jmir.org/2024/1/e50182 %U https://doi.org/10.2196/50182 %U http://www.ncbi.nlm.nih.gov/pubmed/38888947 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 11 %N %P e53897 %T Implementation of Anxiety UK’s Ask Anxia Chatbot Service: Lessons Learned %A Collins,Luke %A Nicholson,Niamh %A Lidbetter,Nicky %A Smithson,Dave %A Baker,Paul %+ Linguistics and English Language, Lancaster University, Economic and Social Research Council Centre for Corpus Approaches to Social Science, Bailrigg, Lancaster, LA1 4YW, United Kingdom, 44 1524 65201, l.collins3@lancaster.ac.uk %K chatbots %K anxiety disorders %K corpus linguistics %K conversational agents %K web-based care %D 2024 %7 17.6.2024 %9 Viewpoint %J JMIR Hum Factors %G English %X Chatbots are increasingly being applied in the context of health care, providing access to services when there are constraints on human resources. Simple, rule-based chatbots are suited to high-volume, repetitive tasks and can therefore be used effectively in providing users with important health information. In this Viewpoint paper, we report on the implementation of a chatbot service called Ask Anxia as part of a wider provision of information and support services offered by the UK national charity, Anxiety UK. We reflect on the changes made to the chatbot over the course of approximately 18 months as the Anxiety UK team monitored its performance and responded to recurrent themes in user queries by developing further information and services. We demonstrate how corpus linguistics can contribute to the evaluation of user queries and the optimization of responses. On the basis of these observations of how Anxiety UK has developed its own chatbot service, we offer recommendations for organizations looking to add automated conversational interfaces to their services. %M 38885016 %R 10.2196/53897 %U https://humanfactors.jmir.org/2024/1/e53897 %U https://doi.org/10.2196/53897 %U http://www.ncbi.nlm.nih.gov/pubmed/38885016 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e53297 %T Triage Performance Across Large Language Models, ChatGPT, and Untrained Doctors in Emergency Medicine: Comparative Study %A Masanneck,Lars %A Schmidt,Linea %A Seifert,Antonia %A Kölsche,Tristan %A Huntemann,Niklas %A Jansen,Robin %A Mehsin,Mohammed %A Bernhard,Michael %A Meuth,Sven G %A Böhm,Lennert %A Pawlitzki,Marc %+ Department of Neurology, Medical Faculty and University Hospital Düsseldorf, Heinrich Heine University Düsseldorf, Moorenstraße 5, Düsseldorf, 40225, Germany, 49 0211 81 17880, lars.masanneck@med.uni-duesseldorf.de %K emergency medicine %K triage %K artificial intelligence %K large language models %K ChatGPT %K untrained doctors %K doctor %K doctors %K comparative study %K digital health %K personnel %K staff %K cohort %K Germany %K German %D 2024 %7 14.6.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models (LLMs) have demonstrated impressive performances in various medical domains, prompting an exploration of their potential utility within the high-demand setting of emergency department (ED) triage. This study evaluated the triage proficiency of different LLMs and ChatGPT, an LLM-based chatbot, compared to professionally trained ED staff and untrained personnel. We further explored whether LLM responses could guide untrained staff in effective triage. Objective: This study aimed to assess the efficacy of LLMs and the associated product ChatGPT in ED triage compared to personnel of varying training status and to investigate if the models’ responses can enhance the triage proficiency of untrained personnel. Methods: A total of 124 anonymized case vignettes were triaged by untrained doctors; different versions of currently available LLMs; ChatGPT; and professionally trained raters, who subsequently agreed on a consensus set according to the Manchester Triage System (MTS). The prototypical vignettes were adapted from cases at a tertiary ED in Germany. The main outcome was the level of agreement between raters’ MTS level assignments, measured via quadratic-weighted Cohen κ. The extent of over- and undertriage was also determined. Notably, instances of ChatGPT were prompted using zero-shot approaches without extensive background information on the MTS. The tested LLMs included raw GPT-4, Llama 3 70B, Gemini 1.5, and Mixtral 8x7b. Results: GPT-4–based ChatGPT and untrained doctors showed substantial agreement with the consensus triage of professional raters (κ=mean 0.67, SD 0.037 and κ=mean 0.68, SD 0.056, respectively), significantly exceeding the performance of GPT-3.5–based ChatGPT (κ=mean 0.54, SD 0.024; P<.001). When untrained doctors used this LLM for second-opinion triage, there was a slight but statistically insignificant performance increase (κ=mean 0.70, SD 0.047; P=.97). Other tested LLMs performed similar to or worse than GPT-4–based ChatGPT or showed odd triaging behavior with the used parameters. LLMs and ChatGPT models tended toward overtriage, whereas untrained doctors undertriaged. Conclusions: While LLMs and the LLM-based product ChatGPT do not yet match professionally trained raters, their best models’ triage proficiency equals that of untrained ED doctors. In its current form, LLMs or ChatGPT thus did not demonstrate gold-standard performance in ED triage and, in the setting of this study, failed to significantly improve untrained doctors’ triage when used as decision support. Notable performance enhancements in newer LLM versions over older ones hint at future improvements with further technological development and specific training. %M 38875696 %R 10.2196/53297 %U https://www.jmir.org/2024/1/e53297 %U https://doi.org/10.2196/53297 %U http://www.ncbi.nlm.nih.gov/pubmed/38875696 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e54987 %T Evolution of Chatbots in Nursing Education: Narrative Review %A Zhang,Fang %A Liu,Xiaoliu %A Wu,Wenyan %A Zhu,Shiben %K nursing education %K chatbots %K artificial intelligence %K narrative review %K ChatGPT %D 2024 %7 13.6.2024 %9 %J JMIR Med Educ %G English %X Background: The integration of chatbots in nursing education is a rapidly evolving area with potential transformative impacts. This narrative review aims to synthesize and analyze the existing literature on chatbots in nursing education. Objective: This study aims to comprehensively examine the temporal trends, international distribution, study designs, and implications of chatbots in nursing education. Methods: A comprehensive search was conducted across 3 databases (PubMed, Web of Science, and Embase) following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) flow diagram. Results: A total of 40 articles met the eligibility criteria, with a notable increase of publications in 2023 (n=28, 70%). Temporal analysis revealed a notable surge in publications from 2021 to 2023, emphasizing the growing scholarly interest. Geographically, Taiwan province made substantial contributions (n=8, 20%), followed by the United States (n=6, 15%) and South Korea (n=4, 10%). Study designs varied, with reviews (n=8, 20%) and editorials (n=7, 18%) being predominant, showcasing the richness of research in this domain. Conclusions: Integrating chatbots into nursing education presents a promising yet relatively unexplored avenue. This review highlights the urgent need for original research, emphasizing the importance of ethical considerations. %R 10.2196/54987 %U https://mededu.jmir.org/2024/1/e54987 %U https://doi.org/10.2196/54987 %0 Journal Article %@ 2562-7600 %I JMIR Publications %V 7 %N %P e52105 %T Navigating the Pedagogical Landscape: Exploring the Implications of AI and Chatbots in Nursing Education %A Srinivasan,Muthuvenkatachalam %A Venugopal,Ambili %A Venkatesan,Latha %A Kumar,Rajesh %+ College of Nursing, All India Institute of Medical Sciences, Guntur Dt, Mangalagiri, 522503, India, 91 9410366146, muthu.venky@gmail.com %K AI %K artificial intelligence %K ChatGPT %K chatbots %K nursing education %K education %K chatbot %K nursing %K ethical %K ethics %K ethical consideration %K accessible %K learning %K efficiency %K student %K student engagement %K student learning %D 2024 %7 13.6.2024 %9 Viewpoint %J JMIR Nursing %G English %X This viewpoint paper explores the pedagogical implications of artificial intelligence (AI) and AI-based chatbots such as ChatGPT in nursing education, examining their potential uses, benefits, challenges, and ethical considerations. AI and chatbots offer transformative opportunities for nursing education, such as personalized learning, simulation and practice, accessible learning, and improved efficiency. They have the potential to increase student engagement and motivation, enhance learning outcomes, and augment teacher support. However, the integration of these technologies also raises ethical considerations, such as privacy, confidentiality, and bias. The viewpoint paper provides a comprehensive overview of the current state of AI and chatbots in nursing education, offering insights into best practices and guidelines for their integration. By examining the impact of AI and ChatGPT on student learning, engagement, and teacher effectiveness and efficiency, this review aims to contribute to the ongoing discussion on the use of AI and chatbots in nursing education and provide recommendations for future research and development in the field. %M 38870516 %R 10.2196/52105 %U https://nursing.jmir.org/2024/1/e52105 %U https://doi.org/10.2196/52105 %U http://www.ncbi.nlm.nih.gov/pubmed/38870516 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e56529 %T Considering the Role of Human Empathy in AI-Driven Therapy %A Rubin,Matan %A Arnon,Hadar %A Huppert,Jonathan D %A Perry,Anat %+ Psychology Department, Hebrew University of Jerusalem, Mt Scopus, Jerusalem, 91905, Israel, 972 2 588 3027, anat.perry@mail.huji.ac.il %K empathy %K empathetic %K empathic %K artificial empathy %K AI %K artificial intelligence %K mental health %K machine learning %K algorithm %K algorithms %K predictive model %K predictive models %K predictive analytics %K predictive system %K practical model %K practical models %K model %K models %K therapy %K mental illness %K mental illnesses %K mental disease %K mental diseases %K mood disorder %K mood disorders %K emotion %K emotions %K e-mental health %K digital mental health %K internet-based therapy %D 2024 %7 11.6.2024 %9 Viewpoint %J JMIR Ment Health %G English %X Recent breakthroughs in artificial intelligence (AI) language models have elevated the vision of using conversational AI support for mental health, with a growing body of literature indicating varying degrees of efficacy. In this paper, we ask when, in therapy, it will be easier to replace humans and, conversely, in what instances, human connection will still be more valued. We suggest that empathy lies at the heart of the answer to this question. First, we define different aspects of empathy and outline the potential empathic capabilities of humans versus AI. Next, we consider what determines when these aspects are needed most in therapy, both from the perspective of therapeutic methodology and from the perspective of patient objectives. Ultimately, our goal is to prompt further investigation and dialogue, urging both practitioners and scholars engaged in AI-mediated therapy to keep these questions and considerations in mind when investigating AI implementation in mental health. %M 38861302 %R 10.2196/56529 %U https://mental.jmir.org/2024/1/e56529 %U https://doi.org/10.2196/56529 %U http://www.ncbi.nlm.nih.gov/pubmed/38861302 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e56117 %T A Use Case for Generative AI in Medical Education %A Sekhar,Tejas C %A Nayak,Yash R %A Abdoler,Emily A %K medical education %K med ed %K generative artificial intelligence %K artificial intelligence %K GAI %K AI %K Anki %K flashcard %K undergraduate medical education %K UME %D 2024 %7 7.6.2024 %9 %J JMIR Med Educ %G English %X %R 10.2196/56117 %U https://mededu.jmir.org/2024/1/e56117 %U https://doi.org/10.2196/56117 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e58370 %T Authors’ Reply: A Use Case for Generative AI in Medical Education %A Pendergrast,Tricia %A Chalmers,Zachary %K ChatGPT %K undergraduate medical education %K large language models %D 2024 %7 7.6.2024 %9 %J JMIR Med Educ %G English %X %R 10.2196/58370 %U https://mededu.jmir.org/2024/1/e58370 %U https://doi.org/10.2196/58370 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e56165 %T Clinical Accuracy, Relevance, Clarity, and Emotional Sensitivity of Large Language Models to Surgical Patient Questions: Cross-Sectional Study %A Dagli,Mert Marcel %A Oettl,Felix Conrad %A Gujral,Jaskeerat %A Malhotra,Kashish %A Ghenbot,Yohannes %A Yoon,Jang W %A Ozturk,Ali K %A Welch,William C %+ Department of Neurosurgery, University of Pennsylvania Perelman School of Medicine, 801 Spruce Street, Philadelphia, PA, 19106, United States, 1 2672306493, marcel.dagli@pennmedicine.upenn.edu %K artificial intelligence %K AI %K natural language processing %K NLP %K large language model %K LLM %K generative AI %K cross-sectional study %K health information %K patient education %K clinical accuracy %K emotional sensitivity %K surgical patient %K surgery %K surgical %D 2024 %7 7.6.2024 %9 Research Letter %J JMIR Form Res %G English %X This cross-sectional study evaluates the clinical accuracy, relevance, clarity, and emotional sensitivity of responses to inquiries from patients undergoing surgery provided by large language models (LLMs), highlighting their potential as adjunct tools in patient communication and education. Our findings demonstrated high performance of LLMs across accuracy, relevance, clarity, and emotional sensitivity, with Anthropic’s Claude 2 outperforming OpenAI’s ChatGPT and Google’s Bard, suggesting LLMs’ potential to serve as complementary tools for enhanced information delivery and patient-surgeon interaction. %M 38848553 %R 10.2196/56165 %U https://formative.jmir.org/2024/1/e56165 %U https://doi.org/10.2196/56165 %U http://www.ncbi.nlm.nih.gov/pubmed/38848553 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e58342 %T Feasibility of Multimodal Artificial Intelligence Using GPT-4 Vision for the Classification of Middle Ear Disease: Qualitative Study and Validation %A Noda,Masao %A Yoshimura,Hidekane %A Okubo,Takuya %A Koshu,Ryota %A Uchiyama,Yuki %A Nomura,Akihiro %A Ito,Makoto %A Takumi,Yutaka %+ Department of Otolaryngology, Head and Neck Surgery, Jichi Medical University, 3311-1 Yakushiji, Shimotsuke, 329-0498, Japan, 81 285442111, doforanabdosuc@gmail.com %K artificial intelligence %K deep learning %K machine learning %K generative AI %K generative %K tympanic membrane %K middle ear disease %K GPT4-Vision %K otolaryngology %K ears %K ear %K tympanic %K vision %K GPT %K GPT4V %K otoscopic %K image %K images %K imaging %K diagnosis %K diagnoses %K diagnostic %K diagnostics %K otitis %K mobile phone %D 2024 %7 31.5.2024 %9 Original Paper %J JMIR AI %G English %X Background: The integration of artificial intelligence (AI), particularly deep learning models, has transformed the landscape of medical technology, especially in the field of diagnosis using imaging and physiological data. In otolaryngology, AI has shown promise in image classification for middle ear diseases. However, existing models often lack patient-specific data and clinical context, limiting their universal applicability. The emergence of GPT-4 Vision (GPT-4V) has enabled a multimodal diagnostic approach, integrating language processing with image analysis. Objective: In this study, we investigated the effectiveness of GPT-4V in diagnosing middle ear diseases by integrating patient-specific data with otoscopic images of the tympanic membrane. Methods: The design of this study was divided into two phases: (1) establishing a model with appropriate prompts and (2) validating the ability of the optimal prompt model to classify images. In total, 305 otoscopic images of 4 middle ear diseases (acute otitis media, middle ear cholesteatoma, chronic otitis media, and otitis media with effusion) were obtained from patients who visited Shinshu University or Jichi Medical University between April 2010 and December 2023. The optimized GPT-4V settings were established using prompts and patients’ data, and the model created with the optimal prompt was used to verify the diagnostic accuracy of GPT-4V on 190 images. To compare the diagnostic accuracy of GPT-4V with that of physicians, 30 clinicians completed a web-based questionnaire consisting of 190 images. Results: The multimodal AI approach achieved an accuracy of 82.1%, which is superior to that of certified pediatricians at 70.6%, but trailing behind that of otolaryngologists at more than 95%. The model’s disease-specific accuracy rates were 89.2% for acute otitis media, 76.5% for chronic otitis media, 79.3% for middle ear cholesteatoma, and 85.7% for otitis media with effusion, which highlights the need for disease-specific optimization. Comparisons with physicians revealed promising results, suggesting the potential of GPT-4V to augment clinical decision-making. Conclusions: Despite its advantages, challenges such as data privacy and ethical considerations must be addressed. Overall, this study underscores the potential of multimodal AI for enhancing diagnostic accuracy and improving patient care in otolaryngology. Further research is warranted to optimize and validate this approach in diverse clinical settings. %M 38875669 %R 10.2196/58342 %U https://ai.jmir.org/2024/1/e58342 %U https://doi.org/10.2196/58342 %U http://www.ncbi.nlm.nih.gov/pubmed/38875669 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54974 %T Harnessing ChatGPT for Thematic Analysis: Are We Ready? %A Lee,V Vien %A van der Lubbe,Stephanie C C %A Goh,Lay Hoon %A Valderas,Jose Maria %+ Division of Family Medicine, Yong Loo Lin School of Medicine, National University of Singapore, NUHS Tower Block, Level 9, 1E Kent Ridge Road, Singapore, 119228, Singapore, 65 67723874, jmvalderas@nus.edu.sg %K ChatGPT %K thematic analysis %K natural language processing %K NLP %K medical research %K qualitative research %K qualitative data %K technology %K viewpoint %K efficiency %D 2024 %7 31.5.2024 %9 Viewpoint %J J Med Internet Res %G English %X ChatGPT (OpenAI) is an advanced natural language processing tool with growing applications across various disciplines in medical research. Thematic analysis, a qualitative research method to identify and interpret patterns in data, is one application that stands to benefit from this technology. This viewpoint explores the use of ChatGPT in three core phases of thematic analysis within a medical context: (1) direct coding of transcripts, (2) generating themes from a predefined list of codes, and (3) preprocessing quotes for manuscript inclusion. Additionally, we explore the potential of ChatGPT to generate interview transcripts, which may be used for training purposes. We assess the strengths and limitations of using ChatGPT in these roles, highlighting areas where human intervention remains necessary. Overall, we argue that ChatGPT can function as a valuable tool during analysis, enhancing the efficiency of the thematic analysis and offering additional insights into the qualitative data. While ChatGPT may not adequately capture the full context of each participant, it can serve as an additional member of the analysis team, contributing to researcher triangulation through knowledge building and sensemaking. %M 38819896 %R 10.2196/54974 %U https://www.jmir.org/2024/1/e54974 %U https://doi.org/10.2196/54974 %U http://www.ncbi.nlm.nih.gov/pubmed/38819896 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56614 %T Redefining Health Care Data Interoperability: Empirical Exploration of Large Language Models in Information Exchange %A Yoon,Dukyong %A Han,Changho %A Kim,Dong Won %A Kim,Songsoo %A Bae,SungA %A Ryu,Jee An %A Choi,Yujin %+ Department of Biomedical Systems Informatics, Yonsei University College of Medicine, 50-1 Yonsei-ro Seodaemun-gu, Seoul, 03722, Republic of Korea, 82 31 5189 8450, dukyong.yoon@yonsei.ac.kr %K health care interoperability %K large language models %K medical data transformation %K data standardization %K text-based %D 2024 %7 31.5.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Efficient data exchange and health care interoperability are impeded by medical records often being in nonstandardized or unstructured natural language format. Advanced language models, such as large language models (LLMs), may help overcome current challenges in information exchange. Objective: This study aims to evaluate the capability of LLMs in transforming and transferring health care data to support interoperability. Methods: Using data from the Medical Information Mart for Intensive Care III and UK Biobank, the study conducted 3 experiments. Experiment 1 assessed the accuracy of transforming structured laboratory results into unstructured format. Experiment 2 explored the conversion of diagnostic codes between the coding frameworks of the ICD-9-CM (International Classification of Diseases, Ninth Revision, Clinical Modification), and Systematized Nomenclature of Medicine Clinical Terms (SNOMED-CT) using a traditional mapping table and a text-based approach facilitated by the LLM ChatGPT. Experiment 3 focused on extracting targeted information from unstructured records that included comprehensive clinical information (discharge notes). Results: The text-based approach showed a high conversion accuracy in transforming laboratory results (experiment 1) and an enhanced consistency in diagnostic code conversion, particularly for frequently used diagnostic names, compared with the traditional mapping approach (experiment 2). In experiment 3, the LLM showed a positive predictive value of 87.2% in extracting generic drug names. Conclusions: This study highlighted the potential role of LLMs in significantly improving health care data interoperability, demonstrated by their high accuracy and efficiency in data transformation and exchange. The LLMs hold vast potential for enhancing medical data exchange without complex standardization for medical terms and data structure. %M 38819879 %R 10.2196/56614 %U https://www.jmir.org/2024/1/e56614 %U https://doi.org/10.2196/56614 %U http://www.ncbi.nlm.nih.gov/pubmed/38819879 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e50025 %T Effectiveness of a Mental Health Chatbot for People With Chronic Diseases: Randomized Controlled Trial %A MacNeill,A Luke %A Doucet,Shelley %A Luke,Alison %+ Centre for Research in Integrated Care, University of New Brunswick, 355 Campus Ring Road, Saint John, NB, E2L 4L5, Canada, 1 506 648 5777, luke.macneill@unb.ca %K chatbot %K chronic disease %K arthritis %K diabetes %K mental health %K depression %K anxiety %K stress %K effectiveness %K application %D 2024 %7 30.5.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: People with chronic diseases tend to experience more mental health issues than their peers without these health conditions. Mental health chatbots offer a potential source of mental health support for people with chronic diseases. Objective: The aim of this study was to determine whether a mental health chatbot can improve mental health in people with chronic diseases. We focused on 2 chronic diseases in particular: arthritis and diabetes. Methods: Individuals with arthritis or diabetes were recruited using various web-based methods. Participants were randomly assigned to 1 of 2 groups. Those in the treatment group used a mental health chatbot app (Wysa [Wysa Inc]) over a period of 4 weeks. Those in the control group received no intervention. Participants completed measures of depression (Patient Health Questionnaire–9), anxiety (Generalized Anxiety Disorder Scale–7), and stress (Perceived Stress Scale–10) at baseline, with follow-up testing 2 and 4 weeks later. Participants in the treatment group completed feedback questions on their experiences with the app at the final assessment point. Results: A total of 68 participants (n=47, 69% women; mean age 42.87, SD 11.27 years) were included in the analysis. Participants were divided evenly between the treatment and control groups. Those in the treatment group reported decreases in depression (P<.001) and anxiety (P<.001) severity over the study period. No such changes were found among participants in the control group. No changes in stress were reported by participants in either group. Participants with arthritis reported higher levels of depression (P=.004) and anxiety (P=.004) severity than participants with diabetes over the course of the study, as well as higher levels of stress (P=.01); otherwise, patterns of results were similar across these health conditions. In response to the feedback questions, participants in the treatment group said that they liked many of the functions and features of the app, the general design of the app, and the user experience. They also disliked some aspects of the app, with most of these reports focusing on the chatbot’s conversational abilities. Conclusions: The results of this study suggest that mental health chatbots can be an effective source of mental health support for people with chronic diseases such as arthritis and diabetes. Although cost-effective and accessible, these programs have limitations and may not be well suited for all individuals. Trial Registration: ClinicalTrials.gov NCT04620668; https://www.clinicaltrials.gov/study/NCT04620668 %M 38814681 %R 10.2196/50025 %U https://formative.jmir.org/2024/1/e50025 %U https://doi.org/10.2196/50025 %U http://www.ncbi.nlm.nih.gov/pubmed/38814681 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e50454 %T Effects of a Chatbot-Based Intervention on Stress and Health-Related Parameters in a Stressed Sample: Randomized Controlled Trial %A Schillings,Christine %A Meißner,Echo %A Erb,Benjamin %A Bendig,Eileen %A Schultchen,Dana %A Pollatos,Olga %+ Department of Clinical and Health Psychology, Institute of Psychology and Education, Ulm University, Albert-Einstein-Allee 43, Ulm, 89081, Germany, 49 731 50 31738, christine.schillings@uni-ulm.de %K chatbot %K intervention %K stress %K interoception %K interoceptive sensibility %K mindfulness %K emotion regulation %K RCT %K randomized controlled trial %D 2024 %7 28.5.2024 %9 Original Paper %J JMIR Ment Health %G English %X Background: Stress levels and the prevalence of mental disorders in the general population have been rising in recent years. Chatbot-based interventions represent novel and promising digital approaches to improve health-related parameters. However, there is a lack of research on chatbot-based interventions in the area of mental health. Objective: The aim of this study was to investigate the effects of a 3-week chatbot-based intervention guided by the chatbot ELME, specifically with respect to the ability to reduce stress and improve various health-related parameters in a stressed sample. Methods: In this multicenter two-armed randomized controlled trial, 118 individuals with medium to high stress levels were randomized to the intervention group (n=59) or the treatment-as-usual control group (n=59). The ELME chatbot guided participants of the intervention group through 3 weeks of training based on the topics stress, mindfulness, and interoception, with practical and psychoeducative elements delivered in two daily interactive intervention sessions via a smartphone (approximately 10-20 minutes each). The primary outcome (perceived stress) and secondary outcomes (mindfulness; interoception or interoceptive sensibility; subjective well-being; and emotion regulation, including the subfacets reappraisal and suppression) were assessed preintervention (T1), post intervention (T2; after 3 weeks), and at follow-up (T3; after 6 weeks). During both conditions, participants also underwent ecological momentary assessments of stress and interoceptive sensibility. Results: There were no significant changes in perceived stress (β03=–.018, SE=.329; P=.96) and momentary stress. Mindfulness and the subfacet reappraisal significantly increased in the intervention group over time, whereas there was no change in the subfacet suppression. Well-being and momentary interoceptive sensibility increased in both groups over time. Conclusions: To gain insight into how the intervention can be improved to achieve its full potential for stress reduction, besides a longer intervention duration, specific sample subgroups should be considered. The chatbot-based intervention seems to have the potential to improve mindfulness and emotion regulation in a stressed sample. Future chatbot-based studies and interventions in health care should be designed based on the latest findings on the efficacy of rule-based and artificial intelligence–based chatbots. Trial Registration: German Clinical Trials Register DRKS00027560; https://drks.de/search/en/trial/DRKS00027560 International Registered Report Identifier (IRRID): RR2-doi.org/10.3389/fdgth.2023.1046202 %M 38805259 %R 10.2196/50454 %U https://mental.jmir.org/2024/1/e50454 %U https://doi.org/10.2196/50454 %U http://www.ncbi.nlm.nih.gov/pubmed/38805259 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 11 %N %P e55399 %T The Impact of Performance Expectancy, Workload, Risk, and Satisfaction on Trust in ChatGPT: Cross-Sectional Survey Analysis %A Choudhury,Avishek %A Shamszare,Hamid %+ Industrial and Management Systems Engineering, Benjamin M. Statler College of Engineering and Mineral Resources, West Virginia University, 321 Engineering Sciences Building, 1306 Evansdale Drive, Morgantown, WV, 26506, United States, 1 3042939431, avishek.choudhury@mail.wvu.edu %K ChatGPT %K chatbots %K health care %K health care decision-making %K health-related decision-making %K health care management %K decision-making %K user perception %K usability %K usable %K usableness %K usefulness %K artificial intelligence %K algorithms %K predictive models %K predictive analytics %K predictive system %K practical models %K deep learning %K cross-sectional survey %D 2024 %7 27.5.2024 %9 Original Paper %J JMIR Hum Factors %G English %X Background: ChatGPT (OpenAI) is a powerful tool for a wide range of tasks, from entertainment and creativity to health care queries. There are potential risks and benefits associated with this technology. In the discourse concerning the deployment of ChatGPT and similar large language models, it is sensible to recommend their use primarily for tasks a human user can execute accurately. As we transition into the subsequent phase of ChatGPT deployment, establishing realistic performance expectations and understanding users’ perceptions of risk associated with its use are crucial in determining the successful integration of this artificial intelligence (AI) technology. Objective: The aim of the study is to explore how perceived workload, satisfaction, performance expectancy, and risk-benefit perception influence users’ trust in ChatGPT. Methods: A semistructured, web-based survey was conducted with 607 adults in the United States who actively use ChatGPT. The survey questions were adapted from constructs used in various models and theories such as the technology acceptance model, the theory of planned behavior, the unified theory of acceptance and use of technology, and research on trust and security in digital environments. To test our hypotheses and structural model, we used the partial least squares structural equation modeling method, a widely used approach for multivariate analysis. Results: A total of 607 people responded to our survey. A significant portion of the participants held at least a high school diploma (n=204, 33.6%), and the majority had a bachelor’s degree (n=262, 43.1%). The primary motivations for participants to use ChatGPT were for acquiring information (n=219, 36.1%), amusement (n=203, 33.4%), and addressing problems (n=135, 22.2%). Some participants used it for health-related inquiries (n=44, 7.2%), while a few others (n=6, 1%) used it for miscellaneous activities such as brainstorming, grammar verification, and blog content creation. Our model explained 64.6% of the variance in trust. Our analysis indicated a significant relationship between (1) workload and satisfaction, (2) trust and satisfaction, (3) performance expectations and trust, and (4) risk-benefit perception and trust. Conclusions: The findings underscore the importance of ensuring user-friendly design and functionality in AI-based applications to reduce workload and enhance user satisfaction, thereby increasing user trust. Future research should further explore the relationship between risk-benefit perception and trust in the context of AI chatbots. %M 38801658 %R 10.2196/55399 %U https://humanfactors.jmir.org/2024/1/e55399 %U https://doi.org/10.2196/55399 %U http://www.ncbi.nlm.nih.gov/pubmed/38801658 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e54507 %T Evidence-Based Learning Strategies in Medicine Using AI %A Arango-Ibanez,Juan Pablo %A Posso-Nuñez,Jose Alejandro %A Díaz-Solórzano,Juan Pablo %A Cruz-Suárez,Gustavo %K artificial intelligence %K large language models %K ChatGPT %K active recall %K memory cues %K LLMs %K evidence-based %K learning strategy %K medicine %K AI %K medical education %K knowledge %K relevance %D 2024 %7 24.5.2024 %9 %J JMIR Med Educ %G English %X Large language models (LLMs), like ChatGPT, are transforming the landscape of medical education. They offer a vast range of applications, such as tutoring (personalized learning), patient simulation, generation of examination questions, and streamlined access to information. The rapid advancement of medical knowledge and the need for personalized learning underscore the relevance and timeliness of exploring innovative strategies for integrating artificial intelligence (AI) into medical education. In this paper, we propose coupling evidence-based learning strategies, such as active recall and memory cues, with AI to optimize learning. These strategies include the generation of tests, mnemonics, and visual cues. %R 10.2196/54507 %U https://mededu.jmir.org/2024/1/e54507 %U https://doi.org/10.2196/54507 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 13 %N %P e57001 %T Assessing and Optimizing Large Language Models on Spondyloarthritis Multi-Choice Question Answering: Protocol for Enhancement and Assessment %A Wang,Anan %A Wu,Yunong %A Ji,Xiaojian %A Wang,Xiangyang %A Hu,Jiawen %A Zhang,Fazhan %A Zhang,Zhanchao %A Pu,Dong %A Tang,Lulu %A Ma,Shikui %A Liu,Qiang %A Dong,Jing %A He,Kunlun %A Li,Kunpeng %A Teng,Da %A Li,Tao %+ Department of Medical Innovation Research, Chinese PLA General Hospital, No.28 Fuxing Road, Wanshou Road, Haidian District, Beijing, China, 86 13810398393, litao301hospital@163.com %K spondyloarthritis %K benchmark %K large language model %K artificial intelligence %K AI %K AI chatbot %K AI-assistant diagnosis %D 2024 %7 24.5.2024 %9 Protocol %J JMIR Res Protoc %G English %X Background: Spondyloarthritis (SpA), a chronic inflammatory disorder, predominantly impacts the sacroiliac joints and spine, significantly escalating the risk of disability. SpA’s complexity, as evidenced by its diverse clinical presentations and symptoms that often mimic other diseases, presents substantial challenges in its accurate diagnosis and differentiation. This complexity becomes even more pronounced in nonspecialist health care environments due to limited resources, resulting in delayed referrals, increased misdiagnosis rates, and exacerbated disability outcomes for patients with SpA. The emergence of large language models (LLMs) in medical diagnostics introduces a revolutionary potential to overcome these diagnostic hurdles. Despite recent advancements in artificial intelligence and LLMs demonstrating effectiveness in diagnosing and treating various diseases, their application in SpA remains underdeveloped. Currently, there is a notable absence of SpA-specific LLMs and an established benchmark for assessing the performance of such models in this particular field. Objective: Our objective is to develop a foundational medical model, creating a comprehensive evaluation benchmark tailored to the essential medical knowledge of SpA and its unique diagnostic and treatment protocols. The model, post-pretraining, will be subject to further enhancement through supervised fine-tuning. It is projected to significantly aid physicians in SpA diagnosis and treatment, especially in settings with limited access to specialized care. Furthermore, this initiative is poised to promote early and accurate SpA detection at the primary care level, thereby diminishing the risks associated with delayed or incorrect diagnoses. Methods: A rigorous benchmark, comprising 222 meticulously formulated multiple-choice questions on SpA, will be established and developed. These questions will be extensively revised to ensure their suitability for accurately evaluating LLMs’ performance in real-world diagnostic and therapeutic scenarios. Our methodology involves selecting and refining top foundational models using public data sets. The best-performing model in our benchmark will undergo further training. Subsequently, more than 80,000 real-world inpatient and outpatient cases from hospitals will enhance LLM training, incorporating techniques such as supervised fine-tuning and low-rank adaptation. We will rigorously assess the models’ generated responses for accuracy and evaluate their reasoning processes using the metrics of fluency, relevance, completeness, and medical proficiency. Results: Development of the model is progressing, with significant enhancements anticipated by early 2024. The benchmark, along with the results of evaluations, is expected to be released in the second quarter of 2024. Conclusions: Our trained model aims to capitalize on the capabilities of LLMs in analyzing complex clinical data, thereby enabling precise detection, diagnosis, and treatment of SpA. This innovation is anticipated to play a vital role in diminishing the disabilities arising from delayed or incorrect SpA diagnoses. By promoting this model across diverse health care settings, we anticipate a significant improvement in SpA management, culminating in enhanced patient outcomes and a reduced overall burden of the disease. International Registered Report Identifier (IRRID): DERR1-10.2196/57001 %M 38788208 %R 10.2196/57001 %U https://www.researchprotocols.org/2024/1/e57001 %U https://doi.org/10.2196/57001 %U http://www.ncbi.nlm.nih.gov/pubmed/38788208 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e54283 %T The Performance of ChatGPT-4V in Interpreting Images and Tables in the Japanese Medical Licensing Exam %A Takagi,Soshi %A Koda,Masahide %A Watari,Takashi %K ChatGPT %K medical licensing examination %K generative artificial intelligence %K medical education %K large language model %K images %K tables %K artificial intelligence %K AI %K Japanese %K reliability %K medical application %K medical applications %K diagnostic %K diagnostics %K online data %K web-based data %D 2024 %7 23.5.2024 %9 %J JMIR Med Educ %G English %X %R 10.2196/54283 %U https://mededu.jmir.org/2024/1/e54283 %U https://doi.org/10.2196/54283 %0 Journal Article %@ 2368-7959 %I %V 11 %N %P e54781 %T The Artificial Third: A Broad View of the Effects of Introducing Generative Artificial Intelligence on Psychotherapy %A Haber,Yuval %A Levkovich,Inbar %A Hadar-Shoval,Dorit %A Elyoseph,Zohar %K psychoanalysis %K generative artificial intelligence %K psychotherapy %K large language models %K narcissism %K narcissist %K narcissistic %K perception %K perceptions %K critical thinking %K transparency %K autonomy %K mental health %K interpersonal %K LLM %K LLMs %K language model %K language models %K artificial intelligence %K generative %K AI %K ethic %K ethics %K ethical %D 2024 %7 23.5.2024 %9 %J JMIR Ment Health %G English %X This paper explores a significant shift in the field of mental health in general and psychotherapy in particular following generative artificial intelligence’s new capabilities in processing and generating humanlike language. Following Freud, this lingo-technological development is conceptualized as the “fourth narcissistic blow” that science inflicts on humanity. We argue that this narcissistic blow has a potentially dramatic influence on perceptions of human society, interrelationships, and the self. We should, accordingly, expect dramatic changes in perceptions of the therapeutic act following the emergence of what we term the artificial third in the field of psychotherapy. The introduction of an artificial third marks a critical juncture, prompting us to ask the following important core questions that address two basic elements of critical thinking, namely, transparency and autonomy: (1) What is this new artificial presence in therapy relationships? (2) How does it reshape our perception of ourselves and our interpersonal dynamics? and (3) What remains of the irreplaceable human elements at the core of therapy? Given the ethical implications that arise from these questions, this paper proposes that the artificial third can be a valuable asset when applied with insight and ethical consideration, enhancing but not replacing the human touch in therapy. %R 10.2196/54781 %U https://mental.jmir.org/2024/1/e54781 %U https://doi.org/10.2196/54781 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e53164 %T Hallucination Rates and Reference Accuracy of ChatGPT and Bard for Systematic Reviews: Comparative Analysis %A Chelli,Mikaël %A Descamps,Jules %A Lavoué,Vincent %A Trojani,Christophe %A Azar,Michel %A Deckert,Marcel %A Raynier,Jean-Luc %A Clowez,Gilles %A Boileau,Pascal %A Ruetsch-Chelli,Caroline %+ Institute for Sports and Reconstructive Bone and Joint Surgery, Groupe Kantys, 7 Avenue, Durante, Nice, 06000, France, 33 4 93 16 76 40, mikael.chelli@gmail.com %K artificial intelligence %K large language models %K ChatGPT %K Bard %K rotator cuff %K systematic reviews %K literature search %K hallucinated %K human conducted %D 2024 %7 22.5.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models (LLMs) have raised both interest and concern in the academic community. They offer the potential for automating literature search and synthesis for systematic reviews but raise concerns regarding their reliability, as the tendency to generate unsupported (hallucinated) content persist. Objective: The aim of the study is to assess the performance of LLMs such as ChatGPT and Bard (subsequently rebranded Gemini) to produce references in the context of scientific writing. Methods: The performance of ChatGPT and Bard in replicating the results of human-conducted systematic reviews was assessed. Using systematic reviews pertaining to shoulder rotator cuff pathology, these LLMs were tested by providing the same inclusion criteria and comparing the results with original systematic review references, serving as gold standards. The study used 3 key performance metrics: recall, precision, and F1-score, alongside the hallucination rate. Papers were considered “hallucinated” if any 2 of the following information were wrong: title, first author, or year of publication. Results: In total, 11 systematic reviews across 4 fields yielded 33 prompts to LLMs (3 LLMs×11 reviews), with 471 references analyzed. Precision rates for GPT-3.5, GPT-4, and Bard were 9.4% (13/139), 13.4% (16/119), and 0% (0/104) respectively (P<.001). Recall rates were 11.9% (13/109) for GPT-3.5 and 13.7% (15/109) for GPT-4, with Bard failing to retrieve any relevant papers (P<.001). Hallucination rates stood at 39.6% (55/139) for GPT-3.5, 28.6% (34/119) for GPT-4, and 91.4% (95/104) for Bard (P<.001). Further analysis of nonhallucinated papers retrieved by GPT models revealed significant differences in identifying various criteria, such as randomized studies, participant criteria, and intervention criteria. The study also noted the geographical and open-access biases in the papers retrieved by the LLMs. Conclusions: Given their current performance, it is not recommended for LLMs to be deployed as the primary or exclusive tool for conducting systematic reviews. Any references generated by such models warrant thorough validation by researchers. The high occurrence of hallucinations in LLMs highlights the necessity for refining their training and functionality before confidently using them for rigorous academic purposes. %M 38776130 %R 10.2196/53164 %U https://www.jmir.org/2024/1/e53164 %U https://doi.org/10.2196/53164 %U http://www.ncbi.nlm.nih.gov/pubmed/38776130 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54758 %T Utility of Large Language Models for Health Care Professionals and Patients in Navigating Hematopoietic Stem Cell Transplantation: Comparison of the Performance of ChatGPT-3.5, ChatGPT-4, and Bard %A Xue,Elisabetta %A Bracken-Clarke,Dara %A Iannantuono,Giovanni Maria %A Choo-Wosoba,Hyoyoung %A Gulley,James L %A Floudas,Charalampos S %+ Center for Immuno-Oncology, Center for Cancer Research, National Cancer Institute, National Institutes of Health, 9000 Rockville Pike, Building 10, B2L312, Bethesda, MD, 20892, United States, 1 2403518904, elisabetta.xue@nih.gov %K hematopoietic stem cell transplant %K large language models %K chatbot %K chatbots %K stem cell %K large language model %K artificial intelligence %K AI %K medical information %K hematopoietic %K HSCT %K ChatGPT %D 2024 %7 17.5.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Artificial intelligence is increasingly being applied to many workflows. Large language models (LLMs) are publicly accessible platforms trained to understand, interact with, and produce human-readable text; their ability to deliver relevant and reliable information is also of particular interest for the health care providers and the patients. Hematopoietic stem cell transplantation (HSCT) is a complex medical field requiring extensive knowledge, background, and training to practice successfully and can be challenging for the nonspecialist audience to comprehend. Objective: We aimed to test the applicability of 3 prominent LLMs, namely ChatGPT-3.5 (OpenAI), ChatGPT-4 (OpenAI), and Bard (Google AI), in guiding nonspecialist health care professionals and advising patients seeking information regarding HSCT. Methods: We submitted 72 open-ended HSCT–related questions of variable difficulty to the LLMs and rated their responses based on consistency—defined as replicability of the response—response veracity, language comprehensibility, specificity to the topic, and the presence of hallucinations. We then rechallenged the 2 best performing chatbots by resubmitting the most difficult questions and prompting to respond as if communicating with either a health care professional or a patient and to provide verifiable sources of information. Responses were then rerated with the additional criterion of language appropriateness, defined as language adaptation for the intended audience. Results: ChatGPT-4 outperformed both ChatGPT-3.5 and Bard in terms of response consistency (66/72, 92%; 54/72, 75%; and 63/69, 91%, respectively; P=.007), response veracity (58/66, 88%; 40/54, 74%; and 16/63, 25%, respectively; P<.001), and specificity to the topic (60/66, 91%; 43/54, 80%; and 27/63, 43%, respectively; P<.001). Both ChatGPT-4 and ChatGPT-3.5 outperformed Bard in terms of language comprehensibility (64/66, 97%; 53/54, 98%; and 52/63, 83%, respectively; P=.002). All displayed episodes of hallucinations. ChatGPT-3.5 and ChatGPT-4 were then rechallenged with a prompt to adapt their language to the audience and to provide source of information, and responses were rated. ChatGPT-3.5 showed better ability to adapt its language to nonmedical audience than ChatGPT-4 (17/21, 81% and 10/22, 46%, respectively; P=.03); however, both failed to consistently provide correct and up-to-date information resources, reporting either out-of-date materials, incorrect URLs, or unfocused references, making their output not verifiable by the reader. Conclusions: In conclusion, despite LLMs’ potential capability in confronting challenging medical topics such as HSCT, the presence of mistakes and lack of clear references make them not yet appropriate for routine, unsupervised clinical use, or patient counseling. Implementation of LLMs’ ability to access and to reference current and updated websites and research papers, as well as development of LLMs trained in specialized domain knowledge data sets, may offer potential solutions for their future clinical application. %M 38758582 %R 10.2196/54758 %U https://www.jmir.org/2024/1/e54758 %U https://doi.org/10.2196/54758 %U http://www.ncbi.nlm.nih.gov/pubmed/38758582 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e52095 %T Sample Size Considerations for Fine-Tuning Large Language Models for Named Entity Recognition Tasks: Methodological Study %A Majdik,Zoltan P %A Graham,S Scott %A Shiva Edward,Jade C %A Rodriguez,Sabrina N %A Karnes,Martha S %A Jensen,Jared T %A Barbour,Joshua B %A Rousseau,Justin F %+ Department of Rhetoric & Writing, The University of Texas at Austin, Parlin Hall 29, Mail Code: B5500, Austin, TX, 78712, United States, 1 512 475 9507, ssg@utexas.edu %K named-entity recognition %K large language models %K fine-tuning %K transfer learning %K expert annotation %K annotation %K sample size %K sample %K language model %K machine learning %K natural language processing %K disclosure %K disclosures %K statement %K statements %K conflict of interest %D 2024 %7 16.5.2024 %9 Original Paper %J JMIR AI %G English %X Background: Large language models (LLMs) have the potential to support promising new applications in health informatics. However, practical data on sample size considerations for fine-tuning LLMs to perform specific tasks in biomedical and health policy contexts are lacking. Objective: This study aims to evaluate sample size and sample selection techniques for fine-tuning LLMs to support improved named entity recognition (NER) for a custom data set of conflicts of interest disclosure statements. Methods: A random sample of 200 disclosure statements was prepared for annotation. All “PERSON” and “ORG” entities were identified by each of the 2 raters, and once appropriate agreement was established, the annotators independently annotated an additional 290 disclosure statements. From the 490 annotated documents, 2500 stratified random samples in different size ranges were drawn. The 2500 training set subsamples were used to fine-tune a selection of language models across 2 model architectures (Bidirectional Encoder Representations from Transformers [BERT] and Generative Pre-trained Transformer [GPT]) for improved NER, and multiple regression was used to assess the relationship between sample size (sentences), entity density (entities per sentence [EPS]), and trained model performance (F1-score). Additionally, single-predictor threshold regression models were used to evaluate the possibility of diminishing marginal returns from increased sample size or entity density. Results: Fine-tuned models ranged in topline NER performance from F1-score=0.79 to F1-score=0.96 across architectures. Two-predictor multiple linear regression models were statistically significant with multiple R2 ranging from 0.6057 to 0.7896 (all P<.001). EPS and the number of sentences were significant predictors of F1-scores in all cases ( P<.001), except for the GPT-2_large model, where EPS was not a significant predictor (P=.184). Model thresholds indicate points of diminishing marginal return from increased training data set sample size measured by the number of sentences, with point estimates ranging from 439 sentences for RoBERTa_large to 527 sentences for GPT-2_large. Likewise, the threshold regression models indicate a diminishing marginal return for EPS with point estimates between 1.36 and 1.38. Conclusions: Relatively modest sample sizes can be used to fine-tune LLMs for NER tasks applied to biomedical text, and training data entity density should representatively approximate entity density in production data. Training data quality and a model architecture’s intended use (text generation vs text processing or classification) may be as, or more, important as training data volume and model parameter size. %M 38875593 %R 10.2196/52095 %U https://ai.jmir.org/2024/1/e52095 %U https://doi.org/10.2196/52095 %U http://www.ncbi.nlm.nih.gov/pubmed/38875593 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 7 %N %P e55898 %T Assessing the Application of Large Language Models in Generating Dermatologic Patient Education Materials According to Reading Level: Qualitative Study %A Lambert,Raphaella %A Choo,Zi-Yi %A Gradwohl,Kelsey %A Schroedl,Liesl %A Ruiz De Luzuriaga,Arlene %+ Pritzker School of Medicine, University of Chicago, 924 East 57th Street #104, Chicago, IL, 60637, United States, 1 7737021937, aleksalambert@uchicagomedicine.org %K artificial intelligence %K large language models %K large language model %K LLM %K LLMs %K machine learning %K natural language processing %K deep learning %K ChatGPT %K health literacy %K health knowledge %K health information %K patient education %K dermatology %K dermatologist %K dermatologists %K derm %K dermatology resident %K dermatology residents %K dermatologic patient education material %K dermatologic patient education materials %K patient education material %K patient education materials %K education material %K education materials %D 2024 %7 16.5.2024 %9 Original Paper %J JMIR Dermatol %G English %X Background: Dermatologic patient education materials (PEMs) are often written above the national average seventh- to eighth-grade reading level. ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT are large language models (LLMs) that are responsive to user prompts. Our project assesses their use in generating dermatologic PEMs at specified reading levels. Objective: This study aims to assess the ability of select LLMs to generate PEMs for common and rare dermatologic conditions at unspecified and specified reading levels. Further, the study aims to assess the preservation of meaning across such LLM-generated PEMs, as assessed by dermatology resident trainees. Methods: The Flesch-Kincaid reading level (FKRL) of current American Academy of Dermatology PEMs was evaluated for 4 common (atopic dermatitis, acne vulgaris, psoriasis, and herpes zoster) and 4 rare (epidermolysis bullosa, bullous pemphigoid, lamellar ichthyosis, and lichen planus) dermatologic conditions. We prompted ChatGPT-3.5, GPT-4, DermGPT, and DocsGPT to “Create a patient education handout about [condition] at a [FKRL]” to iteratively generate 10 PEMs per condition at unspecified fifth- and seventh-grade FKRLs, evaluated with Microsoft Word readability statistics. The preservation of meaning across LLMs was assessed by 2 dermatology resident trainees. Results: The current American Academy of Dermatology PEMs had an average (SD) FKRL of 9.35 (1.26) and 9.50 (2.3) for common and rare diseases, respectively. For common diseases, the FKRLs of LLM-produced PEMs ranged between 9.8 and 11.21 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). For rare diseases, the FKRLs of LLM-produced PEMs ranged between 9.85 and 11.45 (unspecified prompt), between 4.22 and 7.43 (fifth-grade prompt), and between 5.98 and 7.28 (seventh-grade prompt). At the fifth-grade reading level, GPT-4 was better at producing PEMs for both common and rare conditions than ChatGPT-3.5 (P=.001 and P=.01, respectively), DermGPT (P<.001 and P=.03, respectively), and DocsGPT (P<.001 and P=.02, respectively). At the seventh-grade reading level, no significant difference was found between ChatGPT-3.5, GPT-4, DocsGPT, or DermGPT in producing PEMs for common conditions (all P>.05); however, for rare conditions, ChatGPT-3.5 and DocsGPT outperformed GPT-4 (P=.003 and P<.001, respectively). The preservation of meaning analysis revealed that for common conditions, DermGPT ranked the highest for overall ease of reading, patient understandability, and accuracy (14.75/15, 98%); for rare conditions, handouts generated by GPT-4 ranked the highest (14.5/15, 97%). Conclusions: GPT-4 appeared to outperform ChatGPT-3.5, DocsGPT, and DermGPT at the fifth-grade FKRL for both common and rare conditions, although both ChatGPT-3.5 and DocsGPT performed better than GPT-4 at the seventh-grade FKRL for rare conditions. LLM-produced PEMs may reliably meet seventh-grade FKRLs for select common and rare dermatologic conditions and are easy to read, understandable for patients, and mostly accurate. LLMs may play a role in enhancing health literacy and disseminating accessible, understandable PEMs in dermatology. %M 38754096 %R 10.2196/55898 %U https://derma.jmir.org/2024/1/e55898 %U https://doi.org/10.2196/55898 %U http://www.ncbi.nlm.nih.gov/pubmed/38754096 %0 Journal Article %@ 2291-9694 %I %V 12 %N %P e51187 %T The Use of Generative AI for Scientific Literature Searches for Systematic Reviews: ChatGPT and Microsoft Bing AI Performance Evaluation %A Gwon,Yong Nam %A Kim,Jae Heon %A Chung,Hyun Soo %A Jung,Eun Jee %A Chun,Joey %A Lee,Serin %A Shim,Sung Ryul %K artificial intelligence %K search engine %K systematic review %K evidence-based medicine %K ChatGPT %K language model %K education %K tool %K clinical decision support system %K decision support %K support %K treatment %D 2024 %7 14.5.2024 %9 %J JMIR Med Inform %G English %X Background: A large language model is a type of artificial intelligence (AI) model that opens up great possibilities for health care practice, research, and education, although scholars have emphasized the need to proactively address the issue of unvalidated and inaccurate information regarding its use. One of the best-known large language models is ChatGPT (OpenAI). It is believed to be of great help to medical research, as it facilitates more efficient data set analysis, code generation, and literature review, allowing researchers to focus on experimental design as well as drug discovery and development. Objective: This study aims to explore the potential of ChatGPT as a real-time literature search tool for systematic reviews and clinical decision support systems, to enhance their efficiency and accuracy in health care settings. Methods: The search results of a published systematic review by human experts on the treatment of Peyronie disease were selected as a benchmark, and the literature search formula of the study was applied to ChatGPT and Microsoft Bing AI as a comparison to human researchers. Peyronie disease typically presents with discomfort, curvature, or deformity of the penis in association with palpable plaques and erectile dysfunction. To evaluate the quality of individual studies derived from AI answers, we created a structured rating system based on bibliographic information related to the publications. We classified its answers into 4 grades if the title existed: A, B, C, and F. No grade was given for a fake title or no answer. Results: From ChatGPT, 7 (0.5%) out of 1287 identified studies were directly relevant, whereas Bing AI resulted in 19 (40%) relevant studies out of 48, compared to the human benchmark of 24 studies. In the qualitative evaluation, ChatGPT had 7 grade A, 18 grade B, 167 grade C, and 211 grade F studies, and Bing AI had 19 grade A and 28 grade C studies. Conclusions: This is the first study to compare AI and conventional human systematic review methods as a real-time literature collection tool for evidence-based medicine. The results suggest that the use of ChatGPT as a tool for real-time evidence generation is not yet accurate and feasible. Therefore, researchers should be cautious about using such AI. The limitations of this study using the generative pre-trained transformer model are that the search for research topics was not diverse and that it did not prevent the hallucination of generative AI. However, this study will serve as a standard for future studies by providing an index to verify the reliability and consistency of generative AI from a user’s point of view. If the reliability and consistency of AI literature search services are verified, then the use of these technologies will help medical research greatly. %R 10.2196/51187 %U https://medinform.jmir.org/2024/1/e51187 %U https://doi.org/10.2196/51187 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e53724 %T Evaluating the Diagnostic Performance of Large Language Models on Complex Multimodal Medical Cases %A Chiu,Wan Hang Keith %A Ko,Wei Sum Koel %A Cho,William Chi Shing %A Hui,Sin Yu Joanne %A Chan,Wing Chi Lawrence %A Kuo,Michael D %+ Ensemble Group, 10541 E Firewheel Drive, Scottsdale, AZ, 85259, United States, 1 4084512341, mikedkuo@gmail.com %K large language model %K hospital %K health center %K Massachusetts %K statistical analysis %K chi-square %K ANOVA %K clinician %K physician %K performance %K proficiency %K disease etiology %D 2024 %7 13.5.2024 %9 Research Letter %J J Med Internet Res %G English %X Large language models showed interpretative reasoning in solving diagnostically challenging medical cases. %M 38739441 %R 10.2196/53724 %U https://www.jmir.org/2024/1/e53724 %U https://doi.org/10.2196/53724 %U http://www.ncbi.nlm.nih.gov/pubmed/38739441 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e52399 %T Potential of Large Language Models in Health Care: Delphi Study %A Denecke,Kerstin %A May,Richard %A , %A Rivera Romero,Octavio %+ Bern University of Applied Sciences, Quallgasse 21, Biel, 2502, Switzerland, 41 323216794, kerstin.denecke@bfh.ch %K large language models %K LLMs %K health care %K Delphi study %K natural language processing %K NLP %K artificial intelligence %K language model %K Delphi %K future %K innovation %K interview %K interviews %K informatics %K experience %K experiences %K attitude %K attitudes %K opinion %K perception %K perceptions %K perspective %K perspectives %K implementation %D 2024 %7 13.5.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: A large language model (LLM) is a machine learning model inferred from text data that captures subtle patterns of language use in context. Modern LLMs are based on neural network architectures that incorporate transformer methods. They allow the model to relate words together through attention to multiple words in a text sequence. LLMs have been shown to be highly effective for a range of tasks in natural language processing (NLP), including classification and information extraction tasks and generative applications. Objective: The aim of this adapted Delphi study was to collect researchers’ opinions on how LLMs might influence health care and on the strengths, weaknesses, opportunities, and threats of LLM use in health care. Methods: We invited researchers in the fields of health informatics, nursing informatics, and medical NLP to share their opinions on LLM use in health care. We started the first round with open questions based on our strengths, weaknesses, opportunities, and threats framework. In the second and third round, the participants scored these items. Results: The first, second, and third rounds had 28, 23, and 21 participants, respectively. Almost all participants (26/28, 93% in round 1 and 20/21, 95% in round 3) were affiliated with academic institutions. Agreement was reached on 103 items related to use cases, benefits, risks, reliability, adoption aspects, and the future of LLMs in health care. Participants offered several use cases, including supporting clinical tasks, documentation tasks, and medical research and education, and agreed that LLM-based systems will act as health assistants for patient education. The agreed-upon benefits included increased efficiency in data handling and extraction, improved automation of processes, improved quality of health care services and overall health outcomes, provision of personalized care, accelerated diagnosis and treatment processes, and improved interaction between patients and health care professionals. In total, 5 risks to health care in general were identified: cybersecurity breaches, the potential for patient misinformation, ethical concerns, the likelihood of biased decision-making, and the risk associated with inaccurate communication. Overconfidence in LLM-based systems was recognized as a risk to the medical profession. The 6 agreed-upon privacy risks included the use of unregulated cloud services that compromise data security, exposure of sensitive patient data, breaches of confidentiality, fraudulent use of information, vulnerabilities in data storage and communication, and inappropriate access or use of patient data. Conclusions: Future research related to LLMs should not only focus on testing their possibilities for NLP-related tasks but also consider the workflows the models could contribute to and the requirements regarding quality, integration, and regulations needed for successful implementation in practice. %M 38739445 %R 10.2196/52399 %U https://www.jmir.org/2024/1/e52399 %U https://doi.org/10.2196/52399 %U http://www.ncbi.nlm.nih.gov/pubmed/38739445 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e53787 %T The Role of Large Language Models in Transforming Emergency Medicine: Scoping Review %A Preiksaitis,Carl %A Ashenburg,Nicholas %A Bunney,Gabrielle %A Chu,Andrew %A Kabeer,Rana %A Riley,Fran %A Ribeira,Ryan %A Rose,Christian %+ Department of Emergency Medicine, Stanford University School of Medicine, 900 Welch Road, Suite 350, Palo Alto, CA, 94304, United States, 1 650 723 6576, cpreiksaitis@stanford.edu %K large language model %K LLM %K emergency medicine %K clinical decision support %K workflow efficiency %K medical education %K artificial intelligence %K AI %K natural language processing %K NLP %K AI literacy %K ChatGPT %K Bard %K Pathways Language Model %K Med-PaLM %K Bidirectional Encoder Representations from Transformers %K BERT %K generative pretrained transformer %K GPT %K United States %K US %K China %K scoping review %K Preferred Reporting Items for Systematic Reviews and Meta-Analyses %K PRISMA %K decision support %K workflow efficiency %K risk %K ethics %K education %K communication %K medical training %K physician %K health literacy %K emergency care %D 2024 %7 10.5.2024 %9 Review %J JMIR Med Inform %G English %X Background: Artificial intelligence (AI), more specifically large language models (LLMs), holds significant potential in revolutionizing emergency care delivery by optimizing clinical workflows and enhancing the quality of decision-making. Although enthusiasm for integrating LLMs into emergency medicine (EM) is growing, the existing literature is characterized by a disparate collection of individual studies, conceptual analyses, and preliminary implementations. Given these complexities and gaps in understanding, a cohesive framework is needed to comprehend the existing body of knowledge on the application of LLMs in EM. Objective: Given the absence of a comprehensive framework for exploring the roles of LLMs in EM, this scoping review aims to systematically map the existing literature on LLMs’ potential applications within EM and identify directions for future research. Addressing this gap will allow for informed advancements in the field. Methods: Using PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) criteria, we searched Ovid MEDLINE, Embase, Web of Science, and Google Scholar for papers published between January 2018 and August 2023 that discussed LLMs’ use in EM. We excluded other forms of AI. A total of 1994 unique titles and abstracts were screened, and each full-text paper was independently reviewed by 2 authors. Data were abstracted independently, and 5 authors performed a collaborative quantitative and qualitative synthesis of the data. Results: A total of 43 papers were included. Studies were predominantly from 2022 to 2023 and conducted in the United States and China. We uncovered four major themes: (1) clinical decision-making and support was highlighted as a pivotal area, with LLMs playing a substantial role in enhancing patient care, notably through their application in real-time triage, allowing early recognition of patient urgency; (2) efficiency, workflow, and information management demonstrated the capacity of LLMs to significantly boost operational efficiency, particularly through the automation of patient record synthesis, which could reduce administrative burden and enhance patient-centric care; (3) risks, ethics, and transparency were identified as areas of concern, especially regarding the reliability of LLMs’ outputs, and specific studies highlighted the challenges of ensuring unbiased decision-making amidst potentially flawed training data sets, stressing the importance of thorough validation and ethical oversight; and (4) education and communication possibilities included LLMs’ capacity to enrich medical training, such as through using simulated patient interactions that enhance communication skills. Conclusions: LLMs have the potential to fundamentally transform EM, enhancing clinical decision-making, optimizing workflows, and improving patient outcomes. This review sets the stage for future advancements by identifying key research areas: prospective validation of LLM applications, establishing standards for responsible use, understanding provider and patient perceptions, and improving physicians’ AI literacy. Effective integration of LLMs into EM will require collaborative efforts and thorough evaluation to ensure these technologies can be safely and effectively applied. %M 38728687 %R 10.2196/53787 %U https://medinform.jmir.org/2024/1/e53787 %U https://doi.org/10.2196/53787 %U http://www.ncbi.nlm.nih.gov/pubmed/38728687 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e51346 %T ChatGPT as a Tool for Medical Education and Clinical Decision-Making on the Wards: Case Study %A Skryd,Anthony %A Lawrence,Katharine %+ Department of Medicine, NYU Langone Health, 550 1st Avenue, New York City, NY, 10016, United States, 1 646 929 7800, anthony.skryd@nyulangone.org %K ChatGPT %K medical education %K large language models %K LLMs %K clinical decision-making %D 2024 %7 8.5.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Large language models (LLMs) are computational artificial intelligence systems with advanced natural language processing capabilities that have recently been popularized among health care students and educators due to their ability to provide real-time access to a vast amount of medical knowledge. The adoption of LLM technology into medical education and training has varied, and little empirical evidence exists to support its use in clinical teaching environments. Objective: The aim of the study is to identify and qualitatively evaluate potential use cases and limitations of LLM technology for real-time ward-based educational contexts. Methods: A brief, single-site exploratory evaluation of the publicly available ChatGPT-3.5 (OpenAI) was conducted by implementing the tool into the daily attending rounds of a general internal medicine inpatient service at a large urban academic medical center. ChatGPT was integrated into rounds via both structured and organic use, using the web-based “chatbot” style interface to interact with the LLM through conversational free-text and discrete queries. A qualitative approach using phenomenological inquiry was used to identify key insights related to the use of ChatGPT through analysis of ChatGPT conversation logs and associated shorthand notes from the clinical sessions. Results: Identified use cases for ChatGPT integration included addressing medical knowledge gaps through discrete medical knowledge inquiries, building differential diagnoses and engaging dual-process thinking, challenging medical axioms, using cognitive aids to support acute care decision-making, and improving complex care management by facilitating conversations with subspecialties. Potential additional uses included engaging in difficult conversations with patients, exploring ethical challenges and general medical ethics teaching, personal continuing medical education resources, developing ward-based teaching tools, supporting and automating clinical documentation, and supporting productivity and task management. LLM biases, misinformation, ethics, and health equity were identified as areas of concern and potential limitations to clinical and training use. A code of conduct on ethical and appropriate use was also developed to guide team usage on the wards. Conclusions: Overall, ChatGPT offers a novel tool to enhance ward-based learning through rapid information querying, second-order content exploration, and engaged team discussion regarding generated responses. More research is needed to fully understand contexts for educational use, particularly regarding the risks and limitations of the tool in clinical settings and its impacts on trainee development. %M 38717811 %R 10.2196/51346 %U https://formative.jmir.org/2024/1/e51346 %U https://doi.org/10.2196/51346 %U http://www.ncbi.nlm.nih.gov/pubmed/38717811 %0 Journal Article %@ 2563-3570 %I JMIR Publications %V 5 %N %P e52700 %T ChatGPT and Medicine: Together We Embrace the AI Renaissance %A Hacking,Sean %+ NYU Langone, Tisch Hospital, 560 First Avenue, Suite TH 461, New York, NY, 10016, United States, 1 6466836133, hackingsean1@gmail.com %K ChatGPT %K generative AI %K NLP %K medicine %K bioinformatics %K AI democratization %K AI renaissance %K artificial intelligence %K natural language processing %D 2024 %7 7.5.2024 %9 Editorial %J JMIR Bioinform Biotech %G English %X The generative artificial intelligence (AI) model ChatGPT holds transformative prospects in medicine. The development of such models has signaled the beginning of a new era where complex biological data can be made more accessible and interpretable. ChatGPT is a natural language processing tool that can process, interpret, and summarize vast data sets. It can serve as a digital assistant for physicians and researchers, aiding in integrating medical imaging data with other multiomics data and facilitating the understanding of complex biological systems. The physician’s and AI’s viewpoints emphasize the value of such AI models in medicine, providing tangible examples of how this could enhance patient care. The editorial also discusses the rise of generative AI, highlighting its substantial impact in democratizing AI applications for modern medicine. While AI may not supersede health care professionals, practitioners incorporating AI into their practices could potentially have a competitive edge. %M 38935938 %R 10.2196/52700 %U https://bioinform.jmir.org/2024/1/e52700 %U https://doi.org/10.2196/52700 %U http://www.ncbi.nlm.nih.gov/pubmed/38935938 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e46036 %T Effectiveness of an Artificial Intelligence-Assisted App for Improving Eating Behaviors: Mixed Methods Evaluation %A Chew,Han Shi Jocelyn %A Chew,Nicholas WS %A Loong,Shaun Seh Ern %A Lim,Su Lin %A Tam,Wai San Wilson %A Chin,Yip Han %A Chao,Ariana M %A Dimitriadis,Georgios K %A Gao,Yujia %A So,Jimmy Bok Yan %A Shabbir,Asim %A Ngiam,Kee Yuan %+ Alice Lee Centre for Nursing Studies, Yong Loo Lin School of Medicine, National University of Singapore, Level 3, Clinical Research Centre, Block MD11, 10 Medical Drive, Singapore, 117597, Singapore, 65 65168687, jocelyn.chew.hs@nus.edu.sg %K artificial intelligence %K chatbot %K chatbots %K weight %K overweight %K eating %K food %K weight loss %K mHealth %K mobile health %K app %K apps %K applications %K self-regulation %K self-monitoring %K anxiety %K depression %K consideration of future consequences %K mental health %K conversational agent %K conversational agents %K eating behavior %K healthy eating %K food consumption %K obese %K obesity %K diet %K dietary %D 2024 %7 7.5.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: A plethora of weight management apps are available, but many individuals, especially those living with overweight and obesity, still struggle to achieve adequate weight loss. An emerging area in weight management is the support for one’s self-regulation over momentary eating impulses. Objective: This study aims to examine the feasibility and effectiveness of a novel artificial intelligence–assisted weight management app in improving eating behaviors in a Southeast Asian cohort. Methods: A single-group pretest-posttest study was conducted. Participants completed the 1-week run-in period of a 12-week app-based weight management program called the Eating Trigger-Response Inhibition Program (eTRIP). This self-monitoring system was built upon 3 main components, namely, (1) chatbot-based check-ins on eating lapse triggers, (2) food-based computer vision image recognition (system built based on local food items), and (3) automated time-based nudges and meal stopwatch. At every mealtime, participants were prompted to take a picture of their food items, which were identified by a computer vision image recognition technology, thereby triggering a set of chatbot-initiated questions on eating triggers such as who the users were eating with. Paired 2-sided t tests were used to compare the differences in the psychobehavioral constructs before and after the 7-day program, including overeating habits, snacking habits, consideration of future consequences, self-regulation of eating behaviors, anxiety, depression, and physical activity. Qualitative feedback were analyzed by content analysis according to 4 steps, namely, decontextualization, recontextualization, categorization, and compilation. Results: The mean age, self-reported BMI, and waist circumference of the participants were 31.25 (SD 9.98) years, 28.86 (SD 7.02) kg/m2, and 92.60 (SD 18.24) cm, respectively. There were significant improvements in all the 7 psychobehavioral constructs, except for anxiety. After adjusting for multiple comparisons, statistically significant improvements were found for overeating habits (mean –0.32, SD 1.16; P<.001), snacking habits (mean –0.22, SD 1.12; P<.002), self-regulation of eating behavior (mean 0.08, SD 0.49; P=.007), depression (mean –0.12, SD 0.74; P=.007), and physical activity (mean 1288.60, SD 3055.20 metabolic equivalent task-min/day; P<.001). Forty-one participants reported skipping at least 1 meal (ie, breakfast, lunch, or dinner), summing to 578 (67.1%) of the 862 meals skipped. Of the 230 participants, 80 (34.8%) provided textual feedback that indicated satisfactory user experience with eTRIP. Four themes emerged, namely, (1) becoming more mindful of self-monitoring, (2) personalized reminders with prompts and chatbot, (3) food logging with image recognition, and (4) engaging with a simple, easy, and appealing user interface. The attrition rate was 8.4% (21/251). Conclusions: eTRIP is a feasible and effective weight management program to be tested in a larger population for its effectiveness and sustainability as a personalized weight management program for people with overweight and obesity. Trial Registration: ClinicalTrials.gov NCT04833803; https://classic.clinicaltrials.gov/ct2/show/NCT04833803 %M 38713909 %R 10.2196/46036 %U https://www.jmir.org/2024/1/e46036 %U https://doi.org/10.2196/46036 %U http://www.ncbi.nlm.nih.gov/pubmed/38713909 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 13 %N %P e55559 %T Development and Pilot-Testing of an Optimized Conversational Agent or “Chatbot” for Peruvian Adolescents Living With HIV to Facilitate Mental Health Screening, Education, Self-Help, and Linkage to Care: Protocol for a Mixed Methods, Community-Engaged Study %A Galea,Jerome T %A Vasquez,Diego H %A Rupani,Neil %A Gordon,Moya B %A Tapia,Milagros %A Greene,Karah Y %A Kolevic,Lenka %A Franke,Molly F %A Contreras,Carmen %+ School of Social Work, College of Behavioral and Community Sciences, University of South Florida, 13301 Bruce B Downs Boulevard, MHC 1400, Tampa, FL, 33612-3807, United States, 1 813 974 2310, jeromegalea@usf.edu %K chatbot %K digital assistant %K depression %K HIV %K adolescents %D 2024 %7 7.5.2024 %9 Protocol %J JMIR Res Protoc %G English %X Background: Adolescents living with HIV are disproportionally affected by depression, which worsens antiretroviral therapy adherence, increases viral load, and doubles the risk of mortality. Because most adolescents living with HIV live in low- and middle-income countries, few receive depression treatment due to a lack of mental health services and specialists in low-resource settings. Chatbot technology, used increasingly in health service delivery, is a promising approach for delivering low-intensity depression care to adolescents living with HIV in resource-constrained settings. Objective: The goal of this study is to develop and pilot-test for the feasibility and acceptability of a prototype, optimized conversational agent (chatbot) to provide mental health education, self-help skills, and care linkage for adolescents living with HIV. Methods: Chatbot development comprises 3 phases conducted over 2 years. In the first phase (year 1), formative research will be conducted to understand the views, opinions, and preferences of up to 48 youths aged 10-19 years (6 focus groups of up to 8 adolescents living with HIV per group), their caregivers (5 in-depth interviews), and HIV program personnel (5 in-depth interviews) regarding depression among adolescents living with HIV. We will also investigate the perceived acceptability of a mental health chatbot, including barriers and facilitators to accessing and using a chatbot for depression care by adolescents living with HIV. In the second phase (year 1), we will iteratively program a chatbot using the SmartBot360 software with successive versions (0.1, 0.2, and 0.3), meeting regularly with a Youth Advisory Board comprised of adolescents living with HIV who will guide and inform the chatbot development and content to arrive at a prototype version (version 1.0) for pilot-testing. In the third phase (year 2), we will pilot-test the prototype chatbot among 50 adolescents living with HIV naïve to its development. Participants will interact with the chatbot for up to 2 weeks, and data will be collected on the acceptability of the chatbot-delivered depression education and self-help strategies, depression knowledge changes, and intention to seek care linkage. Results: The study was awarded in April 2022, received institutional review board approval in November 2022, received funding in December 2022, and commenced recruitment in March 2023. By the completion of study phases 1 and 2, we expect our chatbot to incorporate key needs and preferences gathered from focus groups and interviews to develop the chatbot. By the completion of study phase 3, we will have assessed the feasibility and acceptability of the prototype chatbot. Study phase 3 began in April 2024. Final results are expected by January 2025 and published thereafter. Conclusions: The study will produce a prototype mental health chatbot developed with and for adolescents living with HIV that will be ready for efficacy testing in a subsequent, larger study. International Registered Report Identifier (IRRID): DERR1-10.2196/55559 %M 38713501 %R 10.2196/55559 %U https://www.researchprotocols.org/2024/1/e55559 %U https://doi.org/10.2196/55559 %U http://www.ncbi.nlm.nih.gov/pubmed/38713501 %0 Journal Article %@ 2561-7605 %I %V 7 %N %P e53019 %T Assessing the Quality of ChatGPT Responses to Dementia Caregivers’ Questions: Qualitative Analysis %A Aguirre,Alyssa %A Hilsabeck,Robin %A Smith,Tawny %A Xie,Bo %A He,Daqing %A Wang,Zhendong %A Zou,Ning %K Alzheimer’s disease %K information technology %K social media %K neurology %K dementia %K Alzheimer disease %K caregiver %K ChatGPT %D 2024 %7 6.5.2024 %9 %J JMIR Aging %G English %X Background: Artificial intelligence (AI) such as ChatGPT by OpenAI holds great promise to improve the quality of life of patients with dementia and their caregivers by providing high-quality responses to their questions about typical dementia behaviors. So far, however, evidence on the quality of such ChatGPT responses is limited. A few recent publications have investigated the quality of ChatGPT responses in other health conditions. Our study is the first to assess ChatGPT using real-world questions asked by dementia caregivers themselves. Objectives: This pilot study examines the potential of ChatGPT-3.5 to provide high-quality information that may enhance dementia care and patient-caregiver education. Methods: Our interprofessional team used a formal rating scale (scoring range: 0-5; the higher the score, the better the quality) to evaluate ChatGPT responses to real-world questions posed by dementia caregivers. We selected 60 posts by dementia caregivers from Reddit, a popular social media platform. These posts were verified by 3 interdisciplinary dementia clinicians as representing dementia caregivers’ desire for information in the areas of memory loss and confusion, aggression, and driving. Word count for posts in the memory loss and confusion category ranged from 71 to 531 (mean 218; median 188), aggression posts ranged from 58 to 602 words (mean 254; median 200), and driving posts ranged from 93 to 550 words (mean 272; median 276). Results: ChatGPT’s response quality scores ranged from 3 to 5. Of the 60 responses, 26 (43%) received 5 points, 21 (35%) received 4 points, and 13 (22%) received 3 points, suggesting high quality. ChatGPT obtained consistently high scores in synthesizing information to provide follow-up recommendations (n=58, 96%), with the lowest scores in the area of comprehensiveness (n=38, 63%). Conclusions: ChatGPT provided high-quality responses to complex questions posted by dementia caregivers, but it did have limitations. ChatGPT was unable to anticipate future problems that a human professional might recognize and address in a clinical encounter. At other times, ChatGPT recommended a strategy that the caregiver had already explicitly tried. This pilot study indicates the potential of AI to provide high-quality information to enhance dementia care and patient-caregiver education in tandem with information provided by licensed health care professionals. Evaluating the quality of responses is necessary to ensure that caregivers can make informed decisions. ChatGPT has the potential to transform health care practice by shaping how caregivers receive health information. %R 10.2196/53019 %U https://aging.jmir.org/2024/1/e53019 %U https://doi.org/10.2196/53019 %0 Journal Article %@ 2291-5222 %I JMIR Publications %V 12 %N %P e57978 %T The Evaluation of Generative AI Should Include Repetition to Assess Stability %A Zhu,Lingxuan %A Mou,Weiming %A Hong,Chenglin %A Yang,Tao %A Lai,Yancheng %A Qi,Chang %A Lin,Anqi %A Zhang,Jian %A Luo,Peng %+ Department of Oncology, Zhujiang Hospital, Southern Medical University, 253 Industrial Avenue, Guangzhou, China, 86 020 61643888, luopeng@smu.edu.cn %K large language model %K generative AI %K ChatGPT %K artificial intelligence %K health care %D 2024 %7 6.5.2024 %9 Commentary %J JMIR Mhealth Uhealth %G English %X The increasing interest in the potential applications of generative artificial intelligence (AI) models like ChatGPT in health care has prompted numerous studies to explore its performance in various medical contexts. However, evaluating ChatGPT poses unique challenges due to the inherent randomness in its responses. Unlike traditional AI models, ChatGPT generates different responses for the same input, making it imperative to assess its stability through repetition. This commentary highlights the importance of including repetition in the evaluation of ChatGPT to ensure the reliability of conclusions drawn from its performance. Similar to biological experiments, which often require multiple repetitions for validity, we argue that assessing generative AI models like ChatGPT demands a similar approach. Failure to acknowledge the impact of repetition can lead to biased conclusions and undermine the credibility of research findings. We urge researchers to incorporate appropriate repetition in their studies from the outset and transparently report their methods to enhance the robustness and reproducibility of findings in this rapidly evolving field. %M 38688841 %R 10.2196/57978 %U https://mhealth.jmir.org/2024/1/e57978 %U https://doi.org/10.2196/57978 %U http://www.ncbi.nlm.nih.gov/pubmed/38688841 %0 Journal Article %@ 2291-5222 %I JMIR Publications %V 12 %N %P e51526 %T Assessing the Efficacy of ChatGPT Versus Human Researchers in Identifying Relevant Studies on mHealth Interventions for Improving Medication Adherence in Patients With Ischemic Stroke When Conducting Systematic Reviews: Comparative Analysis %A Ruksakulpiwat,Suebsarn %A Phianhasin,Lalipat %A Benjasirisan,Chitchanok %A Ding,Kedong %A Ajibade,Anuoluwapo %A Kumar,Ayanesh %A Stewart,Cassie %+ Department of Medical Nursing, Faculty of Nursing, Mahidol University, 2 Wang Lang Road, Siriraj, Bangkok Noi, Bangkok, 10700, Thailand, 66 984782692, suebsarn25@gmail.com %K ChatGPT %K systematic reviews %K medication adherence %K mobile health %K mHealth %K ischemic stroke %K mobile phone %D 2024 %7 6.5.2024 %9 Original Paper %J JMIR Mhealth Uhealth %G English %X Background: ChatGPT by OpenAI emerged as a potential tool for researchers, aiding in various aspects of research. One such application was the identification of relevant studies in systematic reviews. However, a comprehensive comparison of the efficacy of relevant study identification between human researchers and ChatGPT has not been conducted. Objective: This study aims to compare the efficacy of ChatGPT and human researchers in identifying relevant studies on medication adherence improvement using mobile health interventions in patients with ischemic stroke during systematic reviews. Methods: This study used the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines. Four electronic databases, including CINAHL Plus with Full Text, Web of Science, PubMed, and MEDLINE, were searched to identify articles published from inception until 2023 using search terms based on MeSH (Medical Subject Headings) terms generated by human researchers versus ChatGPT. The authors independently screened the titles, abstracts, and full text of the studies identified through separate searches conducted by human researchers and ChatGPT. The comparison encompassed several aspects, including the ability to retrieve relevant studies, accuracy, efficiency, limitations, and challenges associated with each method. Results: A total of 6 articles identified through search terms generated by human researchers were included in the final analysis, of which 4 (67%) reported improvements in medication adherence after the intervention. However, 33% (2/6) of the included studies did not clearly state whether medication adherence improved after the intervention. A total of 10 studies were included based on search terms generated by ChatGPT, of which 6 (60%) overlapped with studies identified by human researchers. Regarding the impact of mobile health interventions on medication adherence, most included studies (8/10, 80%) based on search terms generated by ChatGPT reported improvements in medication adherence after the intervention. However, 20% (2/10) of the studies did not clearly state whether medication adherence improved after the intervention. The precision in accurately identifying relevant studies was higher in human researchers (0.86) than in ChatGPT (0.77). This is consistent with the percentage of relevance, where human researchers (9.8%) demonstrated a higher percentage of relevance than ChatGPT (3%). However, when considering the time required for both humans and ChatGPT to identify relevant studies, ChatGPT substantially outperformed human researchers as it took less time to identify relevant studies. Conclusions: Our comparative analysis highlighted the strengths and limitations of both approaches. Ultimately, the choice between human researchers and ChatGPT depends on the specific requirements and objectives of each review, but the collaborative synergy of both approaches holds the potential to advance evidence-based research and decision-making in the health care field. %M 38710069 %R 10.2196/51526 %U https://mhealth.jmir.org/2024/1/e51526 %U https://doi.org/10.2196/51526 %U http://www.ncbi.nlm.nih.gov/pubmed/38710069 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 13 %N %P e52145 %T A Factorial Randomized Controlled Trial to Optimize User Engagement With a Chatbot-Led Parenting Intervention: Protocol for the ParentText Optimisation Trial %A Ambrosio,Maria Da Graca %A Lachman,Jamie M %A Zinzer,Paula %A Gwebu,Hlengiwe %A Vyas,Seema %A Vallance,Inge %A Calderon,Francisco %A Gardner,Frances %A Markle,Laurie %A Stern,David %A Facciola,Chiara %A Schley,Anne %A Danisa,Nompumelelo %A Brukwe,Kanyisile %A Melendez-Torres,GJ %+ University of Oxford, Barnett House, 32-37 Wellington Square, Oxford, OX1 2ER, United Kingdom, 44 (0)1865270325, maria.ambrosio@wolfson.ox.ac.uk %K parenting intervention %K chatbot-led public health intervention %K engagement %K implementation science %K mobile phone %D 2024 %7 3.5.2024 %9 Protocol %J JMIR Res Protoc %G English %X Background: Violence against children (VAC) is a serious public health concern with long-lasting adverse effects. Evidence-based parenting programs are one effective means to prevent VAC; however, these interventions are not scalable in their typical in-person group format, especially in low- and middle-income countries where the need is greatest. While digital delivery, including via chatbots, offers a scalable and cost-effective means to scale up parenting programs within these settings, it is crucial to understand the key pillars of user engagement to ensure their effective implementation. Objective: This study aims to investigate the most effective and cost-effective combination of external components to optimize user engagement with ParentText, an open-source chatbot-led parenting intervention to prevent VAC in Mpumalanga, South Africa. Methods: This study will use a mixed methods design incorporating a 2 × 2 factorial cluster-randomized controlled trial and qualitative interviews. Parents of adolescent girls (32 clusters, 120 participants [60 parents and 60 girls aged 10 to 17 years] per cluster; N=3840 total participants) will be recruited from the Ehlanzeni and Nkangala districts of Mpumalanga. Clusters will be randomly assigned to receive 1 of the 4 engagement packages that include ParentText alone or combined with in-person sessions and a facilitated WhatsApp support group. Quantitative data collected will include pretest-posttest parent- and adolescent-reported surveys, facilitator-reported implementation data, and digitally tracked engagement data. Qualitative data will be collected from parents and facilitators through in-person or over-the-phone individual semistructured interviews and used to expand the interpretation and understanding of the quantitative findings. Results: Recruitment and data collection started in August 2023 and were finalized in November 2023. The total number of participants enrolled in the study is 1009, with 744 caregivers having completed onboarding to the chatbot-led intervention. Female participants represent 92.96% (938/1009) of the sample population, whereas male participants represent 7.03% (71/1009). The average participant age is 43 (SD 9) years. Conclusions: The ParentText Optimisation Trial is the first study to rigorously test engagement with a chatbot-led parenting intervention in a low- or middle-income country. The results of this study will inform the final selection of external delivery components to support engagement with ParentText in preparation for further evaluation in a randomized controlled trial in 2024. Trial Registration: Open Science Framework (OSF); https://doi.org/10.17605/OSF.IO/WFXNE International Registered Report Identifier (IRRID): DERR1-10.2196/52145 %M 38700935 %R 10.2196/52145 %U https://www.researchprotocols.org/2024/1/e52145 %U https://doi.org/10.2196/52145 %U http://www.ncbi.nlm.nih.gov/pubmed/38700935 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e52499 %T Using Large Language Models to Support Content Analysis: A Case Study of ChatGPT for Adverse Event Detection %A Leas,Eric C %A Ayers,John W %A Desai,Nimit %A Dredze,Mark %A Hogarth,Michael %A Smith,Davey M %+ Herbert Wertheim School of Public Health and Human Longevity Science, University of California San Diego, 9500 Gilman Drive, Mail Code: 0725, La Jolla, CA, 92093, United States, 1 951 346 9131, ecleas@ucsd.edu %K adverse events %K artificial intelligence %K AI %K text analysis %K annotation %K ChatGPT %K LLM %K large language model %K cannabis %K delta-8-THC %K delta-8-tetrahydrocannabiol %D 2024 %7 2.5.2024 %9 Research Letter %J J Med Internet Res %G English %X This study explores the potential of using large language models to assist content analysis by conducting a case study to identify adverse events (AEs) in social media posts. The case study compares ChatGPT’s performance with human annotators’ in detecting AEs associated with delta-8-tetrahydrocannabinol, a cannabis-derived product. Using the identical instructions given to human annotators, ChatGPT closely approximated human results, with a high degree of agreement noted: 94.4% (9436/10,000) for any AE detection (Fleiss κ=0.95) and 99.3% (9931/10,000) for serious AEs (κ=0.96). These findings suggest that ChatGPT has the potential to replicate human annotation accurately and efficiently. The study recognizes possible limitations, including concerns about the generalizability due to ChatGPT’s training data, and prompts further research with different models, data sources, and content analysis tasks. The study highlights the promise of large language models for enhancing the efficiency of biomedical research. %M 38696245 %R 10.2196/52499 %U https://www.jmir.org/2024/1/e52499 %U https://doi.org/10.2196/52499 %U http://www.ncbi.nlm.nih.gov/pubmed/38696245 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54948 %T Integrating Text and Image Analysis: Exploring GPT-4V’s Capabilities in Advanced Radiological Applications Across Subspecialties %A Busch,Felix %A Han,Tianyu %A Makowski,Marcus R %A Truhn,Daniel %A Bressem,Keno K %A Adams,Lisa %+ Department of Neuroradiology, Charité – Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Charitépl. 1, Berlin, 10117, Germany, 49 3045050, felix.busch@charite.de %K GPT-4 %K ChatGPT %K Generative Pre-Trained Transformer %K multimodal large language models %K artificial intelligence %K AI applications in medicine %K diagnostic radiology %K clinical decision support systems %K generative AI %K medical image analysis %D 2024 %7 1.5.2024 %9 Research Letter %J J Med Internet Res %G English %X This study demonstrates that GPT-4V outperforms GPT-4 across radiology subspecialties in analyzing 207 cases with 1312 images from the Radiological Society of North America Case Collection. %M 38691404 %R 10.2196/54948 %U https://www.jmir.org/2024/1/e54948 %U https://doi.org/10.2196/54948 %U http://www.ncbi.nlm.nih.gov/pubmed/38691404 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54706 %T Physician Versus Large Language Model Chatbot Responses to Web-Based Questions From Autistic Patients in Chinese: Cross-Sectional Comparative Analysis %A He,Wenjie %A Zhang,Wenyan %A Jin,Ya %A Zhou,Qiang %A Zhang,Huadan %A Xia,Qing %+ Tianjin University of Traditional Chinese Medicine, 10 Poyang Lake Road, Tuanpo New Town West, Jinghai District, Tianjin, 301617, China, 86 13820689541, xiaqingcho@163.com %K artificial intelligence %K chatbot %K ChatGPT %K ERNIE Bot %K autism %D 2024 %7 30.4.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: There is a dearth of feasibility assessments regarding using large language models (LLMs) for responding to inquiries from autistic patients within a Chinese-language context. Despite Chinese being one of the most widely spoken languages globally, the predominant research focus on applying these models in the medical field has been on English-speaking populations. Objective: This study aims to assess the effectiveness of LLM chatbots, specifically ChatGPT-4 (OpenAI) and ERNIE Bot (version 2.2.3; Baidu, Inc), one of the most advanced LLMs in China, in addressing inquiries from autistic individuals in a Chinese setting. Methods: For this study, we gathered data from DXY—a widely acknowledged, web-based, medical consultation platform in China with a user base of over 100 million individuals. A total of 100 patient consultation samples were rigorously selected from January 2018 to August 2023, amounting to 239 questions extracted from publicly available autism-related documents on the platform. To maintain objectivity, both the original questions and responses were anonymized and randomized. An evaluation team of 3 chief physicians assessed the responses across 4 dimensions: relevance, accuracy, usefulness, and empathy. The team completed 717 evaluations. The team initially identified the best response and then used a Likert scale with 5 response categories to gauge the responses, each representing a distinct level of quality. Finally, we compared the responses collected from different sources. Results: Among the 717 evaluations conducted, 46.86% (95% CI 43.21%-50.51%) of assessors displayed varying preferences for responses from physicians, with 34.87% (95% CI 31.38%-38.36%) of assessors favoring ChatGPT and 18.27% (95% CI 15.44%-21.10%) of assessors favoring ERNIE Bot. The average relevance scores for physicians, ChatGPT, and ERNIE Bot were 3.75 (95% CI 3.69-3.82), 3.69 (95% CI 3.63-3.74), and 3.41 (95% CI 3.35-3.46), respectively. Physicians (3.66, 95% CI 3.60-3.73) and ChatGPT (3.73, 95% CI 3.69-3.77) demonstrated higher accuracy ratings compared to ERNIE Bot (3.52, 95% CI 3.47-3.57). In terms of usefulness scores, physicians (3.54, 95% CI 3.47-3.62) received higher ratings than ChatGPT (3.40, 95% CI 3.34-3.47) and ERNIE Bot (3.05, 95% CI 2.99-3.12). Finally, concerning the empathy dimension, ChatGPT (3.64, 95% CI 3.57-3.71) outperformed physicians (3.13, 95% CI 3.04-3.21) and ERNIE Bot (3.11, 95% CI 3.04-3.18). Conclusions: In this cross-sectional study, physicians’ responses exhibited superiority in the present Chinese-language context. Nonetheless, LLMs can provide valuable medical guidance to autistic patients and may even surpass physicians in demonstrating empathy. However, it is crucial to acknowledge that further optimization and research are imperative prerequisites before the effective integration of LLMs in clinical settings across diverse linguistic environments can be realized. Trial Registration: Chinese Clinical Trial Registry ChiCTR2300074655; https://www.chictr.org.cn/bin/project/edit?pid=199432 %M 38687566 %R 10.2196/54706 %U https://www.jmir.org/2024/1/e54706 %U https://doi.org/10.2196/54706 %U http://www.ncbi.nlm.nih.gov/pubmed/38687566 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e55048 %T Exploring the Performance of ChatGPT Versions 3.5, 4, and 4 With Vision in the Chilean Medical Licensing Examination: Observational Study %A Rojas,Marcos %A Rojas,Marcelo %A Burgess,Valentina %A Toro-Pérez,Javier %A Salehi,Shima %K artificial intelligence %K AI %K generative artificial intelligence %K medical education %K ChatGPT %K EUNACOM %K medical licensure %K medical license %K medical licensing exam %D 2024 %7 29.4.2024 %9 %J JMIR Med Educ %G English %X Background: The deployment of OpenAI’s ChatGPT-3.5 and its subsequent versions, ChatGPT-4 and ChatGPT-4 With Vision (4V; also known as “GPT-4 Turbo With Vision”), has notably influenced the medical field. Having demonstrated remarkable performance in medical examinations globally, these models show potential for educational applications. However, their effectiveness in non-English contexts, particularly in Chile’s medical licensing examinations—a critical step for medical practitioners in Chile—is less explored. This gap highlights the need to evaluate ChatGPT’s adaptability to diverse linguistic and cultural contexts. Objective: This study aims to evaluate the performance of ChatGPT versions 3.5, 4, and 4V in the EUNACOM (Examen Único Nacional de Conocimientos de Medicina), a major medical examination in Chile. Methods: Three official practice drills (540 questions) from the University of Chile, mirroring the EUNACOM’s structure and difficulty, were used to test ChatGPT versions 3.5, 4, and 4V. The 3 ChatGPT versions were provided 3 attempts for each drill. Responses to questions during each attempt were systematically categorized and analyzed to assess their accuracy rate. Results: All versions of ChatGPT passed the EUNACOM drills. Specifically, versions 4 and 4V outperformed version 3.5, achieving average accuracy rates of 79.32% and 78.83%, respectively, compared to 57.53% for version 3.5 (P<.001). Version 4V, however, did not outperform version 4 (P=.73), despite the additional visual capabilities. We also evaluated ChatGPT’s performance in different medical areas of the EUNACOM and found that versions 4 and 4V consistently outperformed version 3.5. Across the different medical areas, version 3.5 displayed the highest accuracy in psychiatry (69.84%), while versions 4 and 4V achieved the highest accuracy in surgery (90.00% and 86.11%, respectively). Versions 3.5 and 4 had the lowest performance in internal medicine (52.74% and 75.62%, respectively), while version 4V had the lowest performance in public health (74.07%). Conclusions: This study reveals ChatGPT’s ability to pass the EUNACOM, with distinct proficiencies across versions 3.5, 4, and 4V. Notably, advancements in artificial intelligence (AI) have not significantly led to enhancements in performance on image-based questions. The variations in proficiency across medical fields suggest the need for more nuanced AI training. Additionally, the study underscores the importance of exploring innovative approaches to using AI to augment human cognition and enhance the learning process. Such advancements have the potential to significantly influence medical education, fostering not only knowledge acquisition but also the development of critical thinking and problem-solving skills among health care professionals. %R 10.2196/55048 %U https://mededu.jmir.org/2024/1/e55048 %U https://doi.org/10.2196/55048 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 11 %N %P e54581 %T Usability Comparison Among Healthy Participants of an Anthropomorphic Digital Human and a Text-Based Chatbot as a Responder to Questions on Mental Health: Randomized Controlled Trial %A Thunström,Almira Osmanovic %A Carlsen,Hanne Krage %A Ali,Lilas %A Larson,Tomas %A Hellström,Andreas %A Steingrimsson,Steinn %+ Region Västra Götaland, Psychiatric Department, Sahlgrenska University Hospital, Journalvägen 5, Gothenburg, 41650, Sweden, 46 313421000, steinn.steingrimsson@gu.se %K chatbot %K chatbots %K chat-bot %K chat-bots %K text-only chatbot, voice-only chatbot %K mental health %K mental illness %K mental disease %K mental diseases %K mental illnesses %K mental health service %K mental health services %K interface %K system usability %K usability %K digital health %K machine learning %K ML %K artificial intelligence %K AI %K algorithm %K algorithms %K NLP %K natural language processing %D 2024 %7 29.4.2024 %9 Original Paper %J JMIR Hum Factors %G English %X Background: The use of chatbots in mental health support has increased exponentially in recent years, with studies showing that they may be effective in treating mental health problems. More recently, the use of visual avatars called digital humans has been introduced. Digital humans have the capability to use facial expressions as another dimension in human-computer interactions. It is important to study the difference in emotional response and usability preferences between text-based chatbots and digital humans for interacting with mental health services. Objective: This study aims to explore to what extent a digital human interface and a text-only chatbot interface differed in usability when tested by healthy participants, using BETSY (Behavior, Emotion, Therapy System, and You) which uses 2 distinct interfaces: a digital human with anthropomorphic features and a text-only user interface. We also set out to explore how chatbot-generated conversations on mental health (specific to each interface) affected self-reported feelings and biometrics. Methods: We explored to what extent a digital human with anthropomorphic features differed from a traditional text-only chatbot regarding perception of usability through the System Usability Scale, emotional reactions through electroencephalography, and feelings of closeness. Healthy participants (n=45) were randomized to 2 groups that used a digital human with anthropomorphic features (n=25) or a text-only chatbot with no such features (n=20). The groups were compared by linear regression analysis and t tests. Results: No differences were observed between the text-only and digital human groups regarding demographic features. The mean System Usability Scale score was 75.34 (SD 10.01; range 57-90) for the text-only chatbot versus 64.80 (SD 14.14; range 40-90) for the digital human interface. Both groups scored their respective chatbot interfaces as average or above average in usability. Women were more likely to report feeling annoyed by BETSY. Conclusions: The text-only chatbot was perceived as significantly more user-friendly than the digital human, although there were no significant differences in electroencephalography measurements. Male participants exhibited lower levels of annoyance with both interfaces, contrary to previously reported findings. %M 38683664 %R 10.2196/54581 %U https://humanfactors.jmir.org/2024/1/e54581 %U https://doi.org/10.2196/54581 %U http://www.ncbi.nlm.nih.gov/pubmed/38683664 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e55595 %T Exploring the Performance of ChatGPT-4 in the Taiwan Audiologist Qualification Examination: Preliminary Observational Study Highlighting the Potential of AI Chatbots in Hearing Care %A Wang,Shangqiguo %A Mo,Changgeng %A Chen,Yuan %A Dai,Xiaolu %A Wang,Huiyi %A Shen,Xiaoli %K ChatGPT %K medical education %K artificial intelligence %K AI %K audiology %K hearing care %K natural language processing %K large language model %K Taiwan %K hearing %K hearing specialist %K audiologist %K examination %K information accuracy %K educational technology %K healthcare services %K chatbot %K health care services %D 2024 %7 26.4.2024 %9 %J JMIR Med Educ %G English %X Background: Artificial intelligence (AI) chatbots, such as ChatGPT-4, have shown immense potential for application across various aspects of medicine, including medical education, clinical practice, and research. Objective: This study aimed to evaluate the performance of ChatGPT-4 in the 2023 Taiwan Audiologist Qualification Examination, thereby preliminarily exploring the potential utility of AI chatbots in the fields of audiology and hearing care services. Methods: ChatGPT-4 was tasked to provide answers and reasoning for the 2023 Taiwan Audiologist Qualification Examination. The examination encompassed six subjects: (1) basic auditory science, (2) behavioral audiology, (3) electrophysiological audiology, (4) principles and practice of hearing devices, (5) health and rehabilitation of the auditory and balance systems, and (6) auditory and speech communication disorders (including professional ethics). Each subject included 50 multiple-choice questions, with the exception of behavioral audiology, which had 49 questions, amounting to a total of 299 questions. Results: The correct answer rates across the 6 subjects were as follows: 88% for basic auditory science, 63% for behavioral audiology, 58% for electrophysiological audiology, 72% for principles and practice of hearing devices, 80% for health and rehabilitation of the auditory and balance systems, and 86% for auditory and speech communication disorders (including professional ethics). The overall accuracy rate for the 299 questions was 75%, which surpasses the examination’s passing criteria of an average 60% accuracy rate across all subjects. A comprehensive review of ChatGPT-4’s responses indicated that incorrect answers were predominantly due to information errors. Conclusions: ChatGPT-4 demonstrated a robust performance in the Taiwan Audiologist Qualification Examination, showcasing effective logical reasoning skills. Our results suggest that with enhanced information accuracy, ChatGPT-4’s performance could be further improved. This study indicates significant potential for the application of AI chatbots in audiology and hearing care services. %R 10.2196/55595 %U https://mededu.jmir.org/2024/1/e55595 %U https://doi.org/10.2196/55595 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e55847 %T Leveraging Large Language Models for Improved Patient Access and Self-Management: Assessor-Blinded Comparison Between Expert- and AI-Generated Content %A Lv,Xiaolei %A Zhang,Xiaomeng %A Li,Yuan %A Ding,Xinxin %A Lai,Hongchang %A Shi,Junyu %+ Department of Oral and Maxillofacial Implantology, Shanghai PerioImplant Innovation Center, Shanghai Ninth People's Hospital, Shanghai Jiao Tong University School of Medicine, Quxi Road No 500, Shanghai, 200011, China, 86 21 23271699 ext 5298, sakyamuni_jin@163.com %K large language model %K artificial intelligence %K public oral health %K health care access %K patient education %D 2024 %7 25.4.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: While large language models (LLMs) such as ChatGPT and Google Bard have shown significant promise in various fields, their broader impact on enhancing patient health care access and quality, particularly in specialized domains such as oral health, requires comprehensive evaluation. Objective: This study aims to assess the effectiveness of Google Bard, ChatGPT-3.5, and ChatGPT-4 in offering recommendations for common oral health issues, benchmarked against responses from human dental experts. Methods: This comparative analysis used 40 questions derived from patient surveys on prevalent oral diseases, which were executed in a simulated clinical environment. Responses, obtained from both human experts and LLMs, were subject to a blinded evaluation process by experienced dentists and lay users, focusing on readability, appropriateness, harmlessness, comprehensiveness, intent capture, and helpfulness. Additionally, the stability of artificial intelligence responses was also assessed by submitting each question 3 times under consistent conditions. Results: Google Bard excelled in readability but lagged in appropriateness when compared to human experts (mean 8.51, SD 0.37 vs mean 9.60, SD 0.33; P=.03). ChatGPT-3.5 and ChatGPT-4, however, performed comparably with human experts in terms of appropriateness (mean 8.96, SD 0.35 and mean 9.34, SD 0.47, respectively), with ChatGPT-4 demonstrating the highest stability and reliability. Furthermore, all 3 LLMs received superior harmlessness scores comparable to human experts, with lay users finding minimal differences in helpfulness and intent capture between the artificial intelligence models and human responses. Conclusions: LLMs, particularly ChatGPT-4, show potential in oral health care, providing patient-centric information for enhancing patient education and clinical care. The observed performance variations underscore the need for ongoing refinement and ethical considerations in health care settings. Future research focuses on developing strategies for the safe integration of LLMs in health care settings. %M 38663010 %R 10.2196/55847 %U https://www.jmir.org/2024/1/e55847 %U https://doi.org/10.2196/55847 %U http://www.ncbi.nlm.nih.gov/pubmed/38663010 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56764 %T Large Language Models and User Trust: Consequence of Self-Referential Learning Loop and the Deskilling of Health Care Professionals %A Choudhury,Avishek %A Chaudhry,Zaira %+ Industrial and Management Systems Engineering, West Virginia University, 321 Engineering Sciences Bdlg, 1306 Evansdale Drive, Morgantown, WV, 26506, United States, 1 5156080777, avishek.choudhury@mail.wvu.edu %K trust %K ChatGPT %K human factors %K healthcare %K LLMs %K large language models %K LLM user trust %K AI accountability %K artificial intelligence %K AI technology %K technologies %K effectiveness %K policy %K medical student %K medical students %K risk factor %K quality of care %K healthcare professional %K healthcare professionals %K human element %D 2024 %7 25.4.2024 %9 Viewpoint %J J Med Internet Res %G English %X As the health care industry increasingly embraces large language models (LLMs), understanding the consequence of this integration becomes crucial for maximizing benefits while mitigating potential pitfalls. This paper explores the evolving relationship among clinician trust in LLMs, the transition of data sources from predominantly human-generated to artificial intelligence (AI)–generated content, and the subsequent impact on the performance of LLMs and clinician competence. One of the primary concerns identified in this paper is the LLMs’ self-referential learning loops, where AI-generated content feeds into the learning algorithms, threatening the diversity of the data pool, potentially entrenching biases, and reducing the efficacy of LLMs. While theoretical at this stage, this feedback loop poses a significant challenge as the integration of LLMs in health care deepens, emphasizing the need for proactive dialogue and strategic measures to ensure the safe and effective use of LLM technology. Another key takeaway from our investigation is the role of user expertise and the necessity for a discerning approach to trusting and validating LLM outputs. The paper highlights how expert users, particularly clinicians, can leverage LLMs to enhance productivity by off-loading routine tasks while maintaining a critical oversight to identify and correct potential inaccuracies in AI-generated content. This balance of trust and skepticism is vital for ensuring that LLMs augment rather than undermine the quality of patient care. We also discuss the risks associated with the deskilling of health care professionals. Frequent reliance on LLMs for critical tasks could result in a decline in health care providers’ diagnostic and thinking skills, particularly affecting the training and development of future professionals. The legal and ethical considerations surrounding the deployment of LLMs in health care are also examined. We discuss the medicolegal challenges, including liability in cases of erroneous diagnoses or treatment advice generated by LLMs. The paper references recent legislative efforts, such as The Algorithmic Accountability Act of 2023, as crucial steps toward establishing a framework for the ethical and responsible use of AI-based technologies in health care. In conclusion, this paper advocates for a strategic approach to integrating LLMs into health care. By emphasizing the importance of maintaining clinician expertise, fostering critical engagement with LLM outputs, and navigating the legal and ethical landscape, we can ensure that LLMs serve as valuable tools in enhancing patient care and supporting health care professionals. This approach addresses the immediate challenges posed by integrating LLMs and sets a foundation for their maintainable and responsible use in the future. %M 38662419 %R 10.2196/56764 %U https://www.jmir.org/2024/1/e56764 %U https://doi.org/10.2196/56764 %U http://www.ncbi.nlm.nih.gov/pubmed/38662419 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54419 %T Using ChatGPT-4 to Create Structured Medical Notes From Audio Recordings of Physician-Patient Encounters: Comparative Study %A Kernberg,Annessa %A Gold,Jeffrey A %A Mohan,Vishnu %+ Department of Medical Informatics and Clinical Epidemiology, Oregon Health and Sciences University, 3181 SW Sam Jackson Park Road, Portland, OR, 97239, United States, 1 5034944469, mohanV@ohsu.edu %K generative AI %K generative artificial intelligence %K ChatGPT %K simulation %K large language model %K clinical documentation %K quality %K accuracy %K reproducibility %K publicly available %K medical note %K medical notes %K generation %K medical documentation %K documentation %K documentations %K AI %K artificial intelligence %K transcript %K transcripts %K ChatGPT-4 %D 2024 %7 22.4.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Medical documentation plays a crucial role in clinical practice, facilitating accurate patient management and communication among health care professionals. However, inaccuracies in medical notes can lead to miscommunication and diagnostic errors. Additionally, the demands of documentation contribute to physician burnout. Although intermediaries like medical scribes and speech recognition software have been used to ease this burden, they have limitations in terms of accuracy and addressing provider-specific metrics. The integration of ambient artificial intelligence (AI)–powered solutions offers a promising way to improve documentation while fitting seamlessly into existing workflows. Objective: This study aims to assess the accuracy and quality of Subjective, Objective, Assessment, and Plan (SOAP) notes generated by ChatGPT-4, an AI model, using established transcripts of History and Physical Examination as the gold standard. We seek to identify potential errors and evaluate the model’s performance across different categories. Methods: We conducted simulated patient-provider encounters representing various ambulatory specialties and transcribed the audio files. Key reportable elements were identified, and ChatGPT-4 was used to generate SOAP notes based on these transcripts. Three versions of each note were created and compared to the gold standard via chart review; errors generated from the comparison were categorized as omissions, incorrect information, or additions. We compared the accuracy of data elements across versions, transcript length, and data categories. Additionally, we assessed note quality using the Physician Documentation Quality Instrument (PDQI) scoring system. Results: Although ChatGPT-4 consistently generated SOAP-style notes, there were, on average, 23.6 errors per clinical case, with errors of omission (86%) being the most common, followed by addition errors (10.5%) and inclusion of incorrect facts (3.2%). There was significant variance between replicates of the same case, with only 52.9% of data elements reported correctly across all 3 replicates. The accuracy of data elements varied across cases, with the highest accuracy observed in the “Objective” section. Consequently, the measure of note quality, assessed by PDQI, demonstrated intra- and intercase variance. Finally, the accuracy of ChatGPT-4 was inversely correlated to both the transcript length (P=.05) and the number of scorable data elements (P=.05). Conclusions: Our study reveals substantial variability in errors, accuracy, and note quality generated by ChatGPT-4. Errors were not limited to specific sections, and the inconsistency in error types across replicates complicated predictability. Transcript length and data complexity were inversely correlated with note accuracy, raising concerns about the model’s effectiveness in handling complex medical cases. The quality and reliability of clinical notes produced by ChatGPT-4 do not meet the standards required for clinical use. Although AI holds promise in health care, caution should be exercised before widespread adoption. Further research is needed to address accuracy, variability, and potential errors. ChatGPT-4, while valuable in various applications, should not be considered a safe alternative to human-generated clinical documentation at this time. %M 38648636 %R 10.2196/54419 %U https://www.jmir.org/2024/1/e54419 %U https://doi.org/10.2196/54419 %U http://www.ncbi.nlm.nih.gov/pubmed/38648636 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e55037 %T ChatGPT’s Performance in Cardiac Arrest and Bradycardia Simulations Using the American Heart Association's Advanced Cardiovascular Life Support Guidelines: Exploratory Study %A Pham,Cecilia %A Govender,Romi %A Tehami,Salik %A Chavez,Summer %A Adepoju,Omolola E %A Liaw,Winston %+ Tilman J Fertitta Family College of Medicine, University of Houston, 5055 Medical Circle, Houston, TX, 77204, United States, 1 713 743 7047, cmpham4@uh.edu %K ChatGPT %K artificial intelligence %K AI %K large language model %K LLM %K cardiac arrest %K bradycardia %K simulation %K advanced cardiovascular life support %K ACLS %K bradycardia simulations %K America %K American %K heart association %K cardiac %K life support %K exploratory study %K heart %K heart attack %K clinical decision support %K diagnostics %K algorithms %D 2024 %7 22.4.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: ChatGPT is the most advanced large language model to date, with prior iterations having passed medical licensing examinations, providing clinical decision support, and improved diagnostics. Although limited, past studies of ChatGPT’s performance found that artificial intelligence could pass the American Heart Association’s advanced cardiovascular life support (ACLS) examinations with modifications. ChatGPT’s accuracy has not been studied in more complex clinical scenarios. As heart disease and cardiac arrest remain leading causes of morbidity and mortality in the United States, finding technologies that help increase adherence to ACLS algorithms, which improves survival outcomes, is critical. Objective: This study aims to examine the accuracy of ChatGPT in following ACLS guidelines for bradycardia and cardiac arrest. Methods: We evaluated the accuracy of ChatGPT’s responses to 2 simulations based on the 2020 American Heart Association ACLS guidelines with 3 primary outcomes of interest: the mean individual step accuracy, the accuracy score per simulation attempt, and the accuracy score for each algorithm. For each simulation step, ChatGPT was scored for correctness (1 point) or incorrectness (0 points). Each simulation was conducted 20 times. Results: ChatGPT’s median accuracy for each step was 85% (IQR 40%-100%) for cardiac arrest and 30% (IQR 13%-81%) for bradycardia. ChatGPT’s median accuracy over 20 simulation attempts for cardiac arrest was 69% (IQR 67%-74%) and for bradycardia was 42% (IQR 33%-50%). We found that ChatGPT’s outputs varied despite consistent input, the same actions were persistently missed, repetitive overemphasis hindered guidance, and erroneous medication information was presented. Conclusions: This study highlights the need for consistent and reliable guidance to prevent potential medical errors and optimize the application of ChatGPT to enhance its reliability and effectiveness in clinical practice. %M 38648098 %R 10.2196/55037 %U https://www.jmir.org/2024/1/e55037 %U https://doi.org/10.2196/55037 %U http://www.ncbi.nlm.nih.gov/pubmed/38648098 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e55388 %T Evaluation of Prompts to Simplify Cardiovascular Disease Information Generated Using a Large Language Model: Cross-Sectional Study %A Mishra,Vishala %A Sarraju,Ashish %A Kalwani,Neil M %A Dexter,Joseph P %+ Data Science Initiative, Harvard University, Science and Engineering Complex 1.312-10, 150 Western Avenue, Allston, MA, 02134, United States, 1 8023381330, jdexter@fas.harvard.edu %K artificial intelligence %K ChatGPT %K GPT %K digital health %K large language model %K NLP %K language model %K language models %K prompt engineering %K health communication %K generative %K health literacy %K natural language processing %K patient-physician communication %K health communication %K prevention %K cardiology %K cardiovascular %K heart %K education %K educational %K human-in-the-loop %K machine learning %D 2024 %7 22.4.2024 %9 Research Letter %J J Med Internet Res %G English %X In this cross-sectional study, we evaluated the completeness, readability, and syntactic complexity of cardiovascular disease prevention information produced by GPT-4 in response to 4 kinds of prompts. %M 38648104 %R 10.2196/55388 %U https://www.jmir.org/2024/1/e55388 %U https://doi.org/10.2196/55388 %U http://www.ncbi.nlm.nih.gov/pubmed/38648104 %0 Journal Article %@ 2561-1011 %I JMIR Publications %V 8 %N %P e53421 %T A Multidisciplinary Assessment of ChatGPT’s Knowledge of Amyloidosis: Observational Study %A King,Ryan C %A Samaan,Jamil S %A Yeo,Yee Hui %A Peng,Yuxin %A Kunkel,David C %A Habib,Ali A %A Ghashghaei,Roxana %+ Division of Cardiology, Department of Medicine, University of California, Irvine Medical Center, 101 The City Drive South, Orange, CA, 92868, United States, 1 714 456 7890, kingrc@hs.uci.edu %K amyloidosis %K ChatGPT %K large language models %K cardiology %K gastroenterology %K neurology %K artificial intelligence %K multidisciplinary care %K assessment %K patient education %K large language model %K accuracy %K reliability %K accessibility %K educational resources %K dissemination %K gastroenterologist %K cardiologist %K medical society %K institution %K institutions %K Facebook %K neurologist %K reproducibility %K amyloidosis-related %D 2024 %7 19.4.2024 %9 Original Paper %J JMIR Cardio %G English %X Background: Amyloidosis, a rare multisystem condition, often requires complex, multidisciplinary care. Its low prevalence underscores the importance of efforts to ensure the availability of high-quality patient education materials for better outcomes. ChatGPT (OpenAI) is a large language model powered by artificial intelligence that offers a potential avenue for disseminating accurate, reliable, and accessible educational resources for both patients and providers. Its user-friendly interface, engaging conversational responses, and the capability for users to ask follow-up questions make it a promising future tool in delivering accurate and tailored information to patients. Objective: We performed a multidisciplinary assessment of the accuracy, reproducibility, and readability of ChatGPT in answering questions related to amyloidosis. Methods: In total, 98 amyloidosis questions related to cardiology, gastroenterology, and neurology were curated from medical societies, institutions, and amyloidosis Facebook support groups and inputted into ChatGPT-3.5 and ChatGPT-4. Cardiology- and gastroenterology-related responses were independently graded by a board-certified cardiologist and gastroenterologist, respectively, who specialize in amyloidosis. These 2 reviewers (RG and DCK) also graded general questions for which disagreements were resolved with discussion. Neurology-related responses were graded by a board-certified neurologist (AAH) who specializes in amyloidosis. Reviewers used the following grading scale: (1) comprehensive, (2) correct but inadequate, (3) some correct and some incorrect, and (4) completely incorrect. Questions were stratified by categories for further analysis. Reproducibility was assessed by inputting each question twice into each model. The readability of ChatGPT-4 responses was also evaluated using the Textstat library in Python (Python Software Foundation) and the Textstat readability package in R software (R Foundation for Statistical Computing). Results: ChatGPT-4 (n=98) provided 93 (95%) responses with accurate information, and 82 (84%) were comprehensive. ChatGPT-3.5 (n=83) provided 74 (89%) responses with accurate information, and 66 (79%) were comprehensive. When examined by question category, ChatGTP-4 and ChatGPT-3.5 provided 53 (95%) and 48 (86%) comprehensive responses, respectively, to “general questions” (n=56). When examined by subject, ChatGPT-4 and ChatGPT-3.5 performed best in response to cardiology questions (n=12) with both models producing 10 (83%) comprehensive responses. For gastroenterology (n=15), ChatGPT-4 received comprehensive grades for 9 (60%) responses, and ChatGPT-3.5 provided 8 (53%) responses. Overall, 96 of 98 (98%) responses for ChatGPT-4 and 73 of 83 (88%) for ChatGPT-3.5 were reproducible. The readability of ChatGPT-4’s responses ranged from 10th to beyond graduate US grade levels with an average of 15.5 (SD 1.9). Conclusions: Large language models are a promising tool for accurate and reliable health information for patients living with amyloidosis. However, ChatGPT’s responses exceeded the American Medical Association’s recommended fifth- to sixth-grade reading level. Future studies focusing on improving response accuracy and readability are warranted. Prior to widespread implementation, the technology’s limitations and ethical implications must be further explored to ensure patient safety and equitable implementation. %M 38640472 %R 10.2196/53421 %U https://cardio.jmir.org/2024/1/e53421 %U https://doi.org/10.2196/53421 %U http://www.ncbi.nlm.nih.gov/pubmed/38640472 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56655 %T Quality of Answers of Generative Large Language Models Versus Peer Users for Interpreting Laboratory Test Results for Lay Patients: Evaluation Study %A He,Zhe %A Bhasuran,Balu %A Jin,Qiao %A Tian,Shubo %A Hanna,Karim %A Shavor,Cindy %A Arguello,Lisbeth Garcia %A Murray,Patrick %A Lu,Zhiyong %+ School of Information, Florida State University, 142 Collegiate Loop, Tallahassee, FL, 32306, United States, 1 8506445775, zhe@fsu.edu %K large language models %K generative artificial intelligence %K generative AI %K ChatGPT %K laboratory test results %K patient education %K natural language processing %D 2024 %7 17.4.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Although patients have easy access to their electronic health records and laboratory test result data through patient portals, laboratory test results are often confusing and hard to understand. Many patients turn to web-based forums or question-and-answer (Q&A) sites to seek advice from their peers. The quality of answers from social Q&A sites on health-related questions varies significantly, and not all responses are accurate or reliable. Large language models (LLMs) such as ChatGPT have opened a promising avenue for patients to have their questions answered. Objective: We aimed to assess the feasibility of using LLMs to generate relevant, accurate, helpful, and unharmful responses to laboratory test–related questions asked by patients and identify potential issues that can be mitigated using augmentation approaches. Methods: We collected laboratory test result–related Q&A data from Yahoo! Answers and selected 53 Q&A pairs for this study. Using the LangChain framework and ChatGPT web portal, we generated responses to the 53 questions from 5 LLMs: GPT-4, GPT-3.5, LLaMA 2, MedAlpaca, and ORCA_mini. We assessed the similarity of their answers using standard Q&A similarity-based evaluation metrics, including Recall-Oriented Understudy for Gisting Evaluation, Bilingual Evaluation Understudy, Metric for Evaluation of Translation With Explicit Ordering, and Bidirectional Encoder Representations from Transformers Score. We used an LLM-based evaluator to judge whether a target model had higher quality in terms of relevance, correctness, helpfulness, and safety than the baseline model. We performed a manual evaluation with medical experts for all the responses to 7 selected questions on the same 4 aspects. Results: Regarding the similarity of the responses from 4 LLMs; the GPT-4 output was used as the reference answer, the responses from GPT-3.5 were the most similar, followed by those from LLaMA 2, ORCA_mini, and MedAlpaca. Human answers from Yahoo data were scored the lowest and, thus, as the least similar to GPT-4–generated answers. The results of the win rate and medical expert evaluation both showed that GPT-4’s responses achieved better scores than all the other LLM responses and human responses on all 4 aspects (relevance, correctness, helpfulness, and safety). LLM responses occasionally also suffered from lack of interpretation in one’s medical context, incorrect statements, and lack of references. Conclusions: By evaluating LLMs in generating responses to patients’ laboratory test result–related questions, we found that, compared to other 4 LLMs and human answers from a Q&A website, GPT-4’s responses were more accurate, helpful, relevant, and safer. There were cases in which GPT-4 responses were inaccurate and not individualized. We identified a number of ways to improve the quality of LLM responses, including prompt engineering, prompt augmentation, retrieval-augmented generation, and response evaluation. %M 38630520 %R 10.2196/56655 %U https://www.jmir.org/2024/1/e56655 %U https://doi.org/10.2196/56655 %U http://www.ncbi.nlm.nih.gov/pubmed/38630520 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e57778 %T Authors’ Reply: “Evaluating GPT-4’s Cognitive Functions Through the Bloom Taxonomy: Insights and Clarifications” %A Herrmann-Werner,Anne %A Festl-Wietek,Teresa %A Holderried,Friederike %A Herschbach,Lea %A Griewatz,Jan %A Masters,Ken %A Zipfel,Stephan %A Mahling,Moritz %+ Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Elfriede-Aulhorn-Strasse 10, Tübingen, 72076, Germany, 49 7071 29 73715, teresa.festl-wietek@med.uni-tuebingen.de %K answer %K artificial intelligence %K assessment %K Bloom’s taxonomy %K ChatGPT %K classification %K error %K exam %K examination %K generative %K GPT-4 %K Generative Pre-trained Transformer 4 %K language model %K learning outcome %K LLM %K MCQ %K medical education %K medical exam %K multiple-choice question %K natural language processing %K NLP %K psychosomatic %K question %K response %K taxonomy %D 2024 %7 16.4.2024 %9 Letter to the Editor %J J Med Internet Res %G English %X %M 38625723 %R 10.2196/57778 %U https://www.jmir.org/2024/1/e57778 %U https://doi.org/10.2196/57778 %U http://www.ncbi.nlm.nih.gov/pubmed/38625723 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e56997 %T Evaluating GPT-4’s Cognitive Functions Through the Bloom Taxonomy: Insights and Clarifications %A Huang,Kuan-Ju %+ Department of Obstetrics and Gynecology, National Taiwan University Hospital Yunlin Branch, No 579, Sec 2, Yunlin Rd, Douliu City, Yunlin County, 640, Taiwan, 886 55323911 ext 563413, restroomer@icloud.com %K artificial intelligence %K ChatGPT %K Bloom taxonomy %K AI %K cognition %D 2024 %7 16.4.2024 %9 Letter to the Editor %J J Med Internet Res %G English %X %M 38625725 %R 10.2196/56997 %U https://www.jmir.org/2024/1/e56997 %U https://doi.org/10.2196/56997 %U http://www.ncbi.nlm.nih.gov/pubmed/38625725 %0 Journal Article %@ 2369-3762 %I %V 10 %N %P e57696 %T A Student’s Viewpoint on ChatGPT Use and Automation Bias in Medical Education %A Dsouza,Jeanne Maria %K AI %K artificial intelligence %K ChatGPT %K medical education %D 2024 %7 15.4.2024 %9 %J JMIR Med Educ %G English %X %R 10.2196/57696 %U https://mededu.jmir.org/2024/1/e57696 %U https://doi.org/10.2196/57696 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e45959 %T Mental Distress, Label Avoidance, and Use of a Mental Health Chatbot: Results From a US Survey %A Kosyluk,Kristin %A Baeder,Tanner %A Greene,Karah Yeona %A Tran,Jennifer T %A Bolton,Cassidy %A Loecher,Nele %A DiEva,Daniel %A Galea,Jerome T %+ Department of Mental Health Law & Policy, University of South Florida, 13301 Bruce B Downs Boulevard, MHC 2735, Tampa, FL, 33612, United States, 1 8139746019, kkosyluk@usf.edu %K chatbots %K conversational agents %K mental health %K resources %K screening %K resource referral %K stigma %K label avoidance %K survey %K training %K behavioral %K COVID-19 %K pilot test %K design %K users %K psychological distress %K symptoms %D 2024 %7 12.4.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: For almost two decades, researchers and clinicians have argued that certain aspects of mental health treatment can be removed from clinicians’ responsibilities and allocated to technology, preserving valuable clinician time and alleviating the burden on the behavioral health care system. The service delivery tasks that could arguably be allocated to technology without negatively impacting patient outcomes include screening, triage, and referral. Objective: We pilot-tested a chatbot for mental health screening and referral to understand the relationship between potential users’ demographics and chatbot use; the completion rate of mental health screening when delivered by a chatbot; and the acceptability of a prototype chatbot designed for mental health screening and referral. This chatbot not only screened participants for psychological distress but also referred them to appropriate resources that matched their level of distress and preferences. The goal of this study was to determine whether a mental health screening and referral chatbot would be feasible and acceptable to users. Methods: We conducted an internet-based survey among a sample of US-based adults. Our survey collected demographic data along with a battery of measures assessing behavioral health and symptoms, stigma (label avoidance and perceived stigma), attitudes toward treatment-seeking, readiness for change, and technology readiness and acceptance. Participants were then offered to engage with our chatbot. Those who engaged with the chatbot completed a mental health screening, received a distress score based on this screening, were referred to resources appropriate for their current level of distress, and were asked to rate the acceptability of the chatbot. Results: We found that mental health screening using a chatbot was feasible, with 168 (75.7%) of our 222 participants completing mental health screening within the chatbot sessions. Various demographic characteristics were associated with a willingness to use the chatbot. The participants who used the chatbot found it to be acceptable. Logistic regression produced a significant model with perceived usefulness and symptoms as significant positive predictors of chatbot use for the overall sample, and label avoidance as the only significant predictor of chatbot use for those currently experiencing distress. Conclusions: Label avoidance, the desire to avoid mental health services to avoid the stigmatized label of mental illness, is a significant negative predictor of care seeking. Therefore, our finding regarding label avoidance and chatbot use has significant public health implications in terms of facilitating access to mental health resources. Those who are high on label avoidance are not likely to seek care in a community mental health clinic, yet they are likely willing to engage with a mental health chatbot, participate in mental health screening, and receive mental health resources within the chatbot session. Chatbot technology may prove to be a way to engage those in care who have previously avoided treatment due to stigma. %M 38607665 %R 10.2196/45959 %U https://formative.jmir.org/2024/1/e45959 %U https://doi.org/10.2196/45959 %U http://www.ncbi.nlm.nih.gov/pubmed/38607665 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e52483 %T Embracing ChatGPT for Medical Education: Exploring Its Impact on Doctors and Medical Students %A Wu,Yijun %A Zheng,Yue %A Feng,Baijie %A Yang,Yuqi %A Kang,Kai %A Zhao,Ailin %+ Department of Hematology, West China Hospital, Sichuan University, 37 Guoxue Street, Chengdu, China, 86 17888841669, irenez20@outlook.com %K artificial intelligence %K AI %K ChatGPT %K medical education %K doctors %K medical students %D 2024 %7 10.4.2024 %9 Viewpoint %J JMIR Med Educ %G English %X ChatGPT (OpenAI), a cutting-edge natural language processing model, holds immense promise for revolutionizing medical education. With its remarkable performance in language-related tasks, ChatGPT offers personalized and efficient learning experiences for medical students and doctors. Through training, it enhances clinical reasoning and decision-making skills, leading to improved case analysis and diagnosis. The model facilitates simulated dialogues, intelligent tutoring, and automated question-answering, enabling the practical application of medical knowledge. However, integrating ChatGPT into medical education raises ethical and legal concerns. Safeguarding patient data and adhering to data protection regulations are critical. Transparent communication with students, physicians, and patients is essential to ensure their understanding of the technology’s purpose and implications, as well as the potential risks and benefits. Maintaining a balance between personalized learning and face-to-face interactions is crucial to avoid hindering critical thinking and communication skills. Despite challenges, ChatGPT offers transformative opportunities. Integrating it with problem-based learning, team-based learning, and case-based learning methodologies can further enhance medical education. With proper regulation and supervision, ChatGPT can contribute to a well-rounded learning environment, nurturing skilled and knowledgeable medical professionals ready to tackle health care challenges. By emphasizing ethical considerations and human-centric approaches, ChatGPT’s potential can be fully harnessed in medical education, benefiting both students and patients alike. %M 38598263 %R 10.2196/52483 %U https://mededu.jmir.org/2024/1/e52483 %U https://doi.org/10.2196/52483 %U http://www.ncbi.nlm.nih.gov/pubmed/38598263 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e55988 %T Assessing the Alignment of Large Language Models With Human Values for Mental Health Integration: Cross-Sectional Study Using Schwartz’s Theory of Basic Values %A Hadar-Shoval,Dorit %A Asraf,Kfir %A Mizrachi,Yonathan %A Haber,Yuval %A Elyoseph,Zohar %+ Department of Brain Sciences, Faculty of Medicine, Imperial College London, Fulham Palace Rd, London, W6 8RF, United Kingdom, 44 547836088, Zohar.j.a@gmail.com %K large language models %K LLMs %K large language model %K LLM %K machine learning %K ML %K natural language processing %K NLP %K deep learning %K ChatGPT %K Chat-GPT %K chatbot %K chatbots %K chat-bot %K chat-bots %K Claude %K values %K Bard %K artificial intelligence %K AI %K algorithm %K algorithms %K predictive model %K predictive models %K predictive analytics %K predictive system %K practical model %K practical models %K mental health %K mental illness %K mental illnesses %K mental disease %K mental diseases %K mental disorder %K mental disorders %K mobile health %K mHealth %K eHealth %K mood disorder %K mood disorders %D 2024 %7 9.4.2024 %9 Original Paper %J JMIR Ment Health %G English %X Background: Large language models (LLMs) hold potential for mental health applications. However, their opaque alignment processes may embed biases that shape problematic perspectives. Evaluating the values embedded within LLMs that guide their decision-making have ethical importance. Schwartz’s theory of basic values (STBV) provides a framework for quantifying cultural value orientations and has shown utility for examining values in mental health contexts, including cultural, diagnostic, and therapist-client dynamics. Objective: This study aimed to (1) evaluate whether the STBV can measure value-like constructs within leading LLMs and (2) determine whether LLMs exhibit distinct value-like patterns from humans and each other. Methods: In total, 4 LLMs (Bard, Claude 2, Generative Pretrained Transformer [GPT]-3.5, GPT-4) were anthropomorphized and instructed to complete the Portrait Values Questionnaire—Revised (PVQ-RR) to assess value-like constructs. Their responses over 10 trials were analyzed for reliability and validity. To benchmark the LLMs’ value profiles, their results were compared to published data from a diverse sample of 53,472 individuals across 49 nations who had completed the PVQ-RR. This allowed us to assess whether the LLMs diverged from established human value patterns across cultural groups. Value profiles were also compared between models via statistical tests. Results: The PVQ-RR showed good reliability and validity for quantifying value-like infrastructure within the LLMs. However, substantial divergence emerged between the LLMs’ value profiles and population data. The models lacked consensus and exhibited distinct motivational biases, reflecting opaque alignment processes. For example, all models prioritized universalism and self-direction, while de-emphasizing achievement, power, and security relative to humans. Successful discriminant analysis differentiated the 4 LLMs’ distinct value profiles. Further examination found the biased value profiles strongly predicted the LLMs’ responses when presented with mental health dilemmas requiring choosing between opposing values. This provided further validation for the models embedding distinct motivational value-like constructs that shape their decision-making. Conclusions: This study leveraged the STBV to map the motivational value-like infrastructure underpinning leading LLMs. Although the study demonstrated the STBV can effectively characterize value-like infrastructure within LLMs, substantial divergence from human values raises ethical concerns about aligning these models with mental health applications. The biases toward certain cultural value sets pose risks if integrated without proper safeguards. For example, prioritizing universalism could promote unconditional acceptance even when clinically unwise. Furthermore, the differences between the LLMs underscore the need to standardize alignment processes to capture true cultural diversity. Thus, any responsible integration of LLMs into mental health care must account for their embedded biases and motivation mismatches to ensure equitable delivery across diverse populations. Achieving this will require transparency and refinement of alignment techniques to instill comprehensive human values. %M 38593424 %R 10.2196/55988 %U https://mental.jmir.org/2024/1/e55988 %U https://doi.org/10.2196/55988 %U http://www.ncbi.nlm.nih.gov/pubmed/38593424 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e55627 %T Evaluating ChatGPT-4’s Diagnostic Accuracy: Impact of Visual Data Integration %A Hirosawa,Takanobu %A Harada,Yukinori %A Tokumasu,Kazuki %A Ito,Takahiro %A Suzuki,Tomoharu %A Shimizu,Taro %+ Department of Diagnostic and Generalist Medicine, Dokkyo Medical University, 880 Kitakobayashi, Mibu-cho, Shimotsuga, 321-0293, Japan, 81 282 87 2498, hirosawa@dokkyomed.ac.jp %K artificial intelligence %K large language model %K LLM %K LLMs %K language model %K language models %K ChatGPT %K GPT %K ChatGPT-4V %K ChatGPT-4 Vision %K clinical decision support %K natural language processing %K decision support %K NLP %K diagnostic excellence %K diagnosis %K diagnoses %K diagnose %K diagnostic %K diagnostics %K image %K images %K imaging %D 2024 %7 9.4.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: In the evolving field of health care, multimodal generative artificial intelligence (AI) systems, such as ChatGPT-4 with vision (ChatGPT-4V), represent a significant advancement, as they integrate visual data with text data. This integration has the potential to revolutionize clinical diagnostics by offering more comprehensive analysis capabilities. However, the impact on diagnostic accuracy of using image data to augment ChatGPT-4 remains unclear. Objective: This study aims to assess the impact of adding image data on ChatGPT-4’s diagnostic accuracy and provide insights into how image data integration can enhance the accuracy of multimodal AI in medical diagnostics. Specifically, this study endeavored to compare the diagnostic accuracy between ChatGPT-4V, which processed both text and image data, and its counterpart, ChatGPT-4, which only uses text data. Methods: We identified a total of 557 case reports published in the American Journal of Case Reports from January 2022 to March 2023. After excluding cases that were nondiagnostic, pediatric, and lacking image data, we included 363 case descriptions with their final diagnoses and associated images. We compared the diagnostic accuracy of ChatGPT-4V and ChatGPT-4 without vision based on their ability to include the final diagnoses within differential diagnosis lists. Two independent physicians evaluated their accuracy, with a third resolving any discrepancies, ensuring a rigorous and objective analysis. Results: The integration of image data into ChatGPT-4V did not significantly enhance diagnostic accuracy, showing that final diagnoses were included in the top 10 differential diagnosis lists at a rate of 85.1% (n=309), comparable to the rate of 87.9% (n=319) for the text-only version (P=.33). Notably, ChatGPT-4V’s performance in correctly identifying the top diagnosis was inferior, at 44.4% (n=161), compared with 55.9% (n=203) for the text-only version (P=.002, χ2 test). Additionally, ChatGPT-4’s self-reports showed that image data accounted for 30% of the weight in developing the differential diagnosis lists in more than half of cases. Conclusions: Our findings reveal that currently, ChatGPT-4V predominantly relies on textual data, limiting its ability to fully use the diagnostic potential of visual information. This study underscores the need for further development of multimodal generative AI systems to effectively integrate and use clinical image data. Enhancing the diagnostic performance of such AI systems through improved multimodal data integration could significantly benefit patient care by providing more accurate and comprehensive diagnostic insights. Future research should focus on overcoming these limitations, paving the way for the practical application of advanced AI in medicine. %M 38592758 %R 10.2196/55627 %U https://medinform.jmir.org/2024/1/e55627 %U https://doi.org/10.2196/55627 %U http://www.ncbi.nlm.nih.gov/pubmed/38592758 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e55318 %T An Empirical Evaluation of Prompting Strategies for Large Language Models in Zero-Shot Clinical Natural Language Processing: Algorithm Development and Validation Study %A Sivarajkumar,Sonish %A Kelley,Mark %A Samolyk-Mazzanti,Alyssa %A Visweswaran,Shyam %A Wang,Yanshan %+ Department of Health Information Management, University of Pittsburgh, 6026 Forbes Tower, Pittsburgh, PA, 15260, United States, 1 4123832712, yanshan.wang@pitt.edu %K large language model %K LLM %K LLMs %K natural language processing %K NLP %K in-context learning %K prompt engineering %K evaluation %K zero-shot %K few shot %K prompting %K GPT %K language model %K language %K models %K machine learning %K clinical data %K clinical information %K extraction %K BARD %K Gemini %K LLaMA-2 %K heuristic %K prompt %K prompts %K ensemble %D 2024 %7 8.4.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: Large language models (LLMs) have shown remarkable capabilities in natural language processing (NLP), especially in domains where labeled data are scarce or expensive, such as the clinical domain. However, to unlock the clinical knowledge hidden in these LLMs, we need to design effective prompts that can guide them to perform specific clinical NLP tasks without any task-specific training data. This is known as in-context learning, which is an art and science that requires understanding the strengths and weaknesses of different LLMs and prompt engineering approaches. Objective: The objective of this study is to assess the effectiveness of various prompt engineering techniques, including 2 newly introduced types—heuristic and ensemble prompts, for zero-shot and few-shot clinical information extraction using pretrained language models. Methods: This comprehensive experimental study evaluated different prompt types (simple prefix, simple cloze, chain of thought, anticipatory, heuristic, and ensemble) across 5 clinical NLP tasks: clinical sense disambiguation, biomedical evidence extraction, coreference resolution, medication status extraction, and medication attribute extraction. The performance of these prompts was assessed using 3 state-of-the-art language models: GPT-3.5 (OpenAI), Gemini (Google), and LLaMA-2 (Meta). The study contrasted zero-shot with few-shot prompting and explored the effectiveness of ensemble approaches. Results: The study revealed that task-specific prompt tailoring is vital for the high performance of LLMs for zero-shot clinical NLP. In clinical sense disambiguation, GPT-3.5 achieved an accuracy of 0.96 with heuristic prompts and 0.94 in biomedical evidence extraction. Heuristic prompts, alongside chain of thought prompts, were highly effective across tasks. Few-shot prompting improved performance in complex scenarios, and ensemble approaches capitalized on multiple prompt strengths. GPT-3.5 consistently outperformed Gemini and LLaMA-2 across tasks and prompt types. Conclusions: This study provides a rigorous evaluation of prompt engineering methodologies and introduces innovative techniques for clinical information extraction, demonstrating the potential of in-context learning in the clinical domain. These findings offer clear guidelines for future prompt-based clinical NLP research, facilitating engagement by non-NLP experts in clinical NLP advancements. To the best of our knowledge, this is one of the first works on the empirical evaluation of different prompt engineering approaches for clinical NLP in this era of generative artificial intelligence, and we hope that it will inspire and inform future research in this area. %M 38587879 %R 10.2196/55318 %U https://medinform.jmir.org/2024/1/e55318 %U https://doi.org/10.2196/55318 %U http://www.ncbi.nlm.nih.gov/pubmed/38587879 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e52935 %T Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study %A Mugaanyi,Joseph %A Cai,Liuying %A Cheng,Sumei %A Lu,Caide %A Huang,Jing %+ Department of Hepato-Pancreato-Biliary Surgery, Ningbo Medical Center Lihuili Hospital, Health Science Center, Ningbo University, No 1111 Jiangnan Road, Ningbo, 315000, China, 86 13819803591, huangjingonline@163.com %K large language models %K accuracy %K academic writing %K AI %K cross-disciplinary evaluation %K scholarly writing %K ChatGPT %K GPT-3.5 %K writing tool %K scholarly %K academic discourse %K LLMs %K machine learning algorithms %K NLP %K natural language processing %K citations %K references %K natural science %K humanities %K chatbot %K artificial intelligence %D 2024 %7 5.4.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models (LLMs) have gained prominence since the release of ChatGPT in late 2022. Objective: The aim of this study was to assess the accuracy of citations and references generated by ChatGPT (GPT-3.5) in two distinct academic domains: the natural sciences and humanities. Methods: Two researchers independently prompted ChatGPT to write an introduction section for a manuscript and include citations; they then evaluated the accuracy of the citations and Digital Object Identifiers (DOIs). Results were compared between the two disciplines. Results: Ten topics were included, including 5 in the natural sciences and 5 in the humanities. A total of 102 citations were generated, with 55 in the natural sciences and 47 in the humanities. Among these, 40 citations (72.7%) in the natural sciences and 36 citations (76.6%) in the humanities were confirmed to exist (P=.42). There were significant disparities found in DOI presence in the natural sciences (39/55, 70.9%) and the humanities (18/47, 38.3%), along with significant differences in accuracy between the two disciplines (18/55, 32.7% vs 4/47, 8.5%). DOI hallucination was more prevalent in the humanities (42/55, 89.4%). The Levenshtein distance was significantly higher in the humanities than in the natural sciences, reflecting the lower DOI accuracy. Conclusions: ChatGPT’s performance in generating citations and references varies across disciplines. Differences in DOI standards and disciplinary nuances contribute to performance variations. Researchers should consider the strengths and limitations of artificial intelligence writing tools with respect to citation accuracy. The use of domain-specific models may enhance accuracy. %M 38578685 %R 10.2196/52935 %U https://www.jmir.org/2024/1/e52935 %U https://doi.org/10.2196/52935 %U http://www.ncbi.nlm.nih.gov/pubmed/38578685 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54580 %T An Entity Extraction Pipeline for Medical Text Records Using Large Language Models: Analytical Study %A Wang,Lei %A Ma,Yinyao %A Bi,Wenshuai %A Lv,Hanlin %A Li,Yuxiang %+ BGI Research, 1-2F, Building 2, Wuhan Optics Valley International Biomedical Enterprise Accelerator Phase 3.1, No 388 Gaoxin Road 2, Donghu New Technology Development Zone, Wuhan, 430074, China, 86 18707190886, lvhanlin@genomics.cn %K clinical data extraction %K large language models %K feature hallucination %K modular approach %K unstructured data processing %D 2024 %7 29.3.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The study of disease progression relies on clinical data, including text data, and extracting valuable features from text data has been a research hot spot. With the rise of large language models (LLMs), semantic-based extraction pipelines are gaining acceptance in clinical research. However, the security and feature hallucination issues of LLMs require further attention. Objective: This study aimed to introduce a novel modular LLM pipeline, which could semantically extract features from textual patient admission records. Methods: The pipeline was designed to process a systematic succession of concept extraction, aggregation, question generation, corpus extraction, and question-and-answer scale extraction, which was tested via 2 low-parameter LLMs: Qwen-14B-Chat (QWEN) and Baichuan2-13B-Chat (BAICHUAN). A data set of 25,709 pregnancy cases from the People’s Hospital of Guangxi Zhuang Autonomous Region, China, was used for evaluation with the help of a local expert’s annotation. The pipeline was evaluated with the metrics of accuracy and precision, null ratio, and time consumption. Additionally, we evaluated its performance via a quantified version of Qwen-14B-Chat on a consumer-grade GPU. Results: The pipeline demonstrates a high level of precision in feature extraction, as evidenced by the accuracy and precision results of Qwen-14B-Chat (95.52% and 92.93%, respectively) and Baichuan2-13B-Chat (95.86% and 90.08%, respectively). Furthermore, the pipeline exhibited low null ratios and variable time consumption. The INT4-quantified version of QWEN delivered an enhanced performance with 97.28% accuracy and a 0% null ratio. Conclusions: The pipeline exhibited consistent performance across different LLMs and efficiently extracted clinical features from textual data. It also showed reliable performance on consumer-grade hardware. This approach offers a viable and effective solution for mining clinical research data from textual records. %M 38551633 %R 10.2196/54580 %U https://www.jmir.org/2024/1/e54580 %U https://doi.org/10.2196/54580 %U http://www.ncbi.nlm.nih.gov/pubmed/38551633 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e57054 %T Performance of GPT-4V in Answering the Japanese Otolaryngology Board Certification Examination Questions: Evaluation Study %A Noda,Masao %A Ueno,Takayoshi %A Koshu,Ryota %A Takaso,Yuji %A Shimada,Mari Dias %A Saito,Chizu %A Sugimoto,Hisashi %A Fushiki,Hiroaki %A Ito,Makoto %A Nomura,Akihiro %A Yoshizaki,Tomokazu %+ Department of Otolaryngology and Head and Neck Surgery, Jichi Medical University, Yakushiji 3311-1, Shimotsuke, 329-0498, Japan, 1 0285442111, doforanabdosuc@gmail.com %K artificial intelligence %K GPT-4v %K large language model %K otolaryngology %K GPT %K ChatGPT %K LLM %K LLMs %K language model %K language models %K head %K respiratory %K ENT: ear %K nose %K throat %K neck %K NLP %K natural language processing %K image %K images %K exam %K exams %K examination %K examinations %K answer %K answers %K answering %K response %K responses %D 2024 %7 28.3.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial intelligence models can learn from medical literature and clinical cases and generate answers that rival human experts. However, challenges remain in the analysis of complex data containing images and diagrams. Objective: This study aims to assess the answering capabilities and accuracy of ChatGPT-4 Vision (GPT-4V) for a set of 100 questions, including image-based questions, from the 2023 otolaryngology board certification examination. Methods: Answers to 100 questions from the 2023 otolaryngology board certification examination, including image-based questions, were generated using GPT-4V. The accuracy rate was evaluated using different prompts, and the presence of images, clinical area of the questions, and variations in the answer content were examined. Results: The accuracy rate for text-only input was, on average, 24.7% but improved to 47.3% with the addition of English translation and prompts (P<.001). The average nonresponse rate for text-only input was 46.3%; this decreased to 2.7% with the addition of English translation and prompts (P<.001). The accuracy rate was lower for image-based questions than for text-only questions across all types of input, with a relatively high nonresponse rate. General questions and questions from the fields of head and neck allergies and nasal allergies had relatively high accuracy rates, which increased with the addition of translation and prompts. In terms of content, questions related to anatomy had the highest accuracy rate. For all content types, the addition of translation and prompts increased the accuracy rate. As for the performance based on image-based questions, the average of correct answer rate with text-only input was 30.4%, and that with text-plus-image input was 41.3% (P=.02). Conclusions: Examination of artificial intelligence’s answering capabilities for the otolaryngology board certification examination improves our understanding of its potential and limitations in this field. Although the improvement was noted with the addition of translation and prompts, the accuracy rate for image-based questions was lower than that for text-based questions, suggesting room for improvement in GPT-4V at this stage. Furthermore, text-plus-image input answers a higher rate in image-based questions. Our findings imply the usefulness and potential of GPT-4V in medicine; however, future consideration of safe use methods is needed. %M 38546736 %R 10.2196/57054 %U https://mededu.jmir.org/2024/1/e57054 %U https://doi.org/10.2196/57054 %U http://www.ncbi.nlm.nih.gov/pubmed/38546736 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e49964 %T Performance of ChatGPT on the India Undergraduate Community Medicine Examination: Cross-Sectional Study %A Gandhi,Aravind P %A Joesph,Felista Karen %A Rajagopal,Vineeth %A Aparnavi,P %A Katkuri,Sushma %A Dayama,Sonal %A Satapathy,Prakasini %A Khatib,Mahalaqua Nazli %A Gaidhane,Shilpa %A Zahiruddin,Quazi Syed %A Behera,Ashish %+ Department of Community Medicine, All India Institute of Medical Sciences, Room 420 Department of Community Medicine, Plot 2, Sector 20, MIHAN, Nagpur, Maharashtra, 441108, India, 91 9585395395, aravindsocialdoc@gmail.com %K artificial intelligence %K ChatGPT %K community medicine %K India %K large language model %K medical education %K digitalization %D 2024 %7 25.3.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Medical students may increasingly use large language models (LLMs) in their learning. ChatGPT is an LLM at the forefront of this new development in medical education with the capacity to respond to multidisciplinary questions. Objective: The aim of this study was to evaluate the ability of ChatGPT 3.5 to complete the Indian undergraduate medical examination in the subject of community medicine. We further compared ChatGPT scores with the scores obtained by the students. Methods: The study was conducted at a publicly funded medical college in Hyderabad, India. The study was based on the internal assessment examination conducted in January 2023 for students in the Bachelor of Medicine and Bachelor of Surgery Final Year–Part I program; the examination of focus included 40 questions (divided between two papers) from the community medicine subject syllabus. Each paper had three sections with different weightage of marks for each section: section one had two long essay–type questions worth 15 marks each, section two had 8 short essay–type questions worth 5 marks each, and section three had 10 short-answer questions worth 3 marks each. The same questions were administered as prompts to ChatGPT 3.5 and the responses were recorded. Apart from scoring ChatGPT responses, two independent evaluators explored the responses to each question to further analyze their quality with regard to three subdomains: relevancy, coherence, and completeness. Each question was scored in these subdomains on a Likert scale of 1-5. The average of the two evaluators was taken as the subdomain score of the question. The proportion of questions with a score 50% of the maximum score (5) in each subdomain was calculated. Results: ChatGPT 3.5 scored 72.3% on paper 1 and 61% on paper 2. The mean score of the 94 students was 43% on paper 1 and 45% on paper 2. The responses of ChatGPT 3.5 were also rated to be satisfactorily relevant, coherent, and complete for most of the questions (>80%). Conclusions: ChatGPT 3.5 appears to have substantial and sufficient knowledge to understand and answer the Indian medical undergraduate examination in the subject of community medicine. ChatGPT may be introduced to students to enable the self-directed learning of community medicine in pilot mode. However, faculty oversight will be required as ChatGPT is still in the initial stages of development, and thus its potential and reliability of medical content from the Indian context need to be further explored comprehensively. %M 38526538 %R 10.2196/49964 %U https://formative.jmir.org/2024/1/e49964 %U https://doi.org/10.2196/49964 %U http://www.ncbi.nlm.nih.gov/pubmed/38526538 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e54840 %T Chatbots and COVID-19: Taking Stock of the Lessons Learned %A Arnold,Virginia %A Purnat,Tina D %A Marten,Robert %A Pattison,Andrew %A Gouda,Hebe %+ Department of Health Promotion, Division of UHC Healthier Populations, World Health Organization, Avenue Appia 20, Geneva, 1211, Switzerland, 41 793865070, goudah@who.int %K chatbots %K COVID-19 %K health %K public health %K pandemic %K health care %D 2024 %7 21.3.2024 %9 Editorial %J J Med Internet Res %G English %X While digital innovation in health was already rapidly evolving, the COVID-19 pandemic has accelerated the generation of digital technology tools, such as chatbots, to help increase access to crucial health information and services to those who were cut off or had limited contact with health services. This theme issue titled “Chatbots and COVID-19” presents articles from researchers and practitioners across the globe, describing the development, implementation, and evaluation of chatbots designed to address a wide range of health concerns and services. In this editorial, we present some of the key challenges and lessons learned arising from the content of this theme issue. Most notably, we note that a stronger evidence base is needed to ensure that chatbots and other digital tools are developed to best serve the needs of population health. %M 38512309 %R 10.2196/54840 %U https://www.jmir.org/2024/1/e54840 %U https://doi.org/10.2196/54840 %U http://www.ncbi.nlm.nih.gov/pubmed/38512309 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e52073 %T Preliminary Evidence of the Use of Generative AI in Health Care Clinical Services: Systematic Narrative Review %A Yim,Dobin %A Khuntia,Jiban %A Parameswaran,Vijaya %A Meyers,Arlen %+ University of Colorado Denver, 1475 Lawrence St., Denver, CO, United States, 1 3038548024, jiban.khuntia@ucdenver.edu %K generative artificial intelligence tools and applications %K GenAI %K service %K clinical %K health care %K transformation %K digital %D 2024 %7 20.3.2024 %9 Review %J JMIR Med Inform %G English %X Background: Generative artificial intelligence tools and applications (GenAI) are being increasingly used in health care. Physicians, specialists, and other providers have started primarily using GenAI as an aid or tool to gather knowledge, provide information, train, or generate suggestive dialogue between physicians and patients or between physicians and patients’ families or friends. However, unless the use of GenAI is oriented to be helpful in clinical service encounters that can improve the accuracy of diagnosis, treatment, and patient outcomes, the expected potential will not be achieved. As adoption continues, it is essential to validate the effectiveness of the infusion of GenAI as an intelligent technology in service encounters to understand the gap in actual clinical service use of GenAI. Objective: This study synthesizes preliminary evidence on how GenAI assists, guides, and automates clinical service rendering and encounters in health care The review scope was limited to articles published in peer-reviewed medical journals. Methods: We screened and selected 0.38% (161/42,459) of articles published between January 1, 2020, and May 31, 2023, identified from PubMed. We followed the protocols outlined in the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines to select highly relevant studies with at least 1 element on clinical use, evaluation, and validation to provide evidence of GenAI use in clinical services. The articles were classified based on their relevance to clinical service functions or activities using the descriptive and analytical information presented in the articles. Results: Of 161 articles, 141 (87.6%) reported using GenAI to assist services through knowledge access, collation, and filtering. GenAI was used for disease detection (19/161, 11.8%), diagnosis (14/161, 8.7%), and screening processes (12/161, 7.5%) in the areas of radiology (17/161, 10.6%), cardiology (12/161, 7.5%), gastrointestinal medicine (4/161, 2.5%), and diabetes (6/161, 3.7%). The literature synthesis in this study suggests that GenAI is mainly used for diagnostic processes, improvement of diagnosis accuracy, and screening and diagnostic purposes using knowledge access. Although this solves the problem of knowledge access and may improve diagnostic accuracy, it is oriented toward higher value creation in health care. Conclusions: GenAI informs rather than assisting or automating clinical service functions in health care. There is potential in clinical service, but it has yet to be actualized for GenAI. More clinical service–level evidence that GenAI is used to streamline some functions or provides more automated help than only information retrieval is needed. To transform health care as purported, more studies related to GenAI applications must automate and guide human-performed services and keep up with the optimism that forward-thinking health care organizations will take advantage of GenAI. %M 38506918 %R 10.2196/52073 %U https://medinform.jmir.org/2024/1/e52073 %U https://doi.org/10.2196/52073 %U http://www.ncbi.nlm.nih.gov/pubmed/38506918 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51151 %T Incorporating ChatGPT in Medical Informatics Education: Mixed Methods Study on Student Perceptions and Experiential Integration Proposals %A Magalhães Araujo,Sabrina %A Cruz-Correia,Ricardo %+ Center for Health Technology and Services Research, Faculty of Medicine, University of Porto, Rua Dr Plácido da Costa, s/n, Porto, 4200-450, Portugal, 351 220 426 91 ext 26911, saraujo@med.up.pt %K education %K medical informatics %K artificial intelligence %K AI %K generative language model %K ChatGPT %D 2024 %7 20.3.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The integration of artificial intelligence (AI) technologies, such as ChatGPT, in the educational landscape has the potential to enhance the learning experience of medical informatics students and prepare them for using AI in professional settings. The incorporation of AI in classes aims to develop critical thinking by encouraging students to interact with ChatGPT and critically analyze the responses generated by the chatbot. This approach also helps students develop important skills in the field of biomedical and health informatics to enhance their interaction with AI tools. Objective: The aim of the study is to explore the perceptions of students regarding the use of ChatGPT as a learning tool in their educational context and provide professors with examples of prompts for incorporating ChatGPT into their teaching and learning activities, thereby enhancing the educational experience for students in medical informatics courses. Methods: This study used a mixed methods approach to gain insights from students regarding the use of ChatGPT in education. To accomplish this, a structured questionnaire was applied to evaluate students’ familiarity with ChatGPT, gauge their perceptions of its use, and understand their attitudes toward its use in academic and learning tasks. Learning outcomes of 2 courses were analyzed to propose ChatGPT’s incorporation in master’s programs in medicine and medical informatics. Results: The majority of students expressed satisfaction with the use of ChatGPT in education, finding it beneficial for various purposes, including generating academic content, brainstorming ideas, and rewriting text. While some participants raised concerns about potential biases and the need for informed use, the overall perception was positive. Additionally, the study proposed integrating ChatGPT into 2 specific courses in the master’s programs in medicine and medical informatics. The incorporation of ChatGPT was envisioned to enhance student learning experiences and assist in project planning, programming code generation, examination preparation, workflow exploration, and technical interview preparation, thus advancing medical informatics education. In medical teaching, it will be used as an assistant for simplifying the explanation of concepts and solving complex problems, as well as for generating clinical narratives and patient simulators. Conclusions: The study’s valuable insights into medical faculty students’ perspectives and integration proposals for ChatGPT serve as an informative guide for professors aiming to enhance medical informatics education. The research delves into the potential of ChatGPT, emphasizes the necessity of collaboration in academic environments, identifies subject areas with discernible benefits, and underscores its transformative role in fostering innovative and engaging learning experiences. The envisaged proposals hold promise in empowering future health care professionals to work in the rapidly evolving era of digital health care. %M 38506920 %R 10.2196/51151 %U https://mededu.jmir.org/2024/1/e51151 %U https://doi.org/10.2196/51151 %U http://www.ncbi.nlm.nih.gov/pubmed/38506920 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e50882 %T Quality and Dependability of ChatGPT and DingXiangYuan Forums for Remote Orthopedic Consultations: Comparative Analysis %A Xue,Zhaowen %A Zhang,Yiming %A Gan,Wenyi %A Wang,Huajun %A She,Guorong %A Zheng,Xiaofei %+ Department of Bone and Joint Surgery and Sports Medicine Center, The First Affiliated Hospital, The First Affiliated Hospital of Jinan University, No. 613, Huangpu Avenue West, Tianhe District, Guangzhou, 510630, China, 86 13076855735, zhengxiaofei12@163.com %K artificial intelligence %K ChatGPT %K consultation %K musculoskeletal %K natural language processing %K remote medical consultation %K orthopaedic %K orthopaedics %D 2024 %7 14.3.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The widespread use of artificial intelligence, such as ChatGPT (OpenAI), is transforming sectors, including health care, while separate advancements of the internet have enabled platforms such as China’s DingXiangYuan to offer remote medical services. Objective: This study evaluates ChatGPT-4’s responses against those of professional health care providers in telemedicine, assessing artificial intelligence’s capability to support the surge in remote medical consultations and its impact on health care delivery. Methods: We sourced remote orthopedic consultations from “Doctor DingXiang,” with responses from its certified physicians as the control and ChatGPT’s responses as the experimental group. In all, 3 blindfolded, experienced orthopedic surgeons assessed responses against 7 criteria: “logical reasoning,” “internal information,” “external information,” “guiding function,” “therapeutic effect,” “medical knowledge popularization education,” and “overall satisfaction.” We used Fleiss κ to measure agreement among multiple raters. Results: Initially, consultation records for a cumulative count of 8 maladies (equivalent to 800 cases) were gathered. We ultimately included 73 consultation records by May 2023, following primary and rescreening, in which no communication records containing private information, images, or voice messages were transmitted. After statistical scoring, we discovered that ChatGPT’s “internal information” score (mean 4.61, SD 0.52 points vs mean 4.66, SD 0.49 points; P=.43) and “therapeutic effect” score (mean 4.43, SD 0.75 points vs mean 4.55, SD 0.62 points; P=.32) were lower than those of the control group, but the differences were not statistically significant. ChatGPT showed better performance with a higher “logical reasoning” score (mean 4.81, SD 0.36 points vs mean 4.75, SD 0.39 points; P=.38), “external information” score (mean 4.06, SD 0.72 points vs mean 3.92, SD 0.77 points; P=.25), and “guiding function” score (mean 4.73, SD 0.51 points vs mean 4.72, SD 0.54 points; P=.96), although the differences were not statistically significant. Meanwhile, the “medical knowledge popularization education” score of ChatGPT was better than that of the control group (mean 4.49, SD 0.67 points vs mean 3.87, SD 1.01 points; P<.001), and the difference was statistically significant. In terms of “overall satisfaction,” the difference was not statistically significant between the groups (mean 8.35, SD 1.38 points vs mean 8.37, SD 1.24 points; P=.92). According to how Fleiss κ values were interpreted, 6 of the control group’s score points were classified as displaying “fair agreement” (P<.001), and 1 was classified as showing “substantial agreement” (P<.001). In the experimental group, 3 points were classified as indicating “fair agreement,” while 4 suggested “moderate agreement” (P<.001). Conclusions: ChatGPT-4 matches the expertise found in DingXiangYuan forums’ paid consultations, excelling particularly in scientific education. It presents a promising alternative for remote health advice. For health care professionals, it could act as an aid in patient education, while patients may use it as a convenient tool for health inquiries. %M 38483451 %R 10.2196/50882 %U https://www.jmir.org/2024/1/e50882 %U https://doi.org/10.2196/50882 %U http://www.ncbi.nlm.nih.gov/pubmed/38483451 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e50056 %T Adapting the Number of Questions Based on Detected Psychological Distress for Cognitive Behavioral Therapy With an Embodied Conversational Agent: Comparative Study %A Shidara,Kazuhiro %A Tanaka,Hiroki %A Adachi,Hiroyoshi %A Kanayama,Daisuke %A Kudo,Takashi %A Nakamura,Satoshi %+ Nara Institute of Science and Technology, 8916-5, Takayama-cho, Ikoma, 630-0192, Japan, 81 80 4687 8116, shidara.kazuhiro.sc5@is.naist.jp %K cognitive behavioral therapy %K psychological distress detection %K embodied conversational agents %K automatic thoughts %K long short-term memory %K multitask learning %D 2024 %7 14.3.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: The high prevalence of mental illness is a critical social problem. The limited availability of mental health services is a major factor that exacerbates this problem. One solution is to deliver cognitive behavioral therapy (CBT) using an embodied conversational agent (ECA). ECAs make it possible to provide health care without location or time constraints. One of the techniques used in CBT is Socratic questioning, which guides users to correct negative thoughts. The effectiveness of this approach depends on a therapist’s skill to adapt to the user’s mood or distress level. However, current ECAs do not possess this skill. Therefore, it is essential to implement this adaptation ability to the ECAs. Objective: This study aims to develop and evaluate a method that automatically adapts the number of Socratic questions based on the level of detected psychological distress during a CBT session with an ECA. We hypothesize that this adaptive approach to selecting the number of questions will lower psychological distress, reduce negative emotional states, and produce more substantial cognitive changes compared with a random number of questions. Methods: In this study, which envisions health care support in daily life, we recruited participants aged from 18 to 65 years for an experiment that involved 2 different conditions: an ECA that adapts a number of questions based on psychological distress detection or an ECA that only asked a random number of questions. The participants were assigned to 1 of the 2 conditions, experienced a single CBT session with an ECA, and completed questionnaires before and after the session. Results: The participants completed the experiment. There were slight differences in sex, age, and preexperimental psychological distress levels between the 2 conditions. The adapted number of questions condition showed significantly lower psychological distress than the random number of questions condition after the session. We also found a significant difference in the cognitive change when the number of questions was adapted based on the detected distress level, compared with when the number of questions was fewer than what was appropriate for the level of distress detected. Conclusions: The results show that an ECA adapting the number of Socratic questions based on detected distress levels increases the effectiveness of CBT. Participants who received an adaptive number of questions experienced greater reductions in distress than those who received a random number of questions. In addition, the participants showed a greater amount of cognitive change when the number of questions matched the detected distress level. This suggests that adapting the question quantity based on distress level detection can improve the results of CBT delivered by an ECA. These results illustrate the advantages of ECAs, paving the way for mental health care that is more tailored and effective. %M 38483464 %R 10.2196/50056 %U https://formative.jmir.org/2024/1/e50056 %U https://doi.org/10.2196/50056 %U http://www.ncbi.nlm.nih.gov/pubmed/38483464 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e53656 %T What Is the Performance of ChatGPT in Determining the Gender of Individuals Based on Their First and Last Names? %A Sebo,Paul %+ University Institute for Primary Care, University of Geneva, Rue Michel-Servet 1, Geneva, 1211, Switzerland, 41 223794390, paulsebo@hotmail.com %K accuracy %K artificial intelligence %K AI %K ChatGPT %K gender %K gender detection tool %K misclassification %K name %K performance %K gender detection %K gender detection tools %K inequalities %K language model %K NamSor %K Gender API %K Switzerland %K physicians %K gender bias %K disparities %K gender disparities %K gender gap %D 2024 %7 13.3.2024 %9 Research Letter %J JMIR AI %G English %X %M 38875596 %R 10.2196/53656 %U https://ai.jmir.org/2024/1/e53656 %U https://doi.org/10.2196/53656 %U http://www.ncbi.nlm.nih.gov/pubmed/38875596 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 7 %N %P e55508 %T Assessing the Utility of Multimodal Large Language Models (GPT-4 Vision and Large Language and Vision Assistant) in Identifying Melanoma Across Different Skin Tones %A Cirone,Katrina %A Akrout,Mohamed %A Abid,Latif %A Oakley,Amanda %+ Schulich School of Medicine and Dentistry, Western University, 1151 Richmond Street, London, ON, N6A 5C1, Canada, 1 6475324596, kcirone2024@meds.uwo.ca %K melanoma %K nevus %K skin pigmentation %K artificial intelligence %K AI %K multimodal large language models %K large language model %K large language models %K LLM %K LLMs %K machine learning %K expert systems %K natural language processing %K NLP %K GPT %K GPT-4V %K dermatology %K skin %K lesion %K lesions %K cancer %K oncology %K visual %D 2024 %7 13.3.2024 %9 Research Letter %J JMIR Dermatol %G English %X The large language models GPT-4 Vision and Large Language and Vision Assistant are capable of understanding and accurately differentiating between benign lesions and melanoma, indicating potential incorporation into dermatologic care, medical research, and education. %M 38477960 %R 10.2196/55508 %U https://derma.jmir.org/2024/1/e55508 %U https://doi.org/10.2196/55508 %U http://www.ncbi.nlm.nih.gov/pubmed/38477960 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e54393 %T Capability of GPT-4V(ision) in the Japanese National Medical Licensing Examination: Evaluation Study %A Nakao,Takahiro %A Miki,Soichiro %A Nakamura,Yuta %A Kikuchi,Tomohiro %A Nomura,Yukihiro %A Hanaoka,Shouhei %A Yoshikawa,Takeharu %A Abe,Osamu %+ Department of Computational Diagnostic Radiology and Preventive Medicine, The University of Tokyo Hospital, 7-3-1 Hongo, Bunkyo-ku, Tokyo, 113-8655, Japan, 81 358008666, tanakao-tky@umin.ac.jp %K AI %K artificial intelligence %K LLM %K large language model %K language model %K language models %K ChatGPT %K GPT-4 %K GPT-4V %K generative pretrained transformer %K image %K images %K imaging %K response %K responses %K exam %K examination %K exams %K examinations %K answer %K answers %K NLP %K natural language processing %K chatbot %K chatbots %K conversational agent %K conversational agents %K medical education %D 2024 %7 12.3.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Previous research applying large language models (LLMs) to medicine was focused on text-based information. Recently, multimodal variants of LLMs acquired the capability of recognizing images. Objective: We aim to evaluate the image recognition capability of generative pretrained transformer (GPT)-4V, a recent multimodal LLM developed by OpenAI, in the medical field by testing how visual information affects its performance to answer questions in the 117th Japanese National Medical Licensing Examination. Methods: We focused on 108 questions that had 1 or more images as part of a question and presented GPT-4V with the same questions under two conditions: (1) with both the question text and associated images and (2) with the question text only. We then compared the difference in accuracy between the 2 conditions using the exact McNemar test. Results: Among the 108 questions with images, GPT-4V’s accuracy was 68% (73/108) when presented with images and 72% (78/108) when presented without images (P=.36). For the 2 question categories, clinical and general, the accuracies with and those without images were 71% (70/98) versus 78% (76/98; P=.21) and 30% (3/10) versus 20% (2/10; P≥.99), respectively. Conclusions: The additional information from the images did not significantly improve the performance of GPT-4V in the Japanese National Medical Licensing Examination. %M 38470459 %R 10.2196/54393 %U https://mededu.jmir.org/2024/1/e54393 %U https://doi.org/10.2196/54393 %U http://www.ncbi.nlm.nih.gov/pubmed/38470459 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e53008 %T Generative AI in Medical Practice: In-Depth Exploration of Privacy and Security Challenges %A Chen,Yan %A Esmaeilzadeh,Pouyan %+ Department of Information Systems and Business Analytics, College of Business, Florida International University, Modesto A Maidique Campus, 11200 SW 8th St, RB 261 B, Miami, FL, 33199, United States, 1 3053483302, pesmaeil@fiu.edu %K artificial intelligence %K AI %K generative artificial intelligence %K generative AI %K medical practices %K potential benefits %K security and privacy threats %D 2024 %7 8.3.2024 %9 Viewpoint %J J Med Internet Res %G English %X As advances in artificial intelligence (AI) continue to transform and revolutionize the field of medicine, understanding the potential uses of generative AI in health care becomes increasingly important. Generative AI, including models such as generative adversarial networks and large language models, shows promise in transforming medical diagnostics, research, treatment planning, and patient care. However, these data-intensive systems pose new threats to protected health information. This Viewpoint paper aims to explore various categories of generative AI in health care, including medical diagnostics, drug discovery, virtual health assistants, medical research, and clinical decision support, while identifying security and privacy threats within each phase of the life cycle of such systems (ie, data collection, model development, and implementation phases). The objectives of this study were to analyze the current state of generative AI in health care, identify opportunities and privacy and security challenges posed by integrating these technologies into existing health care infrastructure, and propose strategies for mitigating security and privacy risks. This study highlights the importance of addressing the security and privacy threats associated with generative AI in health care to ensure the safe and effective use of these systems. The findings of this study can inform the development of future generative AI systems in health care and help health care organizations better understand the potential benefits and risks associated with these systems. By examining the use cases and benefits of generative AI across diverse domains within health care, this paper contributes to theoretical discussions surrounding AI ethics, security vulnerabilities, and data privacy regulations. In addition, this study provides practical insights for stakeholders looking to adopt generative AI solutions within their organizations. %M 38457208 %R 10.2196/53008 %U https://www.jmir.org/2024/1/e53008 %U https://doi.org/10.2196/53008 %U http://www.ncbi.nlm.nih.gov/pubmed/38457208 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 11 %N %P e53559 %T The Temperature Feature of ChatGPT: Modifying Creativity for Clinical Research %A Davis,Joshua %A Van Bulck,Liesbet %A Durieux,Brigitte N %A Lindvall,Charlotta %+ Department of Psychosocial Oncology and Palliative Care, Dana-Farber Cancer Institute, 450 Brookline Ave, Boston, MA, 02215, United States, 1 617 632 6464, charlotta_lindvall@dfci.harvard.edu %K artificial intelligence %K ChatGPT %K clinical communication %K creative %K creativity %K customization %K customize %K customized %K generation %K generative %K language model %K language models %K LLM %K LLMs %K natural language processing %K NLP %K random %K randomness %K tailor %K tailored %K temperature %K text %K texts %K textual %D 2024 %7 8.3.2024 %9 Viewpoint %J JMIR Hum Factors %G English %X More clinicians and researchers are exploring uses for large language model chatbots, such as ChatGPT, for research, dissemination, and educational purposes. Therefore, it becomes increasingly relevant to consider the full potential of this tool, including the special features that are currently available through the application programming interface. One of these features is a variable called temperature, which changes the degree to which randomness is involved in the model’s generated output. This is of particular interest to clinicians and researchers. By lowering this variable, one can generate more consistent outputs; by increasing it, one can receive more creative responses. For clinicians and researchers who are exploring these tools for a variety of tasks, the ability to tailor outputs to be less creative may be beneficial for work that demands consistency. Additionally, access to more creative text generation may enable scientific authors to describe their research in more general language and potentially connect with a broader public through social media. In this viewpoint, we present the temperature feature, discuss potential uses, and provide some examples. %M 38457221 %R 10.2196/53559 %U https://humanfactors.jmir.org/2024/1/e53559 %U https://doi.org/10.2196/53559 %U http://www.ncbi.nlm.nih.gov/pubmed/38457221 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 11 %N %P e52885 %T Leveraging Generative AI Tools to Support the Development of Digital Solutions in Health Care Research: Case Study %A Rodriguez,Danissa V %A Lawrence,Katharine %A Gonzalez,Javier %A Brandfield-Harvey,Beatrix %A Xu,Lynn %A Tasneem,Sumaiya %A Levine,Defne L %A Mann,Devin %+ Department of Population Health, New York University Grossman School of Medicine, 227 East 30th Street, 6th Floor, New York, NY, 10016, United States, 1 646 501 2684, danissa.rodriguez@nyulangone.org %K digital health %K GenAI %K generative %K artificial intelligence %K ChatGPT %K software engineering %K mHealth %K mobile health %K app %K apps %K application %K applications %K diabetes %K diabetic %K diabetes prevention %K digital prescription %K software %K engagement %K behaviour change %K behavior change %K developer %K developers %K LLM %K LLMs %K language model %K language models %K NLP %K natural language processing %D 2024 %7 6.3.2024 %9 Original Paper %J JMIR Hum Factors %G English %X Background: Generative artificial intelligence has the potential to revolutionize health technology product development by improving coding quality, efficiency, documentation, quality assessment and review, and troubleshooting. Objective: This paper explores the application of a commercially available generative artificial intelligence tool (ChatGPT) to the development of a digital health behavior change intervention designed to support patient engagement in a commercial digital diabetes prevention program. Methods: We examined the capacity, advantages, and limitations of ChatGPT to support digital product idea conceptualization, intervention content development, and the software engineering process, including software requirement generation, software design, and code production. In total, 11 evaluators, each with at least 10 years of experience in fields of study ranging from medicine and implementation science to computer science, participated in the output review process (ChatGPT vs human-generated output). All had familiarity or prior exposure to the original personalized automatic messaging system intervention. The evaluators rated the ChatGPT-produced outputs in terms of understandability, usability, novelty, relevance, completeness, and efficiency. Results: Most metrics received positive scores. We identified that ChatGPT can (1) support developers to achieve high-quality products faster and (2) facilitate nontechnical communication and system understanding between technical and nontechnical team members around the development goal of rapid and easy-to-build computational solutions for medical technologies. Conclusions: ChatGPT can serve as a usable facilitator for researchers engaging in the software development life cycle, from product conceptualization to feature identification and user story development to code generation. Trial Registration: ClinicalTrials.gov NCT04049500; https://clinicaltrials.gov/ct2/show/NCT04049500 %M 38446539 %R 10.2196/52885 %U https://humanfactors.jmir.org/2024/1/e52885 %U https://doi.org/10.2196/52885 %U http://www.ncbi.nlm.nih.gov/pubmed/38446539 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 7 %N %P e50163 %T Readability and Health Literacy Scores for ChatGPT-Generated Dermatology Public Education Materials: Cross-Sectional Analysis of Sunscreen and Melanoma Questions %A Roster,Katie %A Kann,Rebecca B %A Farabi,Banu %A Gronbeck,Christian %A Brownstone,Nicholas %A Lipner,Shari R %+ Department of Dermatology, Weill Cornell Medicine, 1305 York Ave 9th Floor, New York, NY, 10021, United States, 1 646 962 3376, shl9032@med.cornell.edu %K ChatGPT %K artificial intelligence %K AI %K LLM %K LLMs %K large language model %K language model %K language models %K generative %K NLP %K natural language processing %K health disparities %K health literacy %K readability %K disparities %K disparity %K dermatology %K health information %K comprehensible %K comprehensibility %K understandability %K patient education %K public education %K health education %K online information %D 2024 %7 6.3.2024 %9 Research Letter %J JMIR Dermatol %G English %X %M 38446502 %R 10.2196/50163 %U https://derma.jmir.org/2024/1/e50163 %U https://doi.org/10.2196/50163 %U http://www.ncbi.nlm.nih.gov/pubmed/38446502 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 7 %N %P e48451 %T Potential Use of ChatGPT in Responding to Patient Questions and Creating Patient Resources %A Reynolds,Kelly %A Tejasvi,Trilokraj %+ Department of Dermatology, University of Michigan, 1500 East Medical Center Drive, Ann Arbor, MI, 48109, United States, 1 7349364054, ttejasvi@med.umich.edu %K artificial intelligence %K AI %K ChatGPT %K patient resources %K patient handouts %K natural language processing software %K language model %K language models %K natural language processing %K chatbot %K chatbots %K conversational agent %K conversational agents %K patient education %K educational resource %K educational %D 2024 %7 6.3.2024 %9 Viewpoint %J JMIR Dermatol %G English %X ChatGPT (OpenAI) is an artificial intelligence–based free natural language processing model that generates complex responses to user-generated prompts. The advent of this tool comes at a time when physician burnout is at an all-time high, which is attributed at least in part to time spent outside of the patient encounter within the electronic medical record (documenting the encounter, responding to patient messages, etc). Although ChatGPT is not specifically designed to provide medical information, it can generate preliminary responses to patients’ questions about their medical conditions and can precipitately create educational patient resources, which do inevitably require rigorous editing and fact-checking on the part of the health care provider to ensure accuracy. In this way, this assistive technology has the potential to not only enhance a physician’s efficiency and work-life balance but also enrich the patient-physician relationship and ultimately improve patient outcomes. %M 38446541 %R 10.2196/48451 %U https://derma.jmir.org/2024/1/e48451 %U https://doi.org/10.2196/48451 %U http://www.ncbi.nlm.nih.gov/pubmed/38446541 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e51837 %T What’s in a Name? Experimental Evidence of Gender Bias in Recommendation Letters Generated by ChatGPT %A Kaplan,Deanna M %A Palitsky,Roman %A Arconada Alvarez,Santiago J %A Pozzo,Nicole S %A Greenleaf,Morgan N %A Atkinson,Ciara A %A Lam,Wilbur A %+ Department of Family and Preventive Medicine, Emory University School of Medicine, Administrative Offices, Wesley Woods Campus, 1841 Clifton Road, NE, 5th Floor, Atlanta, GA, 30329, United States, 1 520 370 6752, deanna.m.kaplan@emory.edu %K chatbot %K generative artificial intelligence %K generative AI %K gender bias %K large language models %K letters of recommendation %K recommendation letter %K language model %K chatbots %K artificial intelligence %K AI %K gender-based language %K human written %K real-world %K scenario %D 2024 %7 5.3.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Artificial intelligence chatbots such as ChatGPT (OpenAI) have garnered excitement about their potential for delegating writing tasks ordinarily performed by humans. Many of these tasks (eg, writing recommendation letters) have social and professional ramifications, making the potential social biases in ChatGPT’s underlying language model a serious concern. Objective: Three preregistered studies used the text analysis program Linguistic Inquiry and Word Count to investigate gender bias in recommendation letters written by ChatGPT in human-use sessions (N=1400 total letters). Methods: We conducted analyses using 22 existing Linguistic Inquiry and Word Count dictionaries, as well as 6 newly created dictionaries based on systematic reviews of gender bias in recommendation letters, to compare recommendation letters generated for the 200 most historically popular “male” and “female” names in the United States. Study 1 used 3 different letter-writing prompts intended to accentuate professional accomplishments associated with male stereotypes, female stereotypes, or neither. Study 2 examined whether lengthening each of the 3 prompts while holding the between-prompt word count constant modified the extent of bias. Study 3 examined the variability within letters generated for the same name and prompts. We hypothesized that when prompted with gender-stereotyped professional accomplishments, ChatGPT would evidence gender-based language differences replicating those found in systematic reviews of human-written recommendation letters (eg, more affiliative, social, and communal language for female names; more agentic and skill-based language for male names). Results: Significant differences in language between letters generated for female versus male names were observed across all prompts, including the prompt hypothesized to be neutral, and across nearly all language categories tested. Historically female names received significantly more social referents (5/6, 83% of prompts), communal or doubt-raising language (4/6, 67% of prompts), personal pronouns (4/6, 67% of prompts), and clout language (5/6, 83% of prompts). Contradicting the study hypotheses, some gender differences (eg, achievement language and agentic language) were significant in both the hypothesized and nonhypothesized directions, depending on the prompt. Heteroscedasticity between male and female names was observed in multiple linguistic categories, with greater variance for historically female names than for historically male names. Conclusions: ChatGPT reproduces many gender-based language biases that have been reliably identified in investigations of human-written reference letters, although these differences vary across prompts and language categories. Caution should be taken when using ChatGPT for tasks that have social consequences, such as reference letter writing. The methods developed in this study may be useful for ongoing bias testing among progressive generations of chatbots across a range of real-world scenarios. Trial Registration: OSF Registries osf.io/ztv96; https://osf.io/ztv96 %M 38441945 %R 10.2196/51837 %U https://www.jmir.org/2024/1/e51837 %U https://doi.org/10.2196/51837 %U http://www.ncbi.nlm.nih.gov/pubmed/38441945 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e49139 %T Use of Large Language Models to Assess the Likelihood of Epidemics From the Content of Tweets: Infodemiology Study %A Deiner,Michael S %A Deiner,Natalie A %A Hristidis,Vagelis %A McLeod,Stephen D %A Doan,Thuy %A Lietman,Thomas M %A Porco,Travis C %+ Francis I. Proctor Foundation for Research in Ophthalmology, University of California, San Francisco, 490 Illinois St., Box 0944, San Francisco, CA, 94143-0944, United States, 1 415 476 0527, travis.porco@ucsf.edu %K conjunctivitis %K microblog %K social media %K generative large language model %K Generative Pre-trained Transformers %K GPT-3.5 %K GPT-4 %K epidemic detection %K Twitter %K X formerly known as Twitter %K infectious eye disease %D 2024 %7 1.3.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Previous work suggests that Google searches could be useful in identifying conjunctivitis epidemics. Content-based assessment of social media content may provide additional value in serving as early indicators of conjunctivitis and other systemic infectious diseases. Objective: We investigated whether large language models, specifically GPT-3.5 and GPT-4 (OpenAI), can provide probabilistic assessments of whether social media posts about conjunctivitis could indicate a regional outbreak. Methods: A total of 12,194 conjunctivitis-related tweets were obtained using a targeted Boolean search in multiple languages from India, Guam (United States), Martinique (France), the Philippines, American Samoa (United States), Fiji, Costa Rica, Haiti, and the Bahamas, covering the time frame from January 1, 2012, to March 13, 2023. By providing these tweets via prompts to GPT-3.5 and GPT-4, we obtained probabilistic assessments that were validated by 2 human raters. We then calculated Pearson correlations of these time series with tweet volume and the occurrence of known outbreaks in these 9 locations, with time series bootstrap used to compute CIs. Results: Probabilistic assessments derived from GPT-3.5 showed correlations of 0.60 (95% CI 0.47-0.70) and 0.53 (95% CI 0.40-0.65) with the 2 human raters, with higher results for GPT-4. The weekly averages of GPT-3.5 probabilities showed substantial correlations with weekly tweet volume for 44% (4/9) of the countries, with correlations ranging from 0.10 (95% CI 0.0-0.29) to 0.53 (95% CI 0.39-0.89), with larger correlations for GPT-4. More modest correlations were found for correlation with known epidemics, with substantial correlation only in American Samoa (0.40, 95% CI 0.16-0.81). Conclusions: These findings suggest that GPT prompting can efficiently assess the content of social media posts and indicate possible disease outbreaks to a degree of accuracy comparable to that of humans. Furthermore, we found that automated content analysis of tweets is related to tweet volume for conjunctivitis-related posts in some locations and to the occurrence of actual epidemics. Future work may improve the sensitivity and specificity of these methods for disease outbreak detection. %M 38427404 %R 10.2196/49139 %U https://www.jmir.org/2024/1/e49139 %U https://doi.org/10.2196/49139 %U http://www.ncbi.nlm.nih.gov/pubmed/38427404 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51426 %T Exploring the Feasibility of Using ChatGPT to Create Just-in-Time Adaptive Physical Activity mHealth Intervention Content: Case Study %A Willms,Amanda %A Liu,Sam %+ School of Exercise Science, Physical and Health Education, University of Victoria, PO Box 3010 STN CSC, Victoria, BC, V8W 2Y2, Canada, 1 250 721 8392, awillms@uvic.ca %K ChatGPT %K digital health %K mobile health %K mHealth %K physical activity %K application %K mobile app %K mobile apps %K content creation %K behavior change %K app design %D 2024 %7 29.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Achieving physical activity (PA) guidelines’ recommendation of 150 minutes of moderate-to-vigorous PA per week has been shown to reduce the risk of many chronic conditions. Despite the overwhelming evidence in this field, PA levels remain low globally. By creating engaging mobile health (mHealth) interventions through strategies such as just-in-time adaptive interventions (JITAIs) that are tailored to an individual’s dynamic state, there is potential to increase PA levels. However, generating personalized content can take a long time due to various versions of content required for the personalization algorithms. ChatGPT presents an incredible opportunity to rapidly produce tailored content; however, there is a lack of studies exploring its feasibility. Objective: This study aimed to (1) explore the feasibility of using ChatGPT to create content for a PA JITAI mobile app and (2) describe lessons learned and future recommendations for using ChatGPT in the development of mHealth JITAI content. Methods: During phase 1, we used Pathverse, a no-code app builder, and ChatGPT to develop a JITAI app to help parents support their child’s PA levels. The intervention was developed based on the Multi-Process Action Control (M-PAC) framework, and the necessary behavior change techniques targeting the M-PAC constructs were implemented in the app design to help parents support their child’s PA. The acceptability of using ChatGPT for this purpose was discussed to determine its feasibility. In phase 2, we summarized the lessons we learned during the JITAI content development process using ChatGPT and generated recommendations to inform future similar use cases. Results: In phase 1, by using specific prompts, we efficiently generated content for 13 lessons relating to increasing parental support for their child’s PA following the M-PAC framework. It was determined that using ChatGPT for this case study to develop PA content for a JITAI was acceptable. In phase 2, we summarized our recommendations into the following six steps when using ChatGPT to create content for mHealth behavior interventions: (1) determine target behavior, (2) ground the intervention in behavior change theory, (3) design the intervention structure, (4) input intervention structure and behavior change constructs into ChatGPT, (5) revise the ChatGPT response, and (6) customize the response to be used in the intervention. Conclusions: ChatGPT offers a remarkable opportunity for rapid content creation in the context of an mHealth JITAI. Although our case study demonstrated that ChatGPT was acceptable, it is essential to approach its use, along with other language models, with caution. Before delivering content to population groups, expert review is crucial to ensure accuracy and relevancy. Future research and application of these guidelines are imperative as we deepen our understanding of ChatGPT and its interactions with human input. %M 38421689 %R 10.2196/51426 %U https://mededu.jmir.org/2024/1/e51426 %U https://doi.org/10.2196/51426 %U http://www.ncbi.nlm.nih.gov/pubmed/38421689 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e48168 %T Attrition in Conversational Agent–Delivered Mental Health Interventions: Systematic Review and Meta-Analysis %A Jabir,Ahmad Ishqi %A Lin,Xiaowen %A Martinengo,Laura %A Sharp,Gemma %A Theng,Yin-Leng %A Tudor Car,Lorainne %+ Lee Kong Chian School of Medicine, Nanyang Technological University Singapore, 11 Mandalay Road, Level 18, Singapore, 308232, Singapore, 65 69041258, lorainne.tudor.car@ntu.edu.sg %K conversational agent %K chatbot %K mental health %K mHealth %K attrition %K dropout %K mobile phone %K artificial intelligence %K AI %K systematic review %K meta-analysis %K digital health interventions %D 2024 %7 27.2.2024 %9 Review %J J Med Internet Res %G English %X Background: Conversational agents (CAs) or chatbots are computer programs that mimic human conversation. They have the potential to improve access to mental health interventions through automated, scalable, and personalized delivery of psychotherapeutic content. However, digital health interventions, including those delivered by CAs, often have high attrition rates. Identifying the factors associated with attrition is critical to improving future clinical trials. Objective: This review aims to estimate the overall and differential rates of attrition in CA-delivered mental health interventions (CA interventions), evaluate the impact of study design and intervention-related aspects on attrition, and describe study design features aimed at reducing or mitigating study attrition. Methods: We searched PubMed, Embase (Ovid), PsycINFO (Ovid), Cochrane Central Register of Controlled Trials, and Web of Science, and conducted a gray literature search on Google Scholar in June 2022. We included randomized controlled trials that compared CA interventions against control groups and excluded studies that lasted for 1 session only and used Wizard of Oz interventions. We also assessed the risk of bias in the included studies using the Cochrane Risk of Bias Tool 2.0. Random-effects proportional meta-analysis was applied to calculate the pooled dropout rates in the intervention groups. Random-effects meta-analysis was used to compare the attrition rate in the intervention groups with that in the control groups. We used a narrative review to summarize the findings. Results: The systematic search retrieved 4566 records from peer-reviewed databases and citation searches, of which 41 (0.90%) randomized controlled trials met the inclusion criteria. The meta-analytic overall attrition rate in the intervention group was 21.84% (95% CI 16.74%-27.36%; I2=94%). Short-term studies that lasted ≤8 weeks showed a lower attrition rate (18.05%, 95% CI 9.91%- 27.76%; I2=94.6%) than long-term studies that lasted >8 weeks (26.59%, 95% CI 20.09%-33.63%; I2=93.89%). Intervention group participants were more likely to attrit than control group participants for short-term (log odds ratio 1.22, 95% CI 0.99-1.50; I2=21.89%) and long-term studies (log odds ratio 1.33, 95% CI 1.08-1.65; I2=49.43%). Intervention-related characteristics associated with higher attrition include stand-alone CA interventions without human support, not having a symptom tracker feature, no visual representation of the CA, and comparing CA interventions with waitlist controls. No participant-level factor reliably predicted attrition. Conclusions: Our results indicated that approximately one-fifth of the participants will drop out from CA interventions in short-term studies. High heterogeneities made it difficult to generalize the findings. Our results suggested that future CA interventions should adopt a blended design with human support, use symptom tracking, compare CA intervention groups against active controls rather than waitlist controls, and include a visual representation of the CA to reduce the attrition rate. Trial Registration: PROSPERO International Prospective Register of Systematic Reviews CRD42022341415; https://www.crd.york.ac.uk/prospero/display_record.php?ID=CRD42022341415 %M 38412023 %R 10.2196/48168 %U https://www.jmir.org/2024/1/e48168 %U https://doi.org/10.2196/48168 %U http://www.ncbi.nlm.nih.gov/pubmed/38412023 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e46758 %T Chatbots That Deliver Contraceptive Support: Systematic Review %A Mills,Rhiana %A Mangone,Emily Rose %A Lesh,Neal %A Jayal,Gayatri %A Mohan,Diwakar %A Baraitser,Paula %+ SH24, 35A Westminster Bridge Road, London, SE1 7JB, United Kingdom, 44 7742932445, rhiana@sh24.org.uk %K chatbot %K contraceptives %K digital health %K AI %K systematic review %K conversational agent %K development best practices %K development %K counseling %K communication %K user feedback %K users %K feedback %K attitudes %K behavior %D 2024 %7 27.2.2024 %9 Review %J J Med Internet Res %G English %X Background: A chatbot is a computer program that is designed to simulate conversation with humans. Chatbots may offer rapid, responsive, and private contraceptive information; counseling; and linkages to products and services, which could improve contraceptive knowledge, attitudes, and behaviors. Objective: This review aimed to systematically collate and interpret evidence to determine whether and how chatbots improve contraceptive knowledge, attitudes, and behaviors. Contraceptive knowledge, attitudes, and behaviors include access to contraceptive information, understanding of contraceptive information, access to contraceptive services, contraceptive uptake, contraceptive continuation, and contraceptive communication or negotiation skills. A secondary aim of the review is to identify and summarize best practice recommendations for chatbot development to improve contraceptive outcomes, including the cost-effectiveness of chatbots where evidence is available. Methods: We systematically searched peer-reviewed and gray literature (2010-2022) for papers that evaluated chatbots offering contraceptive information and services. Sources were included if they featured a chatbot and addressed an element of contraception, for example, uptake of hormonal contraceptives. Literature was assessed for methodological quality using appropriate quality assessment tools. Data were extracted from the included sources using a data extraction framework. A narrative synthesis approach was used to collate qualitative evidence as quantitative evidence was too sparse for a quantitative synthesis to be carried out. Results: We identified 15 sources, including 8 original research papers and 7 gray literature papers. These sources included 16 unique chatbots. This review found the following evidence on the impact and efficacy of chatbots: a large, robust randomized controlled trial suggests that chatbots have no effect on intention to use contraception; a small, uncontrolled cohort study suggests increased uptake of contraception among adolescent girls; and a development report, using poor-quality methods, suggests no impact on improved access to services. There is also poor-quality evidence to suggest increased contraceptive knowledge from interacting with chatbot content. User engagement was mixed, with some chatbots reaching wide audiences and others reaching very small audiences. User feedback suggests that chatbots may be experienced as acceptable, convenient, anonymous, and private, but also as incompetent, inconvenient, and unsympathetic. The best practice guidance on the development of chatbots to improve contraceptive knowledge, attitudes, and behaviors is consistent with that in the literature on chatbots in other health care fields. Conclusions: We found limited and conflicting evidence on chatbots to improve contraceptive knowledge, attitudes, and behaviors. Further research that examines the impact of chatbot interventions in comparison with alternative technologies, acknowledges the varied and changing nature of chatbot interventions, and seeks to identify key features associated with improved contraceptive outcomes is needed. The limitations of this review include the limited evidence available on this topic, the lack of formal evaluation of chatbots in this field, and the lack of standardized definition of what a chatbot is. %M 38412028 %R 10.2196/46758 %U https://www.jmir.org/2024/1/e46758 %U https://doi.org/10.2196/46758 %U http://www.ncbi.nlm.nih.gov/pubmed/38412028 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51523 %T Evaluating Large Language Models for the National Premedical Exam in India: Comparative Analysis of GPT-3.5, GPT-4, and Bard %A Farhat,Faiza %A Chaudhry,Beenish Moalla %A Nadeem,Mohammad %A Sohail,Shahab Saquib %A Madsen,Dag Øivind %+ School of Business, University of South-Eastern Norway, Bredalsveien 14, Hønefoss, 3511, Norway, 47 31008732, dag.oivind.madsen@usn.no %K accuracy %K AI model %K artificial intelligence %K Bard %K ChatGPT %K educational task %K GPT-4 %K Generative Pre-trained Transformers %K large language models %K medical education, medical exam %K natural language processing %K performance %K premedical exams %K suitability %D 2024 %7 21.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models (LLMs) have revolutionized natural language processing with their ability to generate human-like text through extensive training on large data sets. These models, including Generative Pre-trained Transformers (GPT)-3.5 (OpenAI), GPT-4 (OpenAI), and Bard (Google LLC), find applications beyond natural language processing, attracting interest from academia and industry. Students are actively leveraging LLMs to enhance learning experiences and prepare for high-stakes exams, such as the National Eligibility cum Entrance Test (NEET) in India. Objective: This comparative analysis aims to evaluate the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. Methods: In this paper, we evaluated the performance of the 3 mainstream LLMs, namely GPT-3.5, GPT-4, and Google Bard, in answering questions related to the NEET-2023 exam. The questions of the NEET were provided to these artificial intelligence models, and the responses were recorded and compared against the correct answers from the official answer key. Consensus was used to evaluate the performance of all 3 models. Results: It was evident that GPT-4 passed the entrance test with flying colors (300/700, 42.9%), showcasing exceptional performance. On the other hand, GPT-3.5 managed to meet the qualifying criteria, but with a substantially lower score (145/700, 20.7%). However, Bard (115/700, 16.4%) failed to meet the qualifying criteria and did not pass the test. GPT-4 demonstrated consistent superiority over Bard and GPT-3.5 in all 3 subjects. Specifically, GPT-4 achieved accuracy rates of 73% (29/40) in physics, 44% (16/36) in chemistry, and 51% (50/99) in biology. Conversely, GPT-3.5 attained an accuracy rate of 45% (18/40) in physics, 33% (13/26) in chemistry, and 34% (34/99) in biology. The accuracy consensus metric showed that the matching responses between GPT-4 and Bard, as well as GPT-4 and GPT-3.5, had higher incidences of being correct, at 0.56 and 0.57, respectively, compared to the matching responses between Bard and GPT-3.5, which stood at 0.42. When all 3 models were considered together, their matching responses reached the highest accuracy consensus of 0.59. Conclusions: The study’s findings provide valuable insights into the performance of GPT-3.5, GPT-4, and Bard in answering NEET-2023 questions. GPT-4 emerged as the most accurate model, highlighting its potential for educational applications. Cross-checking responses across models may result in confusion as the compared models (as duos or a trio) tend to agree on only a little over half of the correct responses. Using GPT-4 as one of the compared models will result in higher accuracy consensus. The results underscore the suitability of LLMs for high-stakes exams and their positive impact on education. Additionally, the study establishes a benchmark for evaluating and enhancing LLMs’ performance in educational tasks, promoting responsible and informed use of these models in diverse learning environments. %M 38381486 %R 10.2196/51523 %U https://mededu.jmir.org/2024/1/e51523 %U https://doi.org/10.2196/51523 %U http://www.ncbi.nlm.nih.gov/pubmed/38381486 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e52164 %T Human-Written vs AI-Generated Texts in Orthopedic Academic Literature: Comparative Qualitative Analysis %A Hakam,Hassan Tarek %A Prill,Robert %A Korte,Lisa %A Lovreković,Bruno %A Ostojić,Marko %A Ramadanov,Nikolai %A Muehlensiepen,Felix %+ Center of Orthopaedics and Trauma Surgery, University Clinic of Brandenburg, Brandenburg Medical School, Hochstr 29, Brandenburg an der Havel, 14770, Germany, 49 03381 411940, hassantarek.hakam@mhb-fontane.de %K artificial intelligence %K AI %K large language model %K LLM %K research %K orthopedic surgery %K sports medicine %K orthopedics %K surgery %K orthopedic %K qualitative study %K medical database %K feedback %K detection %K tool %K scientific integrity %K study design %D 2024 %7 16.2.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: As large language models (LLMs) are becoming increasingly integrated into different aspects of health care, questions about the implications for medical academic literature have begun to emerge. Key aspects such as authenticity in academic writing are at stake with artificial intelligence (AI) generating highly linguistically accurate and grammatically sound texts. Objective: The objective of this study is to compare human-written with AI-generated scientific literature in orthopedics and sports medicine. Methods: Five original abstracts were selected from the PubMed database. These abstracts were subsequently rewritten with the assistance of 2 LLMs with different degrees of proficiency. Subsequently, researchers with varying degrees of expertise and with different areas of specialization were asked to rank the abstracts according to linguistic and methodological parameters. Finally, researchers had to classify the articles as AI generated or human written. Results: Neither the researchers nor the AI-detection software could successfully identify the AI-generated texts. Furthermore, the criteria previously suggested in the literature did not correlate with whether the researchers deemed a text to be AI generated or whether they judged the article correctly based on these parameters. Conclusions: The primary finding of this study was that researchers were unable to distinguish between LLM-generated and human-written texts. However, due to the small sample size, it is not possible to generalize the results of this study. As is the case with any tool used in academic research, the potential to cause harm can be mitigated by relying on the transparency and integrity of the researchers. With scientific integrity at stake, further research with a similar study design should be conducted to determine the magnitude of this issue. %M 38363631 %R 10.2196/52164 %U https://formative.jmir.org/2024/1/e52164 %U https://doi.org/10.2196/52164 %U http://www.ncbi.nlm.nih.gov/pubmed/38363631 %0 Journal Article %@ 1929-073X %I JMIR Publications %V 13 %N %P e54704 %T A Preliminary Checklist (METRICS) to Standardize the Design and Reporting of Studies on Generative Artificial Intelligence–Based Models in Health Care Education and Practice: Development Study Involving a Literature Review %A Sallam,Malik %A Barakat,Muna %A Sallam,Mohammed %+ Department of Pathology, Microbiology and Forensic Medicine, School of Medicine, The University of Jordan, Queen Rania Al-Abdullah Street-Aljubeiha, Amman, 11942, Jordan, 962 0791845186, malik.sallam@ju.edu.jo %K guidelines %K evaluation %K meaningful analytics %K large language models %K decision support %D 2024 %7 15.2.2024 %9 Original Paper %J Interact J Med Res %G English %X Background: Adherence to evidence-based practice is indispensable in health care. Recently, the utility of generative artificial intelligence (AI) models in health care has been evaluated extensively. However, the lack of consensus guidelines on the design and reporting of findings of these studies poses a challenge for the interpretation and synthesis of evidence. Objective: This study aimed to develop a preliminary checklist to standardize the reporting of generative AI-based studies in health care education and practice. Methods: A literature review was conducted in Scopus, PubMed, and Google Scholar. Published records with “ChatGPT,” “Bing,” or “Bard” in the title were retrieved. Careful examination of the methodologies employed in the included records was conducted to identify the common pertinent themes and the possible gaps in reporting. A panel discussion was held to establish a unified and thorough checklist for the reporting of AI studies in health care. The finalized checklist was used to evaluate the included records by 2 independent raters. Cohen κ was used as the method to evaluate the interrater reliability. Results: The final data set that formed the basis for pertinent theme identification and analysis comprised a total of 34 records. The finalized checklist included 9 pertinent themes collectively referred to as METRICS (Model, Evaluation, Timing, Range/Randomization, Individual factors, Count, and Specificity of prompts and language). Their details are as follows: (1) Model used and its exact settings; (2) Evaluation approach for the generated content; (3) Timing of testing the model; (4) Transparency of the data source; (5) Range of tested topics; (6) Randomization of selecting the queries; (7) Individual factors in selecting the queries and interrater reliability; (8) Count of queries executed to test the model; and (9) Specificity of the prompts and language used. The overall mean METRICS score was 3.0 (SD 0.58). The tested METRICS score was acceptable, with the range of Cohen κ of 0.558 to 0.962 (P<.001 for the 9 tested items). With classification per item, the highest average METRICS score was recorded for the “Model” item, followed by the “Specificity” item, while the lowest scores were recorded for the “Randomization” item (classified as suboptimal) and “Individual factors” item (classified as satisfactory). Conclusions: The METRICS checklist can facilitate the design of studies guiding researchers toward best practices in reporting results. The findings highlight the need for standardized reporting algorithms for generative AI-based studies in health care, considering the variability observed in methodologies and reporting. The proposed METRICS checklist could be a preliminary helpful base to establish a universally accepted approach to standardize the design and reporting of generative AI-based studies in health care, which is a swiftly evolving research topic. %M 38276872 %R 10.2196/54704 %U https://www.i-jmr.org/2024/1/e54704 %U https://doi.org/10.2196/54704 %U http://www.ncbi.nlm.nih.gov/pubmed/38276872 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 13 %N %P e54349 %T Implementation of Chatbot Technology in Health Care: Protocol for a Bibliometric Analysis %A Ni,Zhao %A Peng,Mary L %A Balakrishnan,Vimala %A Tee,Vincent %A Azwa,Iskandar %A Saifi,Rumana %A Nelson,LaRon E %A Vlahov,David %A Altice,Frederick L %+ School of Nursing, Yale University, 400 West Campus Drive, Orange, CT, 06477, United States, 1 2037373039, zhao.ni@yale.edu %K artificial intelligence %K AI %K bibliometric analysis %K chatbots %K health care %K health promotion %D 2024 %7 15.2.2024 %9 Protocol %J JMIR Res Protoc %G English %X Background: Chatbots have the potential to increase people’s access to quality health care. However, the implementation of chatbot technology in the health care system is unclear due to the scarce analysis of publications on the adoption of chatbot in health and medical settings. Objective: This paper presents a protocol of a bibliometric analysis aimed at offering the public insights into the current state and emerging trends in research related to the use of chatbot technology for promoting health. Methods: In this bibliometric analysis, we will select published papers from the databases of CINAHL, IEEE Xplore, PubMed, Scopus, and Web of Science that pertain to chatbot technology and its applications in health care. Our search strategy includes keywords such as “chatbot,” “virtual agent,” “virtual assistant,” “conversational agent,” “conversational AI,” “interactive agent,” “health,” and “healthcare.” Five researchers who are AI engineers and clinicians will independently review the titles and abstracts of selected papers to determine their eligibility for a full-text review. The corresponding author (ZN) will serve as a mediator to address any discrepancies and disputes among the 5 reviewers. Our analysis will encompass various publication patterns of chatbot research, including the number of annual publications, their geographic or institutional distribution, and the number of annual grants supporting chatbot research, and further summarize the methodologies used in the development of health-related chatbots, along with their features and applications in health care settings. Software tool VOSViewer (version 1.6.19; Leiden University) will be used to construct and visualize bibliometric networks. Results: The preparation for the bibliometric analysis began on December 3, 2021, when the research team started the process of familiarizing themselves with the software tools that may be used in this analysis, VOSViewer and CiteSpace, during which they consulted 3 librarians at the Yale University regarding search terms and tentative results. Tentative searches on the aforementioned databases yielded a total of 2340 papers. The official search phase started on July 27, 2023. Our goal is to complete the screening of papers and the analysis by February 15, 2024. Conclusions: Artificial intelligence chatbots, such as ChatGPT (OpenAI Inc), have sparked numerous discussions within the health care industry regarding their impact on human health. Chatbot technology holds substantial promise for advancing health care systems worldwide. However, developing a sophisticated chatbot capable of precise interaction with health care consumers, delivering personalized care, and providing accurate health-related information and knowledge remain considerable challenges. This bibliometric analysis seeks to fill the knowledge gap in the existing literature on health-related chatbots, entailing their applications, the software used in their development, and their preferred functionalities among users. International Registered Report Identifier (IRRID): PRR1-10.2196/54349 %M 38228575 %R 10.2196/54349 %U https://www.researchprotocols.org/2024/1/e54349 %U https://doi.org/10.2196/54349 %U http://www.ncbi.nlm.nih.gov/pubmed/38228575 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51391 %T Learning to Make Rare and Complex Diagnoses With Generative AI Assistance: Qualitative Study of Popular Large Language Models %A Abdullahi,Tassallah %A Singh,Ritambhara %A Eickhoff,Carsten %+ School of Medicine, University of Tübingen, Schaffhausenstr, 77, Tübingen, 72072, Germany, 49 7071 29 843, carsten.eickhoff@uni-tuebingen.de %K clinical decision support %K rare diseases %K complex diseases %K prompt engineering %K reliability %K consistency %K natural language processing %K language model %K Bard %K ChatGPT 3.5 %K GPT-4 %K MedAlpaca %K medical education %K complex diagnosis %K artificial intelligence %K AI assistance %K medical training %K prediction model %D 2024 %7 13.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Patients with rare and complex diseases often experience delayed diagnoses and misdiagnoses because comprehensive knowledge about these diseases is limited to only a few medical experts. In this context, large language models (LLMs) have emerged as powerful knowledge aggregation tools with applications in clinical decision support and education domains. Objective: This study aims to explore the potential of 3 popular LLMs, namely Bard (Google LLC), ChatGPT-3.5 (OpenAI), and GPT-4 (OpenAI), in medical education to enhance the diagnosis of rare and complex diseases while investigating the impact of prompt engineering on their performance. Methods: We conducted experiments on publicly available complex and rare cases to achieve these objectives. We implemented various prompt strategies to evaluate the performance of these models using both open-ended and multiple-choice prompts. In addition, we used a majority voting strategy to leverage diverse reasoning paths within language models, aiming to enhance their reliability. Furthermore, we compared their performance with the performance of human respondents and MedAlpaca, a generative LLM specifically designed for medical tasks. Results: Notably, all LLMs outperformed the average human consensus and MedAlpaca, with a minimum margin of 5% and 13%, respectively, across all 30 cases from the diagnostic case challenge collection. On the frequently misdiagnosed cases category, Bard tied with MedAlpaca but surpassed the human average consensus by 14%, whereas GPT-4 and ChatGPT-3.5 outperformed MedAlpaca and the human respondents on the moderately often misdiagnosed cases category with minimum accuracy scores of 28% and 11%, respectively. The majority voting strategy, particularly with GPT-4, demonstrated the highest overall score across all cases from the diagnostic complex case collection, surpassing that of other LLMs. On the Medical Information Mart for Intensive Care-III data sets, Bard and GPT-4 achieved the highest diagnostic accuracy scores, with multiple-choice prompts scoring 93%, whereas ChatGPT-3.5 and MedAlpaca scored 73% and 47%, respectively. Furthermore, our results demonstrate that there is no one-size-fits-all prompting approach for improving the performance of LLMs and that a single strategy does not universally apply to all LLMs. Conclusions: Our findings shed light on the diagnostic capabilities of LLMs and the challenges associated with identifying an optimal prompting strategy that aligns with each language model’s characteristics and specific task requirements. The significance of prompt engineering is highlighted, providing valuable insights for researchers and practitioners who use these language models for medical training. Furthermore, this study represents a crucial step toward understanding how LLMs can enhance diagnostic reasoning in rare and complex medical cases, paving the way for developing effective educational tools and accurate diagnostic aids to improve patient care and outcomes. %M 38349725 %R 10.2196/51391 %U https://mededu.jmir.org/2024/1/e51391 %U https://doi.org/10.2196/51391 %U http://www.ncbi.nlm.nih.gov/pubmed/38349725 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e48949 %T Cocreating an Automated mHealth Apps Systematic Review Process With Generative AI: Design Science Research Approach %A Giunti,Guido %A Doherty,Colin P %+ Academic Unit of Neurology, School of Medicine, Trinity College Dublin, College Green, Dublin, D02, Ireland, 353 1 896 1000, drguidogiunti@gmail.com %K generative artificial intelligence %K mHealth %K ChatGPT %K evidence-base %K apps %K qualitative study %K design science research %K eHealth %K mobile device %K AI %K language model %K mHealth intervention %K generative AI %K AI tool %K software code %K systematic review %K language model %D 2024 %7 12.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The use of mobile devices for delivering health-related services (mobile health [mHealth]) has rapidly increased, leading to a demand for summarizing the state of the art and practice through systematic reviews. However, the systematic review process is a resource-intensive and time-consuming process. Generative artificial intelligence (AI) has emerged as a potential solution to automate tedious tasks. Objective: This study aimed to explore the feasibility of using generative AI tools to automate time-consuming and resource-intensive tasks in a systematic review process and assess the scope and limitations of using such tools. Methods: We used the design science research methodology. The solution proposed is to use cocreation with a generative AI, such as ChatGPT, to produce software code that automates the process of conducting systematic reviews. Results: A triggering prompt was generated, and assistance from the generative AI was used to guide the steps toward developing, executing, and debugging a Python script. Errors in code were solved through conversational exchange with ChatGPT, and a tentative script was created. The code pulled the mHealth solutions from the Google Play Store and searched their descriptions for keywords that hinted toward evidence base. The results were exported to a CSV file, which was compared to the initial outputs of other similar systematic review processes. Conclusions: This study demonstrates the potential of using generative AI to automate the time-consuming process of conducting systematic reviews of mHealth apps. This approach could be particularly useful for researchers with limited coding skills. However, the study has limitations related to the design science research methodology, subjectivity bias, and the quality of the search results used to train the language model. %M 38345839 %R 10.2196/48949 %U https://mededu.jmir.org/2024/1/e48949 %U https://doi.org/10.2196/48949 %U http://www.ncbi.nlm.nih.gov/pubmed/38345839 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e48514 %T Performance of ChatGPT on the Chinese Postgraduate Examination for Clinical Medicine: Survey Study %A Yu,Peng %A Fang,Changchang %A Liu,Xiaolin %A Fu,Wanying %A Ling,Jitao %A Yan,Zhiwei %A Jiang,Yuan %A Cao,Zhengyu %A Wu,Maoxiong %A Chen,Zhiteng %A Zhu,Wengen %A Zhang,Yuling %A Abudukeremu,Ayiguli %A Wang,Yue %A Liu,Xiao %A Wang,Jingfeng %+ Department of Cardiology, Sun Yat-sen Memorial Hospital of Sun Yat-sen University, 107 Yanjiang West Road, Guangzhou, China, 86 15083827378, liux587@mail.sysu.edu.cn %K ChatGPT %K Chinese Postgraduate Examination for Clinical Medicine %K medical student %K performance %K artificial intelligence %K medical care %K qualitative feedback %K medical education %K clinical decision-making %D 2024 %7 9.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT, an artificial intelligence (AI) based on large-scale language models, has sparked interest in the field of health care. Nonetheless, the capabilities of AI in text comprehension and generation are constrained by the quality and volume of available training data for a specific language, and the performance of AI across different languages requires further investigation. While AI harbors substantial potential in medicine, it is imperative to tackle challenges such as the formulation of clinical care standards; facilitating cultural transitions in medical education and practice; and managing ethical issues including data privacy, consent, and bias. Objective: The study aimed to evaluate ChatGPT’s performance in processing Chinese Postgraduate Examination for Clinical Medicine questions, assess its clinical reasoning ability, investigate potential limitations with the Chinese language, and explore its potential as a valuable tool for medical professionals in the Chinese context. Methods: A data set of Chinese Postgraduate Examination for Clinical Medicine questions was used to assess the effectiveness of ChatGPT’s (version 3.5) medical knowledge in the Chinese language, which has a data set of 165 medical questions that were divided into three categories: (1) common questions (n=90) assessing basic medical knowledge, (2) case analysis questions (n=45) focusing on clinical decision-making through patient case evaluations, and (3) multichoice questions (n=30) requiring the selection of multiple correct answers. First of all, we assessed whether ChatGPT could meet the stringent cutoff score defined by the government agency, which requires a performance within the top 20% of candidates. Additionally, in our evaluation of ChatGPT’s performance on both original and encoded medical questions, 3 primary indicators were used: accuracy, concordance (which validates the answer), and the frequency of insights. Results: Our evaluation revealed that ChatGPT scored 153.5 out of 300 for original questions in Chinese, which signifies the minimum score set to ensure that at least 20% more candidates pass than the enrollment quota. However, ChatGPT had low accuracy in answering open-ended medical questions, with only 31.5% total accuracy. The accuracy for common questions, multichoice questions, and case analysis questions was 42%, 37%, and 17%, respectively. ChatGPT achieved a 90% concordance across all questions. Among correct responses, the concordance was 100%, significantly exceeding that of incorrect responses (n=57, 50%; P<.001). ChatGPT provided innovative insights for 80% (n=132) of all questions, with an average of 2.95 insights per accurate response. Conclusions: Although ChatGPT surpassed the passing threshold for the Chinese Postgraduate Examination for Clinical Medicine, its performance in answering open-ended medical questions was suboptimal. Nonetheless, ChatGPT exhibited high internal concordance and the ability to generate multiple insights in the Chinese language. Future research should investigate the language-based discrepancies in ChatGPT’s performance within the health care context. %M 38335017 %R 10.2196/48514 %U https://mededu.jmir.org/2024/1/e48514 %U https://doi.org/10.2196/48514 %U http://www.ncbi.nlm.nih.gov/pubmed/38335017 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50965 %T Comparison of the Performance of GPT-3.5 and GPT-4 With That of Medical Students on the Written German Medical Licensing Examination: Observational Study %A Meyer,Annika %A Riese,Janik %A Streichert,Thomas %+ Institute for Clinical Chemistry, University Hospital Cologne, Kerpener Str 62, Cologne, 50937, Germany, annika.meyer1@uk-koeln.de %K ChatGPT %K artificial intelligence %K large language model %K medical exams %K medical examinations %K medical education %K LLM %K public trust %K trust %K medical accuracy %K licensing exam %K licensing examination %K improvement %K patient care %K general population %K licensure examination %D 2024 %7 8.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The potential of artificial intelligence (AI)–based large language models, such as ChatGPT, has gained significant attention in the medical field. This enthusiasm is driven not only by recent breakthroughs and improved accessibility, but also by the prospect of democratizing medical knowledge and promoting equitable health care. However, the performance of ChatGPT is substantially influenced by the input language, and given the growing public trust in this AI tool compared to that in traditional sources of information, investigating its medical accuracy across different languages is of particular importance. Objective: This study aimed to compare the performance of GPT-3.5 and GPT-4 with that of medical students on the written German medical licensing examination. Methods: To assess GPT-3.5’s and GPT-4's medical proficiency, we used 937 original multiple-choice questions from 3 written German medical licensing examinations in October 2021, April 2022, and October 2022. Results: GPT-4 achieved an average score of 85% and ranked in the 92.8th, 99.5th, and 92.6th percentiles among medical students who took the same examinations in October 2021, April 2022, and October 2022, respectively. This represents a substantial improvement of 27% compared to GPT-3.5, which only passed 1 out of the 3 examinations. While GPT-3.5 performed well in psychiatry questions, GPT-4 exhibited strengths in internal medicine and surgery but showed weakness in academic research. Conclusions: The study results highlight ChatGPT’s remarkable improvement from moderate (GPT-3.5) to high competency (GPT-4) in answering medical licensing examination questions in German. While GPT-4’s predecessor (GPT-3.5) was imprecise and inconsistent, it demonstrates considerable potential to improve medical education and patient care, provided that medically trained users critically evaluate its results. As the replacement of search engines by AI tools seems possible in the future, further studies with nonprofessional questions are needed to assess the safety and accuracy of ChatGPT for the general population. %M 38329802 %R 10.2196/50965 %U https://mededu.jmir.org/2024/1/e50965 %U https://doi.org/10.2196/50965 %U http://www.ncbi.nlm.nih.gov/pubmed/38329802 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e53216 %T Investigating the Impact of Prompt Engineering on the Performance of Large Language Models for Standardizing Obstetric Diagnosis Text: Comparative Study %A Wang,Lei %A Bi,Wenshuai %A Zhao,Suling %A Ma,Yinyao %A Lv,Longting %A Meng,Chenwei %A Fu,Jingru %A Lv,Hanlin %+ BGI Research, Building 11, Beishan Industrial Zone, Yantian District, Shenzhen, 518083, China, 86 18707190886, lvhanlin@genomics.cn %K obstetric data %K similarity embedding %K term standardization %K large language models %K LLMs %D 2024 %7 8.2.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: The accumulation of vast electronic medical records (EMRs) through medical informatization creates significant research value, particularly in obstetrics. Diagnostic standardization across different health care institutions and regions is vital for medical data analysis. Large language models (LLMs) have been extensively used for various medical tasks. Prompt engineering is key to use LLMs effectively. Objective: This study aims to evaluate and compare the performance of LLMs with various prompt engineering techniques on the task of standardizing obstetric diagnostic terminology using real-world obstetric data. Methods: The paper describes a 4-step approach used for mapping diagnoses in electronic medical records to the International Classification of Diseases, 10th revision, observation domain. First, similarity measures were used for mapping the diagnoses. Second, candidate mapping terms were collected based on similarity scores above a threshold, to be used as the training data set. For generating optimal mapping terms, we used two LLMs (ChatGLM2 and Qwen-14B-Chat [QWEN]) for zero-shot learning in step 3. Finally, a performance comparison was conducted by using 3 pretrained bidirectional encoder representations from transformers (BERTs), including BERT, whole word masking BERT, and momentum contrastive learning with BERT (MC-BERT), for unsupervised optimal mapping term generation in the fourth step. Results: LLMs and BERT demonstrated comparable performance at their respective optimal levels. LLMs showed clear advantages in terms of performance and efficiency in unsupervised settings. Interestingly, the performance of the LLMs varied significantly across different prompt engineering setups. For instance, when applying the self-consistency approach in QWEN, the F1-score improved by 5%, with precision increasing by 7.9%, outperforming the zero-shot method. Likewise, ChatGLM2 delivered similar rates of accurately generated responses. During the analysis, the BERT series served as a comparative model with comparable results. Among the 3 models, MC-BERT demonstrated the highest level of performance. However, the differences among the versions of BERT in this study were relatively insignificant. Conclusions: After applying LLMs to standardize diagnoses and designing 4 different prompts, we compared the results to those generated by the BERT model. Our findings indicate that QWEN prompts largely outperformed the other prompts, with precision comparable to that of the BERT model. These results demonstrate the potential of unsupervised approaches in improving the efficiency of aligning diagnostic terms in daily research and uncovering hidden information values in patient data. %M 38329787 %R 10.2196/53216 %U https://formative.jmir.org/2024/1/e53216 %U https://doi.org/10.2196/53216 %U http://www.ncbi.nlm.nih.gov/pubmed/38329787 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e54369 %T Capacity of Generative AI to Interpret Human Emotions From Visual and Textual Data: Pilot Evaluation Study %A Elyoseph,Zohar %A Refoua,Elad %A Asraf,Kfir %A Lvovsky,Maya %A Shimoni,Yoav %A Hadar-Shoval,Dorit %+ Imperial College London, Fulham Palace Road, London, W6 8RF, United Kingdom, 44 547836088, zohar.j.a@gmail.com %K Reading the Mind in the Eyes Test %K RMET %K emotional awareness %K emotional comprehension %K emotional cue %K emotional cues %K ChatGPT %K large language model %K LLM %K large language models %K LLMs %K empathy %K mentalizing %K mentalization %K machine learning %K artificial intelligence %K AI %K algorithm %K algorithms %K predictive model %K predictive models %K predictive analytics %K predictive system %K practical model %K practical models %K early warning %K early detection %K mental health %K mental disease %K mental illness %K mental illnesses %K mental diseases %D 2024 %7 6.2.2024 %9 Original Paper %J JMIR Ment Health %G English %X Background: Mentalization, which is integral to human cognitive processes, pertains to the interpretation of one’s own and others’ mental states, including emotions, beliefs, and intentions. With the advent of artificial intelligence (AI) and the prominence of large language models in mental health applications, questions persist about their aptitude in emotional comprehension. The prior iteration of the large language model from OpenAI, ChatGPT-3.5, demonstrated an advanced capacity to interpret emotions from textual data, surpassing human benchmarks. Given the introduction of ChatGPT-4, with its enhanced visual processing capabilities, and considering Google Bard’s existing visual functionalities, a rigorous assessment of their proficiency in visual mentalizing is warranted. Objective: The aim of the research was to critically evaluate the capabilities of ChatGPT-4 and Google Bard with regard to their competence in discerning visual mentalizing indicators as contrasted with their textual-based mentalizing abilities. Methods: The Reading the Mind in the Eyes Test developed by Baron-Cohen and colleagues was used to assess the models’ proficiency in interpreting visual emotional indicators. Simultaneously, the Levels of Emotional Awareness Scale was used to evaluate the large language models’ aptitude in textual mentalizing. Collating data from both tests provided a holistic view of the mentalizing capabilities of ChatGPT-4 and Bard. Results: ChatGPT-4, displaying a pronounced ability in emotion recognition, secured scores of 26 and 27 in 2 distinct evaluations, significantly deviating from a random response paradigm (P<.001). These scores align with established benchmarks from the broader human demographic. Notably, ChatGPT-4 exhibited consistent responses, with no discernible biases pertaining to the sex of the model or the nature of the emotion. In contrast, Google Bard’s performance aligned with random response patterns, securing scores of 10 and 12 and rendering further detailed analysis redundant. In the domain of textual analysis, both ChatGPT and Bard surpassed established benchmarks from the general population, with their performances being remarkably congruent. Conclusions: ChatGPT-4 proved its efficacy in the domain of visual mentalizing, aligning closely with human performance standards. Although both models displayed commendable acumen in textual emotion interpretation, Bard’s capabilities in visual emotion interpretation necessitate further scrutiny and potential refinement. This study stresses the criticality of ethical AI development for emotional recognition, highlighting the need for inclusive data, collaboration with patients and mental health experts, and stringent governmental oversight to ensure transparency and protect patient privacy. %M 38319707 %R 10.2196/54369 %U https://mental.jmir.org/2024/1/e54369 %U https://doi.org/10.2196/54369 %U http://www.ncbi.nlm.nih.gov/pubmed/38319707 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50705 %T Increasing Realism and Variety of Virtual Patient Dialogues for Prenatal Counseling Education Through a Novel Application of ChatGPT: Exploratory Observational Study %A Gray,Megan %A Baird,Austin %A Sawyer,Taylor %A James,Jasmine %A DeBroux,Thea %A Bartlett,Michelle %A Krick,Jeanne %A Umoren,Rachel %+ Division of Neonatology, University of Washington, M/S FA.2.113, 4800 Sand Point Way, Seattle, WA, 98105, United States, 1 206 919 5476, graym1@uw.edu %K prenatal counseling %K virtual health %K virtual patient %K simulation %K neonatology %K ChatGPT %K AI %K artificial intelligence %D 2024 %7 1.2.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Using virtual patients, facilitated by natural language processing, provides a valuable educational experience for learners. Generating a large, varied sample of realistic and appropriate responses for virtual patients is challenging. Artificial intelligence (AI) programs can be a viable source for these responses, but their utility for this purpose has not been explored. Objective: In this study, we explored the effectiveness of generative AI (ChatGPT) in developing realistic virtual standardized patient dialogues to teach prenatal counseling skills. Methods: ChatGPT was prompted to generate a list of common areas of concern and questions that families expecting preterm delivery at 24 weeks gestation might ask during prenatal counseling. ChatGPT was then prompted to generate 2 role-plays with dialogues between a parent expecting a potential preterm delivery at 24 weeks and their counseling physician using each of the example questions. The prompt was repeated for 2 unique role-plays: one parent was characterized as anxious and the other as having low trust in the medical system. Role-play scripts were exported verbatim and independently reviewed by 2 neonatologists with experience in prenatal counseling, using a scale of 1-5 on realism, appropriateness, and utility for virtual standardized patient responses. Results: ChatGPT generated 7 areas of concern, with 35 example questions used to generate role-plays. The 35 role-play transcripts generated 176 unique parent responses (median 5, IQR 4-6, per role-play) with 268 unique sentences. Expert review identified 117 (65%) of the 176 responses as indicating an emotion, either directly or indirectly. Approximately half (98/176, 56%) of the responses had 2 or more sentences, and half (88/176, 50%) included at least 1 question. More than half (104/176, 58%) of the responses from role-played parent characters described a feeling, such as being scared, worried, or concerned. The role-plays of parents with low trust in the medical system generated many unique sentences (n=50). Most of the sentences in the responses were found to be reasonably realistic (214/268, 80%), appropriate for variable prenatal counseling conversation paths (233/268, 87%), and usable without more than a minimal modification in a virtual patient program (169/268, 63%). Conclusions: Generative AI programs, such as ChatGPT, may provide a viable source of training materials to expand virtual patient programs, with careful attention to the concerns and questions of patients and families. Given the potential for unrealistic or inappropriate statements and questions, an expert should review AI chat outputs before deploying them in an educational program. %M 38300696 %R 10.2196/50705 %U https://mededu.jmir.org/2024/1/e50705 %U https://doi.org/10.2196/50705 %U http://www.ncbi.nlm.nih.gov/pubmed/38300696 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51344 %T Evaluation of ChatGPT’s Real-Life Implementation in Undergraduate Dental Education: Mixed Methods Study %A Kavadella,Argyro %A Dias da Silva,Marco Antonio %A Kaklamanos,Eleftherios G %A Stamatopoulos,Vasileios %A Giannakopoulos,Kostis %+ School of Dentistry, European University Cyprus, 6, Diogenes street, Engomi, Nicosia, 2404, Cyprus, 357 22559620, a.kavadella@euc.ac.cy %K ChatGPT %K large language models %K LLM %K natural language processing %K artificial Intelligence %K dental education %K higher education %K learning assignments %K dental students %K AI pedagogy %K dentistry %K university %D 2024 %7 31.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The recent artificial intelligence tool ChatGPT seems to offer a range of benefits in academic education while also raising concerns. Relevant literature encompasses issues of plagiarism and academic dishonesty, as well as pedagogy and educational affordances; yet, no real-life implementation of ChatGPT in the educational process has been reported to our knowledge so far. Objective: This mixed methods study aimed to evaluate the implementation of ChatGPT in the educational process, both quantitatively and qualitatively. Methods: In March 2023, a total of 77 second-year dental students of the European University Cyprus were divided into 2 groups and asked to compose a learning assignment on “Radiation Biology and Radiation Protection in the Dental Office,” working collaboratively in small subgroups, as part of the educational semester program of the Dentomaxillofacial Radiology module. Careful planning ensured a seamless integration of ChatGPT, addressing potential challenges. One group searched the internet for scientific resources to perform the task and the other group used ChatGPT for this purpose. Both groups developed a PowerPoint (Microsoft Corp) presentation based on their research and presented it in class. The ChatGPT group students additionally registered all interactions with the language model during the prompting process and evaluated the final outcome; they also answered an open-ended evaluation questionnaire, including questions on their learning experience. Finally, all students undertook a knowledge examination on the topic, and the grades between the 2 groups were compared statistically, whereas the free-text comments of the questionnaires were thematically analyzed. Results: Out of the 77 students, 39 were assigned to the ChatGPT group and 38 to the literature research group. Seventy students undertook the multiple choice question knowledge examination, and examination grades ranged from 5 to 10 on the 0-10 grading scale. The Mann-Whitney U test showed that students of the ChatGPT group performed significantly better (P=.045) than students of the literature research group. The evaluation questionnaires revealed the benefits (human-like interface, immediate response, and wide knowledge base), the limitations (need for rephrasing the prompts to get a relevant answer, general content, false citations, and incapability to provide images or videos), and the prospects (in education, clinical practice, continuing education, and research) of ChatGPT. Conclusions: Students using ChatGPT for their learning assignments performed significantly better in the knowledge examination than their fellow students who used the literature research methodology. Students adapted quickly to the technological environment of the language model, recognized its opportunities and limitations, and used it creatively and efficiently. Implications for practice: the study underscores the adaptability of students to technological innovations including ChatGPT and its potential to enhance educational outcomes. Educators should consider integrating ChatGPT into curriculum design; awareness programs are warranted to educate both students and educators about the limitations of ChatGPT, encouraging critical engagement and responsible use. %M 38111256 %R 10.2196/51344 %U https://mededu.jmir.org/2024/1/e51344 %U https://doi.org/10.2196/51344 %U http://www.ncbi.nlm.nih.gov/pubmed/38111256 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 8 %N %P e51571 %T Investigating the Potential of a Conversational Agent (Phyllis) to Support Adolescent Health and Overcome Barriers to Physical Activity: Co-Design Study %A Moore,Richard %A Al-Tamimi,Abdel-Karim %A Freeman,Elizabeth %+ Sheffield Hallam University, Sport and Physical Activity Research Centre / Advanced Wellbeing Research Centre, Sheffield Hallam University, Olympic Legacy Park, 2 Old Hall Road, Advanced Wellbeing Research Centre, Sheffield, S9 3TU, United Kingdom, 44 7751234185, r.moore@shu.ac.uk %K physical activity %K inactivity %K conversational agent %K CA %K adolescent %K public health %K digital health interventions %K mobile phone %D 2024 %7 31.1.2024 %9 Original Paper %J JMIR Form Res %G English %X Background: Conversational agents (CAs) are a promising solution to support people in improving physical activity (PA) behaviors. However, there is a lack of CAs targeted at adolescents that aim to provide support to overcome barriers to PA. This study reports the results of the co-design, development, and evaluation of a prototype CA called “Phyllis” to support adolescents in overcoming barriers to PA with the aim of improving PA behaviors. The study presents one of the first theory-driven CAs that use existing research, a theoretical framework, and a behavior change model. Objective: The aim of the study is to use a mixed methods approach to investigate the potential of a CA to support adolescents in overcoming barriers to PA and enhance their confidence and motivation to engage in PA. Methods: The methodology involved co-designing with 8 adolescents to create a relational and persuasive CA with a suitable persona and dialogue. The CA was evaluated to determine its acceptability, usability, and effectiveness, with 46 adolescents participating in the study via a web-based survey. Results: The co-design participants were students aged 11 to 13 years, with a sex distribution of 56% (5/9) female and 44% (4/9) male, representing diverse ethnic backgrounds. Participants reported 37 specific barriers to PA, and the most common barriers included a “lack of confidence,” “fear of failure,” and a “lack of motivation.” The CA’s persona, named “Phyllis,” was co-designed with input from the students, reflecting their preferences for a friendly, understanding, and intelligent personality. Users engaged in 61 conversations with Phyllis and reported a positive user experience, and 73% of them expressed a definite intention to use the fully functional CA in the future, with a net promoter score indicating a high likelihood of recommendation. Phyllis also performed well, being able to recognize a range of different barriers to PA. The CA’s persuasive capacity was evaluated in modules focusing on confidence and motivation, with a significant increase in students’ agreement in feeling confident and motivated to engage in PA after interacting with Phyllis. Adolescents also expect to have a personalized experience and be able to personalize all aspects of the CA. Conclusions: The results showed high acceptability and a positive user experience, indicating the CA’s potential. Promising outcomes were observed, with increasing confidence and motivation for PA. Further research and development are needed to create further interventions to address other barriers to PA and assess long-term behavior change. Addressing concerns regarding bias and privacy is crucial for achieving acceptability in the future. The CA’s potential extends to health care systems and multimodal support, providing valuable insights for designing digital health interventions including tackling global inactivity issues among adolescents. %M 38294857 %R 10.2196/51571 %U https://formative.jmir.org/2024/1/e51571 %U https://doi.org/10.2196/51571 %U http://www.ncbi.nlm.nih.gov/pubmed/38294857 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e51069 %T Efficacy of ChatGPT in Cantonese Sentiment Analysis: Comparative Study %A Fu,Ziru %A Hsu,Yu Cheng %A Chan,Christian S %A Lau,Chaak Ming %A Liu,Joyce %A Yip,Paul Siu Fai %+ The Hong Kong Jockey Club Centre for Suicide Research and Prevention, Faculty of Social Sciences, The University of Hong Kong, 2/F, The Hong Kong Jockey Club Building for Interdisciplinary Research, 5 Sassoon Road, Pokfulam, Hong Kong SAR, China (Hong Kong), 852 28315232, sfpyip@hku.hk %K Cantonese %K ChatGPT %K counseling %K natural language processing %K NLP %K sentiment analysis %D 2024 %7 30.1.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Sentiment analysis is a significant yet difficult task in natural language processing. The linguistic peculiarities of Cantonese, including its high similarity with Standard Chinese, its grammatical and lexical uniqueness, and its colloquialism and multilingualism, make it different from other languages and pose additional challenges to sentiment analysis. Recent advances in models such as ChatGPT offer potential viable solutions. Objective: This study investigated the efficacy of GPT-3.5 and GPT-4 in Cantonese sentiment analysis in the context of web-based counseling and compared their performance with other mainstream methods, including lexicon-based methods and machine learning approaches. Methods: We analyzed transcripts from a web-based, text-based counseling service in Hong Kong, including a total of 131 individual counseling sessions and 6169 messages between counselors and help-seekers. First, a codebook was developed for human annotation. A simple prompt (“Is the sentiment of this Cantonese text positive, neutral, or negative? Respond with the sentiment label only.”) was then given to GPT-3.5 and GPT-4 to label each message’s sentiment. GPT-3.5 and GPT-4’s performance was compared with a lexicon-based method and 3 state-of-the-art models, including linear regression, support vector machines, and long short-term memory neural networks. Results: Our findings revealed ChatGPT’s remarkable accuracy in sentiment classification, with GPT-3.5 and GPT-4, respectively, achieving 92.1% (5682/6169) and 95.3% (5880/6169) accuracy in identifying positive, neutral, and negative sentiment, thereby outperforming the traditional lexicon-based method, which had an accuracy of 37.2% (2295/6169), and the 3 machine learning models, which had accuracies ranging from 66% (4072/6169) to 70.9% (4374/6169). Conclusions: Among many text analysis techniques, ChatGPT demonstrates superior accuracy and emerges as a promising tool for Cantonese sentiment analysis. This study also highlights ChatGPT’s applicability in real-world scenarios, such as monitoring the quality of text-based counseling services and detecting message-level sentiments in vivo. The insights derived from this study pave the way for further exploration into the capabilities of ChatGPT in the context of underresourced languages and specialized domains like psychotherapy and natural language processing. %M 38289662 %R 10.2196/51069 %U https://www.jmir.org/2024/1/e51069 %U https://doi.org/10.2196/51069 %U http://www.ncbi.nlm.nih.gov/pubmed/38289662 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 12 %N %P e48995 %T BERT-Based Neural Network for Inpatient Fall Detection From Electronic Medical Records: Retrospective Cohort Study %A Cheligeer,Cheligeer %A Wu,Guosong %A Lee,Seungwon %A Pan,Jie %A Southern,Danielle A %A Martin,Elliot A %A Sapiro,Natalie %A Eastwood,Cathy A %A Quan,Hude %A Xu,Yuan %+ Centre for Health Informatics, Cumming School of Medicine, University of Calgary, 3280 Hospital Dr NW, Calgary, AB, T2N 4Z6, Canada, 1 (403) 210 9554, yuxu@ucalgary.ca %K accidental falls %K electronic medical records %K data mining %K machine learning %K patient safety %K natural language processing %K adverse event %D 2024 %7 30.1.2024 %9 Original Paper %J JMIR Med Inform %G English %X Background: Inpatient falls are a substantial concern for health care providers and are associated with negative outcomes for patients. Automated detection of falls using machine learning (ML) algorithms may aid in improving patient safety and reducing the occurrence of falls. Objective: This study aims to develop and evaluate an ML algorithm for inpatient fall detection using multidisciplinary progress record notes and a pretrained Bidirectional Encoder Representation from Transformers (BERT) language model. Methods: A cohort of 4323 adult patients admitted to 3 acute care hospitals in Calgary, Alberta, Canada from 2016 to 2021 were randomly sampled. Trained reviewers determined falls from patient charts, which were linked to electronic medical records and administrative data. The BERT-based language model was pretrained on clinical notes, and a fall detection algorithm was developed based on a neural network binary classification architecture. Results: To address various use scenarios, we developed 3 different Alberta hospital notes-specific BERT models: a high sensitivity model (sensitivity 97.7, IQR 87.7-99.9), a high positive predictive value model (positive predictive value 85.7, IQR 57.2-98.2), and the high F1-score model (F1=64.4). Our proposed method outperformed 3 classical ML algorithms and an International Classification of Diseases code–based algorithm for fall detection, showing its potential for improved performance in diverse clinical settings. Conclusions: The developed algorithm provides an automated and accurate method for inpatient fall detection using multidisciplinary progress record notes and a pretrained BERT language model. This method could be implemented in clinical practice to improve patient safety and reduce the occurrence of falls in hospitals. %M 38289643 %R 10.2196/48995 %U https://medinform.jmir.org/2024/1/e48995 %U https://doi.org/10.2196/48995 %U http://www.ncbi.nlm.nih.gov/pubmed/38289643 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 11 %N %P e52055 %T Testing the Feasibility and Acceptability of Using an Artificial Intelligence Chatbot to Promote HIV Testing and Pre-Exposure Prophylaxis in Malaysia: Mixed Methods Study %A Cheah,Min Hui %A Gan,Yan Nee %A Altice,Frederick L %A Wickersham,Jeffrey A %A Shrestha,Roman %A Salleh,Nur Afiqah Mohd %A Ng,Kee Seong %A Azwa,Iskandar %A Balakrishnan,Vimala %A Kamarulzaman,Adeeba %A Ni,Zhao %+ School of Nursing, Yale University, 400 West Campus Drive, Orange, CT, 06477, United States, 1 203 737 3039, zhao.ni@yale.edu %K artificial intelligence %K acceptability %K chatbot %K feasibility %K HIV prevention %K HIV testing %K men who have sex with men %K MSM %K mobile health %K mHealth %K preexposure prophylaxis %K PrEP %K mobile phone %D 2024 %7 26.1.2024 %9 Original Paper %J JMIR Hum Factors %G English %X Background: The HIV epidemic continues to grow fastest among men who have sex with men (MSM) in Malaysia in the presence of stigma and discrimination. Engaging MSM on the internet using chatbots supported through artificial intelligence (AI) can potentially help HIV prevention efforts. We previously identified the benefits, limitations, and preferred features of HIV prevention AI chatbots and developed an AI chatbot prototype that is now tested for feasibility and acceptability. Objective: This study aims to test the feasibility and acceptability of an AI chatbot in promoting the uptake of HIV testing and pre-exposure prophylaxis (PrEP) in MSM. Methods: We conducted beta testing with 14 MSM from February to April 2022 using Zoom (Zoom Video Communications, Inc). Beta testing involved 3 steps: a 45-minute human-chatbot interaction using the think-aloud method, a 35-minute semistructured interview, and a 10-minute web-based survey. The first 2 steps were recorded, transcribed verbatim, and analyzed using the Unified Theory of Acceptance and Use of Technology. Emerging themes from the qualitative data were mapped on the 4 domains of the Unified Theory of Acceptance and Use of Technology: performance expectancy, effort expectancy, facilitating conditions, and social influence. Results: Most participants (13/14, 93%) perceived the chatbot to be useful because it provided comprehensive information on HIV testing and PrEP (performance expectancy). All participants indicated that the chatbot was easy to use because of its simple, straightforward design and quick, friendly responses (effort expectancy). Moreover, 93% (13/14) of the participants rated the overall chatbot quality as high, and all participants perceived the chatbot as a helpful tool and would refer it to others. Approximately 79% (11/14) of the participants agreed they would continue using the chatbot. They suggested adding a local language (ie, Bahasa Malaysia) to customize the chatbot to the Malaysian context (facilitating condition) and suggested that the chatbot should also incorporate more information on mental health, HIV risk assessment, and consequences of HIV. In terms of social influence, all participants perceived the chatbot as helpful in avoiding stigma-inducing interactions and thus could increase the frequency of HIV testing and PrEP uptake among MSM. Conclusions: The current AI chatbot is feasible and acceptable to promote the uptake of HIV testing and PrEP. To ensure the successful implementation and dissemination of AI chatbots in Malaysia, they should be customized to communicate in Bahasa Malaysia and upgraded to provide other HIV-related information to improve usability, such as mental health support, risk assessment for sexually transmitted infections, AIDS treatment, and the consequences of contracting HIV. %M 38277206 %R 10.2196/52055 %U https://humanfactors.jmir.org/2024/1/e52055 %U https://doi.org/10.2196/52055 %U http://www.ncbi.nlm.nih.gov/pubmed/38277206 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e48443 %T Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study %A Lee,You-Qian %A Chen,Ching-Tai %A Chen,Chien-Chang %A Lee,Chung-Hong %A Chen,Peitsz %A Wu,Chi-Shin %A Dai,Hong-Jie %+ Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, No. 415, Jiangong Road, Sanmin District, Kaohsiung, 80778, Taiwan, 886 73814526 ext 15510, hjdai@nkust.edu.tw %K code mixing %K electronic health record %K deidentification %K pretrained language model %K large language model %K ChatGPT %D 2024 %7 25.1.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The widespread use of electronic health records in the clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual forms, posing a challenge for deidentification. In multilingual countries, medical records could be written in a mixture of more than one language, referred to as code mixing. Most current clinical natural language processing techniques are designed for monolingual text, and there is a need to address the deidentification of code-mixed text. Objective: The aim of this study was to investigate the effectiveness and underlying mechanism of fine-tuned pretrained language models (PLMs) in identifying PHI in the code-mixed context. Additionally, we aimed to evaluate the potential of prompting large language models (LLMs) for recognizing PHI in a zero-shot manner. Methods: We compiled the first clinical code-mixed deidentification data set consisting of text written in Chinese and English. We explored the effectiveness of fine-tuned PLMs for recognizing PHI in code-mixed content, with a focus on whether PLMs exploit naming regularity and mention coverage to achieve superior performance, by probing the developed models’ outputs to examine their decision-making process. Furthermore, we investigated the potential of prompt-based in-context learning of LLMs for recognizing PHI in code-mixed text. Results: The developed methods were evaluated on a code-mixed deidentification corpus of 1700 discharge summaries. We observed that different PHI types had preferences in their occurrences within the different types of language-mixed sentences, and PLMs could effectively recognize PHI by exploiting the learned name regularity. However, the models may exhibit suboptimal results when regularity is weak or mentions contain unknown words that the representations cannot generate well. We also found that the availability of code-mixed training instances is essential for the model’s performance. Furthermore, the LLM-based deidentification method was a feasible and appealing approach that can be controlled and enhanced through natural language prompts. Conclusions: The study contributes to understanding the underlying mechanism of PLMs in addressing the deidentification process in the code-mixed context and highlights the significance of incorporating code-mixed training instances into the model training phase. To support the advancement of research, we created a manipulated subset of the resynthesized data set available for research purposes. Based on the compiled data set, we found that the LLM-based deidentification method is a feasible approach, but carefully crafted prompts are essential to avoid unwanted output. However, the use of such methods in the hospital setting requires careful consideration of data security and privacy concerns. Further research could explore the augmentation of PLMs and LLMs with external knowledge to improve their strength in recognizing rare PHI. %M 38271060 %R 10.2196/48443 %U https://www.jmir.org/2024/1/e48443 %U https://doi.org/10.2196/48443 %U http://www.ncbi.nlm.nih.gov/pubmed/38271060 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 11 %N %P e50150 %T A Comparison of ChatGPT and Fine-Tuned Open Pre-Trained Transformers (OPT) Against Widely Used Sentiment Analysis Tools: Sentiment Analysis of COVID-19 Survey Data %A Lossio-Ventura,Juan Antonio %A Weger,Rachel %A Lee,Angela Y %A Guinee,Emily P %A Chung,Joyce %A Atlas,Lauren %A Linos,Eleni %A Pereira,Francisco %+ National Institute of Mental Health, National Institutes of Health, 3D41, 10 Center Dr, Bethesda, MD, 20814, United States, 1 3018272632, juan.lossio@nih.gov %K sentiment analysis %K COVID-19 survey %K large language model %K few-shot learning %K zero-shot learning %K ChatGPT %K COVID-19 %D 2024 %7 25.1.2024 %9 Original Paper %J JMIR Ment Health %G English %X Background: Health care providers and health-related researchers face significant challenges when applying sentiment analysis tools to health-related free-text survey data. Most state-of-the-art applications were developed in domains such as social media, and their performance in the health care context remains relatively unknown. Moreover, existing studies indicate that these tools often lack accuracy and produce inconsistent results. Objective: This study aims to address the lack of comparative analysis on sentiment analysis tools applied to health-related free-text survey data in the context of COVID-19. The objective was to automatically predict sentence sentiment for 2 independent COVID-19 survey data sets from the National Institutes of Health and Stanford University. Methods: Gold standard labels were created for a subset of each data set using a panel of human raters. We compared 8 state-of-the-art sentiment analysis tools on both data sets to evaluate variability and disagreement across tools. In addition, few-shot learning was explored by fine-tuning Open Pre-Trained Transformers (OPT; a large language model [LLM] with publicly available weights) using a small annotated subset and zero-shot learning using ChatGPT (an LLM without available weights). Results: The comparison of sentiment analysis tools revealed high variability and disagreement across the evaluated tools when applied to health-related survey data. OPT and ChatGPT demonstrated superior performance, outperforming all other sentiment analysis tools. Moreover, ChatGPT outperformed OPT, exhibited higher accuracy by 6% and higher F-measure by 4% to 7%. Conclusions: This study demonstrates the effectiveness of LLMs, particularly the few-shot learning and zero-shot learning approaches, in the sentiment analysis of health-related survey data. These results have implications for saving human labor and improving efficiency in sentiment analysis tasks, contributing to advancements in the field of automated sentiment analysis. %M 38271138 %R 10.2196/50150 %U https://mental.jmir.org/2024/1/e50150 %U https://doi.org/10.2196/50150 %U http://www.ncbi.nlm.nih.gov/pubmed/38271138 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e50132 %T Development and Evaluation of a Smartphone-Based Chatbot Coach to Facilitate a Balanced Lifestyle in Individuals With Headaches (BalanceUP App): Randomized Controlled Trial %A Ulrich,Sandra %A Gantenbein,Andreas R %A Zuber,Viktor %A Von Wyl,Agnes %A Kowatsch,Tobias %A Künzli,Hansjörg %+ School of Applied Psychology, Zurich University of Applied Sciences, Pfingstweidstrasse 96, 2, Zurich, 8005, Switzerland, 41 58 934 ext 8451, sandra.ulrich@zhaw.ch %K chatbot %K mobile health %K mHealth %K smartphone %K headache management %K psychoeducation %K behavior change %K stress management %K mental well-being %K lifestyle %K mindfulness %K relaxation %K mobile phone %D 2024 %7 24.1.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Primary headaches, including migraine and tension-type headaches, are widespread and have a social, physical, mental, and economic impact. Among the key components of treatment are behavior interventions such as lifestyle modification. Scalable conversational agents (CAs) have the potential to deliver behavior interventions at a low threshold. To our knowledge, there is no evidence of behavioral interventions delivered by CAs for the treatment of headaches. Objective: This study has 2 aims. The first aim was to develop and test a smartphone-based coaching intervention (BalanceUP) for people experiencing frequent headaches, delivered by a CA and designed to improve mental well-being using various behavior change techniques. The second aim was to evaluate the effectiveness of BalanceUP by comparing the intervention and waitlist control groups and assess the engagement and acceptance of participants using BalanceUP. Methods: In an unblinded randomized controlled trial, adults with frequent headaches were recruited on the web and in collaboration with experts and allocated to either a CA intervention (BalanceUP) or a control condition. The effects of the treatment on changes in the primary outcome of the study, that is, mental well-being (as measured by the Patient Health Questionnaire Anxiety and Depression Scale), and secondary outcomes (eg, psychosomatic symptoms, stress, headache-related self-efficacy, intention to change behavior, presenteeism and absenteeism, and pain coping) were analyzed using linear mixed models and Cohen d. Primary and secondary outcomes were self-assessed before and after the intervention, and acceptance was assessed after the intervention. Engagement was measured during the intervention using self-reports and usage data. Results: A total of 198 participants (mean age 38.7, SD 12.14 y; n=172, 86.9% women) participated in the study (intervention group: n=110; waitlist control group: n=88). After the intervention, the intention-to-treat analysis revealed evidence for improved well-being (treatment: β estimate=–3.28, 95% CI –5.07 to –1.48) with moderate between-group effects (Cohen d=–0.66, 95% CI –0.99 to –0.33) in favor of the intervention group. We also found evidence of reduced somatic symptoms, perceived stress, and absenteeism and presenteeism, as well as improved headache management self-efficacy, application of behavior change techniques, and pain coping skills, with effects ranging from medium to large (Cohen d=0.43-1.05). Overall, 64.8% (118/182) of the participants used coaching as intended by engaging throughout the coaching and completing the outro. Conclusions: BalanceUP was well accepted, and the results suggest that coaching delivered by a CA can be effective in reducing the burden of people who experience headaches by improving their well-being. Trial Registration: German Clinical Trials Register DRKS00017422; https://trialsearch.who.int/Trial2.aspx?TrialID=DRKS00017422 %M 38265863 %R 10.2196/50132 %U https://www.jmir.org/2024/1/e50132 %U https://doi.org/10.2196/50132 %U http://www.ncbi.nlm.nih.gov/pubmed/38265863 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e52113 %T Assessing ChatGPT’s Mastery of Bloom’s Taxonomy Using Psychosomatic Medicine Exam Questions: Mixed-Methods Study %A Herrmann-Werner,Anne %A Festl-Wietek,Teresa %A Holderried,Friederike %A Herschbach,Lea %A Griewatz,Jan %A Masters,Ken %A Zipfel,Stephan %A Mahling,Moritz %+ Tübingen Institute for Medical Education, Faculty of Medicine, University of Tübingen, Elfriede-Aulhorn-Strasse 10, Tübingen, 72076 Tübingen, Germany, 49 7071 29 73715, teresa.festl-wietek@med.uni-tuebingen.de %K answer %K artificial intelligence %K assessment %K Bloom’s taxonomy %K ChatGPT %K classification %K error %K exam %K examination %K generative %K GPT-4 %K Generative Pre-trained Transformer 4 %K language model %K learning outcome %K LLM %K MCQ %K medical education %K medical exam %K multiple-choice question %K natural language processing %K NLP %K psychosomatic %K question %K response %K taxonomy %D 2024 %7 23.1.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models such as GPT-4 (Generative Pre-trained Transformer 4) are being increasingly used in medicine and medical education. However, these models are prone to “hallucinations” (ie, outputs that seem convincing while being factually incorrect). It is currently unknown how these errors by large language models relate to the different cognitive levels defined in Bloom’s taxonomy. Objective: This study aims to explore how GPT-4 performs in terms of Bloom’s taxonomy using psychosomatic medicine exam questions. Methods: We used a large data set of psychosomatic medicine multiple-choice questions (N=307) with real-world results derived from medical school exams. GPT-4 answered the multiple-choice questions using 2 distinct prompt versions: detailed and short. The answers were analyzed using a quantitative approach and a qualitative approach. Focusing on incorrectly answered questions, we categorized reasoning errors according to the hierarchical framework of Bloom’s taxonomy. Results: GPT-4’s performance in answering exam questions yielded a high success rate: 93% (284/307) for the detailed prompt and 91% (278/307) for the short prompt. Questions answered correctly by GPT-4 had a statistically significant higher difficulty than questions answered incorrectly (P=.002 for the detailed prompt and P<.001 for the short prompt). Independent of the prompt, GPT-4’s lowest exam performance was 78.9% (15/19), thereby always surpassing the “pass” threshold. Our qualitative analysis of incorrect answers, based on Bloom’s taxonomy, showed that errors were primarily in the “remember” (29/68) and “understand” (23/68) cognitive levels; specific issues arose in recalling details, understanding conceptual relationships, and adhering to standardized guidelines. Conclusions: GPT-4 demonstrated a remarkable success rate when confronted with psychosomatic medicine multiple-choice exam questions, aligning with previous findings. When evaluated through Bloom’s taxonomy, our data revealed that GPT-4 occasionally ignored specific facts (remember), provided illogical reasoning (understand), or failed to apply concepts to a new situation (apply). These errors, which were confidently presented, could be attributed to inherent model biases and the tendency to generate outputs that maximize likelihood. %M 38261378 %R 10.2196/52113 %U https://www.jmir.org/2024/1/e52113 %U https://doi.org/10.2196/52113 %U http://www.ncbi.nlm.nih.gov/pubmed/38261378 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e51926 %T Uncovering Language Disparity of ChatGPT on Retinal Vascular Disease Classification: Cross-Sectional Study %A Liu,Xiaocong %A Wu,Jiageng %A Shao,An %A Shen,Wenyue %A Ye,Panpan %A Wang,Yao %A Ye,Juan %A Jin,Kai %A Yang,Jie %+ Eye Center, The Second Affiliated Hospital, Zhejiang University, 88 Jiefang Road, Hangzhou, Zhejiang, 310009, China, 86 571 87783907, jinkai@zju.edu.cn %K large language models %K ChatGPT %K clinical decision support %K retinal vascular disease %K artificial intelligence %D 2024 %7 22.1.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Benefiting from rich knowledge and the exceptional ability to understand text, large language models like ChatGPT have shown great potential in English clinical environments. However, the performance of ChatGPT in non-English clinical settings, as well as its reasoning, have not been explored in depth. Objective: This study aimed to evaluate ChatGPT’s diagnostic performance and inference abilities for retinal vascular diseases in a non-English clinical environment. Methods: In this cross-sectional study, we collected 1226 fundus fluorescein angiography reports and corresponding diagnoses written in Chinese and tested ChatGPT with 4 prompting strategies (direct diagnosis or diagnosis with a step-by-step reasoning process and in Chinese or English). Results: Compared with ChatGPT using Chinese prompts for direct diagnosis that achieved an F1-score of 70.47%, ChatGPT using English prompts for direct diagnosis achieved the best diagnostic performance (80.05%), which was inferior to ophthalmologists (89.35%) but close to ophthalmologist interns (82.69%). As for its inference abilities, although ChatGPT can derive a reasoning process with a low error rate (0.4 per report) for both Chinese and English prompts, ophthalmologists identified that the latter brought more reasoning steps with less incompleteness (44.31%), misinformation (1.96%), and hallucinations (0.59%) (all P<.001). Also, analysis of the robustness of ChatGPT with different language prompts indicated significant differences in the recall (P=.03) and F1-score (P=.04) between Chinese and English prompts. In short, when prompted in English, ChatGPT exhibited enhanced diagnostic and inference capabilities for retinal vascular disease classification based on Chinese fundus fluorescein angiography reports. Conclusions: ChatGPT can serve as a helpful medical assistant to provide diagnosis in non-English clinical environments, but there are still performance gaps, language disparities, and errors compared to professionals, which demonstrate the potential limitations and the need to continually explore more robust large language models in ophthalmology practice. %M 38252483 %R 10.2196/51926 %U https://www.jmir.org/2024/1/e51926 %U https://doi.org/10.2196/51926 %U http://www.ncbi.nlm.nih.gov/pubmed/38252483 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e49082 %T Beyond the Hype—The Actual Role and Risks of AI in Today’s Medical Practice: Comparative-Approach Study %A Hansen,Steffan %A Brandt,Carl Joakim %A Søndergaard,Jens %+ Research Unit of General Practice, Institution of Public Health, University of Southern Denmark, J.B. Winsløws Vej 9, Odense, 5000, Denmark, 45 65 50 36 19, sholsthansen@health.sdu.dk %K AI %K artificial intelligence %K ChatGPT-4 %K Microsoft Bing %K general practice %K ChatGPT %K chatbot %K chatbots %K writing %K academic %K academia %K Bing %D 2024 %7 22.1.2024 %9 Original Paper %J JMIR AI %G English %X Background: The evolution of artificial intelligence (AI) has significantly impacted various sectors, with health care witnessing some of its most groundbreaking contributions. Contemporary models, such as ChatGPT-4 and Microsoft Bing, have showcased capabilities beyond just generating text, aiding in complex tasks like literature searches and refining web-based queries. Objective: This study explores a compelling query: can AI author an academic paper independently? Our assessment focuses on four core dimensions: relevance (to ensure that AI’s response directly addresses the prompt), accuracy (to ascertain that AI’s information is both factually correct and current), clarity (to examine AI’s ability to present coherent and logical ideas), and tone and style (to evaluate whether AI can align with the formality expected in academic writings). Additionally, we will consider the ethical implications and practicality of integrating AI into academic writing. Methods: To assess the capabilities of ChatGPT-4 and Microsoft Bing in the context of academic paper assistance in general practice, we used a systematic approach. ChatGPT-4, an advanced AI language model by Open AI, excels in generating human-like text and adapting responses based on user interactions, though it has a knowledge cut-off in September 2021. Microsoft Bing's AI chatbot facilitates user navigation on the Bing search engine, offering tailored search Results: In terms of relevance, ChatGPT-4 delved deeply into AI’s health care role, citing academic sources and discussing diverse applications and concerns, while Microsoft Bing provided a concise, less detailed overview. In terms of accuracy, ChatGPT-4 correctly cited 72% (23/32) of its peer-reviewed articles but included some nonexistent references. Microsoft Bing’s accuracy stood at 46% (6/13), supplemented by relevant non–peer-reviewed articles. In terms of clarity, both models conveyed clear, coherent text. ChatGPT-4 was particularly adept at detailing technical concepts, while Microsoft Bing was more general. In terms of tone, both models maintained an academic tone, but ChatGPT-4 exhibited superior depth and breadth in content delivery. Conclusions: Comparing ChatGPT-4 and Microsoft Bing for academic assistance revealed strengths and limitations. ChatGPT-4 excels in depth and relevance but falters in citation accuracy. Microsoft Bing is concise but lacks robust detail. Though both models have potential, neither can independently handle comprehensive academic tasks. As AI evolves, combining ChatGPT-4’s depth with Microsoft Bing’s up-to-date referencing could optimize academic support. Researchers should critically assess AI outputs to maintain academic credibility. %M 38875597 %R 10.2196/49082 %U https://ai.jmir.org/2024/1/e49082 %U https://doi.org/10.2196/49082 %U http://www.ncbi.nlm.nih.gov/pubmed/38875597 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e53225 %T Redefining Virtual Assistants in Health Care: The Future With Large Language Models %A Sezgin,Emre %+ The Abigail Wexner Reseach Institute at Nationwide Children's Hospital, 700 Children's Drive, Columbus, OH, 43205, United States, 1 6147223179, emre.sezgin@nationwidechildrens.org %K large language models %K voice assistants %K virtual assistants %K chatbots %K conversational agents %K health care %D 2024 %7 19.1.2024 %9 Editorial %J J Med Internet Res %G English %X This editorial explores the evolving and transformative role of large language models (LLMs) in enhancing the capabilities of virtual assistants (VAs) in the health care domain, highlighting recent research on the performance of VAs and LLMs in health care information sharing. Focusing on recent research, this editorial unveils the marked improvement in the accuracy and clinical relevance of responses from LLMs, such as GPT-4, compared to current VAs, especially in addressing complex health care inquiries, like those related to postpartum depression. The improved accuracy and clinical relevance with LLMs mark a paradigm shift in digital health tools and VAs. Furthermore, such LLM applications have the potential to dynamically adapt and be integrated into existing VA platforms, offering cost-effective, scalable, and inclusive solutions. These suggest a significant increase in the applicable range of VA applications, as well as the increased value, risk, and impact in health care, moving toward more personalized digital health ecosystems. However, alongside these advancements, it is necessary to develop and adhere to ethical guidelines, regulatory frameworks, governance principles, and privacy and safety measures. We need a robust interdisciplinary collaboration to navigate the complexities of safely and effectively integrating LLMs into health care applications, ensuring that these emerging technologies align with the diverse needs and ethical considerations of the health care domain. %M 38241074 %R 10.2196/53225 %U https://www.jmir.org/2024/1/e53225 %U https://doi.org/10.2196/53225 %U http://www.ncbi.nlm.nih.gov/pubmed/38241074 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50842 %T Performance of ChatGPT on Ophthalmology-Related Questions Across Various Examination Levels: Observational Study %A Haddad,Firas %A Saade,Joanna S %+ Department of Ophthalmology, American University of Beirut Medical Center, Bliss Street, Beirut, 1107 2020, Lebanon, 961 1350000 ext 8031, js62@aub.edu.lb %K ChatGPT %K artificial intelligence %K AI %K board examinations %K ophthalmology %K testing %D 2024 %7 18.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT and language learning models have gained attention recently for their ability to answer questions on various examinations across various disciplines. The question of whether ChatGPT could be used to aid in medical education is yet to be answered, particularly in the field of ophthalmology. Objective: The aim of this study is to assess the ability of ChatGPT-3.5 (GPT-3.5) and ChatGPT-4.0 (GPT-4.0) to answer ophthalmology-related questions across different levels of ophthalmology training. Methods: Questions from the United States Medical Licensing Examination (USMLE) steps 1 (n=44), 2 (n=60), and 3 (n=28) were extracted from AMBOSS, and 248 questions (64 easy, 122 medium, and 62 difficult questions) were extracted from the book, Ophthalmology Board Review Q&A, for the Ophthalmic Knowledge Assessment Program and the Board of Ophthalmology (OB) Written Qualifying Examination (WQE). Questions were prompted identically and inputted to GPT-3.5 and GPT-4.0. Results: GPT-3.5 achieved a total of 55% (n=210) of correct answers, while GPT-4.0 achieved a total of 70% (n=270) of correct answers. GPT-3.5 answered 75% (n=33) of questions correctly in USMLE step 1, 73.33% (n=44) in USMLE step 2, 60.71% (n=17) in USMLE step 3, and 46.77% (n=116) in the OB-WQE. GPT-4.0 answered 70.45% (n=31) of questions correctly in USMLE step 1, 90.32% (n=56) in USMLE step 2, 96.43% (n=27) in USMLE step 3, and 62.90% (n=156) in the OB-WQE. GPT-3.5 performed poorer as examination levels advanced (P<.001), while GPT-4.0 performed better on USMLE steps 2 and 3 and worse on USMLE step 1 and the OB-WQE (P<.001). The coefficient of correlation (r) between ChatGPT answering correctly and human users answering correctly was 0.21 (P=.01) for GPT-3.5 as compared to –0.31 (P<.001) for GPT-4.0. GPT-3.5 performed similarly across difficulty levels, while GPT-4.0 performed more poorly with an increase in the difficulty level. Both GPT models performed significantly better on certain topics than on others. Conclusions: ChatGPT is far from being considered a part of mainstream medical education. Future models with higher accuracy are needed for the platform to be effective in medical education. %M 38236632 %R 10.2196/50842 %U https://mededu.jmir.org/2024/1/e50842 %U https://doi.org/10.2196/50842 %U http://www.ncbi.nlm.nih.gov/pubmed/38236632 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e53961 %T A Generative Pretrained Transformer (GPT)–Powered Chatbot as a Simulated Patient to Practice History Taking: Prospective, Mixed Methods Study %A Holderried,Friederike %A Stegemann–Philipps,Christian %A Herschbach,Lea %A Moldt,Julia-Astrid %A Nevins,Andrew %A Griewatz,Jan %A Holderried,Martin %A Herrmann-Werner,Anne %A Festl-Wietek,Teresa %A Mahling,Moritz %+ Tübingen Institute for Medical Education, Eberhard Karls University, Elfriede-Aulhorn-Str 10, Tübingen, 72076, Germany, 49 7071 2973715, friederike.holderried@med.uni-tuebingen.de %K simulated patient %K GPT %K generative pretrained transformer %K ChatGPT %K history taking %K medical education %K documentation %K history %K simulated %K simulation %K simulations %K NLP %K natural language processing %K artificial intelligence %K interactive %K chatbot %K chatbots %K conversational agent %K conversational agents %K answer %K answers %K response %K responses %K human computer %K human machine %K usability %K satisfaction %D 2024 %7 16.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Communication is a core competency of medical professionals and of utmost importance for patient safety. Although medical curricula emphasize communication training, traditional formats, such as real or simulated patient interactions, can present psychological stress and are limited in repetition. The recent emergence of large language models (LLMs), such as generative pretrained transformer (GPT), offers an opportunity to overcome these restrictions Objective: The aim of this study was to explore the feasibility of a GPT-driven chatbot to practice history taking, one of the core competencies of communication. Methods: We developed an interactive chatbot interface using GPT-3.5 and a specific prompt including a chatbot-optimized illness script and a behavioral component. Following a mixed methods approach, we invited medical students to voluntarily practice history taking. To determine whether GPT provides suitable answers as a simulated patient, the conversations were recorded and analyzed using quantitative and qualitative approaches. We analyzed the extent to which the questions and answers aligned with the provided script, as well as the medical plausibility of the answers. Finally, the students filled out the Chatbot Usability Questionnaire (CUQ). Results: A total of 28 students practiced with our chatbot (mean age 23.4, SD 2.9 years). We recorded a total of 826 question-answer pairs (QAPs), with a median of 27.5 QAPs per conversation and 94.7% (n=782) pertaining to history taking. When questions were explicitly covered by the script (n=502, 60.3%), the GPT-provided answers were mostly based on explicit script information (n=471, 94.4%). For questions not covered by the script (n=195, 23.4%), the GPT answers used 56.4% (n=110) fictitious information. Regarding plausibility, 842 (97.9%) of 860 QAPs were rated as plausible. Of the 14 (2.1%) implausible answers, GPT provided answers rated as socially desirable, leaving role identity, ignoring script information, illogical reasoning, and calculation error. Despite these results, the CUQ revealed an overall positive user experience (77/100 points). Conclusions: Our data showed that LLMs, such as GPT, can provide a simulated patient experience and yield a good user experience and a majority of plausible answers. Our analysis revealed that GPT-provided answers use either explicit script information or are based on available information, which can be understood as abductive reasoning. Although rare, the GPT-based chatbot provides implausible information in some instances, with the major tendency being socially desirable instead of medically plausible information. %M 38227363 %R 10.2196/53961 %U https://mededu.jmir.org/2024/1/e53961 %U https://doi.org/10.2196/53961 %U http://www.ncbi.nlm.nih.gov/pubmed/38227363 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51388 %T Enriching Data Science and Health Care Education: Application and Impact of Synthetic Data Sets Through the Health Gym Project %A Kuo,Nicholas I-Hsien %A Perez-Concha,Oscar %A Hanly,Mark %A Mnatzaganian,Emmanuel %A Hao,Brandon %A Di Sipio,Marcus %A Yu,Guolin %A Vanjara,Jash %A Valerie,Ivy Cerelia %A de Oliveira Costa,Juliana %A Churches,Timothy %A Lujic,Sanja %A Hegarty,Jo %A Jorm,Louisa %A Barbieri,Sebastiano %+ Centre for Big Data Research in Health, The University of New South Wales, Level 2, AGSM Building (G27), Botany St, Kensington NSW, Sydney, 2052, Australia, 61 0293850645, n.kuo@unsw.edu.au %K medical education %K generative model %K generative adversarial networks %K privacy %K antiretroviral therapy (ART) %K human immunodeficiency virus (HIV) %K data science %K educational purposes %K accessibility %K data privacy %K data sets %K sepsis %K hypotension %K HIV %K science education %K health care AI %D 2024 %7 16.1.2024 %9 Viewpoint %J JMIR Med Educ %G English %X Large-scale medical data sets are vital for hands-on education in health data science but are often inaccessible due to privacy concerns. Addressing this gap, we developed the Health Gym project, a free and open-source platform designed to generate synthetic health data sets applicable to various areas of data science education, including machine learning, data visualization, and traditional statistical models. Initially, we generated 3 synthetic data sets for sepsis, acute hypotension, and antiretroviral therapy for HIV infection. This paper discusses the educational applications of Health Gym’s synthetic data sets. We illustrate this through their use in postgraduate health data science courses delivered by the University of New South Wales, Australia, and a Datathon event, involving academics, students, clinicians, and local health district professionals. We also include adaptable worked examples using our synthetic data sets, designed to enrich hands-on tutorial and workshop experiences. Although we highlight the potential of these data sets in advancing data science education and health care artificial intelligence, we also emphasize the need for continued research into the inherent limitations of synthetic data. %M 38227356 %R 10.2196/51388 %U https://mededu.jmir.org/2024/1/e51388 %U https://doi.org/10.2196/51388 %U http://www.ncbi.nlm.nih.gov/pubmed/38227356 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e49970 %T A Novel Evaluation Model for Assessing ChatGPT on Otolaryngology–Head and Neck Surgery Certification Examinations: Performance Study %A Long,Cai %A Lowe,Kayle %A Zhang,Jessica %A Santos,André dos %A Alanazi,Alaa %A O'Brien,Daniel %A Wright,Erin D %A Cote,David %+ Division of Otolaryngology–Head and Neck Surgery, University of Alberta, 8440-112 Street, Edmonton, AB, T6G 2B7, Canada, 1 (780) 407 8822, cai.long.med@gmail.com %K medical licensing %K otolaryngology %K otology %K laryngology %K ear %K nose %K throat %K ENT %K surgery %K surgical %K exam %K exams %K response %K responses %K answer %K answers %K chatbot %K chatbots %K examination %K examinations %K medical education %K otolaryngology/head and neck surgery %K OHNS %K artificial intelligence %K AI %K ChatGPT %K medical examination %K large language models %K language model %K LLM %K LLMs %K wide range information %K patient safety %K clinical implementation %K safety %K machine learning %K NLP %K natural language processing %D 2024 %7 16.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT is among the most popular large language models (LLMs), exhibiting proficiency in various standardized tests, including multiple-choice medical board examinations. However, its performance on otolaryngology–head and neck surgery (OHNS) certification examinations and open-ended medical board certification examinations has not been reported. Objective: We aimed to evaluate the performance of ChatGPT on OHNS board examinations and propose a novel method to assess an AI model’s performance on open-ended medical board examination questions. Methods: Twenty-one open-ended questions were adopted from the Royal College of Physicians and Surgeons of Canada’s sample examination to query ChatGPT on April 11, 2023, with and without prompts. A new model, named Concordance, Validity, Safety, Competency (CVSC), was developed to evaluate its performance. Results: In an open-ended question assessment, ChatGPT achieved a passing mark (an average of 75% across 3 trials) in the attempts and demonstrated higher accuracy with prompts. The model demonstrated high concordance (92.06%) and satisfactory validity. While demonstrating considerable consistency in regenerating answers, it often provided only partially correct responses. Notably, concerning features such as hallucinations and self-conflicting answers were observed. Conclusions: ChatGPT achieved a passing score in the sample examination and demonstrated the potential to pass the OHNS certification examination of the Royal College of Physicians and Surgeons of Canada. Some concerns remain due to its hallucinations, which could pose risks to patient safety. Further adjustments are necessary to yield safer and more accurate answers for clinical implementation. %M 38227351 %R 10.2196/49970 %U https://mededu.jmir.org/2024/1/e49970 %U https://doi.org/10.2196/49970 %U http://www.ncbi.nlm.nih.gov/pubmed/38227351 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e47339 %T The Use of ChatGPT for Education Modules on Integrated Pharmacotherapy of Infectious Disease: Educators' Perspectives %A Al-Worafi,Yaser Mohammed %A Goh,Khang Wen %A Hermansyah,Andi %A Tan,Ching Siang %A Ming,Long Chiau %+ School of Pharmacy, KPJ Healthcare University, Lot PT 17010 Persiaran Seriemas, Kota Seriemas, Nilai, 71800, Malaysia, 60 67942692, tcsiang@kpju.edu.my %K innovation and technology %K quality education %K sustainable communities %K innovation and infrastructure %K partnerships for the goals %K sustainable education %K social justice %K ChatGPT %K artificial intelligence %K feasibility %D 2024 %7 12.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial Intelligence (AI) plays an important role in many fields, including medical education, practice, and research. Many medical educators started using ChatGPT at the end of 2022 for many purposes. Objective: The aim of this study was to explore the potential uses, benefits, and risks of using ChatGPT in education modules on integrated pharmacotherapy of infectious disease. Methods: A content analysis was conducted to investigate the applications of ChatGPT in education modules on integrated pharmacotherapy of infectious disease. Questions pertaining to curriculum development, syllabus design, lecture note preparation, and examination construction were posed during data collection. Three experienced professors rated the appropriateness and precision of the answers provided by ChatGPT. The consensus rating was considered. The professors also discussed the prospective applications, benefits, and risks of ChatGPT in this educational setting. Results: ChatGPT demonstrated the ability to contribute to various aspects of curriculum design, with ratings ranging from 50% to 92% for appropriateness and accuracy. However, there were limitations and risks associated with its use, including incomplete syllabi, the absence of essential learning objectives, and the inability to design valid questionnaires and qualitative studies. It was suggested that educators use ChatGPT as a resource rather than relying primarily on its output. There are recommendations for effectively incorporating ChatGPT into the curriculum of the education modules on integrated pharmacotherapy of infectious disease. Conclusions: Medical and health sciences educators can use ChatGPT as a guide in many aspects related to the development of the curriculum of the education modules on integrated pharmacotherapy of infectious disease, syllabus design, lecture notes preparation, and examination preparation with caution. %M 38214967 %R 10.2196/47339 %U https://mededu.jmir.org/2024/1/e47339 %U https://doi.org/10.2196/47339 %U http://www.ncbi.nlm.nih.gov/pubmed/38214967 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e48996 %T Automated Paper Screening for Clinical Reviews Using Large Language Models: Data Analysis Study %A Guo,Eddie %A Gupta,Mehul %A Deng,Jiawen %A Park,Ye-Jean %A Paget,Michael %A Naugler,Christopher %+ Cumming School of Medicine, University of Calgary, 3330 University Dr NW, Calgary, AB, T2N 1N4, Canada, 1 5879880292, eddie.guo@ucalgary.ca %K abstract screening %K Chat GPT %K classification %K extract %K extraction %K free text %K GPT %K GPT-4 %K language model %K large language models %K LLM %K natural language processing %K NLP %K nonopiod analgesia %K review methodology %K review methods %K screening %K systematic review %K systematic %K unstructured data %D 2024 %7 12.1.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: The systematic review of clinical research papers is a labor-intensive and time-consuming process that often involves the screening of thousands of titles and abstracts. The accuracy and efficiency of this process are critical for the quality of the review and subsequent health care decisions. Traditional methods rely heavily on human reviewers, often requiring a significant investment of time and resources. Objective: This study aims to assess the performance of the OpenAI generative pretrained transformer (GPT) and GPT-4 application programming interfaces (APIs) in accurately and efficiently identifying relevant titles and abstracts from real-world clinical review data sets and comparing their performance against ground truth labeling by 2 independent human reviewers. Methods: We introduce a novel workflow using the Chat GPT and GPT-4 APIs for screening titles and abstracts in clinical reviews. A Python script was created to make calls to the API with the screening criteria in natural language and a corpus of title and abstract data sets filtered by a minimum of 2 human reviewers. We compared the performance of our model against human-reviewed papers across 6 review papers, screening over 24,000 titles and abstracts. Results: Our results show an accuracy of 0.91, a macro F1-score of 0.60, a sensitivity of excluded papers of 0.91, and a sensitivity of included papers of 0.76. The interrater variability between 2 independent human screeners was κ=0.46, and the prevalence and bias-adjusted κ between our proposed methods and the consensus-based human decisions was κ=0.96. On a randomly selected subset of papers, the GPT models demonstrated the ability to provide reasoning for their decisions and corrected their initial decisions upon being asked to explain their reasoning for incorrect classifications. Conclusions: Large language models have the potential to streamline the clinical review process, save valuable time and effort for researchers, and contribute to the overall quality of clinical reviews. By prioritizing the workflow and acting as an aid rather than a replacement for researchers and reviewers, models such as GPT-4 can enhance efficiency and lead to more accurate and reliable conclusions in medical research. %M 38214966 %R 10.2196/48996 %U https://www.jmir.org/2024/1/e48996 %U https://doi.org/10.2196/48996 %U http://www.ncbi.nlm.nih.gov/pubmed/38214966 %0 Journal Article %@ 2817-1705 %I JMIR Publications %V 3 %N %P e50442 %T Assessment of ChatGPT-3.5's Knowledge in Oncology: Comparative Study with ASCO-SEP Benchmarks %A Odabashian,Roupen %A Bastin,Donald %A Jones,Georden %A Manzoor,Maria %A Tangestaniapour,Sina %A Assad,Malke %A Lakhani,Sunita %A Odabashian,Maritsa %A McGee,Sharon %+ Department of Oncology, Barbara Ann Karmanos Cancer Institute, Wayne State University, 4100 John R St, Detroit, MI, 48201, United States, 1 (313) 745 3000 ext 7731, roupen.odabashian@mclaren.org %K artificial intelligence %K ChatGPT-3.5 %K language model %K medical oncology %D 2024 %7 12.1.2024 %9 Original Paper %J JMIR AI %G English %X Background: ChatGPT (Open AI) is a state-of-the-art large language model that uses artificial intelligence (AI) to address questions across diverse topics. The American Society of Clinical Oncology Self-Evaluation Program (ASCO-SEP) created a comprehensive educational program to help physicians keep up to date with the many rapid advances in the field. The question bank consists of multiple choice questions addressing the many facets of cancer care, including diagnosis, treatment, and supportive care. As ChatGPT applications rapidly expand, it becomes vital to ascertain if the knowledge of ChatGPT-3.5 matches the established standards that oncologists are recommended to follow. Objective: This study aims to evaluate whether ChatGPT-3.5’s knowledge aligns with the established benchmarks that oncologists are expected to adhere to. This will furnish us with a deeper understanding of the potential applications of this tool as a support for clinical decision-making. Methods: We conducted a systematic assessment of the performance of ChatGPT-3.5 on the ASCO-SEP, the leading educational and assessment tool for medical oncologists in training and practice. Over 1000 multiple choice questions covering the spectrum of cancer care were extracted. Questions were categorized by cancer type or discipline, with subcategorization as treatment, diagnosis, or other. Answers were scored as correct if ChatGPT-3.5 selected the answer as defined by ASCO-SEP. Results: Overall, ChatGPT-3.5 achieved a score of 56.1% (583/1040) for the correct answers provided. The program demonstrated varying levels of accuracy across cancer types or disciplines. The highest accuracy was observed in questions related to developmental therapeutics (8/10; 80% correct), while the lowest accuracy was observed in questions related to gastrointestinal cancer (102/209; 48.8% correct). There was no significant difference in the program’s performance across the predefined subcategories of diagnosis, treatment, and other (P=.16, which is greater than .05). Conclusions: This study evaluated ChatGPT-3.5’s oncology knowledge using the ASCO-SEP, aiming to address uncertainties regarding AI tools like ChatGPT in clinical decision-making. Our findings suggest that while ChatGPT-3.5 offers a hopeful outlook for AI in oncology, its present performance in ASCO-SEP tests necessitates further refinement to reach the requisite competency levels. Future assessments could explore ChatGPT’s clinical decision support capabilities with real-world clinical scenarios, its ease of integration into medical workflows, and its potential to foster interdisciplinary collaboration and patient engagement in health care settings. %M 38875575 %R 10.2196/50442 %U https://ai.jmir.org/2024/1/e50442 %U https://doi.org/10.2196/50442 %U http://www.ncbi.nlm.nih.gov/pubmed/38875575 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51308 %T Comprehensiveness, Accuracy, and Readability of Exercise Recommendations Provided by an AI-Based Chatbot: Mixed Methods Study %A Zaleski,Amanda L %A Berkowsky,Rachel %A Craig,Kelly Jean Thomas %A Pescatello,Linda S %+ Clinical Evidence Development, Aetna Medical Affairs, CVS Health Corporation, 151 Farmington Avenue, Hartford, CT, 06156, United States, 1 8605385003, zaleskia@aetna.com %K exercise prescription %K health literacy %K large language model %K patient education %K artificial intelligence %K AI %K chatbot %D 2024 %7 11.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: Regular physical activity is critical for health and disease prevention. Yet, health care providers and patients face barriers to implement evidence-based lifestyle recommendations. The potential to augment care with the increased availability of artificial intelligence (AI) technologies is limitless; however, the suitability of AI-generated exercise recommendations has yet to be explored. Objective: The purpose of this study was to assess the comprehensiveness, accuracy, and readability of individualized exercise recommendations generated by a novel AI chatbot. Methods: A coding scheme was developed to score AI-generated exercise recommendations across ten categories informed by gold-standard exercise recommendations, including (1) health condition–specific benefits of exercise, (2) exercise preparticipation health screening, (3) frequency, (4) intensity, (5) time, (6) type, (7) volume, (8) progression, (9) special considerations, and (10) references to the primary literature. The AI chatbot was prompted to provide individualized exercise recommendations for 26 clinical populations using an open-source application programming interface. Two independent reviewers coded AI-generated content for each category and calculated comprehensiveness (%) and factual accuracy (%) on a scale of 0%-100%. Readability was assessed using the Flesch-Kincaid formula. Qualitative analysis identified and categorized themes from AI-generated output. Results: AI-generated exercise recommendations were 41.2% (107/260) comprehensive and 90.7% (146/161) accurate, with the majority (8/15, 53%) of inaccuracy related to the need for exercise preparticipation medical clearance. Average readability level of AI-generated exercise recommendations was at the college level (mean 13.7, SD 1.7), with an average Flesch reading ease score of 31.1 (SD 7.7). Several recurring themes and observations of AI-generated output included concern for liability and safety, preference for aerobic exercise, and potential bias and direct discrimination against certain age-based populations and individuals with disabilities. Conclusions: There were notable gaps in the comprehensiveness, accuracy, and readability of AI-generated exercise recommendations. Exercise and health care professionals should be aware of these limitations when using and endorsing AI-based technologies as a tool to support lifestyle change involving exercise. %M 38206661 %R 10.2196/51308 %U https://mededu.jmir.org/2024/1/e51308 %U https://doi.org/10.2196/51308 %U http://www.ncbi.nlm.nih.gov/pubmed/38206661 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e47134 %T Embodied Conversational Agents for Chronic Diseases: Scoping Review %A Jiang,Zhili %A Huang,Xiting %A Wang,Zhiqian %A Liu,Yang %A Huang,Lihua %A Luo,Xiaolin %+ Department of Nursing, The First Affiliated Hospital, Zhejiang University School of Medicine, Building 17, 3rd Floor, 79 Qingchun Road, Hangzhou, 310003, China, 86 13867129329, lihuahuang818@zju.edu.cn %K embodied conversational agent %K ECA %K chronic diseases %K eHealth %K health care %K mobile phone %D 2024 %7 9.1.2024 %9 Review %J J Med Internet Res %G English %X Background: Embodied conversational agents (ECAs) are computer-generated animated humanlike characters that interact with users through verbal and nonverbal behavioral cues. They are increasingly used in a range of fields, including health care. Objective: This scoping review aims to identify the current practice in the development and evaluation of ECAs for chronic diseases. Methods: We applied a methodological framework in this review. A total of 6 databases (ie, PubMed, Embase, CINAHL, ACM Digital Library, IEEE Xplore Digital Library, and Web of Science) were searched using a combination of terms related to ECAs and health in October 2023. Two independent reviewers selected the studies and extracted the data. This review followed the PRISMA-ScR (Preferred Reporting Items of Systematic Reviews and Meta-Analyses Extension for Scoping Reviews) statement. Results: The literature search found 6332 papers, of which 36 (0.57%) met the inclusion criteria. Among the 36 studies, 27 (75%) originated from the United States, and 28 (78%) were published from 2020 onward. The reported ECAs covered a wide range of chronic diseases, with a focus on cancers, atrial fibrillation, and type 2 diabetes, primarily to promote screening and self-management. Most ECAs were depicted as middle-aged women based on screenshots and communicated with users through voice and nonverbal behavior. The most frequently reported evaluation outcomes were acceptability and effectiveness. Conclusions: This scoping review provides valuable insights for technology developers and health care professionals regarding the development and implementation of ECAs. It emphasizes the importance of technological advances in the embodiment, personalized strategy, and communication modality and requires in-depth knowledge of user preferences regarding appearance, animation, and intervention content. Future studies should incorporate measures of cost, efficiency, and productivity to provide a comprehensive evaluation of the benefits of using ECAs in health care. %M 38194260 %R 10.2196/47134 %U https://www.jmir.org/2024/1/e47134 %U https://doi.org/10.2196/47134 %U http://www.ncbi.nlm.nih.gov/pubmed/38194260 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51247 %T Artificial Intelligence in Medicine: Cross-Sectional Study Among Medical Students on Application, Education, and Ethical Aspects %A Weidener,Lukas %A Fischer,Michael %+ Research Unit for Quality and Ethics in Health Care, UMIT TIROL – Private University for Health Sciences and Health Technology, Eduard-Wallnöfer-Zentrum 1, Hall in Tirol, 6060, Austria, 43 17670491594, lukas.weidener@edu.umit-tirol.at %K artificial intelligence %K AI technology %K medicine %K medical education %K medical curriculum %K medical school %K AI ethics %K ethics %D 2024 %7 5.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The use of artificial intelligence (AI) in medicine not only directly impacts the medical profession but is also increasingly associated with various potential ethical aspects. In addition, the expanding use of AI and AI-based applications such as ChatGPT demands a corresponding shift in medical education to adequately prepare future practitioners for the effective use of these tools and address the associated ethical challenges they present. Objective: This study aims to explore how medical students from Germany, Austria, and Switzerland perceive the use of AI in medicine and the teaching of AI and AI ethics in medical education in accordance with their use of AI-based chat applications, such as ChatGPT. Methods: This cross-sectional study, conducted from June 15 to July 15, 2023, surveyed medical students across Germany, Austria, and Switzerland using a web-based survey. This study aimed to assess students’ perceptions of AI in medicine and the integration of AI and AI ethics into medical education. The survey, which included 53 items across 6 sections, was developed and pretested. Data analysis used descriptive statistics (median, mode, IQR, total number, and percentages) and either the chi-square or Mann-Whitney U tests, as appropriate. Results: Surveying 487 medical students across Germany, Austria, and Switzerland revealed limited formal education on AI or AI ethics within medical curricula, although 38.8% (189/487) had prior experience with AI-based chat applications, such as ChatGPT. Despite varied prior exposures, 71.7% (349/487) anticipated a positive impact of AI on medicine. There was widespread consensus (385/487, 74.9%) on the need for AI and AI ethics instruction in medical education, although the current offerings were deemed inadequate. Regarding the AI ethics education content, all proposed topics were rated as highly relevant. Conclusions: This study revealed a pronounced discrepancy between the use of AI-based (chat) applications, such as ChatGPT, among medical students in Germany, Austria, and Switzerland and the teaching of AI in medical education. To adequately prepare future medical professionals, there is an urgent need to integrate the teaching of AI and AI ethics into the medical curricula. %M 38180787 %R 10.2196/51247 %U https://mededu.jmir.org/2024/1/e51247 %U https://doi.org/10.2196/51247 %U http://www.ncbi.nlm.nih.gov/pubmed/38180787 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51148 %T Pure Wisdom or Potemkin Villages? A Comparison of ChatGPT 3.5 and ChatGPT 4 on USMLE Step 3 Style Questions: Quantitative Analysis %A Knoedler,Leonard %A Alfertshofer,Michael %A Knoedler,Samuel %A Hoch,Cosima C %A Funk,Paul F %A Cotofana,Sebastian %A Maheta,Bhagvat %A Frank,Konstantin %A Brébant,Vanessa %A Prantl,Lukas %A Lamby,Philipp %+ Department of Plastic, Hand and Reconstructive Surgery, University Hospital Regensburg, Franz-Josef-Strauß-Allee 11, Regensburg, 93053, Germany, 49 151 44824958, leonardknoedler@t-online.de %K ChatGPT %K United States Medical Licensing Examination %K artificial intelligence %K USMLE %K USMLE Step 1 %K OpenAI %K medical education %K clinical decision-making %D 2024 %7 5.1.2024 %9 Original Paper %J JMIR Med Educ %G English %X Background: The United States Medical Licensing Examination (USMLE) has been critical in medical education since 1992, testing various aspects of a medical student’s knowledge and skills through different steps, based on their training level. Artificial intelligence (AI) tools, including chatbots like ChatGPT, are emerging technologies with potential applications in medicine. However, comprehensive studies analyzing ChatGPT’s performance on USMLE Step 3 in large-scale scenarios and comparing different versions of ChatGPT are limited. Objective: This paper aimed to analyze ChatGPT’s performance on USMLE Step 3 practice test questions to better elucidate the strengths and weaknesses of AI use in medical education and deduce evidence-based strategies to counteract AI cheating. Methods: A total of 2069 USMLE Step 3 practice questions were extracted from the AMBOSS study platform. After including 229 image-based questions, a total of 1840 text-based questions were further categorized and entered into ChatGPT 3.5, while a subset of 229 questions were entered into ChatGPT 4. Responses were recorded, and the accuracy of ChatGPT answers as well as its performance in different test question categories and for different difficulty levels were compared between both versions. Results: Overall, ChatGPT 4 demonstrated a statistically significant superior performance compared to ChatGPT 3.5, achieving an accuracy of 84.7% (194/229) and 56.9% (1047/1840), respectively. A noteworthy correlation was observed between the length of test questions and the performance of ChatGPT 3.5 (ρ=–0.069; P=.003), which was absent in ChatGPT 4 (P=.87). Additionally, the difficulty of test questions, as categorized by AMBOSS hammer ratings, showed a statistically significant correlation with performance for both ChatGPT versions, with ρ=–0.289 for ChatGPT 3.5 and ρ=–0.344 for ChatGPT 4. ChatGPT 4 surpassed ChatGPT 3.5 in all levels of test question difficulty, except for the 2 highest difficulty tiers (4 and 5 hammers), where statistical significance was not reached. Conclusions: In this study, ChatGPT 4 demonstrated remarkable proficiency in taking the USMLE Step 3, with an accuracy rate of 84.7% (194/229), outshining ChatGPT 3.5 with an accuracy rate of 56.9% (1047/1840). Although ChatGPT 4 performed exceptionally, it encountered difficulties in questions requiring the application of theoretical concepts, particularly in cardiology and neurology. These insights are pivotal for the development of examination strategies that are resilient to AI and underline the promising role of AI in the realm of medical education and diagnostics. %M 38180782 %R 10.2196/51148 %U https://mededu.jmir.org/2024/1/e51148 %U https://doi.org/10.2196/51148 %U http://www.ncbi.nlm.nih.gov/pubmed/38180782 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e51183 %T Generative Language Models and Open Notes: Exploring the Promise and Limitations %A Blease,Charlotte %A Torous,John %A McMillan,Brian %A Hägglund,Maria %A Mandl,Kenneth D %+ Department of Women's and Children's Health, Uppsala University, Box 256, Uppsala, 751 05, Sweden, 46 18 471 00 0, charlotteblease@gmail.com %K ChatGPT %K generative language models %K large language models %K medical education %K Open Notes %K online record access %K patient-centered care %K empathy %K language model %K online record access %K documentation %K communication tool %K clinical documentation %D 2024 %7 4.1.2024 %9 Viewpoint %J JMIR Med Educ %G English %X Patients’ online record access (ORA) is growing worldwide. In some countries, including the United States and Sweden, access is advanced with patients obtaining rapid access to their full records on the web including laboratory and test results, lists of prescribed medications, vaccinations, and even the very narrative reports written by clinicians (the latter, commonly referred to as “open notes”). In the United States, patient’s ORA is also available in a downloadable form for use with other apps. While survey studies have shown that some patients report many benefits from ORA, there remain challenges with implementation around writing clinical documentation that patients may now read. With ORA, the functionality of the record is evolving; it is no longer only an aide memoire for doctors but also a communication tool for patients. Studies suggest that clinicians are changing how they write documentation, inviting worries about accuracy and completeness. Other concerns include work burdens; while few objective studies have examined the impact of ORA on workload, some research suggests that clinicians are spending more time writing notes and answering queries related to patients’ records. Aimed at addressing some of these concerns, clinician and patient education strategies have been proposed. In this viewpoint paper, we explore these approaches and suggest another longer-term strategy: the use of generative artificial intelligence (AI) to support clinicians in documenting narrative summaries that patients will find easier to understand. Applied to narrative clinical documentation, we suggest that such approaches may significantly help preserve the accuracy of notes, strengthen writing clarity and signals of empathy and patient-centered care, and serve as a buffer against documentation work burdens. However, we also consider the current risks associated with existing generative AI. We emphasize that for this innovation to play a key role in ORA, the cocreation of clinical notes will be imperative. We also caution that clinicians will need to be supported in how to work alongside generative AI to optimize its considerable potential. %M 38175688 %R 10.2196/51183 %U https://mededu.jmir.org/2024/1/e51183 %U https://doi.org/10.2196/51183 %U http://www.ncbi.nlm.nih.gov/pubmed/38175688 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 10 %N %P e50869 %T Patients, Doctors, and Chatbots %A Erren,Thomas C %+ Institute and Policlinic for Occupational Medicine, Environmental Medicine and Prevention Research, University Hospital of Cologne, University of Cologne, Berlin-Kölnische Allee 4, Köln (Zollstock), 50937, Germany, 49 022147876780, tim.erren@uni-koeln.de %K chatbot %K ChatGPT %K medical advice %K ethics %K patients %K doctors %D 2024 %7 4.1.2024 %9 Viewpoint %J JMIR Med Educ %G English %X Medical advice is key to the relationship between doctor and patient. The question I will address is “how may chatbots affect the interaction between patients and doctors in regards to medical advice?” I describe what lies ahead when using chatbots and identify questions galore for the daily work of doctors. I conclude with a gloomy outlook, expectations for the urgently needed ethical discourse, and a hope in relation to humans and machines. %M 38175695 %R 10.2196/50869 %U https://mededu.jmir.org/2024/1/e50869 %U https://doi.org/10.2196/50869 %U http://www.ncbi.nlm.nih.gov/pubmed/38175695 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e51501 %T Text Dialogue Analysis for Primary Screening of Mild Cognitive Impairment: Development and Validation Study %A Wang,Changyu %A Liu,Siru %A Li,Aiqing %A Liu,Jialin %+ Information Center, West China Hospital, Sichuan University, No. 37 Guo Xue Xiang28 85422306, Chengdu, 610041, China, 86 28 85422306, DLJL8@163.com %K artificial intelligence %K AI %K AI models %K ChatGPT %K primary screening %K mild cognitive impairment %K standardization %K prompt design %K design %K artificial intelligence %K cognitive impairment %K screening %K model %K clinician %K diagnosis %D 2023 %7 29.12.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: Artificial intelligence models tailored to diagnose cognitive impairment have shown excellent results. However, it is unclear whether large linguistic models can rival specialized models by text alone. Objective: In this study, we explored the performance of ChatGPT for primary screening of mild cognitive impairment (MCI) and standardized the design steps and components of the prompts. Methods: We gathered a total of 174 participants from the DementiaBank screening and classified 70% of them into the training set and 30% of them into the test set. Only text dialogues were kept. Sentences were cleaned using a macro code, followed by a manual check. The prompt consisted of 5 main parts, including character setting, scoring system setting, indicator setting, output setting, and explanatory information setting. Three dimensions of variables from published studies were included: vocabulary (ie, word frequency and word ratio, phrase frequency and phrase ratio, and lexical complexity), syntax and grammar (ie, syntactic complexity and grammatical components), and semantics (ie, semantic density and semantic coherence). We used R 4.3.0. for the analysis of variables and diagnostic indicators. Results: Three additional indicators related to the severity of MCI were incorporated into the final prompt for the model. These indicators were effective in discriminating between MCI and cognitively normal participants: tip-of-the-tongue phenomenon (P<.001), difficulty with complex ideas (P<.001), and memory issues (P<.001). The final GPT-4 model achieved a sensitivity of 0.8636, a specificity of 0.9487, and an area under the curve of 0.9062 on the training set; on the test set, the sensitivity, specificity, and area under the curve reached 0.7727, 0.8333, and 0.8030, respectively. Conclusions: ChatGPT was effective in the primary screening of participants with possible MCI. Improved standardization of prompts by clinicians would also improve the performance of the model. It is important to note that ChatGPT is not a substitute for a clinician making a diagnosis. %M 38157230 %R 10.2196/51501 %U https://www.jmir.org/2023/1/e51501 %U https://doi.org/10.2196/51501 %U http://www.ncbi.nlm.nih.gov/pubmed/38157230 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e51580 %T Evaluation of the Performance of Generative AI Large Language Models ChatGPT, Google Bard, and Microsoft Bing Chat in Supporting Evidence-Based Dentistry: Comparative Mixed Methods Study %A Giannakopoulos,Kostis %A Kavadella,Argyro %A Aaqel Salim,Anas %A Stamatopoulos,Vassilis %A Kaklamanos,Eleftherios G %+ School of Dentistry, European University Cyprus, 6 Diogenis St, Engomi, Nicosia, 2404, Cyprus, 357 22559622, k.giannakopoulos@euc.ac.cy %K artificial intelligence %K AI %K large language models %K generative pretrained transformers %K evidence-based dentistry %K ChatGPT %K Google Bard %K Microsoft Bing %K clinical practice %K dental professional %K dental practice %K clinical decision-making %K clinical practice guidelines %D 2023 %7 28.12.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: The increasing application of generative artificial intelligence large language models (LLMs) in various fields, including dentistry, raises questions about their accuracy. Objective: This study aims to comparatively evaluate the answers provided by 4 LLMs, namely Bard (Google LLC), ChatGPT-3.5 and ChatGPT-4 (OpenAI), and Bing Chat (Microsoft Corp), to clinically relevant questions from the field of dentistry. Methods: The LLMs were queried with 20 open-type, clinical dentistry–related questions from different disciplines, developed by the respective faculty of the School of Dentistry, European University Cyprus. The LLMs’ answers were graded 0 (minimum) to 10 (maximum) points against strong, traditionally collected scientific evidence, such as guidelines and consensus statements, using a rubric, as if they were examination questions posed to students, by 2 experienced faculty members. The scores were statistically compared to identify the best-performing model using the Friedman and Wilcoxon tests. Moreover, the evaluators were asked to provide a qualitative evaluation of the comprehensiveness, scientific accuracy, clarity, and relevance of the LLMs’ answers. Results: Overall, no statistically significant difference was detected between the scores given by the 2 evaluators; therefore, an average score was computed for every LLM. Although ChatGPT-4 statistically outperformed ChatGPT-3.5 (P=.008), Bing Chat (P=.049), and Bard (P=.045), all models occasionally exhibited inaccuracies, generality, outdated content, and a lack of source references. The evaluators noted instances where the LLMs delivered irrelevant information, vague answers, or information that was not fully accurate. Conclusions: This study demonstrates that although LLMs hold promising potential as an aid in the implementation of evidence-based dentistry, their current limitations can lead to potentially harmful health care decisions if not used judiciously. Therefore, these tools should not replace the dentist’s critical thinking and in-depth understanding of the subject matter. Further research, clinical validation, and model improvements are necessary for these tools to be fully integrated into dental practice. Dental practitioners must be aware of the limitations of LLMs, as their imprudent use could potentially impact patient care. Regulatory measures should be established to oversee the use of these evolving technologies. %M 38009003 %R 10.2196/51580 %U https://www.jmir.org/2023/1/e51580 %U https://doi.org/10.2196/51580 %U http://www.ncbi.nlm.nih.gov/pubmed/38009003 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e51199 %T Empathy and Equity: Key Considerations for Large Language Model Adoption in Health Care %A Koranteng,Erica %A Rao,Arya %A Flores,Efren %A Lev,Michael %A Landman,Adam %A Dreyer,Keith %A Succi,Marc %+ Massachusetts General Hospital, 55 Fruit St, Boston, 02114, United States, 1 617 935 9144, msucci@mgh.harvard.edu %K ChatGPT %K AI %K artificial intelligence %K large language models %K LLMs %K ethics %K empathy %K equity %K bias %K language model %K health care application %K patient care %K care %K development %K framework %K model %K ethical implication %D 2023 %7 28.12.2023 %9 Viewpoint %J JMIR Med Educ %G English %X The growing presence of large language models (LLMs) in health care applications holds significant promise for innovative advancements in patient care. However, concerns about ethical implications and potential biases have been raised by various stakeholders. Here, we evaluate the ethics of LLMs in medicine along 2 key axes: empathy and equity. We outline the importance of these factors in novel models of care and develop frameworks for addressing these alongside LLM deployment. %M 38153778 %R 10.2196/51199 %U https://mededu.jmir.org/2023/1/e51199 %U https://doi.org/10.2196/51199 %U http://www.ncbi.nlm.nih.gov/pubmed/38153778 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48904 %T Differentiating ChatGPT-Generated and Human-Written Medical Texts: Quantitative Study %A Liao,Wenxiong %A Liu,Zhengliang %A Dai,Haixing %A Xu,Shaochen %A Wu,Zihao %A Zhang,Yiyang %A Huang,Xiaoke %A Zhu,Dajiang %A Cai,Hongmin %A Li,Quanzheng %A Liu,Tianming %A Li,Xiang %+ Department of Radiology, Massachusetts General Hospital, 55 Fruit St, Boston, MA, 02114, United States, 1 7062480264, xli60@mgh.harvard.edu %K ChatGPT %K medical ethics %K linguistic analysis %K text classification %K artificial intelligence %K medical texts %K machine learning %D 2023 %7 28.12.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models, such as ChatGPT, are capable of generating grammatically perfect and human-like text content, and a large number of ChatGPT-generated texts have appeared on the internet. However, medical texts, such as clinical notes and diagnoses, require rigorous validation, and erroneous medical content generated by ChatGPT could potentially lead to disinformation that poses significant harm to health care and the general public. Objective: This study is among the first on responsible artificial intelligence–generated content in medicine. We focus on analyzing the differences between medical texts written by human experts and those generated by ChatGPT and designing machine learning workflows to effectively detect and differentiate medical texts generated by ChatGPT. Methods: We first constructed a suite of data sets containing medical texts written by human experts and generated by ChatGPT. We analyzed the linguistic features of these 2 types of content and uncovered differences in vocabulary, parts-of-speech, dependency, sentiment, perplexity, and other aspects. Finally, we designed and implemented machine learning methods to detect medical text generated by ChatGPT. The data and code used in this paper are published on GitHub. Results: Medical texts written by humans were more concrete, more diverse, and typically contained more useful information, while medical texts generated by ChatGPT paid more attention to fluency and logic and usually expressed general terminologies rather than effective information specific to the context of the problem. A bidirectional encoder representations from transformers–based model effectively detected medical texts generated by ChatGPT, and the F1 score exceeded 95%. Conclusions: Although text generated by ChatGPT is grammatically perfect and human-like, the linguistic characteristics of generated medical texts were different from those written by human experts. Medical text generated by ChatGPT could be effectively detected by the proposed machine learning algorithms. This study provides a pathway toward trustworthy and accountable use of large language models in medicine. %M 38153785 %R 10.2196/48904 %U https://mededu.jmir.org/2023/1/e48904 %U https://doi.org/10.2196/48904 %U http://www.ncbi.nlm.nih.gov/pubmed/38153785 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 10 %N %P e43120 %T Democratizing the Development of Chatbots to Improve Public Health: Feasibility Study of COVID-19 Misinformation %A Powell,Leigh %A Nour,Radwa %A Sleibi,Randa %A Al Suwaidi,Hanan %A Zary,Nabil %+ Institute for Excellence in Health Professions Education, Mohammed Bin Rashid University of Medicine and Health Sciences, Building 14, Dubai Healthcare City, PO Box 505055, Dubai, United Arab Emirates, 971 585960762, nabil.zary@mbru.ac.ae %K COVID-19 %K vaccine hesitancy %K infodemic %K chatbot %K motivational interviewing %K social media %K conversational agent %K misinformation %K online health information %K usability study %K vaccine misinformation %D 2023 %7 28.12.2023 %9 Original Paper %J JMIR Hum Factors %G English %X Background: Chatbots enable users to have humanlike conversations on various topics and can vary widely in complexity and functionality. An area of research priority in chatbots is democratizing chatbots to all, removing barriers to entry, such as financial ones, to help make chatbots a possibility for the wider global population to improve access to information, help reduce the digital divide between nations, and improve areas of public good (eg, health communication). Chatbots in this space may help create the potential for improved health outcomes, potentially alleviating some of the burdens on health care providers and systems to be the sole voices of outreach to public health. Objective: This study explored the feasibility of developing a chatbot using approaches that are accessible in low- and middle-resource settings, such as using technology that is low cost, can be developed by nonprogrammers, and can be deployed over social media platforms to reach the broadest-possible audience without the need for a specialized technical team. Methods: This study is presented in 2 parts. First, we detailed the design and development of a chatbot, VWise, including the resources used and development considerations for the conversational model. Next, we conducted a case study of 33 participants who engaged in a pilot with our chatbot. We explored the following 3 research questions: (1) Is it feasible to develop and implement a chatbot addressing a public health issue with only minimal resources? (2) What is the participants’ experience with using the chatbot? (3) What kinds of measures of engagement are observed from using the chatbot? Results: A high level of engagement with the chatbot was demonstrated by the large number of participants who stayed with the conversation to its natural end (n=17, 52%), requested to see the free online resource, selected to view all information about a given concern, and returned to have a dialogue about a second concern (n=12, 36%). Conclusions: This study explored the feasibility of and the design and development considerations for a chatbot, VWise. Our early findings from this initial pilot suggest that developing a functioning and low-cost chatbot is feasible, even in low-resource environments. Our results show that low-resource environments can enter the health communication chatbot space using readily available human and technical resources. However, despite these early indicators, many limitations exist in this study and further work with a larger sample size and greater diversity of participants is needed. This study represents early work on a chatbot in its virtual infancy. We hope this study will help provide those who feel chatbot access may be out of reach with a useful guide to enter this space, enabling more democratized access to chatbots for all. %M 37290040 %R 10.2196/43120 %U https://humanfactors.jmir.org/2023/1/e43120 %U https://doi.org/10.2196/43120 %U http://www.ncbi.nlm.nih.gov/pubmed/37290040 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e49239 %T Patient Acceptability of Symptom Screening and Patient Education Using a Chatbot for Autoimmune Inflammatory Diseases: Survey Study %A Tan,Tze Chin %A Roslan,Nur Emillia Binte %A Li,James Weiquan %A Zou,Xinying %A Chen,Xiangmei %A Ratnasari, %A Santosa,Anindita %+ Division of Rheumatology and Immunology, Department of Medicine, Changi General Hospital, 2 Simei Street 3, Level 6, Medical Centre, Singapore, 529889, Singapore, 65 90128379, anindita.santosa@singhealth.com.sg %K conversational agents %K digital technology in medicine %K rheumatology %K early diagnosis %K education %K patient‒physician interactions %K autoimmune rheumatic diseases %K chatbot %K implementation %K patient survey %K digital health intervention %D 2023 %7 28.12.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: Chatbots have the potential to enhance health care interaction, satisfaction, and service delivery. However, data regarding their acceptance across diverse patient populations are limited. In-depth studies on the reception of chatbots by patients with chronic autoimmune inflammatory diseases are lacking, although such studies are vital for facilitating the effective integration of chatbots in rheumatology care. Objective: We aim to assess patient perceptions and acceptance of a chatbot designed for autoimmune inflammatory rheumatic diseases (AIIRDs). Methods: We administered a comprehensive survey in an outpatient setting at a top-tier rheumatology referral center. The target cohort included patients who interacted with a chatbot explicitly tailored to facilitate diagnosis and obtain information on AIIRDs. Following the RE-AIM (Reach, Effectiveness, Adoption, Implementation and Maintenance) framework, the survey was designed to gauge the effectiveness, user acceptability, and implementation of the chatbot. Results: Between June and October 2022, we received survey responses from 200 patients, with an equal number of 100 initial consultations and 100 follow-up (FU) visits. The mean scores on a 5-point acceptability scale ranged from 4.01 (SD 0.63) to 4.41 (SD 0.54), indicating consistently high ratings across the different aspects of chatbot performance. Multivariate regression analysis indicated that having a FU visit was significantly associated with a greater willingness to reuse the chatbot for symptom determination (P=.01). Further, patients’ comfort with chatbot diagnosis increased significantly after meeting physicians (P<.001). We observed no significant differences in chatbot acceptance according to sex, education level, or diagnosis category. Conclusions: This study underscores that chatbots tailored to AIIRDs have a favorable reception. The inclination of FU patients to engage with the chatbot signifies the possible influence of past clinical encounters and physician affirmation on its use. Although further exploration is required to refine their integration, the prevalent positive perceptions suggest that chatbots have the potential to strengthen the bridge between patients and health care providers, thus enhancing the delivery of rheumatology care to various cohorts. %M 37219234 %R 10.2196/49239 %U https://formative.jmir.org/2023/1/e49239 %U https://doi.org/10.2196/49239 %U http://www.ncbi.nlm.nih.gov/pubmed/37219234 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e51798 %T Exploring the Potential of ChatGPT-4 in Predicting Refractive Surgery Categorizations: Comparative Study %A Ćirković,Aleksandar %A Katz,Toam %+ Care Vision Germany, Ltd, Zeltnerstraße 1-3, Nuremberg, 90443, Germany, 49 9119564950, aleksandar.cirkovic@mailbox.org %K artificial intelligence %K machine learning %K decision support systems %K clinical %K refractive surgical procedures %K risk assessment %K ophthalmology %K health informatics %K predictive modeling %K data analysis %K medical decision-making %K eHealth %K ChatGPT-4 %K ChatGPT %K refractive surgery %K categorization %K AI-powered algorithm %K large language model %K decision-making %D 2023 %7 28.12.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: Refractive surgery research aims to optimally precategorize patients by their suitability for various types of surgery. Recent advances have led to the development of artificial intelligence–powered algorithms, including machine learning approaches, to assess risks and enhance workflow. Large language models (LLMs) like ChatGPT-4 (OpenAI LP) have emerged as potential general artificial intelligence tools that can assist across various disciplines, possibly including refractive surgery decision-making. However, their actual capabilities in precategorizing refractive surgery patients based on real-world parameters remain unexplored. Objective: This exploratory study aimed to validate ChatGPT-4’s capabilities in precategorizing refractive surgery patients based on commonly used clinical parameters. The goal was to assess whether ChatGPT-4’s performance when categorizing batch inputs is comparable to those made by a refractive surgeon. A simple binary set of categories (patient suitable for laser refractive surgery or not) as well as a more detailed set were compared. Methods: Data from 100 consecutive patients from a refractive clinic were anonymized and analyzed. Parameters included age, sex, manifest refraction, visual acuity, and various corneal measurements and indices from Scheimpflug imaging. This study compared ChatGPT-4’s performance with a clinician’s categorizations using Cohen κ coefficient, a chi-square test, a confusion matrix, accuracy, precision, recall, F1-score, and receiver operating characteristic area under the curve. Results: A statistically significant noncoincidental accordance was found between ChatGPT-4 and the clinician’s categorizations with a Cohen κ coefficient of 0.399 for 6 categories (95% CI 0.256-0.537) and 0.610 for binary categorization (95% CI 0.372-0.792). The model showed temporal instability and response variability, however. The chi-square test on 6 categories indicated an association between the 2 raters’ distributions (χ²5=94.7, P<.001). Here, the accuracy was 0.68, precision 0.75, recall 0.68, and F1-score 0.70. For 2 categories, the accuracy was 0.88, precision 0.88, recall 0.88, F1-score 0.88, and area under the curve 0.79. Conclusions: This study revealed that ChatGPT-4 exhibits potential as a precategorization tool in refractive surgery, showing promising agreement with clinician categorizations. However, its main limitations include, among others, dependency on solely one human rater, small sample size, the instability and variability of ChatGPT’s (OpenAI LP) output between iterations and nontransparency of the underlying models. The results encourage further exploration into the application of LLMs like ChatGPT-4 in health care, particularly in decision-making processes that require understanding vast clinical data. Future research should focus on defining the model’s accuracy with prompt and vignette standardization, detecting confounding factors, and comparing to other versions of ChatGPT-4 and other LLMs to pave the way for larger-scale validation and real-world implementation. %M 38153777 %R 10.2196/51798 %U https://formative.jmir.org/2023/1/e51798 %U https://doi.org/10.2196/51798 %U http://www.ncbi.nlm.nih.gov/pubmed/38153777 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e51229 %T Comparisons of Quality, Correctness, and Similarity Between ChatGPT-Generated and Human-Written Abstracts for Basic Research: Cross-Sectional Study %A Cheng,Shu-Li %A Tsai,Shih-Jen %A Bai,Ya-Mei %A Ko,Chih-Hung %A Hsu,Chih-Wei %A Yang,Fu-Chi %A Tsai,Chia-Kuang %A Tu,Yu-Kang %A Yang,Szu-Nian %A Tseng,Ping-Tao %A Hsu,Tien-Wei %A Liang,Chih-Sung %A Su,Kuan-Pin %+ Department of Psychiatry, E-Da Dachang Hospital, I-Shou University, No. 305, Dachang 1st Rd., Sanmin District, Kaohsiung, 807, Taiwan, 886 7 5599123, s9801101@gmail.com %K ChatGPT %K abstract %K AI-generated scientific content %K plagiarism %K artificial intelligence %K NLP %K natural language processing %K LLM %K language model %K language models %K text %K textual %K generation %K generative %K extract %K extraction %K scientific research %K academic research %K publication %K publications %K abstracts %D 2023 %7 25.12.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: ChatGPT may act as a research assistant to help organize the direction of thinking and summarize research findings. However, few studies have examined the quality, similarity (abstracts being similar to the original one), and accuracy of the abstracts generated by ChatGPT when researchers provide full-text basic research papers. Objective: We aimed to assess the applicability of an artificial intelligence (AI) model in generating abstracts for basic preclinical research. Methods: We selected 30 basic research papers from Nature, Genome Biology, and Biological Psychiatry. Excluding abstracts, we inputted the full text into ChatPDF, an application of a language model based on ChatGPT, and we prompted it to generate abstracts with the same style as used in the original papers. A total of 8 experts were invited to evaluate the quality of these abstracts (based on a Likert scale of 0-10) and identify which abstracts were generated by ChatPDF, using a blind approach. These abstracts were also evaluated for their similarity to the original abstracts and the accuracy of the AI content. Results: The quality of ChatGPT-generated abstracts was lower than that of the actual abstracts (10-point Likert scale: mean 4.72, SD 2.09 vs mean 8.09, SD 1.03; P<.001). The difference in quality was significant in the unstructured format (mean difference –4.33; 95% CI –4.79 to –3.86; P<.001) but minimal in the 4-subheading structured format (mean difference –2.33; 95% CI –2.79 to –1.86). Among the 30 ChatGPT-generated abstracts, 3 showed wrong conclusions, and 10 were identified as AI content. The mean percentage of similarity between the original and the generated abstracts was not high (2.10%-4.40%). The blinded reviewers achieved a 93% (224/240) accuracy rate in guessing which abstracts were written using ChatGPT. Conclusions: Using ChatGPT to generate a scientific abstract may not lead to issues of similarity when using real full texts written by humans. However, the quality of the ChatGPT-generated abstracts was suboptimal, and their accuracy was not 100%. %M 38145486 %R 10.2196/51229 %U https://www.jmir.org/2023/1/e51229 %U https://doi.org/10.2196/51229 %U http://www.ncbi.nlm.nih.gov/pubmed/38145486 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50373 %T AI-Enabled Medical Education: Threads of Change, Promising Futures, and Risky Realities Across Four Potential Future Worlds %A Knopp,Michelle I %A Warm,Eric J %A Weber,Danielle %A Kelleher,Matthew %A Kinnear,Benjamin %A Schumacher,Daniel J %A Santen,Sally A %A Mendonça,Eneida %A Turner,Laurah %+ Department of Medical Education, College of Medicine, University of Cincinnati, Cincinnati, OH, United States, 1 5133303999, turnela@ucmail.uc.edu %K artificial intelligence %K medical education %K scenario planning %K future of healthcare %K ethics and AI %K future %K scenario %K ChatGPT %K generative %K GPT-4 %K ethic %K ethics %K ethical %K strategic planning %K Open-AI %K OpenAI %K privacy %K autonomy %K autonomous %D 2023 %7 25.12.2023 %9 Viewpoint %J JMIR Med Educ %G English %X Background: The rapid trajectory of artificial intelligence (AI) development and advancement is quickly outpacing society's ability to determine its future role. As AI continues to transform various aspects of our lives, one critical question arises for medical education: what will be the nature of education, teaching, and learning in a future world where the acquisition, retention, and application of knowledge in the traditional sense are fundamentally altered by AI? Objective: The purpose of this perspective is to plan for the intersection of health care and medical education in the future. Methods: We used GPT-4 and scenario-based strategic planning techniques to craft 4 hypothetical future worlds influenced by AI's integration into health care and medical education. This method, used by organizations such as Shell and the Accreditation Council for Graduate Medical Education, assesses readiness for alternative futures and effectively manages uncertainty, risk, and opportunity. The detailed scenarios provide insights into potential environments the medical profession may face and lay the foundation for hypothesis generation and idea-building regarding responsible AI implementation. Results: The following 4 worlds were created using OpenAI’s GPT model: AI Harmony, AI conflict, The world of Ecological Balance, and Existential Risk. Risks include disinformation and misinformation, loss of privacy, widening inequity, erosion of human autonomy, and ethical dilemmas. Benefits involve improved efficiency, personalized interventions, enhanced collaboration, early detection, and accelerated research. Conclusions: To ensure responsible AI use, the authors suggest focusing on 3 key areas: developing a robust ethical framework, fostering interdisciplinary collaboration, and investing in education and training. A strong ethical framework emphasizes patient safety, privacy, and autonomy while promoting equity and inclusivity. Interdisciplinary collaboration encourages cooperation among various experts in developing and implementing AI technologies, ensuring that they address the complex needs and challenges in health care and medical education. Investing in education and training prepares professionals and trainees with necessary skills and knowledge to effectively use and critically evaluate AI technologies. The integration of AI in health care and medical education presents a critical juncture between transformative advancements and significant risks. By working together to address both immediate and long-term risks and consequences, we can ensure that AI integration leads to a more equitable, sustainable, and prosperous future for both health care and medical education. As we engage with AI technologies, our collective actions will ultimately determine the state of the future of health care and medical education to harness AI's power while ensuring the safety and well-being of humanity. %M 38145471 %R 10.2196/50373 %U https://mededu.jmir.org/2023/1/e50373 %U https://doi.org/10.2196/50373 %U http://www.ncbi.nlm.nih.gov/pubmed/38145471 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e50865 %T Evaluation of GPT-4’s Chest X-Ray Impression Generation: A Reader Study on Performance and Perception %A Ziegelmayer,Sebastian %A Marka,Alexander W %A Lenhart,Nicolas %A Nehls,Nadja %A Reischl,Stefan %A Harder,Felix %A Sauter,Andreas %A Makowski,Marcus %A Graf,Markus %A Gawlitza,Joshua %+ Department of Diagnostic and Interventional Radiology, School of Medicine & Klinikum rechts der Isar, Technical University of Munich, Ismaninger Straße 22, Munich, 81675, Germany, 49 1759153694, ga89rog@mytum.de %K generative model %K GPT %K medical imaging %K artificial intelligence %K imaging %K radiology %K radiological %K radiography %K diagnostic %K chest %K x-ray %K x-rays %K generative %K multimodal %K impression %K impressions %K image %K images %K AI %D 2023 %7 22.12.2023 %9 Research Letter %J J Med Internet Res %G English %X Exploring the generative capabilities of the multimodal GPT-4, our study uncovered significant differences between radiological assessments and automatic evaluation metrics for chest x-ray impression generation and revealed radiological bias. %M 38133918 %R 10.2196/50865 %U https://www.jmir.org/2023/1/e50865 %U https://doi.org/10.2196/50865 %U http://www.ncbi.nlm.nih.gov/pubmed/38133918 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e51302 %T Medical Student Experiences and Perceptions of ChatGPT and Artificial Intelligence: Cross-Sectional Study %A Alkhaaldi,Saif M I %A Kassab,Carl H %A Dimassi,Zakia %A Oyoun Alsoud,Leen %A Al Fahim,Maha %A Al Hageh,Cynthia %A Ibrahim,Halah %+ Department of Medical Science, Khalifa University College of Medicine and Health Sciences, PO Box 127788, Abu Dhabi, United Arab Emirates, 971 23125423, halah.ibrahim@ku.ac.ae %K medical education %K ChatGPT %K artificial intelligence %K large language models %K LLMs %K AI %K medical student %K medical students %K cross-sectional study %K training %K technology %K medicine %K health care professionals %K risk %K technology %K education %D 2023 %7 22.12.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial intelligence (AI) has the potential to revolutionize the way medicine is learned, taught, and practiced, and medical education must prepare learners for these inevitable changes. Academic medicine has, however, been slow to embrace recent AI advances. Since its launch in November 2022, ChatGPT has emerged as a fast and user-friendly large language model that can assist health care professionals, medical educators, students, trainees, and patients. While many studies focus on the technology’s capabilities, potential, and risks, there is a gap in studying the perspective of end users. Objective: The aim of this study was to gauge the experiences and perspectives of graduating medical students on ChatGPT and AI in their training and future careers. Methods: A cross-sectional web-based survey of recently graduated medical students was conducted in an international academic medical center between May 5, 2023, and June 13, 2023. Descriptive statistics were used to tabulate variable frequencies. Results: Of 325 applicants to the residency programs, 265 completed the survey (an 81.5% response rate). The vast majority of respondents denied using ChatGPT in medical school, with 20.4% (n=54) using it to help complete written assessments and only 9.4% using the technology in their clinical work (n=25). More students planned to use it during residency, primarily for exploring new medical topics and research (n=168, 63.4%) and exam preparation (n=151, 57%). Male students were significantly more likely to believe that AI will improve diagnostic accuracy (n=47, 51.7% vs n=69, 39.7%; P=.001), reduce medical error (n=53, 58.2% vs n=71, 40.8%; P=.002), and improve patient care (n=60, 65.9% vs n=95, 54.6%; P=.007). Previous experience with AI was significantly associated with positive AI perception in terms of improving patient care, decreasing medical errors and misdiagnoses, and increasing the accuracy of diagnoses (P=.001, P<.001, P=.008, respectively). Conclusions: The surveyed medical students had minimal formal and informal experience with AI tools and limited perceptions of the potential uses of AI in health care but had overall positive views of ChatGPT and AI and were optimistic about the future of AI in medical education and health care. Structured curricula and formal policies and guidelines are needed to adequately prepare medical learners for the forthcoming integration of AI in medicine. %M 38133911 %R 10.2196/51302 %U https://mededu.jmir.org/2023/1/e51302 %U https://doi.org/10.2196/51302 %U http://www.ncbi.nlm.nih.gov/pubmed/38133911 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50658 %T Using ChatGPT for Clinical Practice and Medical Education: Cross-Sectional Survey of Medical Students’ and Physicians’ Perceptions %A Tangadulrat,Pasin %A Sono,Supinya %A Tangtrakulwanich,Boonsin %+ Department of Orthopedics, Faculty of Medicine, Prince of Songkla University, Floor 9 Rattanacheewarak Building, 15 Kanchanavanich Rd, Hatyai, 90110, Thailand, 66 74451601, boonsin.b@psu.ac.th %K ChatGPT %K AI %K artificial intelligence %K medical education %K medical students %K student %K students %K intern %K interns %K resident %K residents %K knee osteoarthritis %K survey %K surveys %K questionnaire %K questionnaires %K chatbot %K chatbots %K conversational agent %K conversational agents %K attitude %K attitudes %K opinion %K opinions %K perception %K perceptions %K perspective %K perspectives %K acceptance %D 2023 %7 22.12.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT is a well-known large language model–based chatbot. It could be used in the medical field in many aspects. However, some physicians are still unfamiliar with ChatGPT and are concerned about its benefits and risks. Objective: We aim to evaluate the perception of physicians and medical students toward using ChatGPT in the medical field. Methods: A web-based questionnaire was sent to medical students, interns, residents, and attending staff with questions regarding their perception toward using ChatGPT in clinical practice and medical education. Participants were also asked to rate their perception of ChatGPT’s generated response about knee osteoarthritis. Results: Participants included 124 medical students, 46 interns, 37 residents, and 32 attending staff. After reading ChatGPT’s response, 132 of the 239 (55.2%) participants had a positive rating about using ChatGPT for clinical practice. The proportion of positive answers was significantly lower in graduated physicians (48/115, 42%) compared with medical students (84/124, 68%; P<.001). Participants listed a lack of a patient-specific treatment plan, updated evidence, and a language barrier as ChatGPT’s pitfalls. Regarding using ChatGPT for medical education, the proportion of positive responses was also significantly lower in graduate physicians (71/115, 62%) compared to medical students (103/124, 83.1%; P<.001). Participants were concerned that ChatGPT’s response was too superficial, might lack scientific evidence, and might need expert verification. Conclusions: Medical students generally had a positive perception of using ChatGPT for guiding treatment and medical education, whereas graduated doctors were more cautious in this regard. Nonetheless, both medical students and graduated doctors positively perceived using ChatGPT for creating patient educational materials. %M 38133908 %R 10.2196/50658 %U https://mededu.jmir.org/2023/1/e50658 %U https://doi.org/10.2196/50658 %U http://www.ncbi.nlm.nih.gov/pubmed/38133908 %0 Journal Article %@ 2291-9694 %I JMIR Publications %V 11 %N %P e53785 %T Introducing the “AI Language Models in Health Care” Section: Actionable Strategies for Targeted and Wide-Scale Deployment %A Castonguay,Alexandre %A Lovis,Christian %+ Faculté des sciences infirmières, Université de Montréal, 2375, chemin de la Côte-Sainte-Catherine, Montréal, QC, H3T1A8, Canada, alexandre.castonguay.2@umontreal.ca %K generative AI %K health care digitalization %K AI in health care %K digital health standards %K AI implementation %K artificial intelligence %D 2023 %7 21.12.2023 %9 Editorial %J JMIR Med Inform %G English %X The realm of health care is on the cusp of a significant technological leap, courtesy of the advancements in artificial intelligence (AI) language models, but ensuring the ethical design, deployment, and use of these technologies is imperative to truly realize their potential in improving health care delivery and promoting human well-being and safety. Indeed, these models have demonstrated remarkable prowess in generating humanlike text, evidenced by a growing body of research and real-world applications. This capability paves the way for enhanced patient engagement, clinical decision support, and a plethora of other applications that were once considered beyond reach. However, the journey from potential to real-world application is laden with challenges ranging from ensuring reliability and transparency to navigating a complex regulatory landscape. There is still a need for comprehensive evaluation and rigorous validation to ensure that these models are reliable, transparent, and ethically sound. This editorial introduces the new section, titled “AI Language Models in Health Care.” This section seeks to create a platform for academics, practitioners, and innovators to share their insights, research findings, and real-world applications of AI language models in health care. The aim is to foster a community that is not only excited about the possibilities but also critically engaged with the ethical, practical, and regulatory challenges that lie ahead. %M 38127431 %R 10.2196/53785 %U https://medinform.jmir.org/2023/1/e53785 %U https://doi.org/10.2196/53785 %U http://www.ncbi.nlm.nih.gov/pubmed/38127431 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e47217 %T Evaluation of the Current State of Chatbots for Digital Health: Scoping Review %A Xue,Jia %A Zhang,Bolun %A Zhao,Yaxi %A Zhang,Qiaoru %A Zheng,Chengda %A Jiang,Jielin %A Li,Hanjia %A Liu,Nian %A Li,Ziqian %A Fu,Weiying %A Peng,Yingdong %A Logan,Judith %A Zhang,Jingwen %A Xiang,Xiaoling %+ Factor Inwentash Faculty of Social Work, University of Toronto, 246 bloor street, Toronto, ON, M5S 1V4, Canada, 1 416 946 5429, jia.xue@utoronto.ca %K artificial intelligence %K chatbot %K health %K mental health %K suicide %K suicidal %K conversational capacity %K relational capacity %K personalization %K in-app reviews %K experience %K experiences %K scoping %K review methods %K review methodology %K chatbots %K conversational agent %K conversational agents %D 2023 %7 19.12.2023 %9 Review %J J Med Internet Res %G English %X Background: Chatbots have become ubiquitous in our daily lives, enabling natural language conversations with users through various modes of communication. Chatbots have the potential to play a significant role in promoting health and well-being. As the number of studies and available products related to chatbots continues to rise, there is a critical need to assess product features to enhance the design of chatbots that effectively promote health and behavioral change. Objective: This scoping review aims to provide a comprehensive assessment of the current state of health-related chatbots, including the chatbots’ characteristics and features, user backgrounds, communication models, relational building capacity, personalization, interaction, responses to suicidal thoughts, and users’ in-app experiences during chatbot use. Through this analysis, we seek to identify gaps in the current research, guide future directions, and enhance the design of health-focused chatbots. Methods: Following the scoping review methodology by Arksey and O'Malley and guided by the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews) checklist, this study used a two-pronged approach to identify relevant chatbots: (1) searching the iOS and Android App Stores and (2) reviewing scientific literature through a search strategy designed by a librarian. Overall, 36 chatbots were selected based on predefined criteria from both sources. These chatbots were systematically evaluated using a comprehensive framework developed for this study, including chatbot characteristics, user backgrounds, building relational capacity, personalization, interaction models, responses to critical situations, and user experiences. Ten coauthors were responsible for downloading and testing the chatbots, coding their features, and evaluating their performance in simulated conversations. The testing of all chatbot apps was limited to their free-to-use features. Results: This review provides an overview of the diversity of health-related chatbots, encompassing categories such as mental health support, physical activity promotion, and behavior change interventions. Chatbots use text, animations, speech, images, and emojis for communication. The findings highlight variations in conversational capabilities, including empathy, humor, and personalization. Notably, concerns regarding safety, particularly in addressing suicidal thoughts, were evident. Approximately 44% (16/36) of the chatbots effectively addressed suicidal thoughts. User experiences and behavioral outcomes demonstrated the potential of chatbots in health interventions, but evidence remains limited. Conclusions: This scoping review underscores the significance of chatbots in health-related applications and offers insights into their features, functionalities, and user experiences. This study contributes to advancing the understanding of chatbots’ role in digital health interventions, thus paving the way for more effective and user-centric health promotion strategies. This study informs future research directions, emphasizing the need for rigorous randomized control trials, standardized evaluation metrics, and user-centered design to unlock the full potential of chatbots in enhancing health and well-being. Future research should focus on addressing limitations, exploring real-world user experiences, and implementing robust data security and privacy measures. %M 38113097 %R 10.2196/47217 %U https://www.jmir.org/2023/1/e47217 %U https://doi.org/10.2196/47217 %U http://www.ncbi.nlm.nih.gov/pubmed/38113097 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e49771 %T Potential and Limitations of ChatGPT 3.5 and 4.0 as a Source of COVID-19 Information: Comprehensive Comparative Analysis of Generative and Authoritative Information %A Wang,Guoyong %A Gao,Kai %A Liu,Qianyang %A Wu,Yuxin %A Zhang,Kaijun %A Zhou,Wei %A Guo,Chunbao %+ Women and Children's Hospital, Chongqing Medical University, No 120 Longshan Road, Longshan Street, Yubei District, Chongqing, 400010, China, 86 023 60354300, guochunbao@foxmail.com %K ChatGPT 3.5 %K ChatGPT 4.0 %K artificial intelligence %K AI %K COVID-19 %K pandemic %K public health %K information retrieval %D 2023 %7 14.12.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: The COVID-19 pandemic, caused by the SARS-CoV-2 virus, has necessitated reliable and authoritative information for public guidance. The World Health Organization (WHO) has been a primary source of such information, disseminating it through a question and answer format on its official website. Concurrently, ChatGPT 3.5 and 4.0, a deep learning-based natural language generation system, has shown potential in generating diverse text types based on user input. Objective: This study evaluates the accuracy of COVID-19 information generated by ChatGPT 3.5 and 4.0, assessing its potential as a supplementary public information source during the pandemic. Methods: We extracted 487 COVID-19–related questions from the WHO’s official website and used ChatGPT 3.5 and 4.0 to generate corresponding answers. These generated answers were then compared against the official WHO responses for evaluation. Two clinical experts scored the generated answers on a scale of 0-5 across 4 dimensions—accuracy, comprehensiveness, relevance, and clarity—with higher scores indicating better performance in each dimension. The WHO responses served as the reference for this assessment. Additionally, we used the BERT (Bidirectional Encoder Representations from Transformers) model to generate similarity scores (0-1) between the generated and official answers, providing a dual validation mechanism. Results: The mean (SD) scores for ChatGPT 3.5–generated answers were 3.47 (0.725) for accuracy, 3.89 (0.719) for comprehensiveness, 4.09 (0.787) for relevance, and 3.49 (0.809) for clarity. For ChatGPT 4.0, the mean (SD) scores were 4.15 (0.780), 4.47 (0.641), 4.56 (0.600), and 4.09 (0.698), respectively. All differences were statistically significant (P<.001), with ChatGPT 4.0 outperforming ChatGPT 3.5. The BERT model verification showed mean (SD) similarity scores of 0.83 (0.07) for ChatGPT 3.5 and 0.85 (0.07) for ChatGPT 4.0 compared with the official WHO answers. Conclusions: ChatGPT 3.5 and 4.0 can generate accurate and relevant COVID-19 information to a certain extent. However, compared with official WHO responses, gaps and deficiencies exist. Thus, users of ChatGPT 3.5 and 4.0 should also reference other reliable information sources to mitigate potential misinformation risks. Notably, ChatGPT 4.0 outperformed ChatGPT 3.5 across all evaluated dimensions, a finding corroborated by BERT model validation. %M 38096014 %R 10.2196/49771 %U https://www.jmir.org/2023/1/e49771 %U https://doi.org/10.2196/49771 %U http://www.ncbi.nlm.nih.gov/pubmed/38096014 %0 Journal Article %@ 2291-5222 %I JMIR Publications %V 11 %N %P e43105 %T Effects of User-Reported Risk Factors and Follow-Up Care Activities on Satisfaction With a COVID-19 Chatbot: Cross-Sectional Study %A Singh,Akanksha %A Schooley,Benjamin %A Patel,Nitin %+ IT & Cybersecurity, Department of Electrical and Computer Engineering, Brigham Young University, 240 Engineering Building, Provo, UT, 84602, United States, 1 8014220027, Ben_Schooley@byu.edu %K patient engagement %K chatbot %K population health %K health recommender systems %K conversational recommender systems %K design factors %K COVID-19 %D 2023 %7 14.12.2023 %9 Original Paper %J JMIR Mhealth Uhealth %G English %X Background: The COVID-19 pandemic influenced many to consider methods to reduce human contact and ease the burden placed on health care workers. Conversational agents or chatbots are a set of technologies that may aid with these challenges. They may provide useful interactions for users, potentially reducing the health care worker burden while increasing user satisfaction. Research aims to understand these potential impacts of chatbots and conversational recommender systems and their associated design features. Objective: The objective of this study was to evaluate user perceptions of the helpfulness of an artificial intelligence chatbot that was offered free to the public in response to COVID-19. The chatbot engaged patients and provided educational information and the opportunity to report symptoms, understand personal risks, and receive referrals for care. Methods: A cross-sectional study design was used to analyze 82,222 chats collected from patients in South Carolina seeking services from the Prisma Health system. Chi-square tests and multinomial logistic regression analyses were conducted to assess the relationship between reported risk factors and perceived chat helpfulness using chats started between April 24, 2020, and April 21, 2022. Results: A total of 82,222 chat series were started with at least one question or response on record; 53,805 symptom checker questions with at least one COVID-19–related activity series were completed, with 5191 individuals clicking further to receive a virtual video visit and 2215 clicking further to make an appointment with a local physician. Patients who were aged >65 years (P<.001), reported comorbidities (P<.001), had been in contact with a person with COVID-19 in the last 14 days (P<.001), and responded to symptom checker questions that placed them at a higher risk of COVID-19 (P<.001) were 1.8 times more likely to report the chat as helpful than those who reported lower risk factors. Users who engaged with the chatbot to conduct a series of activities were more likely to find the chat helpful (P<.001), including seeking COVID-19 information (3.97-4.07 times), in-person appointments (2.46-1.99 times), telehealth appointments with a nearby provider (2.48-1.9 times), or vaccination (2.9-3.85 times) compared with those who did not perform any of these activities. Conclusions: Chatbots that are designed to target high-risk user groups and provide relevant actionable items may be perceived as a helpful approach to early contact with the health system for assessing communicable disease symptoms and follow-up care options at home before virtual or in-person contact with health care providers. The results identified and validated significant design factors for conversational recommender systems, including triangulating a high-risk target user population and providing relevant actionable items for users to choose from as part of user engagement. %M 38096007 %R 10.2196/43105 %U https://mhealth.jmir.org/2023/1/e43105 %U https://doi.org/10.2196/43105 %U http://www.ncbi.nlm.nih.gov/pubmed/38096007 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 6 %N %P e49889 %T The Accuracy and Appropriateness of ChatGPT Responses on Nonmelanoma Skin Cancer Information Using Zero-Shot Chain of Thought Prompting %A O'Hagan,Ross %A Poplausky,Dina %A Young,Jade N %A Gulati,Nicholas %A Levoska,Melissa %A Ungar,Benjamin %A Ungar,Jonathan %+ Department of Dermatology, Icahn School of Medicine at Mount Sinai, 5th Floor, 5 East 98th Street, New York, NY, 10029, United States, 1 212 241 3288, jonathan.ungar@mountsinai.org %K ChatGPT %K artificial intelligence %K large language models %K nonmelanoma skin %K skin cancer %K cell carcinoma %K chatbot %K dermatology %K dermatologist %K epidermis %K dermis %K oncology %K cancer %D 2023 %7 14.12.2023 %9 Research Letter %J JMIR Dermatol %G English %X %M 38096013 %R 10.2196/49889 %U https://derma.jmir.org/2023/1/e49889 %U https://doi.org/10.2196/49889 %U http://www.ncbi.nlm.nih.gov/pubmed/38096013 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 12 %N %P e53556 %T AI Conversational Agent to Improve Varenicline Adherence: Protocol for a Mixed Methods Feasibility Study %A Minian,Nadia %A Mehra,Kamna %A Earle,Mackenzie %A Hafuth,Sowsan %A Ting-A-Kee,Ryan %A Rose,Jonathan %A Veldhuizen,Scott %A Zawertailo,Laurie %A Ratto,Matt %A Melamed,Osnat C %A Selby,Peter %+ INTREPID Lab, Centre for Addiction and Mental Health, 1025 Queen Street West, Toronto, ON, M6J1H1, Canada, 1 4165358501 ext 77420, nadia.minian2@camh.ca %K evaluation %K health bot %K medication adherence %K smoking cessation %K varenicline %K artificial intelligence %K AI %D 2023 %7 11.12.2023 %9 Protocol %J JMIR Res Protoc %G English %X Background: Varenicline is a pharmacological intervention for tobacco dependence that is safe and effective in facilitating smoking cessation. Enhanced adherence to varenicline augments the probability of prolonged smoking abstinence. However, research has shown that one-third of people who use varenicline are nonadherent by the second week. There is evidence showing that behavioral support helps with medication adherence. We have designed an artificial intelligence (AI) conversational agent or health bot, called “ChatV,” based on evidence of what works as well as what varenicline is, that can provide these supports. ChatV is an evidence-based, patient- and health care provider–informed health bot to improve adherence to varenicline. ChatV has been programmed to provide medication reminders, answer questions about varenicline and smoking cessation, and track medication intake and the number of cigarettes. Objective: This study aims to explore the feasibility of the ChatV health bot, to examine if it is used as intended, and to determine the appropriateness of proceeding with a randomized controlled trial. Methods: We will conduct a mixed methods feasibility study where we will pilot-test ChatV with 40 participants. Participants will be provided with a standard 12-week varenicline regimen and access to ChatV. Passive data collection will include adoption measures (how often participants use the chatbot, what features they used, when did they use it, etc). In addition, participants will complete questionnaires (at 1, 4, 8, and 12 weeks) assessing self-reported smoking status and varenicline adherence, as well as questions regarding the acceptability, appropriateness, and usability of the chatbot, and participate in an interview assessing acceptability, appropriateness, fidelity, and adoption. We will use “stop, amend, and go” progression criteria for pilot studies to decide if a randomized controlled trial is a reasonable next step and what modifications are required. A health equity lens will be adopted during participant recruitment and data analysis to understand and address the differences in uptake and use of this digital health solution among diverse sociodemographic groups. The taxonomy of implementation outcomes will be used to assess feasibility, that is, acceptability, appropriateness, fidelity, adoption, and usability. In addition, medication adherence and smoking cessation will be measured to assess the preliminary treatment effect. Interview data will be analyzed using the framework analysis method. Results: Participant enrollment for the study will begin in January 2024. Conclusions: By using predetermined progression criteria, the results of this preliminary study will inform the determination of whether to advance toward a larger randomized controlled trial to test the effectiveness of the health bot. Additionally, this study will explore the acceptability, appropriateness, fidelity, adoption, and usability of the health bot. These insights will be instrumental in refining the intervention and the health bot. Trial Registration: ClinicalTrials.gov NCT05997901; https://classic.clinicaltrials.gov/ct2/show/NCT05997901 International Registered Report Identifier (IRRID): PRR1-10.2196/53556 %M 38079201 %R 10.2196/53556 %U https://www.researchprotocols.org/2023/1/e53556 %U https://doi.org/10.2196/53556 %U http://www.ncbi.nlm.nih.gov/pubmed/38079201 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e52091 %T The Impact of Generative Conversational Artificial Intelligence on the Lesbian, Gay, Bisexual, Transgender, and Queer Community: Scoping Review %A Bragazzi,Nicola Luigi %A Crapanzano,Andrea %A Converti,Manlio %A Zerbetto,Riccardo %A Khamisy-Farah,Rola %+ Laboratory for Industrial and Applied Mathematics, Department of Mathematics and Statistics, York University, 4700 Keele Street, Toronto, ON, M3J 1P3, Canada, 1 416 736 2100, robertobragazzi@gmail.com %K generative conversational artificial intelligence %K chatbot %K lesbian, gay, bisexual, transgender, and queer community %K LGBTQ %K scoping review %K mobile phone %D 2023 %7 6.12.2023 %9 Review %J J Med Internet Res %G English %X Background: Despite recent significant strides toward acceptance, inclusion, and equality, members of the lesbian, gay, bisexual, transgender, and queer (LGBTQ) community still face alarming mental health disparities, being almost 3 times more likely to experience depression, anxiety, and suicidal thoughts than their heterosexual counterparts. These unique psychological challenges are due to discrimination, stigmatization, and identity-related struggles and can potentially benefit from generative conversational artificial intelligence (AI). As the latest advancement in AI, conversational agents and chatbots can imitate human conversation and support mental health, fostering diversity and inclusivity, combating stigma, and countering discrimination. In contrast, if not properly designed, they can perpetuate exclusion and inequities. Objective: This study aims to examine the impact of generative conversational AI on the LGBTQ community. Methods: This study was designed as a scoping review. Four electronic scholarly databases (Scopus, Embase, Web of Science, and MEDLINE via PubMed) and gray literature (Google Scholar) were consulted from inception without any language restrictions. Original studies focusing on the LGBTQ community or counselors working with this community exposed to chatbots and AI-enhanced internet-based platforms and exploring the feasibility, acceptance, or effectiveness of AI-enhanced tools were deemed eligible. The findings were reported in accordance with the PRISMA-ScR (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for Scoping Reviews). Results: Seven applications (HIVST-Chatbot, TelePrEP Navigator, Amanda Selfie, Crisis Contact Simulator, REALbot, Tough Talks, and Queer AI) were included and reviewed. The chatbots and internet-based assistants identified served various purposes: (1) to identify LGBTQ individuals at risk of suicide or contracting HIV or other sexually transmitted infections, (2) to provide resources to LGBTQ youth from underserved areas, (3) facilitate HIV status disclosure to sex partners, and (4) develop training role-play personas encompassing the diverse experiences and intersecting identities of LGBTQ youth to educate counselors. The use of generative conversational AI for the LGBTQ community is still in its early stages. Initial studies have found that deploying chatbots is feasible and well received, with high ratings for usability and user satisfaction. However, there is room for improvement in terms of the content provided and making conversations more engaging and interactive. Many of these studies used small sample sizes and short-term interventions measuring limited outcomes. Conclusions: Generative conversational AI holds promise, but further development and formal evaluation are needed, including studies with larger samples, longer interventions, and randomized trials to compare different content, delivery methods, and dissemination platforms. In addition, a focus on engagement with behavioral objectives is essential to advance this field. The findings have broad practical implications, highlighting that AI’s impact spans various aspects of people’s lives. Assessing AI’s impact on diverse communities and adopting diversity-aware and intersectional approaches can help shape AI’s positive impact on society as a whole. %M 37864350 %R 10.2196/52091 %U https://www.jmir.org/2023/1/e52091 %U https://doi.org/10.2196/52091 %U http://www.ncbi.nlm.nih.gov/pubmed/37864350 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e52202 %T Performance Comparison of ChatGPT-4 and Japanese Medical Residents in the General Medicine In-Training Examination: Comparison Study %A Watari,Takashi %A Takagi,Soshi %A Sakaguchi,Kota %A Nishizaki,Yuji %A Shimizu,Taro %A Yamamoto,Yu %A Tokuda,Yasuharu %+ Department of Medicine, University of Michigan Medical School, 2215 Fuller Road, Ann Arbor, MI, 48105, United States, 1 734 769 7100, wataritari@gmail.com %K ChatGPT %K artificial intelligence %K medical education %K clinical training %K non-English language %K ChatGPT-4 %K Japan %K Japanese %K Asia %K Asian %K exam %K examination %K exams %K examinations %K NLP %K natural language processing %K LLM %K language model %K language models %K performance %K response %K responses %K answer %K answers %K chatbot %K chatbots %K conversational agent %K conversational agents %K reasoning %K clinical %K GM-ITE %K self-assessment %K residency programs %D 2023 %7 6.12.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: The reliability of GPT-4, a state-of-the-art expansive language model specializing in clinical reasoning and medical knowledge, remains largely unverified across non-English languages. Objective: This study aims to compare fundamental clinical competencies between Japanese residents and GPT-4 by using the General Medicine In-Training Examination (GM-ITE). Methods: We used the GPT-4 model provided by OpenAI and the GM-ITE examination questions for the years 2020, 2021, and 2022 to conduct a comparative analysis. This analysis focused on evaluating the performance of individuals who were concluding their second year of residency in comparison to that of GPT-4. Given the current abilities of GPT-4, our study included only single-choice exam questions, excluding those involving audio, video, or image data. The assessment included 4 categories: general theory (professionalism and medical interviewing), symptomatology and clinical reasoning, physical examinations and clinical procedures, and specific diseases. Additionally, we categorized the questions into 7 specialty fields and 3 levels of difficulty, which were determined based on residents’ correct response rates. Results: Upon examination of 137 GM-ITE questions in Japanese, GPT-4 scores were significantly higher than the mean scores of residents (residents: 55.8%, GPT-4: 70.1%; P<.001). In terms of specific disciplines, GPT-4 scored 23.5 points higher in the “specific diseases,” 30.9 points higher in “obstetrics and gynecology,” and 26.1 points higher in “internal medicine.” In contrast, GPT-4 scores in “medical interviewing and professionalism,” “general practice,” and “psychiatry” were lower than those of the residents, although this discrepancy was not statistically significant. Upon analyzing scores based on question difficulty, GPT-4 scores were 17.2 points lower for easy problems (P=.007) but were 25.4 and 24.4 points higher for normal and difficult problems, respectively (P<.001). In year-on-year comparisons, GPT-4 scores were 21.7 and 21.5 points higher in the 2020 (P=.01) and 2022 (P=.003) examinations, respectively, but only 3.5 points higher in the 2021 examinations (no significant difference). Conclusions: In the Japanese language, GPT-4 also outperformed the average medical residents in the GM-ITE test, originally designed for them. Specifically, GPT-4 demonstrated a tendency to score higher on difficult questions with low resident correct response rates and those demanding a more comprehensive understanding of diseases. However, GPT-4 scored comparatively lower on questions that residents could readily answer, such as those testing attitudes toward patients and professionalism, as well as those necessitating an understanding of context and communication. These findings highlight the strengths and limitations of artificial intelligence applications in medical education and practice. %M 38055323 %R 10.2196/52202 %U https://mededu.jmir.org/2023/1/e52202 %U https://doi.org/10.2196/52202 %U http://www.ncbi.nlm.nih.gov/pubmed/38055323 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e51603 %T How Can the Clinical Aptitude of AI Assistants Be Assayed? %A Thirunavukarasu,Arun James %+ Oxford University Clinical Academic Graduate School, University of Oxford, John Radcliffe Hospital, Level 3, Oxford, OX3 9DU, United Kingdom, 44 1865 289 467, ajt205@cantab.ac.uk %K artificial intelligence %K AI %K validation %K clinical decision aid %K artificial general intelligence %K foundation models %K large language models %K LLM %K language model %K ChatGPT %K chatbot %K chatbots %K conversational agent %K conversational agents %K pitfall %K pitfalls %K pain point %K pain points %K implementation %K barrier %K barriers %K challenge %K challenges %D 2023 %7 5.12.2023 %9 Viewpoint %J J Med Internet Res %G English %X Large language models (LLMs) are exhibiting remarkable performance in clinical contexts, with exemplar results ranging from expert-level attainment in medical examination questions to superior accuracy and relevance when responding to patient queries compared to real doctors replying to queries on social media. The deployment of LLMs in conventional health care settings is yet to be reported, and there remains an open question as to what evidence should be required before such deployment is warranted. Early validation studies use unvalidated surrogate variables to represent clinical aptitude, and it may be necessary to conduct prospective randomized controlled trials to justify the use of an LLM for clinical advice or assistance, as potential pitfalls and pain points cannot be exhaustively predicted. This viewpoint states that as LLMs continue to revolutionize the field, there is an opportunity to improve the rigor of artificial intelligence (AI) research to reward innovation, conferring real benefits to real patients. %M 38051572 %R 10.2196/51603 %U https://www.jmir.org/2023/1/e51603 %U https://doi.org/10.2196/51603 %U http://www.ncbi.nlm.nih.gov/pubmed/38051572 %0 Journal Article %@ 1929-0748 %I JMIR Publications %V 12 %N %P e51873 %T Usability and Efficacy of Artificial Intelligence Chatbots (ChatGPT) for Health Sciences Students: Protocol for a Crossover Randomized Controlled Trial %A Veras,Mirella %A Dyer,Joseph-Omer %A Rooney,Morgan %A Barros Silva,Paulo Goberlânio %A Rutherford,Derek %A Kairy,Dahlia %+ Health Sciences, Carleton University, 1125 Colonel By Drive, Ottawa, ON, K1S 5B6, Canada, 1 613 520 2600, mirella.veras@carleton.ca %K artificial intelligence %K AI %K health sciences %K usability %K learning outcomes %K perceptions %K OpenAI %K ChatGPT %K education %K randomized controlled trial %K RCT %K crossover RCT %D 2023 %7 24.11.2023 %9 Protocol %J JMIR Res Protoc %G English %X Background: The integration of artificial intelligence (AI) into health sciences students’ education holds significant importance. The rapid advancement of AI has opened new horizons in scientific writing and has the potential to reshape human-technology interactions. AI in education may impact critical thinking, leading to unintended consequences that need to be addressed. Understanding the implications of AI adoption in education is essential for ensuring its responsible and effective use, empowering health sciences students to navigate AI-driven technologies’ evolving field with essential knowledge and skills. Objective: This study aims to provide details on the study protocol and the methods used to investigate the usability and efficacy of ChatGPT, a large language model. The primary focus is on assessing its role as a supplementary learning tool for improving learning processes and outcomes among undergraduate health sciences students, with a specific emphasis on chronic diseases. Methods: This single-blinded, crossover, randomized, controlled trial is part of a broader mixed methods study, and the primary emphasis of this paper is on the quantitative component of the overall research. A total of 50 students will be recruited for this study. The alternative hypothesis posits that there will be a significant difference in learning outcomes and technology usability between students using ChatGPT (group A) and those using standard web-based tools (group B) to access resources and complete assignments. Participants will be allocated to sequence AB or BA in a 1:1 ratio using computer-generated randomization. Both arms include students’ participation in a writing assignment intervention, with a washout period of 21 days between interventions. The primary outcome is the measure of the technology usability and effectiveness of ChatGPT, whereas the secondary outcome is the measure of students’ perceptions and experiences with ChatGPT as a learning tool. Outcome data will be collected up to 24 hours after the interventions. Results: This study aims to understand the potential benefits and challenges of incorporating AI as an educational tool, particularly in the context of student learning. The findings are expected to identify critical areas that need attention and help educators develop a deeper understanding of AI’s impact on the educational field. By exploring the differences in the usability and efficacy between ChatGPT and conventional web-based tools, this study seeks to inform educators and students on the responsible integration of AI into academic settings, with a specific focus on health sciences education. Conclusions: By exploring the usability and efficacy of ChatGPT compared with conventional web-based tools, this study seeks to inform educators and students about the responsible integration of AI into academic settings. Trial Registration: ClinicalTrails.gov NCT05963802; https://clinicaltrials.gov/study/NCT05963802 International Registered Report Identifier (IRRID): PRR1-10.2196/51873 %M 37999958 %R 10.2196/51873 %U https://www.researchprotocols.org/2023/1/e51873 %U https://doi.org/10.2196/51873 %U http://www.ncbi.nlm.nih.gov/pubmed/37999958 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e47274 %T The Intersection of ChatGPT, Clinical Medicine, and Medical Education %A Wong,Rebecca Shin-Yee %A Ming,Long Chiau %A Raja Ali,Raja Affendi %+ School of Medical and Life Sciences, Sunway University, No 5, Jalan Universiti, Bandar Sunway, Selangor, 47500, Malaysia, 60 374918622 ext 7452, longchiauming@gmail.com %K ChatGPT %K clinical research %K large language model %K artificial intelligence %K ethical considerations %K AI %K OpenAI %D 2023 %7 21.11.2023 %9 Viewpoint %J JMIR Med Educ %G English %X As we progress deeper into the digital age, the robust development and application of advanced artificial intelligence (AI) technology, specifically generative language models like ChatGPT (OpenAI), have potential implications in all sectors including medicine. This viewpoint article aims to present the authors’ perspective on the integration of AI models such as ChatGPT in clinical medicine and medical education. The unprecedented capacity of ChatGPT to generate human-like responses, refined through Reinforcement Learning with Human Feedback, could significantly reshape the pedagogical methodologies within medical education. Through a comprehensive review and the authors’ personal experiences, this viewpoint article elucidates the pros, cons, and ethical considerations of using ChatGPT within clinical medicine and notably, its implications for medical education. This exploration is crucial in a transformative era where AI could potentially augment human capability in the process of knowledge creation and dissemination, potentially revolutionizing medical education and clinical practice. The importance of maintaining academic integrity and professional standards is highlighted. The relevance of establishing clear guidelines for the responsible and ethical use of AI technologies in clinical medicine and medical education is also emphasized. %M 37988149 %R 10.2196/47274 %U https://mededu.jmir.org/2023/1/e47274 %U https://doi.org/10.2196/47274 %U http://www.ncbi.nlm.nih.gov/pubmed/37988149 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 6 %N %P e49280 %T Evaluation of ChatGPT Dermatology Responses to Common Patient Queries %A Ferreira,Alana L %A Chu,Brian %A Grant-Kels,Jane M %A Ogunleye,Temitayo %A Lipoff,Jules B %+ Department of Dermatology, Lewis Katz School of Medicine, Temple University, 525 Jamestown Avenue, Suite #206, Philadelphia, PA, 19128, United States, 1 215 482 7546, jules.lipoff@temple.edu %K ChatGPT %K dermatology %K dermatologist %K artificial intelligence %K AI %K medical advice %K GPT-4 %K patient queries %K information resource %K response evaluation %K skin condition %K skin %K tool %K AI tool %D 2023 %7 17.11.2023 %9 Research Letter %J JMIR Dermatol %G English %X %M 37976093 %R 10.2196/49280 %U https://derma.jmir.org/2023/1/e49280 %U https://doi.org/10.2196/49280 %U http://www.ncbi.nlm.nih.gov/pubmed/37976093 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e49368 %T A SWOT (Strengths, Weaknesses, Opportunities, and Threats) Analysis of ChatGPT in the Medical Literature: Concise Review %A Gödde,Daniel %A Nöhl,Sophia %A Wolf,Carina %A Rupert,Yannick %A Rimkus,Lukas %A Ehlers,Jan %A Breuckmann,Frank %A Sellmann,Timur %+ Department of Pathology and Molecularpathology, Helios University Hospital Wuppertal, Witten/Herdecke University, Alfred-Herrhausen-Straße 50, Witten, 58455, Germany, 49 202 896 2541, daniel.goedde@helios-gesundheit.de %K ChatGPT %K chatbot %K artificial intelligence %K education technology %K medical education %K machine learning %K chatbots %K concise review %K review methods %K review methodology %K SWOT %D 2023 %7 16.11.2023 %9 Review %J J Med Internet Res %G English %X Background: ChatGPT is a 175-billion-parameter natural language processing model that is already involved in scientific content and publications. Its influence ranges from providing quick access to information on medical topics, assisting in generating medical and scientific articles and papers, performing medical data analyses, and even interpreting complex data sets. Objective: The future role of ChatGPT remains uncertain and a matter of debate already shortly after its release. This review aimed to analyze the role of ChatGPT in the medical literature during the first 3 months after its release. Methods: We performed a concise review of literature published in PubMed from December 1, 2022, to March 31, 2023. To find all publications related to ChatGPT or considering ChatGPT, the search term was kept simple (“ChatGPT” in AllFields). All publications available as full text in German or English were included. All accessible publications were evaluated according to specifications by the author team (eg, impact factor, publication modus, article type, publication speed, and type of ChatGPT integration or content). The conclusions of the articles were used for later SWOT (strengths, weaknesses, opportunities, and threats) analysis. All data were analyzed on a descriptive basis. Results: Of 178 studies in total, 160 met the inclusion criteria and were evaluated. The average impact factor was 4.423 (range 0-96.216), and the average publication speed was 16 (range 0-83) days. Among the articles, there were 77 editorials (48,1%), 43 essays (26.9%), 21 studies (13.1%), 6 reviews (3.8%), 6 case reports (3.8%), 6 news (3.8%), and 1 meta-analysis (0.6%). Of those, 54.4% (n=87) were published as open access, with 5% (n=8) provided on preprint servers. Over 400 quotes with information on strengths, weaknesses, opportunities, and threats were detected. By far, most (n=142, 34.8%) were related to weaknesses. ChatGPT excels in its ability to express ideas clearly and formulate general contexts comprehensibly. It performs so well that even experts in the field have difficulty identifying abstracts generated by ChatGPT. However, the time-limited scope and the need for corrections by experts were mentioned as weaknesses and threats of ChatGPT. Opportunities include assistance in formulating medical issues for nonnative English speakers, as well as the possibility of timely participation in the development of such artificial intelligence tools since it is in its early stages and can therefore still be influenced. Conclusions: Artificial intelligence tools such as ChatGPT are already part of the medical publishing landscape. Despite their apparent opportunities, policies and guidelines must be implemented to ensure benefits in education, clinical practice, and research and protect against threats such as scientific misconduct, plagiarism, and inaccuracy. %M 37865883 %R 10.2196/49368 %U https://www.jmir.org/2023/1/e49368 %U https://doi.org/10.2196/49368 %U http://www.ncbi.nlm.nih.gov/pubmed/37865883 %0 Journal Article %@ 2562-0959 %I JMIR Publications %V 6 %N %P e50409 %T Assessing the Accuracy and Comprehensiveness of ChatGPT in Offering Clinical Guidance for Atopic Dermatitis and Acne Vulgaris %A Lakdawala,Nehal %A Channa,Leelakrishna %A Gronbeck,Christian %A Lakdawala,Nikita %A Weston,Gillian %A Sloan,Brett %A Feng,Hao %+ Department of Dermatology, University of Connecticut Health Center, 21 South Rd, Farmington, CT, 06032, United States, 1 8606794600, haofeng625@gmail.com %K ChatGPT %K artificial intelligence %K dermatology %K clinical guidance %K counseling %K atopic dermatitis %K acne vulgaris %K skin %K acne %K dermatitis %K NLP %K natural language processing %K dermatologic %K dermatological %K recommendation %K recommendations %K guidance %K advise %K counsel %K response %K responses %K chatbot %K chatbots %K conversational agent %K conversational agents %K answer %K answers %K computer generated %K automated %D 2023 %7 14.11.2023 %9 Research Letter %J JMIR Dermatol %G English %X %M 37962920 %R 10.2196/50409 %U https://derma.jmir.org/2023/1/e50409 %U https://doi.org/10.2196/50409 %U http://www.ncbi.nlm.nih.gov/pubmed/37962920 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e49877 %T ChatGPT Interactive Medical Simulations for Early Clinical Education: Case Study %A Scherr,Riley %A Halaseh,Faris F %A Spina,Aidin %A Andalib,Saman %A Rivera,Ronald %+ Irvine School of Medicine, University of California, 1001 Health Sciences Rd, Irvine, CA, 92617, United States, 1 949 824 6119, rscherr@hs.uci.edu %K ChatGPT %K medical school simulations %K preclinical curriculum %K artificial intelligence %K AI %K AI in medical education %K medical education %K simulation %K generative %K curriculum %K clinical education %K simulations %D 2023 %7 10.11.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: The transition to clinical clerkships can be difficult for medical students, as it requires the synthesis and application of preclinical information into diagnostic and therapeutic decisions. ChatGPT—a generative language model with many medical applications due to its creativity, memory, and accuracy—can help students in this transition. Objective: This paper models ChatGPT 3.5’s ability to perform interactive clinical simulations and shows this tool’s benefit to medical education. Methods: Simulation starting prompts were refined using ChatGPT 3.5 in Google Chrome. Starting prompts were selected based on assessment format, stepwise progression of simulation events and questions, free-response question type, responsiveness to user inputs, postscenario feedback, and medical accuracy of the feedback. The chosen scenarios were advanced cardiac life support and medical intensive care (for sepsis and pneumonia). Results: Two starting prompts were chosen. Prompt 1 was developed through 3 test simulations and used successfully in 2 simulations. Prompt 2 was developed through 10 additional test simulations and used successfully in 1 simulation. Conclusions: ChatGPT is capable of creating simulations for early clinical education. These simulations let students practice novel parts of the clinical curriculum, such as forming independent diagnostic and therapeutic impressions over an entire patient encounter. Furthermore, the simulations can adapt to user inputs in a way that replicates real life more accurately than premade question bank clinical vignettes. Finally, ChatGPT can create potentially unlimited free simulations with specific feedback, which increases access for medical students with lower socioeconomic status and underresourced medical schools. However, no tool is perfect, and ChatGPT is no exception; there are concerns about simulation accuracy and replicability that need to be addressed to further optimize ChatGPT’s performance as an educational resource. %M 37948112 %R 10.2196/49877 %U https://mededu.jmir.org/2023/1/e49877 %U https://doi.org/10.2196/49877 %U http://www.ncbi.nlm.nih.gov/pubmed/37948112 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e51300 %T An AI Dietitian for Type 2 Diabetes Mellitus Management Based on Large Language and Image Recognition Models: Preclinical Concept Validation Study %A Sun,Haonan %A Zhang,Kai %A Lan,Wei %A Gu,Qiufeng %A Jiang,Guangxiang %A Yang,Xue %A Qin,Wanli %A Han,Dongran %+ School of Life Science, Beijing University of Chinese Medicine, Scientific Research Building #542, Beijing, 102401, China, 86 13466590473, 18811570951@163.com %K ChatGPT %K artificial intelligence %K AI %K diabetes %K diabetic %K nutrition %K nutritional %K diet %K dietary %K dietician %K medical nutrition therapy %K ingredient recognition %K digital health %K language model %K image recognition %K machine learning %K deep learning %K NLP %K natural language processing %K meal %K recommendation %K meals %K food %K GPT 4.0 %D 2023 %7 9.11.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: Nutritional management for patients with diabetes in China is a significant challenge due to the low supply of registered clinical dietitians. To address this, an artificial intelligence (AI)–based nutritionist program that uses advanced language and image recognition models was created. This program can identify ingredients from images of a patient’s meal and offer nutritional guidance and dietary recommendations. Objective: The primary objective of this study is to evaluate the competence of the models that support this program. Methods: The potential of an AI nutritionist program for patients with type 2 diabetes mellitus (T2DM) was evaluated through a multistep process. First, a survey was conducted among patients with T2DM and endocrinologists to identify knowledge gaps in dietary practices. ChatGPT and GPT 4.0 were then tested through the Chinese Registered Dietitian Examination to assess their proficiency in providing evidence-based dietary advice. ChatGPT’s responses to common questions about medical nutrition therapy were compared with expert responses by professional dietitians to evaluate its proficiency. The model’s food recommendations were scrutinized for consistency with expert advice. A deep learning–based image recognition model was developed for food identification at the ingredient level, and its performance was compared with existing models. Finally, a user-friendly app was developed, integrating the capabilities of language and image recognition models to potentially improve care for patients with T2DM. Results: Most patients (182/206, 88.4%) demanded more immediate and comprehensive nutritional management and education. Both ChatGPT and GPT 4.0 passed the Chinese Registered Dietitian examination. ChatGPT’s food recommendations were mainly in line with best practices, except for certain foods like root vegetables and dry beans. Professional dietitians’ reviews of ChatGPT’s responses to common questions were largely positive, with 162 out of 168 providing favorable reviews. The multilabel image recognition model evaluation showed that the Dino V2 model achieved an average F1 score of 0.825, indicating high accuracy in recognizing ingredients. Conclusions: The model evaluations were promising. The AI-based nutritionist program is now ready for a supervised pilot study. %M 37943581 %R 10.2196/51300 %U https://www.jmir.org/2023/1/e51300 %U https://doi.org/10.2196/51300 %U http://www.ncbi.nlm.nih.gov/pubmed/37943581 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e47191 %T Assessing the Performance of ChatGPT in Medical Biochemistry Using Clinical Case Vignettes: Observational Study %A Surapaneni,Krishna Mohan %+ Panimalar Medical College Hospital & Research Institute, Varadharajapuram, Poonamallee, Chennai, 600123, India, 91 9789099989, krishnamohan.surapaneni@gmail.com %K ChatGPT %K artificial intelligence %K medical education %K medical Biochemistry %K biochemistry %K chatbot %K case study %K case scenario %K medical exam %K medical examination %K computer generated %D 2023 %7 7.11.2023 %9 Short Paper %J JMIR Med Educ %G English %X Background: ChatGPT has gained global attention recently owing to its high performance in generating a wide range of information and retrieving any kind of data instantaneously. ChatGPT has also been tested for the United States Medical Licensing Examination (USMLE) and has successfully cleared it. Thus, its usability in medical education is now one of the key discussions worldwide. Objective: The objective of this study is to evaluate the performance of ChatGPT in medical biochemistry using clinical case vignettes. Methods: The performance of ChatGPT was evaluated in medical biochemistry using 10 clinical case vignettes. Clinical case vignettes were randomly selected and inputted in ChatGPT along with the response options. We tested the responses for each clinical case twice. The answers generated by ChatGPT were saved and checked using our reference material. Results: ChatGPT generated correct answers for 4 questions on the first attempt. For the other cases, there were differences in responses generated by ChatGPT in the first and second attempts. In the second attempt, ChatGPT provided correct answers for 6 questions and incorrect answers for 4 questions out of the 10 cases that were used. But, to our surprise, for case 3, different answers were obtained with multiple attempts. We believe this to have happened owing to the complexity of the case, which involved addressing various critical medical aspects related to amino acid metabolism in a balanced approach. Conclusions: According to the findings of our study, ChatGPT may not be considered an accurate information provider for application in medical education to improve learning and assessment. However, our study was limited by a small sample size (10 clinical case vignettes) and the use of the publicly available version of ChatGPT (version 3.5). Although artificial intelligence (AI) has the capability to transform medical education, we emphasize the validation of such data produced by such AI systems for correctness and dependability before it could be implemented in practice. %M 37934568 %R 10.2196/47191 %U https://mededu.jmir.org/2023/1/e47191 %U https://doi.org/10.2196/47191 %U http://www.ncbi.nlm.nih.gov/pubmed/37934568 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e52865 %T The Impact of Multimodal Large Language Models on Health Care’s Future %A Meskó,Bertalan %+ The Medical Futurist Intitute, Povl Bang-Jensen u. 2/B1. 4/1., Budapest, XI., 1118, Hungary, 36 703807260, berci@medicalfuturist.com %K artificial intelligence %K ChatGPT %K digital health %K future %K GPT-4 %K Generative Pre-Trained Transformer %K large language models %K multimodality %K technology %K AI %K LLM %D 2023 %7 2.11.2023 %9 Viewpoint %J J Med Internet Res %G English %X When large language models (LLMs) were introduced to the public at large in late 2022 with ChatGPT (OpenAI), the interest was unprecedented, with more than 1 billion unique users within 90 days. Until the introduction of Generative Pre-trained Transformer 4 (GPT-4) in March 2023, these LLMs only contained a single mode—text. As medicine is a multimodal discipline, the potential future versions of LLMs that can handle multimodality—meaning that they could interpret and generate not only text but also images, videos, sound, and even comprehensive documents—can be conceptualized as a significant evolution in the field of artificial intelligence (AI). This paper zooms in on the new potential of generative AI, a new form of AI that also includes tools such as LLMs, through the achievement of multimodal inputs of text, images, and speech on health care’s future. We present several futuristic scenarios to illustrate the potential path forward as multimodal LLMs (M-LLMs) could represent the gateway between health care professionals and using AI for medical purposes. It is important to point out, though, that despite the unprecedented potential of generative AI in the form of M-LLMs, the human touch in medicine remains irreplaceable. AI should be seen as a tool that can augment health care professionals rather than replace them. It is also important to consider the human aspects of health care—empathy, understanding, and the doctor-patient relationship—when deploying AI. %M 37917126 %R 10.2196/52865 %U https://www.jmir.org/2023/1/e52865 %U https://doi.org/10.2196/52865 %U http://www.ncbi.nlm.nih.gov/pubmed/37917126 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e47532 %T The Accuracy and Potential Racial and Ethnic Biases of GPT-4 in the Diagnosis and Triage of Health Conditions: Evaluation Study %A Ito,Naoki %A Kadomatsu,Sakina %A Fujisawa,Mineto %A Fukaguchi,Kiyomitsu %A Ishizawa,Ryo %A Kanda,Naoki %A Kasugai,Daisuke %A Nakajima,Mikio %A Goto,Tadahiro %A Tsugawa,Yusuke %+ TXP Medical Co Ltd, 41-1 H¹O Kanda 706, Tokyo, 101-0042, Japan, 81 03 5615 8433, tag695@mail.harvard.edu %K GPT-4 %K racial and ethnic bias %K typical clinical vignettes %K diagnosis %K triage %K artificial intelligence %K AI %K race %K clinical vignettes %K physician %K efficiency %K decision-making %K bias %K GPT %D 2023 %7 2.11.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Whether GPT-4, the conversational artificial intelligence, can accurately diagnose and triage health conditions and whether it presents racial and ethnic biases in its decisions remain unclear. Objective: We aim to assess the accuracy of GPT-4 in the diagnosis and triage of health conditions and whether its performance varies by patient race and ethnicity. Methods: We compared the performance of GPT-4 and physicians, using 45 typical clinical vignettes, each with a correct diagnosis and triage level, in February and March 2023. For each of the 45 clinical vignettes, GPT-4 and 3 board-certified physicians provided the most likely primary diagnosis and triage level (emergency, nonemergency, or self-care). Independent reviewers evaluated the diagnoses as “correct” or “incorrect.” Physician diagnosis was defined as the consensus of the 3 physicians. We evaluated whether the performance of GPT-4 varies by patient race and ethnicity, by adding the information on patient race and ethnicity to the clinical vignettes. Results: The accuracy of diagnosis was comparable between GPT-4 and physicians (the percentage of correct diagnosis was 97.8% (44/45; 95% CI 88.2%-99.9%) for GPT-4 and 91.1% (41/45; 95% CI 78.8%-97.5%) for physicians; P=.38). GPT-4 provided appropriate reasoning for 97.8% (44/45) of the vignettes. The appropriateness of triage was comparable between GPT-4 and physicians (GPT-4: 30/45, 66.7%; 95% CI 51.0%-80.0%; physicians: 30/45, 66.7%; 95% CI 51.0%-80.0%; P=.99). The performance of GPT-4 in diagnosing health conditions did not vary among different races and ethnicities (Black, White, Asian, and Hispanic), with an accuracy of 100% (95% CI 78.2%-100%). P values, compared to the GPT-4 output without incorporating race and ethnicity information, were all .99. The accuracy of triage was not significantly different even if patients’ race and ethnicity information was added. The accuracy of triage was 62.2% (95% CI 46.5%-76.2%; P=.50) for Black patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for White patients; 66.7% (95% CI 51.0%-80.0%; P=.99) for Asian patients, and 62.2% (95% CI 46.5%-76.2%; P=.69) for Hispanic patients. P values were calculated by comparing the outputs with and without conditioning on race and ethnicity. Conclusions: GPT-4’s ability to diagnose and triage typical clinical vignettes was comparable to that of board-certified physicians. The performance of GPT-4 did not vary by patient race and ethnicity. These findings should be informative for health systems looking to introduce conversational artificial intelligence to improve the efficiency of patient diagnosis and triage. %M 37917120 %R 10.2196/47532 %U https://mededu.jmir.org/2023/1/e47532 %U https://doi.org/10.2196/47532 %U http://www.ncbi.nlm.nih.gov/pubmed/37917120 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e51421 %T Exploring the Possible Use of AI Chatbots in Public Health Education: Feasibility Study %A Baglivo,Francesco %A De Angelis,Luigi %A Casigliani,Virginia %A Arzilli,Guglielmo %A Privitera,Gaetano Pierpaolo %A Rizzo,Caterina %+ Department of Translational Research and New Technologies in Medicine and Surgery, University of Pisa, Via San Zeno 35, Pisa (PI), 56123, Italy, 39 3288348649, f.baglivo@studenti.unipi.it %K artificial intelligence %K chatbots %K medical education %K vaccination %K public health %K medical students %K large language model %K generative AI %K ChatGPT %K Google Bard %K AI chatbot %K health education %K public health %K health care %K medical training %K educational support tool %K chatbot model %D 2023 %7 1.11.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Artificial intelligence (AI) is a rapidly developing field with the potential to transform various aspects of health care and public health, including medical training. During the “Hygiene and Public Health” course for fifth-year medical students, a practical training session was conducted on vaccination using AI chatbots as an educational supportive tool. Before receiving specific training on vaccination, the students were given a web-based test extracted from the Italian National Medical Residency Test. After completing the test, a critical correction of each question was performed assisted by AI chatbots. Objective: The main aim of this study was to identify whether AI chatbots can be considered educational support tools for training in public health. The secondary objective was to assess the performance of different AI chatbots on complex multiple-choice medical questions in the Italian language. Methods: A test composed of 15 multiple-choice questions on vaccination was extracted from the Italian National Medical Residency Test using targeted keywords and administered to medical students via Google Forms and to different AI chatbot models (Bing Chat, ChatGPT, Chatsonic, Google Bard, and YouChat). The correction of the test was conducted in the classroom, focusing on the critical evaluation of the explanations provided by the chatbot. A Mann-Whitney U test was conducted to compare the performances of medical students and AI chatbots. Student feedback was collected anonymously at the end of the training experience. Results: In total, 36 medical students and 5 AI chatbot models completed the test. The students achieved an average score of 8.22 (SD 2.65) out of 15, while the AI chatbots scored an average of 12.22 (SD 2.77). The results indicated a statistically significant difference in performance between the 2 groups (U=49.5, P<.001), with a large effect size (r=0.69). When divided by question type (direct, scenario-based, and negative), significant differences were observed in direct (P<.001) and scenario-based (P<.001) questions, but not in negative questions (P=.48). The students reported a high level of satisfaction (7.9/10) with the educational experience, expressing a strong desire to repeat the experience (7.6/10). Conclusions: This study demonstrated the efficacy of AI chatbots in answering complex medical questions related to vaccination and providing valuable educational support. Their performance significantly surpassed that of medical students in direct and scenario-based questions. The responsible and critical use of AI chatbots can enhance medical education, making it an essential aspect to integrate into the educational system. %M 37910155 %R 10.2196/51421 %U https://mededu.jmir.org/2023/1/e51421 %U https://doi.org/10.2196/51421 %U http://www.ncbi.nlm.nih.gov/pubmed/37910155 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e49385 %T Health Care Trainees’ and Professionals’ Perceptions of ChatGPT in Improving Medical Knowledge Training: Rapid Survey Study %A Hu,Je-Ming %A Liu,Feng-Cheng %A Chu,Chi-Ming %A Chang,Yu-Tien %+ School of Public Health, National Defense Medical Center, No 161, Sec 6, Minquan E Rd, Neihu Dist, Taipei, 114, Taiwan, 886 2 8792 3100 ext 18454, greengarden720925@gmail.com %K ChatGPT %K large language model %K medicine %K perception evaluation %K internet survey %K structural equation modeling %K SEM %D 2023 %7 18.10.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: ChatGPT is a powerful pretrained large language model. It has both demonstrated potential and raised concerns related to knowledge translation and knowledge transfer. To apply and improve knowledge transfer in the real world, it is essential to assess the perceptions and acceptance of the users of ChatGPT-assisted training. Objective: We aimed to investigate the perceptions of health care trainees and professionals on ChatGPT-assisted training, using biomedical informatics as an example. Methods: We used purposeful sampling to include all health care undergraduate trainees and graduate professionals (n=195) from January to May 2023 in the School of Public Health at the National Defense Medical Center in Taiwan. Subjects were asked to watch a 2-minute video introducing 5 scenarios about ChatGPT-assisted training in biomedical informatics and then answer a self-designed online (web- and mobile-based) questionnaire according to the Kirkpatrick model. The survey responses were used to develop 4 constructs: “perceived knowledge acquisition,” “perceived training motivation,” “perceived training satisfaction,” and “perceived training effectiveness.” The study used structural equation modeling (SEM) to evaluate and test the structural model and hypotheses. Results: The online questionnaire response rate was 152 of 195 (78%); 88 of 152 participants (58%) were undergraduate trainees and 90 of 152 participants (59%) were women. The ages ranged from 18 to 53 years (mean 23.3, SD 6.0 years). There was no statistical difference in perceptions of training evaluation between men and women. Most participants were enthusiastic about the ChatGPT-assisted training, while the graduate professionals were more enthusiastic than undergraduate trainees. Nevertheless, some concerns were raised about potential cheating on training assessment. The average scores for knowledge acquisition, training motivation, training satisfaction, and training effectiveness were 3.84 (SD 0.80), 3.76 (SD 0.93), 3.75 (SD 0.87), and 3.72 (SD 0.91), respectively (Likert scale 1-5: strongly disagree to strongly agree). Knowledge acquisition had the highest score and training effectiveness the lowest. In the SEM results, training effectiveness was influenced predominantly by knowledge acquisition and partially met the hypotheses in the research framework. Knowledge acquisition had a direct effect on training effectiveness, training satisfaction, and training motivation, with β coefficients of .80, .87, and .97, respectively (all P<.001). Conclusions: Most health care trainees and professionals perceived ChatGPT-assisted training as an aid in knowledge transfer. However, to improve training effectiveness, it should be combined with empirical experts for proper guidance and dual interaction. In a future study, we recommend using a larger sample size for evaluation of internet-connected large language models in medical knowledge transfer. %M 37851495 %R 10.2196/49385 %U https://www.jmir.org/2023/1/e49385 %U https://doi.org/10.2196/49385 %U http://www.ncbi.nlm.nih.gov/pubmed/37851495 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 10 %N %P e49132 %T A Motivational Interviewing Chatbot With Generative Reflections for Increasing Readiness to Quit Smoking: Iterative Development Study %A Brown,Andrew %A Kumar,Ash Tanuj %A Melamed,Osnat %A Ahmed,Imtihan %A Wang,Yu Hao %A Deza,Arnaud %A Morcos,Marc %A Zhu,Leon %A Maslej,Marta %A Minian,Nadia %A Sujaya,Vidya %A Wolff,Jodi %A Doggett,Olivia %A Iantorno,Mathew %A Ratto,Matt %A Selby,Peter %A Rose,Jonathan %+ The Edward S Rogers Sr Department of Electrical & Computer Engineering, University of Toronto, 10 King's College Rd, Toronto, ON, M5S 3G4, Canada, 1 416 978 6992, jonathan.rose@ece.utoronto.ca %K conversational agents %K chatbots %K behavior change %K smoking cessation %K motivational interviewing %K deep learning %K natural language processing %K transformers %K generative artificial intelligence %K artificial intelligence %K AI %D 2023 %7 17.10.2023 %9 Original Paper %J JMIR Ment Health %G English %X Background: The motivational interviewing (MI) approach has been shown to help move ambivalent smokers toward the decision to quit smoking. There have been several attempts to broaden access to MI through text-based chatbots. These typically use scripted responses to client statements, but such nonspecific responses have been shown to reduce effectiveness. Recent advances in natural language processing provide a new way to create responses that are specific to a client’s statements, using a generative language model. Objective: This study aimed to design, evolve, and measure the effectiveness of a chatbot system that can guide ambivalent people who smoke toward the decision to quit smoking with MI-style generative reflections. Methods: Over time, 4 different MI chatbot versions were evolved, and each version was tested with a separate group of ambivalent smokers. A total of 349 smokers were recruited through a web-based recruitment platform. The first chatbot version only asked questions without reflections on the answers. The second version asked the questions and provided reflections with an initial version of the reflection generator. The third version used an improved reflection generator, and the fourth version added extended interaction on some of the questions. Participants’ readiness to quit was measured before the conversation and 1 week later using an 11-point scale that measured 3 attributes related to smoking cessation: readiness, confidence, and importance. The number of quit attempts made in the week before the conversation and the week after was surveyed; in addition, participants rated the perceived empathy of the chatbot. The main body of the conversation consists of 5 scripted questions, responses from participants, and (for 3 of the 4 versions) generated reflections. A pretrained transformer-based neural network was fine-tuned on examples of high-quality reflections to generate MI reflections. Results: The increase in average confidence using the nongenerative version was 1.0 (SD 2.0; P=.001), whereas for the 3 generative versions, the increases ranged from 1.2 to 1.3 (SD 2.0-2.3; P<.001). The extended conversation with improved generative reflections was the only version associated with a significant increase in average importance (0.7, SD 2.0; P<.001) and readiness (0.4, SD 1.7; P=.01). The enhanced reflection and extended conversations exhibited significantly better perceived empathy than the nongenerative conversation (P=.02 and P=.004, respectively). The number of quit attempts did not significantly change between the week before the conversation and the week after across all 4 conversations. Conclusions: The results suggest that generative reflections increase the impact of a conversation on readiness to quit smoking 1 week later, although a significant portion of the impact seen so far can be achieved by only asking questions without the reflections. These results support further evolution of the chatbot conversation and can serve as a basis for comparison against more advanced versions. %M 37847539 %R 10.2196/49132 %U https://mental.jmir.org/2023/1/e49132 %U https://doi.org/10.2196/49132 %U http://www.ncbi.nlm.nih.gov/pubmed/37847539 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e42960 %T Building a Chatbot in a Pandemic %A Rambaud,Kimberly %A van Woerden,Simon %A Palumbo,Leonardo %A Salvi,Cristiana %A Smallwood,Catherine %A Rockenschaub,Gerald %A Okoliyski,Michail %A Marinova,Lora %A Fomaidi,Galina %A Djalalova,Malika %A Faruqui,Nabiha %A Melo Bianco,Viviane %A Mosquera,Mario %A Spasov,Ivaylo %A Totskaya,Yekaterina %+ World Health Organization Regional Office for Europe, Marmorvej 51, Copenhagen, Denmark, 45 627164307, kimb.ramb@gmail.com %K COVID-19 %K chatbots %K evidence-based communication channels %K conversational agent %K user-centered %K health promotion %K digital health intervention %K online health information %K digital health tool %K health communication %D 2023 %7 10.10.2023 %9 Viewpoint %J J Med Internet Res %G English %X Easy access to evidence-based information on COVID-19 within an infodemic has been a challenging task. Chatbots have been introduced in times of emergency, when human resources are stretched thin and individuals need a user-centered resource. The World Health Organization Regional Office for Europe and UNICEF (United Nations Children's Fund) Europe and Central Asia came together to build a chatbot, HealthBuddy+, to assist country populations in the region to access accurate COVID-19 information in the local languages, adapted to the country context. Working in close collaboration with thematic technical experts, colleagues and counterparts at the country level allowed the project to be tailored to a diverse range of subtopics. To ensure that HealthBuddy+ was relevant and useful in countries across the region, the 2 regional offices worked closely with their counterparts in country offices, which were essential in partnering with national authorities, engaging communities, promoting the tool, and identifying the most relevant communication channels in which to embed HealthBuddy+. Over the past 2 years, the project has expanded from a web-based chatbot in 7 languages to a multistream, multifunction chatbot available in 16 regional languages, and HealthBuddy+ continues to expand and adjust to meet emerging health emergency needs. %M 37074958 %R 10.2196/42960 %U https://www.jmir.org/2023/1/e42960 %U https://doi.org/10.2196/42960 %U http://www.ncbi.nlm.nih.gov/pubmed/37074958 %0 Journal Article %@ 2561-326X %I JMIR Publications %V 7 %N %P e47267 %T Acceptability of a Pain History Assessment and Education Chatbot (Dolores) Across Age Groups in Populations With Chronic Pain: Development and Pilot Testing %A Andrews,Nicole Emma %A Ireland,David %A Vijayakumar,Pranavie %A Burvill,Lyza %A Hay,Elizabeth %A Westerman,Daria %A Rose,Tanya %A Schlumpf,Mikaela %A Strong,Jenny %A Claus,Andrew %+ RECOVER Injury Research Centre, The University of Queensland, Level 7, Surgical Treatment and Rehabilitation Service (STARS), 296 Herston Rd, Herston, 4029, Australia, 61 418762617, n.andrews@uq.edu.au %K chronic pain %K education %K neurophysiology %K neuroscience %K conversation agent %K chatbot %K age %K young adult %K adolescence %K adolescent %K pain %K patient education %K usability %K acceptability %K mobile health %K mHealth %K mobile app %K health app %K youth %K mobile phone %D 2023 %7 6.10.2023 %9 Original Paper %J JMIR Form Res %G English %X Background: The delivery of education on pain neuroscience and the evidence for different treatment approaches has become a key component of contemporary persistent pain management. Chatbots, or more formally conversation agents, are increasingly being used in health care settings due to their versatility in providing interactive and individualized approaches to both capture and deliver information. Research focused on the acceptability of diverse chatbot formats can assist in developing a better understanding of the educational needs of target populations. Objective: This study aims to detail the development and initial pilot testing of a multimodality pain education chatbot (Dolores) that can be used across different age groups and investigate whether acceptability and feedback were comparable across age groups following pilot testing. Methods: Following an initial design phase involving software engineers (n=2) and expert clinicians (n=6), a total of 60 individuals with chronic pain who attended an outpatient clinic at 1 of 2 pain centers in Australia were recruited for pilot testing. The 60 individuals consisted of 20 (33%) adolescents (aged 10-18 years), 20 (33%) young adults (aged 19-35 years), and 20 (33%) adults (aged >35 years) with persistent pain. Participants spent 20 to 30 minutes completing interactive chatbot activities that enabled the Dolores app to gather a pain history and provide education about pain and pain treatments. After the chatbot activities, participants completed a custom-made feedback questionnaire measuring the acceptability constructs pertaining to health education chatbots. To determine the effect of age group on the acceptability ratings and feedback provided, a series of binomial logistic regression models and cumulative odds ordinal logistic regression models with proportional odds were generated. Results: Overall, acceptability was high for the following constructs: engagement, perceived value, usability, accuracy, responsiveness, adoption intention, esthetics, and overall quality. The effect of age group on all acceptability ratings was small and not statistically significant. An analysis of open-ended question responses revealed that major frustrations with the app were related to Dolores’ speech, which was explored further through a comparative analysis. With respect to providing negative feedback about Dolores’ speech, a logistic regression model showed that the effect of age group was statistically significant (χ22=11.7; P=.003) and explained 27.1% of the variance (Nagelkerke R2). Adults and young adults were less likely to comment on Dolores’ speech compared with adolescent participants (odds ratio 0.20, 95% CI 0.05-0.84 and odds ratio 0.05, 95% CI 0.01-0.43, respectively). Comments were related to both speech rate (too slow) and quality (unpleasant and robotic). Conclusions: This study provides support for the acceptability of pain history and education chatbots across different age groups. Chatbot acceptability for adolescent cohorts may be improved by enabling the self-selection of speech characteristics such as rate and personable tone. %M 37801342 %R 10.2196/47267 %U https://formative.jmir.org/2023/1/e47267 %U https://doi.org/10.2196/47267 %U http://www.ncbi.nlm.nih.gov/pubmed/37801342 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e49963 %T A Future of Smarter Digital Health Empowered by Generative Pretrained Transformer %A Miao,Hongyu %A Li,Chengdong %A Wang,Jing %+ College of Nursing, Florida State University, 98 Varsity Way, Tallahassee, FL, 32306, United States, 1 8506443299, JingWang@nursing.fsu.edu %K generative pretrained model %K artificial intelligence %K digital health %K generative pretrained transformer %K ChatGPT %K precision medicine %K AI %K privacy %K ethics %D 2023 %7 26.9.2023 %9 Viewpoint %J J Med Internet Res %G English %X Generative pretrained transformer (GPT) tools have been thriving, as ignited by the remarkable success of OpenAI’s recent chatbot product. GPT technology offers countless opportunities to significantly improve or renovate current health care research and practice paradigms, especially digital health interventions and digital health–enabled clinical care, and a future of smarter digital health can thus be expected. In particular, GPT technology can be incorporated through various digital health platforms in homes and hospitals embedded with numerous sensors, wearables, and remote monitoring devices. In this viewpoint paper, we highlight recent research progress that depicts the future picture of a smarter digital health ecosystem through GPT-facilitated centralized communications, automated analytics, personalized health care, and instant decision-making. %M 37751243 %R 10.2196/49963 %U https://www.jmir.org/2023/1/e49963 %U https://doi.org/10.2196/49963 %U http://www.ncbi.nlm.nih.gov/pubmed/37751243 %0 Journal Article %@ 2368-7959 %I JMIR Publications %V 10 %N %P e51232 %T Suicide Risk Assessments Through the Eyes of ChatGPT-3.5 Versus ChatGPT-4: Vignette Study %A Levkovich,Inbar %A Elyoseph,Zohar %+ Department of Psychology and Educational Counseling, The Center for Psychobiological Research, Max Stern Yezreel Valley College, Hatena 14b Kiryat Tivon, Emek Yezreel, 3650414, Israel, 972 54 783 6088, Zohare@yvc.ac.il %K artificial intelligence %K ChatGPT %K diagnosis %K psychological assessment %K psychological %K suicide risk %K risk assessment %K text vignette %K NLP %K natural language processing %K suicide %K suicidal %K risk %K assessment %K vignette %K vignettes %K assessments %K mental %K self-harm %D 2023 %7 20.9.2023 %9 Original Paper %J JMIR Ment Health %G English %X Background: ChatGPT, a linguistic artificial intelligence (AI) model engineered by OpenAI, offers prospective contributions to mental health professionals. Although having significant theoretical implications, ChatGPT’s practical capabilities, particularly regarding suicide prevention, have not yet been substantiated. Objective: The study’s aim was to evaluate ChatGPT’s ability to assess suicide risk, taking into consideration 2 discernable factors—perceived burdensomeness and thwarted belongingness—over a 2-month period. In addition, we evaluated whether ChatGPT-4 more accurately evaluated suicide risk than did ChatGPT-3.5. Methods: ChatGPT was tasked with assessing a vignette that depicted a hypothetical patient exhibiting differing degrees of perceived burdensomeness and thwarted belongingness. The assessments generated by ChatGPT were subsequently contrasted with standard evaluations rendered by mental health professionals. Using both ChatGPT-3.5 and ChatGPT-4 (May 24, 2023), we executed 3 evaluative procedures in June and July 2023. Our intent was to scrutinize ChatGPT-4’s proficiency in assessing various facets of suicide risk in relation to the evaluative abilities of both mental health professionals and an earlier version of ChatGPT-3.5 (March 14 version). Results: During the period of June and July 2023, we found that the likelihood of suicide attempts as evaluated by ChatGPT-4 was similar to the norms of mental health professionals (n=379) under all conditions (average Z score of 0.01). Nonetheless, a pronounced discrepancy was observed regarding the assessments performed by ChatGPT-3.5 (May version), which markedly underestimated the potential for suicide attempts, in comparison to the assessments carried out by the mental health professionals (average Z score of –0.83). The empirical evidence suggests that ChatGPT-4’s evaluation of the incidence of suicidal ideation and psychache was higher than that of the mental health professionals (average Z score of 0.47 and 1.00, respectively). Conversely, the level of resilience as assessed by both ChatGPT-4 and ChatGPT-3.5 (both versions) was observed to be lower in comparison to the assessments offered by mental health professionals (average Z score of –0.89 and –0.90, respectively). Conclusions: The findings suggest that ChatGPT-4 estimates the likelihood of suicide attempts in a manner akin to evaluations provided by professionals. In terms of recognizing suicidal ideation, ChatGPT-4 appears to be more precise. However, regarding psychache, there was an observed overestimation by ChatGPT-4, indicating a need for further research. These results have implications regarding ChatGPT-4’s potential to support gatekeepers, patients, and even mental health professionals’ decision-making. Despite the clinical potential, intensive follow-up studies are necessary to establish the use of ChatGPT-4’s capabilities in clinical practice. The finding that ChatGPT-3.5 frequently underestimates suicide risk, especially in severe cases, is particularly troubling. It indicates that ChatGPT may downplay one’s actual suicide risk level. %M 37728984 %R 10.2196/51232 %U https://mental.jmir.org/2023/1/e51232 %U https://doi.org/10.2196/51232 %U http://www.ncbi.nlm.nih.gov/pubmed/37728984 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48780 %T Anki Tagger: A Generative AI Tool for Aligning Third-Party Resources to Preclinical Curriculum %A Pendergrast,Tricia %A Chalmers,Zachary %+ Northwestern University Feinberg School of Medicine, 303 E Chicago Ave, Morton 1-670, Chicago, IL, 60611, United States, 1 3125038194, zachary.chalmers@northwestern.edu %K ChatGPT %K undergraduate medical education %K large language models %K Anki %K flashcards %K artificial intelligence %K AI %D 2023 %7 20.9.2023 %9 Research Letter %J JMIR Med Educ %G English %X Using large language models, we developed a method to efficiently query existing flashcard libraries and select those most relevant to an individual's medical school curricula. %M 37728965 %R 10.2196/48780 %U https://mededu.jmir.org/2023/1/e48780 %U https://doi.org/10.2196/48780 %U http://www.ncbi.nlm.nih.gov/pubmed/37728965 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e47621 %T The Potential of ChatGPT as a Self-Diagnostic Tool in Common Orthopedic Diseases: Exploratory Study %A Kuroiwa,Tomoyuki %A Sarcon,Aida %A Ibara,Takuya %A Yamada,Eriku %A Yamamoto,Akiko %A Tsukamoto,Kazuya %A Fujita,Koji %+ Division of Medical Design Innovations, Open Innovation Center, Institute of Research Innovation, Tokyo Medical and Dental University, 1-5-45 Yushima, Bunkyo-ku, Tokyo, 1138519, Japan, 81 358035279, fujiorth@tmd.ac.jp %K ChatGPT %K generative pretrained transformer %K natural language processing %K artificial intelligence %K chatbot %K diagnosis %K self-diagnosis %K accuracy %K precision %K language model %K orthopedic disease %K AI model %K health information %D 2023 %7 15.9.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: Artificial intelligence (AI) has gained tremendous popularity recently, especially the use of natural language processing (NLP). ChatGPT is a state-of-the-art chatbot capable of creating natural conversations using NLP. The use of AI in medicine can have a tremendous impact on health care delivery. Although some studies have evaluated ChatGPT’s accuracy in self-diagnosis, there is no research regarding its precision and the degree to which it recommends medical consultations. Objective: The aim of this study was to evaluate ChatGPT’s ability to accurately and precisely self-diagnose common orthopedic diseases, as well as the degree of recommendation it provides for medical consultations. Methods: Over a 5-day course, each of the study authors submitted the same questions to ChatGPT. The conditions evaluated were carpal tunnel syndrome (CTS), cervical myelopathy (CM), lumbar spinal stenosis (LSS), knee osteoarthritis (KOA), and hip osteoarthritis (HOA). Answers were categorized as either correct, partially correct, incorrect, or a differential diagnosis. The percentage of correct answers and reproducibility were calculated. The reproducibility between days and raters were calculated using the Fleiss κ coefficient. Answers that recommended that the patient seek medical attention were recategorized according to the strength of the recommendation as defined by the study. Results: The ratios of correct answers were 25/25, 1/25, 24/25, 16/25, and 17/25 for CTS, CM, LSS, KOA, and HOA, respectively. The ratios of incorrect answers were 23/25 for CM and 0/25 for all other conditions. The reproducibility between days was 1.0, 0.15, 0.7, 0.6, and 0.6 for CTS, CM, LSS, KOA, and HOA, respectively. The reproducibility between raters was 1.0, 0.1, 0.64, –0.12, and 0.04 for CTS, CM, LSS, KOA, and HOA, respectively. Among the answers recommending medical attention, the phrases “essential,” “recommended,” “best,” and “important” were used. Specifically, “essential” occurred in 4 out of 125, “recommended” in 12 out of 125, “best” in 6 out of 125, and “important” in 94 out of 125 answers. Additionally, 7 out of the 125 answers did not include a recommendation to seek medical attention. Conclusions: The accuracy and reproducibility of ChatGPT to self-diagnose five common orthopedic conditions were inconsistent. The accuracy could potentially be improved by adding symptoms that could easily identify a specific location. Only a few answers were accompanied by a strong recommendation to seek medical attention according to our study standards. Although ChatGPT could serve as a potential first step in accessing care, we found variability in accurate self-diagnosis. Given the risk of harm with self-diagnosis without medical follow-up, it would be prudent for an NLP to include clear language alerting patients to seek expert medical opinions. We hope to shed further light on the use of AI in a future clinical study. %M 37713254 %R 10.2196/47621 %U https://www.jmir.org/2023/1/e47621 %U https://doi.org/10.2196/47621 %U http://www.ncbi.nlm.nih.gov/pubmed/37713254 %0 Journal Article %@ 2561-7605 %I JMIR Publications %V 6 %N %P e51776 %T Shaping the Future of Older Adult Care: ChatGPT, Advanced AI, and the Transformation of Clinical Practice %A Fear,Kathleen %A Gleber,Conrad %+ UR Health Lab, University of Rochester Medical Center, 30 Corporate Woods, Suite 180, Rochester, NY, 14623, United States, 1 585 341 4954, kathleen_fear@urmc.rochester.edu %K generative AI %K artificial intelligence %K large language models %K ChatGPT %K Generative Pre-trained Transformer %D 2023 %7 13.9.2023 %9 Guest Editorial %J JMIR Aging %G English %X As the older adult population in the United States grows, new approaches to managing and streamlining clinical work are needed to accommodate their increased demand for health care. Deep learning and generative artificial intelligence (AI) have the potential to transform how care is delivered and how clinicians practice in geriatrics. In this editorial, we explore the opportunities and limitations of these technologies. %M 37703085 %R 10.2196/51776 %U https://aging.jmir.org/2023/1/e51776 %U https://doi.org/10.2196/51776 %U http://www.ncbi.nlm.nih.gov/pubmed/37703085 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e49240 %T Clinical Accuracy of Large Language Models and Google Search Responses to Postpartum Depression Questions: Cross-Sectional Study %A Sezgin,Emre %A Chekeni,Faraaz %A Lee,Jennifer %A Keim,Sarah %+ Nationwide Children's Hospital, 700 Children's Dr, Columbus, OH, 43205, United States, 1 614 722 3179, emre.sezgin@nationwidechildrens.org %K mental health %K postpartum depression %K health information seeking %K large language model %K GPT %K LaMDA %K Google %K ChatGPT %K artificial intelligence %K natural language processing %K generative AI %K depression %K cross-sectional study %K clinical accuracy %D 2023 %7 11.9.2023 %9 Research Letter %J J Med Internet Res %G English %X %M 37695668 %R 10.2196/49240 %U https://www.jmir.org/2023/1/e49240 %U https://doi.org/10.2196/49240 %U http://www.ncbi.nlm.nih.gov/pubmed/37695668 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46482 %T Artificial Intelligence in Medical Education: Comparative Analysis of ChatGPT, Bing, and Medical Students in Germany %A Roos,Jonas %A Kasapovic,Adnan %A Jansen,Tom %A Kaczmarczyk,Robert %+ Department of Dermatology and Allergy, Technical University of Munich, Biedersteiner Str. 29, Munich, 80802, Germany, 49 08941403033, robert.kaczmarczyk@tum.de %K medical education %K state examinations %K exams %K large language models %K artificial intelligence %K ChatGPT %D 2023 %7 4.9.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models (LLMs) have demonstrated significant potential in diverse domains, including medicine. Nonetheless, there is a scarcity of studies examining their performance in medical examinations, especially those conducted in languages other than English, and in direct comparison with medical students. Analyzing the performance of LLMs in state medical examinations can provide insights into their capabilities and limitations and evaluate their potential role in medical education and examination preparation.  Objective: This study aimed to assess and compare the performance of 3 LLMs, GPT-4, Bing, and GPT-3.5-Turbo, in the German Medical State Examinations of 2022 and to evaluate their performance relative to that of medical students.  Methods: The LLMs were assessed on a total of 630 questions from the spring and fall German Medical State Examinations of 2022. The performance was evaluated with and without media-related questions. Statistical analyses included 1-way ANOVA and independent samples t tests for pairwise comparisons. The relative strength of the LLMs in comparison with that of the students was also evaluated.  Results: GPT-4 achieved the highest overall performance, correctly answering 88.1% of questions, closely followed by Bing (86.0%) and GPT-3.5-Turbo (65.7%). The students had an average correct answer rate of 74.6%. Both GPT-4 and Bing significantly outperformed the students in both examinations. When media questions were excluded, Bing achieved the highest performance of 90.7%, closely followed by GPT-4 (90.4%), while GPT-3.5-Turbo lagged (68.2%). There was a significant decline in the performance of GPT-4 and Bing in the fall 2022 examination, which was attributed to a higher proportion of media-related questions and a potential increase in question difficulty.  Conclusions: LLMs, particularly GPT-4 and Bing, demonstrate potential as valuable tools in medical education and for pretesting examination questions. Their high performance, even relative to that of medical students, indicates promising avenues for further development and integration into the educational and clinical landscape.  %M 37665620 %R 10.2196/46482 %U https://mededu.jmir.org/2023/1/e46482 %U https://doi.org/10.2196/46482 %U http://www.ncbi.nlm.nih.gov/pubmed/37665620 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e50844 %T AI Is Changing the Landscape of Academic Writing: What Can Be Done? Authors’ Reply to: AI Increases the Pressure to Overhaul the Scientific Peer Review Process. Comment on “Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora’s Box Has Been Opened” %A Májovský,Martin %A Mikolov,Tomas %A Netuka,David %+ Department of Neurosurgery and Neurooncology, First Faculty of Medicine, Charles University, U Vojenské nemocnice 1200, Prague, 16000, Czech Republic, 420 973202963, majovmar@uvn.cz %K artificial intelligence %K AI %K publications %K ethics %K neurosurgery %K ChatGPT %K Chat Generative Pre-trained Transformer %K language models %K fraudulent medical articles %D 2023 %7 31.8.2023 %9 Letter to the Editor %J J Med Internet Res %G English %X %M 37651175 %R 10.2196/50844 %U https://www.jmir.org/2023/1/e50844 %U https://doi.org/10.2196/50844 %U http://www.ncbi.nlm.nih.gov/pubmed/37651175 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e50591 %T AI Increases the Pressure to Overhaul the Scientific Peer Review Process. Comment on “Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora’s Box Has Been Opened” %A Liu,Nicholas %A Brown,Amy %+ John A Burns School of Medicine, University of Hawai'i at Mānoa, 651 Ilalo St, Honolulu, HI, 96813, United States, 1 808 692 1000, nliu6@hawaii.edu %K artificial intelligence %K AI %K publications %K ethics %K neurosurgery %K ChatGPT %K Chat Generative Pre-trained Transformer %K language models %K fraudulent medical articles %D 2023 %7 31.8.2023 %9 Letter to the Editor %J J Med Internet Res %G English %X %M 37651167 %R 10.2196/50591 %U https://www.jmir.org/2023/1/e50591 %U https://doi.org/10.2196/50591 %U http://www.ncbi.nlm.nih.gov/pubmed/37651167 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e51584 %T Best Practices for Using AI Tools as an Author, Peer Reviewer, or Editor %A Leung,Tiffany I %A de Azevedo Cardoso,Taiane %A Mavragani,Amaryllis %A Eysenbach,Gunther %+ JMIR Publications, Inc, 130 Queens Quay East, Unit 1100, Toronto, ON, M5A 0P6, Canada, 1 416 583 2040, tiffany.leung@jmir.org %K publishing %K open access publishing %K open science %K publication policy %K science editing %K scholarly publishing %K scientific publishing %K research %K scientific research %K editorial %K artificial intelligence %K AI %D 2023 %7 31.8.2023 %9 Editorial %J J Med Internet Res %G English %X The ethics of generative artificial intelligence (AI) use in scientific manuscript content creation has become a serious matter of concern in the scientific publishing community. Generative AI has computationally become capable of elaborating research questions; refining programming code; generating text in scientific language; and generating images, graphics, or figures. However, this technology should be used with caution. In this editorial, we outline the current state of editorial policies on generative AI or chatbot use in authorship, peer review, and editorial processing of scientific and scholarly manuscripts. Additionally, we provide JMIR Publications’ editorial policies on these issues. We further detail JMIR Publications’ approach to the applications of AI in the editorial process for manuscripts in review in a JMIR Publications journal. %M 37651164 %R 10.2196/51584 %U https://www.jmir.org/2023/1/e51584 %U https://doi.org/10.2196/51584 %U http://www.ncbi.nlm.nih.gov/pubmed/37651164 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e51494 %T Can AI Mitigate Bias in Writing Letters of Recommendation? %A Leung,Tiffany I %A Sagar,Ankita %A Shroff,Swati %A Henry,Tracey L %+ JMIR Publications, 130 Queens Quay East, Unit 1100, Toronto, ON, M5A 0P6, Canada, 1 416 583 2040, tiffany.leung@jmir.org %K sponsorship %K implicit bias %K gender bias %K bias %K letters of recommendation %K artificial intelligence %K large language models %K medical education %K career advancement %K tenure and promotion %K promotion %K leadership %D 2023 %7 23.8.2023 %9 Editorial %J JMIR Med Educ %G English %X Letters of recommendation play a significant role in higher education and career progression, particularly for women and underrepresented groups in medicine and science. Already, there is evidence to suggest that written letters of recommendation contain language that expresses implicit biases, or unconscious biases, and that these biases occur for all recommenders regardless of the recommender’s sex. Given that all individuals have implicit biases that may influence language use, there may be opportunities to apply contemporary technologies, such as large language models or other forms of generative artificial intelligence (AI), to augment and potentially reduce implicit biases in the written language of letters of recommendation. In this editorial, we provide a brief overview of existing literature on the manifestations of implicit bias in letters of recommendation, with a focus on academia and medical education. We then highlight potential opportunities and drawbacks of applying this emerging technology in augmenting the focused, professional task of writing letters of recommendation. We also offer best practices for integrating their use into the routine writing of letters of recommendation and conclude with our outlook for the future of generative AI applications in supporting this task. %M 37610808 %R 10.2196/51494 %U https://mededu.jmir.org/2023/1/e51494 %U https://doi.org/10.2196/51494 %U http://www.ncbi.nlm.nih.gov/pubmed/37610808 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e48659 %T Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study %A Rao,Arya %A Pang,Michael %A Kim,John %A Kamineni,Meghana %A Lie,Winston %A Prasad,Anoop K %A Landman,Adam %A Dreyer,Keith %A Succi,Marc D %+ Department of Radiology, Massachusetts General Hospital, 55 Fruit Street, Boston, MA, 02114, United States, 1 617 935 9144, msucci@partners.org %K large language models %K LLMs %K artificial intelligence %K AI %K clinical decision support %K clinical vignettes %K ChatGPT %K Generative Pre-trained Transformer %K GPT %K utility %K development %K usability %K chatbot %K accuracy %K decision-making %D 2023 %7 22.8.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language model (LLM)–based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated. Objective: This study aimed to evaluate ChatGPT’s capacity for ongoing clinical decision support via its performance on standardized clinical vignettes. Methods: We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT’s performance on clinical tasks. Results: ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=–15.8%; P<.001) and clinical management (β=–7.4%; P=.02) question types. Conclusions: ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT’s training data set. %M 37606976 %R 10.2196/48659 %U https://www.jmir.org/2023/1/e48659 %U https://doi.org/10.2196/48659 %U http://www.ncbi.nlm.nih.gov/pubmed/37606976 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48433 %T Examining Real-World Medication Consultations and Drug-Herb Interactions: ChatGPT Performance Evaluation %A Hsu,Hsing-Yu %A Hsu,Kai-Cheng %A Hou,Shih-Yen %A Wu,Ching-Lung %A Hsieh,Yow-Wen %A Cheng,Yih-Dih %+ Department of Pharmacy, China Medical University Hospital, 2 Yuh-Der Road, Taichung, 404327, Taiwan, 886 4 22052121 ext 12261, yowenhsieh@gmail.com %K ChatGPT %K large language model %K natural language processing %K real-world medication consultation questions %K NLP %K drug-herb interactions %K pharmacist %K LLM %K language models %K chat generative pre-trained transformer %D 2023 %7 21.8.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Since OpenAI released ChatGPT, with its strong capability in handling natural tasks and its user-friendly interface, it has garnered significant attention. Objective: A prospective analysis is required to evaluate the accuracy and appropriateness of medication consultation responses generated by ChatGPT. Methods: A prospective cross-sectional study was conducted by the pharmacy department of a medical center in Taiwan. The test data set comprised retrospective medication consultation questions collected from February 1, 2023, to February 28, 2023, along with common questions about drug-herb interactions. Two distinct sets of questions were tested: real-world medication consultation questions and common questions about interactions between traditional Chinese and Western medicines. We used the conventional double-review mechanism. The appropriateness of each response from ChatGPT was assessed by 2 experienced pharmacists. In the event of a discrepancy between the assessments, a third pharmacist stepped in to make the final decision. Results: Of 293 real-world medication consultation questions, a random selection of 80 was used to evaluate ChatGPT’s performance. ChatGPT exhibited a higher appropriateness rate in responding to public medication consultation questions compared to those asked by health care providers in a hospital setting (31/51, 61% vs 20/51, 39%; P=.01). Conclusions: The findings from this study suggest that ChatGPT could potentially be used for answering basic medication consultation questions. Our analysis of the erroneous information allowed us to identify potential medical risks associated with certain questions; this problem deserves our close attention. %M 37561097 %R 10.2196/48433 %U https://mededu.jmir.org/2023/1/e48433 %U https://doi.org/10.2196/48433 %U http://www.ncbi.nlm.nih.gov/pubmed/37561097 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50945 %T The Role of Large Language Models in Medical Education: Applications and Implications %A Safranek,Conrad W %A Sidamon-Eristoff,Anne Elizabeth %A Gilson,Aidan %A Chartash,David %+ Section for Biomedical Informatics and Data Science, Yale University School of Medicine, 9th Fl, 100 College St, New Haven, CT, 06510, United States, 1 317 440 0354, david.chartash@yale.edu %K large language models %K ChatGPT %K medical education %K LLM %K artificial intelligence in health care %K AI %K autoethnography %D 2023 %7 14.8.2023 %9 Editorial %J JMIR Med Educ %G English %X Large language models (LLMs) such as ChatGPT have sparked extensive discourse within the medical education community, spurring both excitement and apprehension. Written from the perspective of medical students, this editorial offers insights gleaned through immersive interactions with ChatGPT, contextualized by ongoing research into the imminent role of LLMs in health care. Three distinct positive use cases for ChatGPT were identified: facilitating differential diagnosis brainstorming, providing interactive practice cases, and aiding in multiple-choice question review. These use cases can effectively help students learn foundational medical knowledge during the preclinical curriculum while reinforcing the learning of core Entrustable Professional Activities. Simultaneously, we highlight key limitations of LLMs in medical education, including their insufficient ability to teach the integration of contextual and external information, comprehend sensory and nonverbal cues, cultivate rapport and interpersonal interaction, and align with overarching medical education and patient care goals. Through interacting with LLMs to augment learning during medical school, students can gain an understanding of their strengths and weaknesses. This understanding will be pivotal as we navigate a health care landscape increasingly intertwined with LLMs and artificial intelligence. %M 37578830 %R 10.2196/50945 %U https://mededu.jmir.org/2023/1/e50945 %U https://doi.org/10.2196/50945 %U http://www.ncbi.nlm.nih.gov/pubmed/37578830 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e50696 %T Ethical Challenges in AI Approaches to Eating Disorders %A Sharp,Gemma %A Torous,John %A West,Madeline L %+ Department of Neuroscience, Monash University, 99 Commercial Road, Melbourne, 3004, Australia, 61 421253188, gemma.sharp@monash.edu %K eating disorders %K body image %K artificial intelligence %K AI %K chatbot %K ethics %D 2023 %7 14.8.2023 %9 Editorial %J J Med Internet Res %G English %X The use of artificial intelligence (AI) to assist with the prevention, identification, and management of eating disorders and body image concerns is exciting, but it is not without risk. Technology is advancing rapidly, and ensuring that responsible standards are in place to mitigate risk and protect users is vital to the success and safety of technologies and users. %M 37578836 %R 10.2196/50696 %U https://www.jmir.org/2023/1/e50696 %U https://doi.org/10.2196/50696 %U http://www.ncbi.nlm.nih.gov/pubmed/37578836 %0 Journal Article %@ 1929-073X %I JMIR Publications %V 12 %N %P e46900 %T Appropriateness and Comprehensiveness of Using ChatGPT for Perioperative Patient Education in Thoracic Surgery in Different Language Contexts: Survey Study %A Shao,Chen-ye %A Li,Hui %A Liu,Xiao-long %A Li,Chang %A Yang,Li-qin %A Zhang,Yue-juan %A Luo,Jing %A Zhao,Jun %+ Department of Thoracic Surgery, The First Affiliated Hospital of Soochow University, 899 Pinghai Road, Gusu District, Suzhou, 215006, China, 86 15250965957, zhaojia0327@126.com %K patient education %K ChatGPT %K Generative Pre-trained Transformer %K thoracic surgery %K evaluation %K patient %K education %K surgery %K thoracic %K language %K language model %K clinical workflow %K artificial intelligence %K AI %K workflow %K communication %K feasibility %D 2023 %7 14.8.2023 %9 Short Paper %J Interact J Med Res %G English %X Background: ChatGPT, a dialogue-based artificial intelligence language model, has shown promise in assisting clinical workflows and patient-clinician communication. However, there is a lack of feasibility assessments regarding its use for perioperative patient education in thoracic surgery. Objective: This study aimed to assess the appropriateness and comprehensiveness of using ChatGPT for perioperative patient education in thoracic surgery in both English and Chinese contexts. Methods: This pilot study was conducted in February 2023. A total of 37 questions focused on perioperative patient education in thoracic surgery were created based on guidelines and clinical experience. Two sets of inquiries were made to ChatGPT for each question, one in English and the other in Chinese. The responses generated by ChatGPT were evaluated separately by experienced thoracic surgical clinicians for appropriateness and comprehensiveness based on a hypothetical draft response to a patient’s question on the electronic information platform. For a response to be qualified, it required at least 80% of reviewers to deem it appropriate and 50% to deem it comprehensive. Statistical analyses were performed using the unpaired chi-square test or Fisher exact test, with a significance level set at P<.05. Results: The set of 37 commonly asked questions covered topics such as disease information, diagnostic procedures, perioperative complications, treatment measures, disease prevention, and perioperative care considerations. In both the English and Chinese contexts, 34 (92%) out of 37 responses were qualified in terms of both appropriateness and comprehensiveness. The remaining 3 (8%) responses were unqualified in these 2 contexts. The unqualified responses primarily involved the diagnosis of disease symptoms and surgical-related complications symptoms. The reasons for determining the responses as unqualified were similar in both contexts. There was no statistically significant difference (34/37, 92% vs 34/37, 92%; P=.99) in the qualification rate between the 2 language sets. Conclusions: This pilot study demonstrates the potential feasibility of using ChatGPT for perioperative patient education in thoracic surgery in both English and Chinese contexts. ChatGPT is expected to enhance patient satisfaction, reduce anxiety, and improve compliance during the perioperative period. In the future, there will be remarkable potential application for using artificial intelligence, in conjunction with human review, for patient education and health consultation after patients have provided their informed consent. %M 37578819 %R 10.2196/46900 %U https://www.i-jmr.org/2023/1/e46900 %U https://doi.org/10.2196/46900 %U http://www.ncbi.nlm.nih.gov/pubmed/37578819 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e48009 %T Ethical Considerations of Using ChatGPT in Health Care %A Wang,Changyu %A Liu,Siru %A Yang,Hao %A Guo,Jiulin %A Wu,Yuxuan %A Liu,Jialin %+ Information Center, West China Hospital, Sichuan University, No 37 Guo Xue Xiang, Chengdu, 610041, China, 86 28 85422306, DLJL8@163.com %K ethics %K ChatGPT %K artificial intelligence %K AI %K large language models %K health care %K artificial intelligence development %K development %K algorithm %K patient safety %K patient privacy %K safety %K privacy %D 2023 %7 11.8.2023 %9 Viewpoint %J J Med Internet Res %G English %X ChatGPT has promising applications in health care, but potential ethical issues need to be addressed proactively to prevent harm. ChatGPT presents potential ethical challenges from legal, humanistic, algorithmic, and informational perspectives. Legal ethics concerns arise from the unclear allocation of responsibility when patient harm occurs and from potential breaches of patient privacy due to data collection. Clear rules and legal boundaries are needed to properly allocate liability and protect users. Humanistic ethics concerns arise from the potential disruption of the physician-patient relationship, humanistic care, and issues of integrity. Overreliance on artificial intelligence (AI) can undermine compassion and erode trust. Transparency and disclosure of AI-generated content are critical to maintaining integrity. Algorithmic ethics raise concerns about algorithmic bias, responsibility, transparency and explainability, as well as validation and evaluation. Information ethics include data bias, validity, and effectiveness. Biased training data can lead to biased output, and overreliance on ChatGPT can reduce patient adherence and encourage self-diagnosis. Ensuring the accuracy, reliability, and validity of ChatGPT-generated content requires rigorous validation and ongoing updates based on clinical practice. To navigate the evolving ethical landscape of AI, AI in health care must adhere to the strictest ethical standards. Through comprehensive ethical guidelines, health care professionals can ensure the responsible use of ChatGPT, promote accurate and reliable information exchange, protect patient privacy, and empower patients to make informed decisions about their health care. %M 37566454 %R 10.2196/48009 %U https://www.jmir.org/2023/1/e48009 %U https://doi.org/10.2196/48009 %U http://www.ncbi.nlm.nih.gov/pubmed/37566454 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48978 %T Performance of ChatGPT on the Situational Judgement Test—A Professional Dilemmas–Based Examination for Doctors in the United Kingdom %A Borchert,Robin J %A Hickman,Charlotte R %A Pepys,Jack %A Sadler,Timothy J %+ Department of Radiology, University of Cambridge, Hills Road, Cambridge, CB2 0QQ, United Kingdom, 1 1223 805000, rb729@medschl.cam.ac.uk %K ChatGPT %K language models %K Situational Judgement Test %K medical education %K artificial intelligence %K language model %K exam %K examination %K SJT %K judgement %K reasoning %K communication %K chatbot %D 2023 %7 7.8.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: ChatGPT is a large language model that has performed well on professional examinations in the fields of medicine, law, and business. However, it is unclear how ChatGPT would perform on an examination assessing professionalism and situational judgement for doctors. Objective: We evaluated the performance of ChatGPT on the Situational Judgement Test (SJT): a national examination taken by all final-year medical students in the United Kingdom. This examination is designed to assess attributes such as communication, teamwork, patient safety, prioritization skills, professionalism, and ethics. Methods: All questions from the UK Foundation Programme Office’s (UKFPO’s) 2023 SJT practice examination were inputted into ChatGPT. For each question, ChatGPT’s answers and rationales were recorded and assessed on the basis of the official UK Foundation Programme Office scoring template. Questions were categorized into domains of Good Medical Practice on the basis of the domains referenced in the rationales provided in the scoring sheet. Questions without clear domain links were screened by reviewers and assigned one or multiple domains. ChatGPT's overall performance, as well as its performance across the domains of Good Medical Practice, was evaluated. Results: Overall, ChatGPT performed well, scoring 76% on the SJT but scoring full marks on only a few questions (9%), which may reflect possible flaws in ChatGPT’s situational judgement or inconsistencies in the reasoning across questions (or both) in the examination itself. ChatGPT demonstrated consistent performance across the 4 outlined domains in Good Medical Practice for doctors. Conclusions: Further research is needed to understand the potential applications of large language models, such as ChatGPT, in medical education for standardizing questions and providing consistent rationales for examinations assessing professionalism and ethics. %M 37548997 %R 10.2196/48978 %U https://mededu.jmir.org/2023/1/e48978 %U https://doi.org/10.2196/48978 %U http://www.ncbi.nlm.nih.gov/pubmed/37548997 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e48498 %T Toward Community-Based Natural Language Processing (CBNLP): Cocreating With Communities %A Pillai,Malvika %A Griffin,Ashley C %A Kronk,Clair A %A McCall,Terika %+ Center for Biomedical Informatics Research, Stanford University School of Medicine, 1265 Welch Rd, Stanford, CA, 94305, United States, 1 650 724 3979, mpillai@stanford.edu %K ChatGPT %K natural language processing %K community-based participatory research %K research design %K artificial intelligence %K participatory %K co-design %K machine learning %K co-creation %K community based %K lived experience %K lived experiences %K collaboration %K collaborative %D 2023 %7 4.8.2023 %9 Viewpoint %J J Med Internet Res %G English %X Rapid development and adoption of natural language processing (NLP) techniques has led to a multitude of exciting and innovative societal and health care applications. These advancements have also generated concerns around perpetuation of historical injustices and that these tools lack cultural considerations. While traditional health care NLP techniques typically include clinical subject matter experts to extract health information or aid in interpretation, few NLP tools involve community stakeholders with lived experiences. In this perspective paper, we draw upon the field of community-based participatory research, which gathers input from community members for development of public health interventions, to identify and examine ways to equitably involve communities in developing health care NLP tools. To realize the potential of community-based NLP (CBNLP), research and development teams must thoughtfully consider mechanisms and resources needed to effectively collaborate with community members for maximal societal and ethical impact of NLP-based tools. %M 37540551 %R 10.2196/48498 %U https://www.jmir.org/2023/1/e48498 %U https://doi.org/10.2196/48498 %U http://www.ncbi.nlm.nih.gov/pubmed/37540551 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e48966 %T ChatGPT vs Google for Queries Related to Dementia and Other Cognitive Decline: Comparison of Results %A Hristidis,Vagelis %A Ruggiano,Nicole %A Brown,Ellen L %A Ganta,Sai Rithesh Reddy %A Stewart,Selena %+ Department of Computer Science and Engineering, University of California, Riverside, Winston Chung Hall, Room 317, Riverside, CA, 92521, United States, 1 9518272478, vagelis@cs.ucr.edu %K chatbots %K large language models %K ChatGPT %K web search %K language model %K Google %K aging %K cognitive %K cognition %K dementia %K gerontology %K geriatric %K geriatrics %K query %K queries %K information seeking %K search %D 2023 %7 25.7.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: People living with dementia or other cognitive decline and their caregivers (PLWD) increasingly rely on the web to find information about their condition and available resources and services. The recent advancements in large language models (LLMs), such as ChatGPT, provide a new alternative to the more traditional web search engines, such as Google. Objective: This study compared the quality of the results of ChatGPT and Google for a collection of PLWD-related queries. Methods: A set of 30 informational and 30 service delivery (transactional) PLWD-related queries were selected and submitted to both Google and ChatGPT. Three domain experts assessed the results for their currency of information, reliability of the source, objectivity, relevance to the query, and similarity of their response. The readability of the results was also analyzed. Interrater reliability coefficients were calculated for all outcomes. Results: Google had superior currency and higher reliability. ChatGPT results were evaluated as more objective. ChatGPT had a significantly higher response relevance, while Google often drew upon sources that were referral services for dementia care or service providers themselves. The readability was low for both platforms, especially for ChatGPT (mean grade level 12.17, SD 1.94) compared to Google (mean grade level 9.86, SD 3.47). The similarity between the content of ChatGPT and Google responses was rated as high for 13 (21.7%) responses, medium for 16 (26.7%) responses, and low for 31 (51.6%) responses. Conclusions: Both Google and ChatGPT have strengths and weaknesses. ChatGPT rarely includes the source of a result. Google more often provides a date for and a known reliable source of the response compared to ChatGPT, whereas ChatGPT supplies more relevant responses to queries. The results of ChatGPT may be out of date and often do not specify a validity time stamp. Google sometimes returns results based on commercial entities. The readability scores for both indicate that responses are often not appropriate for persons with low health literacy skills. In the future, the addition of both the source and the date of health-related information and availability in other languages may increase the value of these platforms for both nonmedical and medical professionals. %M 37490317 %R 10.2196/48966 %U https://www.jmir.org/2023/1/e48966 %U https://doi.org/10.2196/48966 %U http://www.ncbi.nlm.nih.gov/pubmed/37490317 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e50336 %T Authors’ Reply to: Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations %A Gilson,Aidan %A Safranek,Conrad W %A Huang,Thomas %A Socrates,Vimig %A Chi,Ling %A Taylor,Richard Andrew %A Chartash,David %+ Section for Biomedical Informatics and Data Science, Yale University School of Medicine, 100 College Street, 9th Fl, New Haven, CT, 06510, United States, 1 203 737 5379, david.chartash@yale.edu %K natural language processing %K NLP %K MedQA %K generative pre-trained transformer %K GPT %K medical education %K chatbot %K artificial intelligence %K AI %K education technology %K ChatGPT %K conversational agent %K machine learning %K large language models %K knowledge assessment %D 2023 %7 13.7.2023 %9 Letter to the Editor %J JMIR Med Educ %G English %X %M 37440299 %R 10.2196/50336 %U https://mededu.jmir.org/2023/1/e50336 %U https://doi.org/10.2196/50336 %U http://www.ncbi.nlm.nih.gov/pubmed/37440299 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48305 %T Variability in Large Language Models’ Responses to Medical Licensing and Certification Examinations. Comment on “How Does ChatGPT Perform on the United States Medical Licensing Examination? The Implications of Large Language Models for Medical Education and Knowledge Assessment” %A Epstein,Richard H %A Dexter,Franklin %+ Department of Anesthesiology, Perioperative Medicine and Pain Management, University of Miami Miller School of Medicine, 1400 NW 12th Ave, Suite 4022F, Miami, FL, 33136, United States, 1 215 896 7850, repstein@med.miami.edu %K natural language processing %K NLP %K MedQA %K generative pre-trained transformer %K GPT %K medical education %K chatbot %K artificial intelligence %K AI %K education technology %K ChatGPT %K Google Bard %K conversational agent %K machine learning %K large language models %K knowledge assessment %D 2023 %7 13.7.2023 %9 Letter to the Editor %J JMIR Med Educ %G English %X %M 37440293 %R 10.2196/48305 %U https://mededu.jmir.org/2023/1/e48305 %U https://doi.org/10.2196/48305 %U http://www.ncbi.nlm.nih.gov/pubmed/37440293 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46939 %T Putting ChatGPT’s Medical Advice to the (Turing) Test: Survey Study %A Nov,Oded %A Singh,Nina %A Mann,Devin %+ Department of Technology Management, Tandon School of Engineering, New York University, 5 Metrotech, Brooklyn, New York, NY, 11201, United States, 1 646 207 7864, onov@nyu.edu %K artificial intelligence %K AI %K ChatGPT %K large language model %K patient-provider interaction %K chatbot %K feasibility %K ethics %K privacy %K language model %K machine learning %D 2023 %7 10.7.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Chatbots are being piloted to draft responses to patient questions, but patients’ ability to distinguish between provider and chatbot responses and patients’ trust in chatbots’ functions are not well established. Objective: This study aimed to assess the feasibility of using ChatGPT (Chat Generative Pre-trained Transformer) or a similar artificial intelligence–based chatbot for patient-provider communication. Methods: A survey study was conducted in January 2023. Ten representative, nonadministrative patient-provider interactions were extracted from the electronic health record. Patients’ questions were entered into ChatGPT with a request for the chatbot to respond using approximately the same word count as the human provider’s response. In the survey, each patient question was followed by a provider- or ChatGPT-generated response. Participants were informed that 5 responses were provider generated and 5 were chatbot generated. Participants were asked—and incentivized financially—to correctly identify the response source. Participants were also asked about their trust in chatbots’ functions in patient-provider communication, using a Likert scale from 1-5. Results: A US-representative sample of 430 study participants aged 18 and older were recruited on Prolific, a crowdsourcing platform for academic studies. In all, 426 participants filled out the full survey. After removing participants who spent less than 3 minutes on the survey, 392 respondents remained. Overall, 53.3% (209/392) of respondents analyzed were women, and the average age was 47.1 (range 18-91) years. The correct classification of responses ranged between 49% (192/392) to 85.7% (336/392) for different questions. On average, chatbot responses were identified correctly in 65.5% (1284/1960) of the cases, and human provider responses were identified correctly in 65.1% (1276/1960) of the cases. On average, responses toward patients’ trust in chatbots’ functions were weakly positive (mean Likert score 3.4 out of 5), with lower trust as the health-related complexity of the task in the questions increased. Conclusions: ChatGPT responses to patient questions were weakly distinguishable from provider responses. Laypeople appear to trust the use of chatbots to answer lower-risk health questions. It is important to continue studying patient-chatbot interaction as chatbots move from administrative to more clinical roles in health care. %M 37428540 %R 10.2196/46939 %U https://mededu.jmir.org/2023/1/e46939 %U https://doi.org/10.2196/46939 %U http://www.ncbi.nlm.nih.gov/pubmed/37428540 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e47479 %T Reliability of Medical Information Provided by ChatGPT: Assessment Against Clinical Guidelines and Patient Information Quality Instrument %A Walker,Harriet Louise %A Ghani,Shahi %A Kuemmerli,Christoph %A Nebiker,Christian Andreas %A Müller,Beat Peter %A Raptis,Dimitri Aristotle %A Staubli,Sebastian Manuel %+ Royal Free London NHS Foundation Trust, Pond Street, London, NW3 2QG, United Kingdom, 44 20 7794 0500, s.staubli@nhs.net %K artificial intelligence %K internet information %K patient information %K ChatGPT %K EQIP tool %K chatbot %K chatbots %K conversational agent %K conversational agents %K internal medicine %K pancreas %K liver %K hepatic %K biliary %K gall %K bile %K gallstone %K pancreatitis %K pancreatic %K medical information %D 2023 %7 30.6.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: ChatGPT-4 is the latest release of a novel artificial intelligence (AI) chatbot able to answer freely formulated and complex questions. In the near future, ChatGPT could become the new standard for health care professionals and patients to access medical information. However, little is known about the quality of medical information provided by the AI. Objective: We aimed to assess the reliability of medical information provided by ChatGPT. Methods: Medical information provided by ChatGPT-4 on the 5 hepato-pancreatico-biliary (HPB) conditions with the highest global disease burden was measured with the Ensuring Quality Information for Patients (EQIP) tool. The EQIP tool is used to measure the quality of internet-available information and consists of 36 items that are divided into 3 subsections. In addition, 5 guideline recommendations per analyzed condition were rephrased as questions and input to ChatGPT, and agreement between the guidelines and the AI answer was measured by 2 authors independently. All queries were repeated 3 times to measure the internal consistency of ChatGPT. Results: Five conditions were identified (gallstone disease, pancreatitis, liver cirrhosis, pancreatic cancer, and hepatocellular carcinoma). The median EQIP score across all conditions was 16 (IQR 14.5-18) for the total of 36 items. Divided by subsection, median scores for content, identification, and structure data were 10 (IQR 9.5-12.5), 1 (IQR 1-1), and 4 (IQR 4-5), respectively. Agreement between guideline recommendations and answers provided by ChatGPT was 60% (15/25). Interrater agreement as measured by the Fleiss κ was 0.78 (P<.001), indicating substantial agreement. Internal consistency of the answers provided by ChatGPT was 100%. Conclusions: ChatGPT provides medical information of comparable quality to available static internet information. Although currently of limited quality, large language models could become the future standard for patients and health care professionals to gather medical information. %M 37389908 %R 10.2196/47479 %U https://www.jmir.org/2023/1/e47479 %U https://doi.org/10.2196/47479 %U http://www.ncbi.nlm.nih.gov/pubmed/37389908 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48002 %T Performance of GPT-3.5 and GPT-4 on the Japanese Medical Licensing Examination: Comparison Study %A Takagi,Soshi %A Watari,Takashi %A Erabi,Ayano %A Sakaguchi,Kota %+ General Medicine Center, Shimane University Hospital, 89-1, Enya, Izumo, 693-8501, Japan, 81 0853 20 2217, wataritari@gmail.com %K ChatGPT %K Chat Generative Pre-trained Transformer %K GPT-4 %K Generative Pre-trained Transformer 4 %K artificial intelligence %K AI %K medical education %K Japanese Medical Licensing Examination %K medical licensing %K clinical support %K learning model %D 2023 %7 29.6.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: The competence of ChatGPT (Chat Generative Pre-Trained Transformer) in non-English languages is not well studied. Objective: This study compared the performances of GPT-3.5 (Generative Pre-trained Transformer) and GPT-4 on the Japanese Medical Licensing Examination (JMLE) to evaluate the reliability of these models for clinical reasoning and medical knowledge in non-English languages. Methods: This study used the default mode of ChatGPT, which is based on GPT-3.5; the GPT-4 model of ChatGPT Plus; and the 117th JMLE in 2023. A total of 254 questions were included in the final analysis, which were categorized into 3 types, namely general, clinical, and clinical sentence questions. Results: The results indicated that GPT-4 outperformed GPT-3.5 in terms of accuracy, particularly for general, clinical, and clinical sentence questions. GPT-4 also performed better on difficult questions and specific disease questions. Furthermore, GPT-4 achieved the passing criteria for the JMLE, indicating its reliability for clinical reasoning and medical knowledge in non-English languages. Conclusions: GPT-4 could become a valuable tool for medical education and clinical support in non–English-speaking regions, such as Japan. %M 37384388 %R 10.2196/48002 %U https://mededu.jmir.org/2023/1/e48002 %U https://doi.org/10.2196/48002 %U http://www.ncbi.nlm.nih.gov/pubmed/37384388 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e48568 %T Utility of ChatGPT in Clinical Practice %A Liu,Jialin %A Wang,Changyu %A Liu,Siru %+ Department of Biomedical Informatics, Vanderbilt University Medical Center, 2525 West End Ave #1475, Nashville, TN, 37212, United States, 1 615 875 5216, siru.liu@vumc.org %K ChatGPT %K artificial intelligence %K large language models %K clinical practice %K large language model %K natural language processing %K NLP %K doctor-patient %K patient-physician %K communication %K challenges %K barriers %K recommendations %K guidance %K guidelines %K best practices %K risks %D 2023 %7 28.6.2023 %9 Viewpoint %J J Med Internet Res %G English %X ChatGPT is receiving increasing attention and has a variety of application scenarios in clinical practice. In clinical decision support, ChatGPT has been used to generate accurate differential diagnosis lists, support clinical decision-making, optimize clinical decision support, and provide insights for cancer screening decisions. In addition, ChatGPT has been used for intelligent question-answering to provide reliable information about diseases and medical queries. In terms of medical documentation, ChatGPT has proven effective in generating patient clinical letters, radiology reports, medical notes, and discharge summaries, improving efficiency and accuracy for health care providers. Future research directions include real-time monitoring and predictive analytics, precision medicine and personalized treatment, the role of ChatGPT in telemedicine and remote health care, and integration with existing health care systems. Overall, ChatGPT is a valuable tool that complements the expertise of health care providers and improves clinical decision-making and patient care. However, ChatGPT is a double-edged sword. We need to carefully consider and study the benefits and potential dangers of ChatGPT. In this viewpoint, we discuss recent advances in ChatGPT research in clinical practice and suggest possible risks and challenges of using ChatGPT in clinical practice. It will help guide and support future artificial intelligence research similar to ChatGPT in health. %M 37379067 %R 10.2196/48568 %U https://www.jmir.org/2023/1/e48568 %U https://doi.org/10.2196/48568 %U http://www.ncbi.nlm.nih.gov/pubmed/37379067 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e48392 %T The ChatGPT (Generative Artificial Intelligence) Revolution Has Made Artificial Intelligence Approachable for Medical Professionals %A Mesko,Bertalan %+ The Medical Futurist Institute, Povl Bang-Jensen u 2/B1 4/1, Budapest, 1118, Hungary, 36 703807260, berci@medicalfuturist.com %K artificial intelligence %K digital health %K future %K technology %K ChatGPT %K medical practice %K large language model %K language model %K generative %K conversational agent %K conversation agents %K chatbot %K generated text %K computer generated %K medical education %K continuing education %K professional development %K curriculum %K curricula %D 2023 %7 22.6.2023 %9 Viewpoint %J J Med Internet Res %G English %X In November 2022, OpenAI publicly launched its large language model (LLM), ChatGPT, and reached the milestone of having over 100 million users in only 2 months. LLMs have been shown to be useful in a myriad of health care–related tasks and processes. In this paper, I argue that attention to, public access to, and debate about LLMs have initiated a wave of products and services using generative artificial intelligence (AI), which had previously found it hard to attract physicians. This paper describes what AI tools have become available since the beginning of the ChatGPT revolution and contemplates how it they might change physicians’ perceptions about this breakthrough technology. %M 37347508 %R 10.2196/48392 %U https://www.jmir.org/2023/1/e48392 %U https://doi.org/10.2196/48392 %U http://www.ncbi.nlm.nih.gov/pubmed/37347508 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e47184 %T Investigating the Impact of User Trust on the Adoption and Use of ChatGPT: Survey Analysis %A Choudhury,Avishek %A Shamszare,Hamid %+ Industrial and Management Systems Engineering, Benjamin M. Statler College of Engineering and Mineral Resources, West Virginia University, 321 Engineering Sciences Building, 1306 Evansdale Drive, Morgantown, WV, 26506, United States, 1 304 293 9431, avishek.choudhury@mail.wvu.edu %K ChatGPT %K trust in AI %K artificial intelligence %K technology adoption %K behavioral intention %K chatbot %K human factors %K trust %K adoption %K intent %K survey %K shared accountability %K AI policy %D 2023 %7 14.6.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: ChatGPT (Chat Generative Pre-trained Transformer) has gained popularity for its ability to generate human-like responses. It is essential to note that overreliance or blind trust in ChatGPT, especially in high-stakes decision-making contexts, can have severe consequences. Similarly, lacking trust in the technology can lead to underuse, resulting in missed opportunities. Objective: This study investigated the impact of users’ trust in ChatGPT on their intent and actual use of the technology. Four hypotheses were tested: (1) users’ intent to use ChatGPT increases with their trust in the technology; (2) the actual use of ChatGPT increases with users’ intent to use the technology; (3) the actual use of ChatGPT increases with users’ trust in the technology; and (4) users’ intent to use ChatGPT can partially mediate the effect of trust in the technology on its actual use. Methods: This study distributed a web-based survey to adults in the United States who actively use ChatGPT (version 3.5) at least once a month between February 2023 through March 2023. The survey responses were used to develop 2 latent constructs: Trust and Intent to Use, with Actual Use being the outcome variable. The study used partial least squares structural equation modeling to evaluate and test the structural model and hypotheses. Results: In the study, 607 respondents completed the survey. The primary uses of ChatGPT were for information gathering (n=219, 36.1%), entertainment (n=203, 33.4%), and problem-solving (n=135, 22.2%), with a smaller number using it for health-related queries (n=44, 7.2%) and other activities (n=6, 1%). Our model explained 50.5% and 9.8% of the variance in Intent to Use and Actual Use, respectively, with path coefficients of 0.711 and 0.221 for Trust on Intent to Use and Actual Use, respectively. The bootstrapped results failed to reject all 4 null hypotheses, with Trust having a significant direct effect on both Intent to Use (β=0.711, 95% CI 0.656-0.764) and Actual Use (β=0.302, 95% CI 0.229-0.374). The indirect effect of Trust on Actual Use, partially mediated by Intent to Use, was also significant (β=0.113, 95% CI 0.001-0.227). Conclusions: Our results suggest that trust is critical to users’ adoption of ChatGPT. It remains crucial to highlight that ChatGPT was not initially designed for health care applications. Therefore, an overreliance on it for health-related advice could potentially lead to misinformation and subsequent health risks. Efforts must be focused on improving the ChatGPT’s ability to distinguish between queries that it can safely handle and those that should be redirected to human experts (health care professionals). Although risks are associated with excessive trust in artificial intelligence–driven chatbots such as ChatGPT, the potential risks can be reduced by advocating for shared accountability and fostering collaboration between developers, subject matter experts, and human factors researchers. %M 37314848 %R 10.2196/47184 %U https://www.jmir.org/2023/1/e47184 %U https://doi.org/10.2196/47184 %U http://www.ncbi.nlm.nih.gov/pubmed/37314848 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48163 %T The Advent of Generative Language Models in Medical Education %A Karabacak,Mert %A Ozkara,Burak Berksu %A Margetis,Konstantinos %A Wintermark,Max %A Bisdas,Sotirios %+ Department of Neuroradiology, The National Hospital for Neurology and Neurosurgery, University College London NHS Foundation Trust, National Hospital for Neurology and Neurosurgery, Queen Square, London, WC1N 3BG, United Kingdom, 44 020 3448 3446, s.bisdas@ucl.ac.uk %K generative language model %K artificial intelligence %K medical education %K ChatGPT %K academic integrity %K AI-driven feedback %K stimulation %K evaluation %K technology %K learning environment %K medical student %D 2023 %7 6.6.2023 %9 Viewpoint %J JMIR Med Educ %G English %X Artificial intelligence (AI) and generative language models (GLMs) present significant opportunities for enhancing medical education, including the provision of realistic simulations, digital patients, personalized feedback, evaluation methods, and the elimination of language barriers. These advanced technologies can facilitate immersive learning environments and enhance medical students' educational outcomes. However, ensuring content quality, addressing biases, and managing ethical and legal concerns present obstacles. To mitigate these challenges, it is necessary to evaluate the accuracy and relevance of AI-generated content, address potential biases, and develop guidelines and policies governing the use of AI-generated content in medical education. Collaboration among educators, researchers, and practitioners is essential for developing best practices, guidelines, and transparent AI models that encourage the ethical and responsible use of GLMs and AI in medical education. By sharing information about the data used for training, obstacles encountered, and evaluation methods, developers can increase their credibility and trustworthiness within the medical community. In order to realize the full potential of AI and GLMs in medical education while mitigating potential risks and obstacles, ongoing research and interdisciplinary collaboration are necessary. By collaborating, medical professionals can ensure that these technologies are effectively and responsibly integrated, contributing to enhanced learning experiences and patient care. %M 37279048 %R 10.2196/48163 %U https://mededu.jmir.org/2023/1/e48163 %U https://doi.org/10.2196/48163 %U http://www.ncbi.nlm.nih.gov/pubmed/37279048 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e48291 %T Large Language Models in Medical Education: Opportunities, Challenges, and Future Directions %A Abd-alrazaq,Alaa %A AlSaad,Rawan %A Alhuwail,Dari %A Ahmed,Arfan %A Healy,Padraig Mark %A Latifi,Syed %A Aziz,Sarah %A Damseh,Rafat %A Alabed Alrazak,Sadam %A Sheikh,Javaid %+ AI Center for Precision Health, Weill Cornell Medicine-Qatar, PO Box 5825, Doha Al Luqta St, Ar-Rayyan, Doha, NA, Qatar, 974 55708549, alaa_alzoubi88@yahoo.com %K large language models %K artificial intelligence %K medical education %K ChatGPT %K GPT-4 %K generative AI %K students %K educators %D 2023 %7 1.6.2023 %9 Viewpoint %J JMIR Med Educ %G English %X The integration of large language models (LLMs), such as those in the Generative Pre-trained Transformers (GPT) series, into medical education has the potential to transform learning experiences for students and elevate their knowledge, skills, and competence. Drawing on a wealth of professional and academic experience, we propose that LLMs hold promise for revolutionizing medical curriculum development, teaching methodologies, personalized study plans and learning materials, student assessments, and more. However, we also critically examine the challenges that such integration might pose by addressing issues of algorithmic bias, overreliance, plagiarism, misinformation, inequity, privacy, and copyright concerns in medical education. As we navigate the shift from an information-driven educational paradigm to an artificial intelligence (AI)–driven educational paradigm, we argue that it is paramount to understand both the potential and the pitfalls of LLMs in medical education. This paper thus offers our perspective on the opportunities and challenges of using LLMs in this context. We believe that the insights gleaned from this analysis will serve as a foundation for future recommendations and best practices in the field, fostering the responsible and effective use of AI technologies in medical education. %M 37261894 %R 10.2196/48291 %U https://mededu.jmir.org/2023/1/e48291 %U https://doi.org/10.2196/48291 %U http://www.ncbi.nlm.nih.gov/pubmed/37261894 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e49323 %T Open Science and Software Assistance: Commentary on “Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora’s Box Has Been Opened” %A Ballester,Pedro L %+ Neuroscience Graduate Program, McMaster University, 1280 Main Street West, Hamilton, ON, L8S 4L8, Canada, 1 905 525 9140, pedballester@gmail.com %K artificial intelligence %K AI %K ChatGPT %K open science %K reproducibility %K software assistance %D 2023 %7 31.5.2023 %9 Commentary %J J Med Internet Res %G English %X Májovský and colleagues have investigated the important issue of ChatGPT being used for the complete generation of scientific works, including fake data and tables. The issues behind why ChatGPT poses a significant concern to research reach far beyond the model itself. Once again, the lack of reproducibility and visibility of scientific works creates an environment where fraudulent or inaccurate work can thrive. What are some of the ways in which we can handle this new situation? %M 37256656 %R 10.2196/49323 %U https://www.jmir.org/2023/1/e49323 %U https://doi.org/10.2196/49323 %U http://www.ncbi.nlm.nih.gov/pubmed/37256656 %0 Journal Article %@ 1438-8871 %I JMIR Publications %V 25 %N %P e46924 %T Artificial Intelligence Can Generate Fraudulent but Authentic-Looking Scientific Medical Articles: Pandora’s Box Has Been Opened %A Májovský,Martin %A Černý,Martin %A Kasal,Matěj %A Komarc,Martin %A Netuka,David %+ Department of Neurosurgery and Neurooncology, First Faculty of Medicine, Charles University, U Vojenské nemocnice 1200, Prague, 16000, Czech Republic, 420 973202963, majovmar@uvn.cz %K artificial intelligence %K publications %K ethics %K neurosurgery %K ChatGPT %K language models %K fraudulent medical articles %D 2023 %7 31.5.2023 %9 Original Paper %J J Med Internet Res %G English %X Background: Artificial intelligence (AI) has advanced substantially in recent years, transforming many industries and improving the way people live and work. In scientific research, AI can enhance the quality and efficiency of data analysis and publication. However, AI has also opened up the possibility of generating high-quality fraudulent papers that are difficult to detect, raising important questions about the integrity of scientific research and the trustworthiness of published papers. Objective: The aim of this study was to investigate the capabilities of current AI language models in generating high-quality fraudulent medical articles. We hypothesized that modern AI models can create highly convincing fraudulent papers that can easily deceive readers and even experienced researchers. Methods: This proof-of-concept study used ChatGPT (Chat Generative Pre-trained Transformer) powered by the GPT-3 (Generative Pre-trained Transformer 3) language model to generate a fraudulent scientific article related to neurosurgery. GPT-3 is a large language model developed by OpenAI that uses deep learning algorithms to generate human-like text in response to prompts given by users. The model was trained on a massive corpus of text from the internet and is capable of generating high-quality text in a variety of languages and on various topics. The authors posed questions and prompts to the model and refined them iteratively as the model generated the responses. The goal was to create a completely fabricated article including the abstract, introduction, material and methods, discussion, references, charts, etc. Once the article was generated, it was reviewed for accuracy and coherence by experts in the fields of neurosurgery, psychiatry, and statistics and compared to existing similar articles. Results: The study found that the AI language model can create a highly convincing fraudulent article that resembled a genuine scientific paper in terms of word usage, sentence structure, and overall composition. The AI-generated article included standard sections such as introduction, material and methods, results, and discussion, as well a data sheet. It consisted of 1992 words and 17 citations, and the whole process of article creation took approximately 1 hour without any special training of the human user. However, there were some concerns and specific mistakes identified in the generated article, specifically in the references. Conclusions: The study demonstrates the potential of current AI language models to generate completely fabricated scientific articles. Although the papers look sophisticated and seemingly flawless, expert readers may identify semantic inaccuracies and errors upon closer inspection. We highlight the need for increased vigilance and better detection methods to combat the potential misuse of AI in scientific research. At the same time, it is important to recognize the potential benefits of using AI language models in genuine scientific writing and research, such as manuscript preparation and language editing. %M 37256685 %R 10.2196/46924 %U https://www.jmir.org/2023/1/e46924 %U https://doi.org/10.2196/46924 %U http://www.ncbi.nlm.nih.gov/pubmed/37256685 %0 Journal Article %@ 2292-9495 %I JMIR Publications %V 10 %N %P e47564 %T User Intentions to Use ChatGPT for Self-Diagnosis and Health-Related Purposes: Cross-sectional Survey Study %A Shahsavar,Yeganeh %A Choudhury,Avishek %+ Industrial and Management Systems Engineering, Benjamin M Statler College of Engineering and Mineral Resources, West Virginia University, 1306 Evansdale Drive, 321 Engineering Sciences Building, Morgantown, WV, 26506, United States, 1 3042934970, avishek.choudhury@mail.wvu.edu %K human factors %K behavioral intention %K chatbots %K health care %K integrated diagnostics %K use %K ChatGPT %K artificial intelligence %K users %K self-diagnosis %K decision-making %K integration %K willingness %K policy %D 2023 %7 17.5.2023 %9 Original Paper %J JMIR Hum Factors %G English %X Background: With the rapid advancement of artificial intelligence (AI) technologies, AI-powered chatbots, such as Chat Generative Pretrained Transformer (ChatGPT), have emerged as potential tools for various applications, including health care. However, ChatGPT is not specifically designed for health care purposes, and its use for self-diagnosis raises concerns regarding its adoption’s potential risks and benefits. Users are increasingly inclined to use ChatGPT for self-diagnosis, necessitating a deeper understanding of the factors driving this trend. Objective: This study aims to investigate the factors influencing users’ perception of decision-making processes and intentions to use ChatGPT for self-diagnosis and to explore the implications of these findings for the safe and effective integration of AI chatbots in health care. Methods: A cross-sectional survey design was used, and data were collected from 607 participants. The relationships between performance expectancy, risk-reward appraisal, decision-making, and intention to use ChatGPT for self-diagnosis were analyzed using partial least squares structural equation modeling (PLS-SEM). Results: Most respondents were willing to use ChatGPT for self-diagnosis (n=476, 78.4%). The model demonstrated satisfactory explanatory power, accounting for 52.4% of the variance in decision-making and 38.1% in the intent to use ChatGPT for self-diagnosis. The results supported all 3 hypotheses: The higher performance expectancy of ChatGPT (β=.547, 95% CI 0.474-0.620) and positive risk-reward appraisals (β=.245, 95% CI 0.161-0.325) were positively associated with the improved perception of decision-making outcomes among users, and enhanced perception of decision-making processes involving ChatGPT positively impacted users’ intentions to use the technology for self-diagnosis (β=.565, 95% CI 0.498-0.628). Conclusions: Our research investigated factors influencing users’ intentions to use ChatGPT for self-diagnosis and health-related purposes. Even though the technology is not specifically designed for health care, people are inclined to use ChatGPT in health care contexts. Instead of solely focusing on discouraging its use for health care purposes, we advocate for improving the technology and adapting it for suitable health care applications. Our study highlights the importance of collaboration among AI developers, health care providers, and policy makers in ensuring AI chatbots’ safe and responsible use in health care. By understanding users’ expectations and decision-making processes, we can develop AI chatbots, such as ChatGPT, that are tailored to human needs, providing reliable and verified health information sources. This approach not only enhances health care accessibility but also improves health literacy and awareness. As the field of AI chatbots in health care continues to evolve, future research should explore the long-term effects of using AI chatbots for self-diagnosis and investigate their potential integration with other digital health interventions to optimize patient care and outcomes. In doing so, we can ensure that AI chatbots, including ChatGPT, are designed and implemented to safeguard users’ well-being and support positive health outcomes in health care settings. %M 37195756 %R 10.2196/47564 %U https://humanfactors.jmir.org/2023/1/e47564 %U https://doi.org/10.2196/47564 %U http://www.ncbi.nlm.nih.gov/pubmed/37195756 %0 Journal Article %@ 2373-6658 %I JMIR Publications %V 7 %N %P e48136 %T Impact of ChatGPT on Interdisciplinary Nursing Education and Research %A Miao,Hongyu %A Ahn,Hyochol %+ Florida State University, 98 Varsity Way, Tallahassee, FL, 32306, United States, 1 8506442647, hyochol.ahn@jmir.org %K ChatGPT %K nursing education %K nursing research %K artificial intelligence %K OpenAI %D 2023 %7 24.4.2023 %9 Editorial %J Asian Pac Isl Nurs J %G English %X ChatGPT, a trending artificial intelligence tool developed by OpenAI, was launched in November 2022. The impact of ChatGPT on the nursing and interdisciplinary research ecosystem is profound. %M 37093625 %R 10.2196/48136 %U https://apinj.jmir.org/2023/1/e48136 %U https://doi.org/10.2196/48136 %U http://www.ncbi.nlm.nih.gov/pubmed/37093625 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46599 %T Trialling a Large Language Model (ChatGPT) in General Practice With the Applied Knowledge Test: Observational Study Demonstrating Opportunities and Limitations in Primary Care %A Thirunavukarasu,Arun James %A Hassan,Refaat %A Mahmood,Shathar %A Sanghera,Rohan %A Barzangi,Kara %A El Mukashfi,Mohanned %A Shah,Sachin %+ University of Cambridge School of Clinical Medicine, Box 111 Cambridge Biomedical Campus, Cambridge, CB2 0SP, United Kingdom, 44 0 1223 336732 ext 3, ajt205@cantab.ac.uk %K ChatGPT %K large language model %K natural language processing %K decision support techniques %K artificial intelligence %K AI %K deep learning %K primary care %K general practice %K family medicine %K chatbot %D 2023 %7 21.4.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Large language models exhibiting human-level performance in specialized tasks are emerging; examples include Generative Pretrained Transformer 3.5, which underlies the processing of ChatGPT. Rigorous trials are required to understand the capabilities of emerging technology, so that innovation can be directed to benefit patients and practitioners. Objective: Here, we evaluated the strengths and weaknesses of ChatGPT in primary care using the Membership of the Royal College of General Practitioners Applied Knowledge Test (AKT) as a medium. Methods: AKT questions were sourced from a web-based question bank and 2 AKT practice papers. In total, 674 unique AKT questions were inputted to ChatGPT, with the model’s answers recorded and compared to correct answers provided by the Royal College of General Practitioners. Each question was inputted twice in separate ChatGPT sessions, with answers on repeated trials compared to gauge consistency. Subject difficulty was gauged by referring to examiners’ reports from 2018 to 2022. Novel explanations from ChatGPT—defined as information provided that was not inputted within the question or multiple answer choices—were recorded. Performance was analyzed with respect to subject, difficulty, question source, and novel model outputs to explore ChatGPT’s strengths and weaknesses. Results: Average overall performance of ChatGPT was 60.17%, which is below the mean passing mark in the last 2 years (70.42%). Accuracy differed between sources (P=.04 and .06). ChatGPT’s performance varied with subject category (P=.02 and .02), but variation did not correlate with difficulty (Spearman ρ=–0.241 and –0.238; P=.19 and .20). The proclivity of ChatGPT to provide novel explanations did not affect accuracy (P>.99 and .23). Conclusions: Large language models are approaching human expert–level performance, although further development is required to match the performance of qualified primary care physicians in the AKT. Validated high-performance models may serve as assistants or autonomous clinical tools to ameliorate the general practice workforce crisis. %M 37083633 %R 10.2196/46599 %U https://mededu.jmir.org/2023/1/e46599 %U https://doi.org/10.2196/46599 %U http://www.ncbi.nlm.nih.gov/pubmed/37083633 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46876 %T ChatGPT in Clinical Toxicology %A Sabry Abdel-Messih,Mary %A Kamel Boulos,Maged N %+ School of Medicine, University of Lisbon, Av Prof Egas Moniz MB, Lisbon, 1649-028, Portugal, 351 92 053 1573, mnkboulos@ieee.org %K ChatGPT %K clinical toxicology %K organophosphates %K artificial intelligence %K AI %K medical education %D 2023 %7 8.3.2023 %9 Letter to the Editor %J JMIR Med Educ %G English %X ChatGPT has recently been shown to pass the United States Medical Licensing Examination (USMLE). We tested ChatGPT (Feb 13, 2023 release) using a typical clinical toxicology case of acute organophosphate poisoning. ChatGPT fared well in answering all of our queries regarding it. %M 36867743 %R 10.2196/46876 %U https://mededu.jmir.org/2023/1/e46876 %U https://doi.org/10.2196/46876 %U http://www.ncbi.nlm.nih.gov/pubmed/36867743 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e46885 %T The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers %A Eysenbach,Gunther %+ JMIR Publications, 130 Queens Quay East, Suite 1100-1102, Toronto, ON, M5A 0P6, Canada, 1 416 786 6970, geysenba@gmail.com %K artificial intelligence %K AI %K ChatGPT %K generative language model %K medical education %K interview %K future of education %D 2023 %7 6.3.2023 %9 Editorial %J JMIR Med Educ %G English %X ChatGPT is a generative language model tool launched by OpenAI on November 30, 2022, enabling the public to converse with a machine on a broad range of topics. In January 2023, ChatGPT reached over 100 million users, making it the fastest-growing consumer application to date. This interview with ChatGPT is part 2 of a larger interview with ChatGPT. It provides a snapshot of the current capabilities of ChatGPT and illustrates the vast potential for medical education, research, and practice but also hints at current problems and limitations. In this conversation with Gunther Eysenbach, the founder and publisher of JMIR Publications, ChatGPT generated some ideas on how to use chatbots in medical education. It also illustrated its capabilities to generate a virtual patient simulation and quizzes for medical students; critiqued a simulated doctor-patient communication and attempts to summarize a research article (which turned out to be fabricated); commented on methods to detect machine-generated text to ensure academic integrity; generated a curriculum for health professionals to learn about artificial intelligence (AI); and helped to draft a call for papers for a new theme issue to be launched in JMIR Medical Education on ChatGPT. The conversation also highlighted the importance of proper “prompting.” Although the language generator does make occasional mistakes, it admits these when challenged. The well-known disturbing tendency of large language models to hallucinate became evident when ChatGPT fabricated references. The interview provides a glimpse into the capabilities and limitations of ChatGPT and the future of AI-supported medical education. Due to the impact of this new technology on medical education, JMIR Medical Education is launching a call for papers for a new e-collection and theme issue. The initial draft of the call for papers was entirely machine generated by ChatGPT, but will be edited by the human guest editors of the theme issue. %M 36863937 %R 10.2196/46885 %U https://mededu.jmir.org/2023/1/e46885 %U https://doi.org/10.2196/46885 %U http://www.ncbi.nlm.nih.gov/pubmed/36863937 %0 Journal Article %@ 2369-3762 %I JMIR Publications %V 9 %N %P e45312 %T How Does ChatGPT Perform on the United States Medical Licensing Examination (USMLE)? The Implications of Large Language Models for Medical Education and Knowledge Assessment %A Gilson,Aidan %A Safranek,Conrad W %A Huang,Thomas %A Socrates,Vimig %A Chi,Ling %A Taylor,Richard Andrew %A Chartash,David %+ Section for Biomedical Informatics and Data Science, Yale University School of Medicine, 300 George Street, Suite 501, New Haven, CT, 06511, United States, 1 203 737 5379, david.chartash@yale.edu %K natural language processing %K NLP %K MedQA %K generative pre-trained transformer %K GPT %K medical education %K chatbot %K artificial intelligence %K education technology %K ChatGPT %K conversational agent %K machine learning %K USMLE %D 2023 %7 8.2.2023 %9 Original Paper %J JMIR Med Educ %G English %X Background: Chat Generative Pre-trained Transformer (ChatGPT) is a 175-billion-parameter natural language processing model that can generate conversation-style responses to user input. Objective: This study aimed to evaluate the performance of ChatGPT on questions within the scope of the United States Medical Licensing Examination (USMLE) Step 1 and Step 2 exams, as well as to analyze responses for user interpretability. Methods: We used 2 sets of multiple-choice questions to evaluate ChatGPT’s performance, each with questions pertaining to Step 1 and Step 2. The first set was derived from AMBOSS, a commonly used question bank for medical students, which also provides statistics on question difficulty and the performance on an exam relative to the user base. The second set was the National Board of Medical Examiners (NBME) free 120 questions. ChatGPT’s performance was compared to 2 other large language models, GPT-3 and InstructGPT. The text output of each ChatGPT response was evaluated across 3 qualitative metrics: logical justification of the answer selected, presence of information internal to the question, and presence of information external to the question. Results: Of the 4 data sets, AMBOSS-Step1, AMBOSS-Step2, NBME-Free-Step1, and NBME-Free-Step2, ChatGPT achieved accuracies of 44% (44/100), 42% (42/100), 64.4% (56/87), and 57.8% (59/102), respectively. ChatGPT outperformed InstructGPT by 8.15% on average across all data sets, and GPT-3 performed similarly to random chance. The model demonstrated a significant decrease in performance as question difficulty increased (P=.01) within the AMBOSS-Step1 data set. We found that logical justification for ChatGPT’s answer selection was present in 100% of outputs of the NBME data sets. Internal information to the question was present in 96.8% (183/189) of all questions. The presence of information external to the question was 44.5% and 27% lower for incorrect answers relative to correct answers on the NBME-Free-Step1 (P<.001) and NBME-Free-Step2 (P=.001) data sets, respectively. Conclusions: ChatGPT marks a significant improvement in natural language processing models on the tasks of medical question answering. By performing at a greater than 60% threshold on the NBME-Free-Step-1 data set, we show that the model achieves the equivalent of a passing score for a third-year medical student. Additionally, we highlight ChatGPT’s capacity to provide logic and informational context across the majority of answers. These facts taken together make a compelling case for the potential applications of ChatGPT as an interactive medical education tool to support learning. %M 36753318 %R 10.2196/45312 %U https://mededu.jmir.org/2023/1/e45312 %U https://doi.org/10.2196/45312 %U http://www.ncbi.nlm.nih.gov/pubmed/36753318