%0 Journal Article %@ 1438-8871 %I JMIR Publications %V 26 %N %P e60063 %T EyeGPT for Patient Inquiries and Medical Education: Development and Validation of an Ophthalmology Large Language Model %A Chen,Xiaolan %A Zhao,Ziwei %A Zhang,Weiyi %A Xu,Pusheng %A Wu,Yue %A Xu,Mingpu %A Gao,Le %A Li,Yinwen %A Shang,Xianwen %A Shi,Danli %A He,Mingguang %+ School of Optometry, The Hong Kong Polytechnic University, 11 Yuk Choi Road, Hung Hom, KLN, Hong Kong, 999077, China, 852 27664825, danli.shi@polyu.edu.hk %K large language model %K generative pretrained transformer %K generative artificial intelligence %K ophthalmology %K retrieval-augmented generation %K medical assistant %K EyeGPT %K generative AI %D 2024 %7 11.12.2024 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models (LLMs) have the potential to enhance clinical flow and improve medical education, but they encounter challenges related to specialized knowledge in ophthalmology. Objective: This study aims to enhance ophthalmic knowledge by refining a general LLM into an ophthalmology-specialized assistant for patient inquiries and medical education. Methods: We transformed Llama2 into an ophthalmology-specialized LLM, termed EyeGPT, through the following 3 strategies: prompt engineering for role-playing, fine-tuning with publicly available data sets filtered for eye-specific terminology (83,919 samples), and retrieval-augmented generation leveraging a medical database and 14 ophthalmology textbooks. The efficacy of various EyeGPT variants was evaluated by 4 board-certified ophthalmologists through comprehensive use of 120 diverse category questions in both simple and complex question-answering scenarios. The performance of the best EyeGPT model was then compared with that of the unassisted human physician group and the EyeGPT+human group. We proposed 4 metrics for assessment: accuracy, understandability, trustworthiness, and empathy. The proportion of hallucinations was also reported. Results: The best fine-tuned model significantly outperformed the original Llama2 model at providing informed advice (mean 9.30, SD 4.42 vs mean 13.79, SD 5.70; P<.001) and mitigating hallucinations (97/120, 80.8% vs 53/120, 44.2%, P<.001). Incorporating information retrieval from reliable sources, particularly ophthalmology textbooks, further improved the model's response compared with solely the best fine-tuned model (mean 13.08, SD 5.43 vs mean 15.14, SD 4.64; P=.001) and reduced hallucinations (71/120, 59.2% vs 57/120, 47.4%, P=.02). Subgroup analysis revealed that EyeGPT showed robustness across common diseases, with consistent performance across different users and domains. Among the variants, the model integrating fine-tuning and book retrieval ranked highest, closely followed by the combination of fine-tuning and the manual database, standalone fine-tuning, and pure role-playing methods. EyeGPT demonstrated competitive capabilities in understandability and empathy when compared with human ophthalmologists. With the assistance of EyeGPT, the performance of the ophthalmologist was notably enhanced. Conclusions: We pioneered and introduced EyeGPT by refining a general domain LLM and conducted a comprehensive comparison and evaluation of different strategies to develop an ophthalmology-specific assistant. Our results highlight EyeGPT’s potential to assist ophthalmologists and patients in medical settings. %M 39661433 %R 10.2196/60063 %U https://www.jmir.org/2024/1/e60063 %U https://doi.org/10.2196/60063 %U http://www.ncbi.nlm.nih.gov/pubmed/39661433