TY - JOUR AU - Chen, Xiaolan AU - Zhao, Ziwei AU - Zhang, Weiyi AU - Xu, Pusheng AU - Wu, Yue AU - Xu, Mingpu AU - Gao, Le AU - Li, Yinwen AU - Shang, Xianwen AU - Shi, Danli AU - He, Mingguang PY - 2024 DA - 2024/12/11 TI - EyeGPT for Patient Inquiries and Medical Education: Development and Validation of an Ophthalmology Large Language Model JO - J Med Internet Res SP - e60063 VL - 26 KW - large language model KW - generative pretrained transformer KW - generative artificial intelligence KW - ophthalmology KW - retrieval-augmented generation KW - medical assistant KW - EyeGPT KW - generative AI AB - Background: Large language models (LLMs) have the potential to enhance clinical flow and improve medical education, but they encounter challenges related to specialized knowledge in ophthalmology. Objective: This study aims to enhance ophthalmic knowledge by refining a general LLM into an ophthalmology-specialized assistant for patient inquiries and medical education. Methods: We transformed Llama2 into an ophthalmology-specialized LLM, termed EyeGPT, through the following 3 strategies: prompt engineering for role-playing, fine-tuning with publicly available data sets filtered for eye-specific terminology (83,919 samples), and retrieval-augmented generation leveraging a medical database and 14 ophthalmology textbooks. The efficacy of various EyeGPT variants was evaluated by 4 board-certified ophthalmologists through comprehensive use of 120 diverse category questions in both simple and complex question-answering scenarios. The performance of the best EyeGPT model was then compared with that of the unassisted human physician group and the EyeGPT+human group. We proposed 4 metrics for assessment: accuracy, understandability, trustworthiness, and empathy. The proportion of hallucinations was also reported. Results: The best fine-tuned model significantly outperformed the original Llama2 model at providing informed advice (mean 9.30, SD 4.42 vs mean 13.79, SD 5.70; P<.001) and mitigating hallucinations (97/120, 80.8% vs 53/120, 44.2%, P<.001). Incorporating information retrieval from reliable sources, particularly ophthalmology textbooks, further improved the model's response compared with solely the best fine-tuned model (mean 13.08, SD 5.43 vs mean 15.14, SD 4.64; P=.001) and reduced hallucinations (71/120, 59.2% vs 57/120, 47.4%, P=.02). Subgroup analysis revealed that EyeGPT showed robustness across common diseases, with consistent performance across different users and domains. Among the variants, the model integrating fine-tuning and book retrieval ranked highest, closely followed by the combination of fine-tuning and the manual database, standalone fine-tuning, and pure role-playing methods. EyeGPT demonstrated competitive capabilities in understandability and empathy when compared with human ophthalmologists. With the assistance of EyeGPT, the performance of the ophthalmologist was notably enhanced. Conclusions: We pioneered and introduced EyeGPT by refining a general domain LLM and conducted a comprehensive comparison and evaluation of different strategies to develop an ophthalmology-specific assistant. Our results highlight EyeGPT’s potential to assist ophthalmologists and patients in medical settings. SN - 1438-8871 UR - https://www.jmir.org/2024/1/e60063 UR - https://doi.org/10.2196/60063 UR - http://www.ncbi.nlm.nih.gov/pubmed/39661433 DO - 10.2196/60063 ID - info:doi/10.2196/60063 ER -