@Article{info:doi/10.2196/60063, author="Chen, Xiaolan and Zhao, Ziwei and Zhang, Weiyi and Xu, Pusheng and Wu, Yue and Xu, Mingpu and Gao, Le and Li, Yinwen and Shang, Xianwen and Shi, Danli and He, Mingguang", title="EyeGPT for Patient Inquiries and Medical Education: Development and Validation of an Ophthalmology Large Language Model", journal="J Med Internet Res", year="2024", month="Dec", day="11", volume="26", pages="e60063", keywords="large language model; generative pretrained transformer; generative artificial intelligence; ophthalmology; retrieval-augmented generation; medical assistant; EyeGPT; generative AI", abstract="Background: Large language models (LLMs) have the potential to enhance clinical flow and improve medical education, but they encounter challenges related to specialized knowledge in ophthalmology. Objective: This study aims to enhance ophthalmic knowledge by refining a general LLM into an ophthalmology-specialized assistant for patient inquiries and medical education. Methods: We transformed Llama2 into an ophthalmology-specialized LLM, termed EyeGPT, through the following 3 strategies: prompt engineering for role-playing, fine-tuning with publicly available data sets filtered for eye-specific terminology (83,919 samples), and retrieval-augmented generation leveraging a medical database and 14 ophthalmology textbooks. The efficacy of various EyeGPT variants was evaluated by 4 board-certified ophthalmologists through comprehensive use of 120 diverse category questions in both simple and complex question-answering scenarios. The performance of the best EyeGPT model was then compared with that of the unassisted human physician group and the EyeGPT+human group. We proposed 4 metrics for assessment: accuracy, understandability, trustworthiness, and empathy. The proportion of hallucinations was also reported. Results: The best fine-tuned model significantly outperformed the original Llama2 model at providing informed advice (mean 9.30, SD 4.42 vs mean 13.79, SD 5.70; P<.001) and mitigating hallucinations (97/120, 80.8{\%} vs 53/120, 44.2{\%}, P<.001). Incorporating information retrieval from reliable sources, particularly ophthalmology textbooks, further improved the model's response compared with solely the best fine-tuned model (mean 13.08, SD 5.43 vs mean 15.14, SD 4.64; P=.001) and reduced hallucinations (71/120, 59.2{\%} vs 57/120, 47.4{\%}, P=.02). Subgroup analysis revealed that EyeGPT showed robustness across common diseases, with consistent performance across different users and domains. Among the variants, the model integrating fine-tuning and book retrieval ranked highest, closely followed by the combination of fine-tuning and the manual database, standalone fine-tuning, and pure role-playing methods. EyeGPT demonstrated competitive capabilities in understandability and empathy when compared with human ophthalmologists. With the assistance of EyeGPT, the performance of the ophthalmologist was notably enhanced. Conclusions: We pioneered and introduced EyeGPT by refining a general domain LLM and conducted a comprehensive comparison and evaluation of different strategies to develop an ophthalmology-specific assistant. Our results highlight EyeGPT's potential to assist ophthalmologists and patients in medical settings. ", issn="1438-8871", doi="10.2196/60063", url="https://www.jmir.org/2024/1/e60063", url="https://doi.org/10.2196/60063", url="http://www.ncbi.nlm.nih.gov/pubmed/39661433" }