EyeGPT for Patient Inquiries and Medical Education: Development and Validation of an Ophthalmology Large Language Model

doi:10.2196/60063

Original Paper

¹School of Optometry, The Hong Kong Polytechnic University, Hong Kong, China

²State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, China

³Department of Ophthalmology, Shanghai General Hospital (Shanghai First People’s Hospital), School of Medicine, Shanghai Jiao Tong University, Shanghai, China

⁴National Clinical Research Center for Eye Diseases, Shanghai, China

⁵Research Centre for SHARP Vision (RCSV), The Hong Kong Polytechnic University, Hong Kong, China

⁶Centre for Eye and Vision Research (CEVR), 17W Hong Kong Science Park, Hong Kong, China

*these authors contributed equally

Corresponding Author:

Danli Shi, MD, PhD

School of Optometry

The Hong Kong Polytechnic University

11 Yuk Choi Road

Hung Hom, KLN

Hong Kong, 999077

China

Phone: 852 27664825

Email: danli.shi@polyu.edu.hk

Background: Large language models (LLMs) have the potential to enhance clinical flow and improve medical education, but they encounter challenges related to specialized knowledge in ophthalmology.

Objective: This study aims to enhance ophthalmic knowledge by refining a general LLM into an ophthalmology-specialized assistant for patient inquiries and medical education.

Methods: We transformed Llama2 into an ophthalmology-specialized LLM, termed EyeGPT, through the following 3 strategies: prompt engineering for role-playing, fine-tuning with publicly available data sets filtered for eye-specific terminology (83,919 samples), and retrieval-augmented generation leveraging a medical database and 14 ophthalmology textbooks. The efficacy of various EyeGPT variants was evaluated by 4 board-certified ophthalmologists through comprehensive use of 120 diverse category questions in both simple and complex question-answering scenarios. The performance of the best EyeGPT model was then compared with that of the unassisted human physician group and the EyeGPT+human group. We proposed 4 metrics for assessment: accuracy, understandability, trustworthiness, and empathy. The proportion of hallucinations was also reported.

Results: The best fine-tuned model significantly outperformed the original Llama2 model at providing informed advice (mean 9.30, SD 4.42 vs mean 13.79, SD 5.70; P<.001) and mitigating hallucinations (97/120, 80.8% vs 53/120, 44.2%, P<.001). Incorporating information retrieval from reliable sources, particularly ophthalmology textbooks, further improved the model's response compared with solely the best fine-tuned model (mean 13.08, SD 5.43 vs mean 15.14, SD 4.64; P=.001) and reduced hallucinations (71/120, 59.2% vs 57/120, 47.4%, P=.02). Subgroup analysis revealed that EyeGPT showed robustness across common diseases, with consistent performance across different users and domains. Among the variants, the model integrating fine-tuning and book retrieval ranked highest, closely followed by the combination of fine-tuning and the manual database, standalone fine-tuning, and pure role-playing methods. EyeGPT demonstrated competitive capabilities in understandability and empathy when compared with human ophthalmologists. With the assistance of EyeGPT, the performance of the ophthalmologist was notably enhanced.

Conclusions: We pioneered and introduced EyeGPT by refining a general domain LLM and conducted a comprehensive comparison and evaluation of different strategies to develop an ophthalmology-specific assistant. Our results highlight EyeGPT’s potential to assist ophthalmologists and patients in medical settings.

J Med Internet Res 2024;26:e60063

doi:10.2196/60063

Keywords

large language model (208); generative pretrained transformer (21); generative artificial intelligence (53); ophthalmology (55); retrieval-augmented generation (15); medical assistant (2); EyeGPT; generative AI (58)

Ophthalmic diseases pose significant concerns for public health [Burton MJ, Ramke J, Marques AP, Bourne RRA, Congdon N, Jones I, et al. The Lancet Global Health Commission on Global Eye Health: vision beyond 2020. Lancet Glob Health. Apr 2021;9(4):e489-e551. [FREE Full text] [CrossRef] [Medline]1]. However, shortages of professionals and inefficiencies in primary eye care systems often funnel patients into overcrowded tertiary centers. This results in extended wait times and unaddressed postconsultation questions, frequently requiring additional face-to-face appointments [Betzler BK, Chen H, Cheng C, Lee CS, Ning G, Song SJ, et al. Large language models and their impact in ophthalmology. The Lancet Digital Health. Dec 2023;5(12):e917-e924. [CrossRef]2]. These challenges can be attributed to the limited ophthalmic knowledge among patients and the limited experience in eye care among primary health care providers [Liu Y, Swearingen R. Diabetic eye screening: knowledge and perspectives from providers and patients. Curr Diab Rep. Aug 31, 2017;17(10):94. [FREE Full text] [CrossRef] [Medline]3]. Therefore, there is a pressing need to enhance ophthalmic health education for both patients and primary health care providers. However, relying solely on manpower to address these issues presents further challenges, particularly as the rate of population aging continues to outpace the growth rate of ophthalmologists.

Large language models (LLMs) have recently emerged as powerful tools to alleviate these burdens and streamline clinical flow with the capability of understanding and generating human-like text [Thirunavukarasu AJ, Ting DSJ, Elangovan K, Gutierrez L, Tan TF, Ting DSW. Large language models in medicine. Nat Med. Aug 2023;29(8):1930-1940. [CrossRef] [Medline]4]. In ophthalmology, LLMs show promise both for ophthalmic certification exams [Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. Dec 2023;3(4):100324. [FREE Full text] [CrossRef] [Medline]5] and interpreting imaging reports across various linguistic environments [Liu X, Wu J, Shao A, Shen W, Ye P, Wang Y, et al. Uncovering language disparity of ChatGPT on retinal vascular disease classification: cross-sectional study. J Med Internet Res. Jan 22, 2024;26:e51926. [FREE Full text] [CrossRef] [Medline]6,Chen X, Zhang W, Xu P, Zhao Z, Zheng Y, Shi D, et al. FFA-GPT: an automated pipeline for fundus fluorescein angiography interpretation and question-answer. NPJ Digit Med. May 03, 2024;7(1):111. [FREE Full text] [CrossRef] [Medline]7]. However, there are several limitations to existing LLMs. First, there are challenges with addressing specialized ophthalmology knowledge for general LLMs. Previous research has demonstrated the suboptimal performance of ChatGPT in ophthalmology, with only 15.4% of the responses graded as completely accurate in vitreoretinal disease [Caranfa JT, Bommakanti NK, Young BK, Zhao PY. Accuracy of vitreoretinal disease information from an artificial intelligence chatbot. JAMA Ophthalmol. Sep 01, 2023;141(9):906-907. [CrossRef] [Medline]8]. Even with GPT-4, which currently exhibits the greatest capability, nonnegligible instances of misinformation occur, with only 30.6%, 21.5%, and 55.6% of responses about ocular multimodal images considered accurate, highly usable, and harmless, respectively [Xu P, Chen X, Zhao Z, Shi D. Unveiling the clinical incapabilities: a benchmarking study of GPT-4V(ision) for ophthalmic multimodal image analysis. Br J Ophthalmol. Sep 20, 2024;108(10):1384-1389. [CrossRef] [Medline]9]. A critical factor underlying these shortcomings is the model’s insufficient grasp of specialized knowledge, particularly in handling medical abbreviations and jargon within highly specialized domains [Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. Dec 2023;3(4):100324. [FREE Full text] [CrossRef] [Medline]5]. Therefore, there is a need to design a dedicated model trained on clinically relevant domain data. Second, it is widely recognized that LLMs occasionally generate inaccurate and misleading statements (hallucinations), which can potentially lead to medical errors. Fine-tuning with professional data can somewhat mitigate hallucinations, but the model can still produce them when faced with unfamiliar input [Zakka C, Shad R, Chaurasia A, Dalal AR, Kim JL, Moor M, et al. Almanac - retrieval-augmented language models for clinical medicine. NEJM AI. Feb 25, 2024;1(2):1. [FREE Full text] [CrossRef] [Medline]10]. Therefore, additional solutions are required. Third, there is a noticeable absence of comprehensive evaluations for LLMs in ophthalmology. Although previous studies have explored the ophthalmic question-answering (QA) capabilities of LLMs, the majority have been limited to multiple-choice formats [Antaki F, Touma S, Milad D, El-Khoury J, Duval R. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci. Dec 2023;3(4):100324. [FREE Full text] [CrossRef] [Medline]5,Lin JC, Younessi DN, Kurapati SS, Tang OY, Scott IU. Comparison of GPT-3.5, GPT-4, and human user performance on a practice ophthalmology written examination. Eye (Lond). Dec 08, 2023;37(17):3694-3695. [CrossRef] [Medline]11-Cai LZ, Shaheen A, Jin A, Fukui R, Yi JS, Yannuzzi N, et al. Performance of generative large language models on ophthalmology board-style questions. Am J Ophthalmol. Oct 2023;254:141-149. [CrossRef] [Medline]13]. Although a few studies have used open-ended questions to evaluate the performance of LLMs, they lack detailed categorization of the questions and primarily focus on scattered aspects such as accuracy, comprehensiveness, or safety [Decker H, Trang K, Ramirez J, Colley A, Pierce L, Coleman M, et al. Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw Open. Oct 02, 2023;6(10):e2336997. [FREE Full text] [CrossRef] [Medline]14,Pushpanathan K, Lim ZW, Er Yew SM, Chen DZ, Hui'En Lin HA, Lin Goh JH, et al. Popular large language model chatbots' accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries. iScience. Nov 17, 2023;26(11):108163. [FREE Full text] [CrossRef] [Medline]15]. Consequently, a comprehensive evaluation framework is urgently needed to test ophthalmology-related LLMs and compare their responses with those provided by certified ophthalmologists.

Recognizing this, we aimed to develop an artificial intelligence (AI) assistant, namely EyeGPT, to meet the specific informational needs in ophthalmic clinical and educational scenarios. By leveraging Llama2, a flexible and scalable open-source LLM known for its impressive performance in medicine [Ge J, Sun S, Owens J, Galvez V, Gologorskaya O, Lai JC, et al. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatology. Nov 01, 2024;80(5):1158-1168. [CrossRef] [Medline]16-Sandmann S, Riepenhausen S, Plagwitz L, Varghese J. Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun. Mar 06, 2024;15(1):2050. [FREE Full text] [CrossRef] [Medline]18], we infused the model with a granular level of ophthalmic expertise through role-playing, fine-tuning, and retrieval-augmented generation (RAG). The resultant model, EyeGPT, was evaluated for its efficacy in patient consultations and medical education. This work provides valuable insights into building and evaluating ophthalmic assistants, paving the way for the next generation of AI-assisted ophthalmic practice.

Ethical Considerations

The study overview is presented in Figure 1. Our research protocol adhered to the principles of the Helsinki Declaration. The study was approved by the Institutional Review Board of the Hong Kong Polytechnic University (number: HSEARS20240202004). This research involves publicly available data. We ensured that the data were deidentified and all private information was removed. Informed consent was unnecessary as the publicly available data do not contain identifiable information.

**Figure 1.** Overview of this study. GPT: generative pre-trained transformer; MCQA: multiple-choice question answering; RAG: retrieval-augmented generation; USMLE=United States Medical Licensing Examination.

Development of EyeGPT

Base Model

We used Meta’s Llama2 as the base model in our study, which was trained on 2 trillion tokens from publicly accessible data [Touvron H, Martin L, Stone K, Albert P, Almahairi A, Babaei Y. Llama 2: Open foundation and fine-tuned chat models. arXiv. Preprint posted online on July 19. [CrossRef]19]. We used the Llama2-7b-chat model, which was additionally fine-tuned on publicly available instruction data sets and over 1 million human annotations, thus having basic conversation skills [Taori R, Gulrajani I, Zhang T, Dubois Y, Li X, Guestrin C. tatsu-lab / stanford_alpaca. GitHub. URL: https://github.com/tatsu-lab/stanford_alpaca [accessed 2024-11-26] 20]. To inject professional ophthalmic knowledge into the model, we did experiments successively under the scenarios described in the following paragraphs.

Role-Playing

In generative AI, the engineering technique known as “role-playing” involves directing LLMs to “embody” or “imitate” specific roles for improved results [Kong A, Zhao S, Chen H, Li Q, Qin Y, Sun R, et al. Better Zero-Shot Reasoning with Role-Play Prompting. 2024. Presented at: Annual Conference of the North American Chapter of the Association for Computational Linguistics; June 16–21, 2024; Mexico City, Mexico. [CrossRef]21]. To enable the LLM to generate more relevant and empathetic responses, we assigned it the role of an “ophthalmologist” and the user the dual roles of a “patient” and “medical student.” This was achieved by giving the following instructions: “Suppose you are an ophthalmologist, and you need to answer the patient’s question with care/student’s question with patience.”

Fine-Tuning

To inject domain-specific knowledge and make Llama2 more proficient in capturing ophthalmic terminologies and logical reasoning, we trained it on domain-specific data sets, including MedAlpaca [Han T, Adams L, Papaioannou J, Grundmann P, Oberhauser T, Löser A, et al. MedAlpaca -- an open-source collection of medical conversational AI models and training data. arXiv. Preprint posted online on October 422], GenMedGPT-HealthCareMagic [Li Y, Li Z, Zhang K, Dan R, Jiang S, Zhang Y. ChatDoctor: a medical chat model fine-tuned on a large language model meta-AI (LLaMA) using medical domain knowledge. Cureus. Jun 2023;15(6):e40895. [FREE Full text] [CrossRef] [Medline]23], MedMCQA [Pal A, Umapathi L, Sankarasubbu M. MedMCQA: a large-scale multi-subject multi-choice dataset for medical domain question answering. Proceedings of Machine Learning Research. 2022;174:248-260. [FREE Full text]24], and the United States Medical Licensing Examination (USMLE). Processing of the USMLE data followed the method proposed by Jin et al [Jin D, Pan E, Oufattole N, Weng W, Fang H, Szolovits P. What disease does this patient have? A large-scale open domain question answering dataset from medical exams. Applied Sciences. Jul 12, 2021;11(14):6421. [CrossRef]25]. The data sets were filtered to remove conversations of little practical significance and responses with errors. We used instruction tuning [Ouyang L, Wu J, Jiang X, Almeida D, Wainwright C, Mishkin P, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems. 2022:35-44. [FREE Full text]26] to align the model with task-specific user objectives, enhance model controllability, and ensure rapid domain-specific adaptation. For data sets initially designed for multiple-choice QA, we automatically added an instruction at the beginning: “Answer the multiple-choice question.” For our specific task, we filtered out nonophthalmology data with eye-related keywords.

Multimedia Appendix 1

Public datasets used in fine-tuning EyeGPT.

PDF File (Adobe PDF File), 93 KB Multimedia Appendix 1 presents the characteristics of the filtered data sets, and lists the keywords we used.

The final data set comprised 83,919 samples, with 81,919 used for training and 2000 used for validation. We used low-rank adaptation (LoRA) [Hu E, Shen Y, Wallis P, Allen-Zhu Z, Li Y, Wang S, et al. LoRA: Low-Rank Adaptation of Large Language Models. 2021. Presented at: The Ninth International Conference on Learning Representations; May 3-7, 2021; Virtual meeting.27] to fine-tune the Llama2-7B model by adding a low-rank matrix while keeping the original parameters frozen, aiming to complement the original weight matrices of the model. The models were fine-tuned using 3*V100 GPUs with a batch size of 24, learning rate of 0.00003, maximum sequence length of 512 tokens, and warm-up ratio of 0.03. For LoRA-specific hyperparameters, the rank of low-rank factorization was 8, the scaling factor for the rank was 16, and the dropout was 0.05. Specifically, we performed 3 types of fine-tuning: Fine-tune 1 (2000 iterations), Fine-tune 2 (3500 iterations), and Fine-tune 3 (10,000 iterations). The entire training process took approximately 11 hours to complete.

Retrieval-Augmented Generation

LLMs may produce potential inaccuracies responses (hallucinations) to questions [Chen X, Xiang J, Lu S, Liu Y, He M, Shi D. Evaluating large language models in medical applications: a survey. arXiv. Preprint posted online on May 1328], which is unacceptable in the medical field. However, the accuracy of these models could be significantly improved if they could generate responses based on a reliable knowledge database. Here, to further improve the performance of EyeGPT, we introduced the external knowledge corpus of medical books and a manual database.

For the medical books, we used 14 specialized ophthalmology textbooks that cover a wide range of comprehensive ophthalmic knowledge, including general ophthalmology, optometry, retinal diseases, and more [Basic and Clinical Science Course: 2014-2015. San Francisco, CA. American Academy of Ophthalmology; 2014. 29-Ryan S, Sadda S, Hinton D, Schachat A, Sadda S, Wilkinson C, et al, editors. Retina. Philadelphia, PA. Saunders Elsevier; 2013. 31]. Please refer to

Multimedia Appendix 3

The specific list of textbooks used in knowledge enhancement.

PDF File (Adobe PDF File), 119 KB Multimedia Appendix 3 for the specific textbook list.

We manually built a database (sample shown in

Multimedia Appendix 4

Sample of our manual database.

PDF File (Adobe PDF File), 427 KB Multimedia Appendix 4) containing information on diseases, symptoms, medical tests and treatment procedures, and potential medications. This database, sourced from the open-access web and research papers, serves as an external and offline knowledge corpus for EyeGPT. It can be continually updated without model retraining and may provide more up-to-date information than textbooks.

To leverage external knowledge, we adopted the LangChain framework’s information retrieval techniques. The “all-MiniLM-L6-v2” [Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. 2019. Presented at: Conference on Empirical Methods in Natural Language Processing and 9th International Joint Conference on Natural Language Processing; November 3–7, 2019; Hong Kong, China. [CrossRef]32] open-source embedding model was used to map text into vector space. We used the “RecursiveCharacterTextSplitter” [RecursiveCharacterTextSplitter. URL: https://apipythonlangchaincom/en/latest/character/langchain_text_splitterscharacterRecursiveCharacterTextSplitterhtml [accessed 2024-12-02] 33] to segment the text for efficient retrieval, with a chunk size set to 1024 characters. Roughly 2 segments are retrieved from the vector storage for each response. In addition, we constructed a retriever with Facebook AI Similarity Search (FAISS) [Douze M, Guzhva A, Deng C, Johnson J, Szilvasy G, Mazaré P-E, et al. The faiss library. arXiv. Preprint posted online on Sep 0634] based on the segmented documents and established a conversational retrieval chain that seamlessly integrated our EyeGPT with the external database through LangChain.

Evaluation

Overview of the Evaluation

To assess the professional performance of various EyeGPT variants, namely (1) original (Llama2), (2) role-play (original plus role-play), (3) fine-tune 1-3 (fine-tuned model versions 1-3 plus role-play), (4) role-play+book (role-play plus book retrieval), (5) role-play+database (role-play plus manual database retrieval), (6) best fine-tune+book (the best fine-tuned model plus book retrieval), (7) best fine-tune+database (the best fine-tuned model plus manual database retrieval), our ophthalmology expert panel curated a set of 120 ophthalmic care-related questions based on their clinical expertise. We followed the user-centered evaluation approach proposed by Abbasian et al [Abbasian M, Khatibi E, Azimi I, Oniani D, Shakeri Hossein Abad Z, Thieme A, et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digit Med. Mar 29, 2024;7(1):82. [FREE Full text] [CrossRef] [Medline]35], considering the following 3 key factors: disease type, character type, and domain type. Disease type covered a wide range of medical conditions from various subspecialties, including common, specialty, and rare diseases, resulting in 12 disease categories such as myopia, retinal detachment, and Stickler syndrome (refer to

Multimedia Appendix 5

Specific diseases of question lists.

PDF File (Adobe PDF File), 97 KB Multimedia Appendix 5 for the detailed disease list). Character types included patients and medical students representing potential EyeGPT users. Domain types were divided into 5 topics: disease description, risk factors, diagnosis, treatment and prevention, and prognosis. We conducted the evaluations manually, including an independent evaluation of different EyeGPT variants, best-ranked comparisons for evaluating human-machine performance, and error analysis of the machine.

Independent Evaluation

This evaluation was designed to compare the performance of various optimization strategies of the EyeGPT variants and identify the best-performing one. Two board-certified ophthalmologists independently conducted manual assessment using a 5-point scale to assess the responses of each variant. The evaluation focused on the following 4 aspects: accuracy, understandability, trustworthiness, and empathy [Abbasian M, Khatibi E, Azimi I, Oniani D, Shakeri Hossein Abad Z, Thieme A, et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digit Med. Mar 29, 2024;7(1):82. [FREE Full text] [CrossRef] [Medline]35]. The detailed grading scale is presented in

Multimedia Appendix 6

PDF File (Adobe PDF File), 109 KB Multimedia Appendix 6. The scale ranged from 1 (strongly disagree) to 5 (strongly agree), with the average score from the 2 evaluators recorded as the score for each response aspect. The maximum score for each aspect was 5, and these scores were summed to obtain the final score for each response, with a maximum possible score of 20.

To evaluate the effectiveness of different optimization strategies in mitigating hallucinations, we defined answers with accuracy scores below 4 as containing hallucinations in our study. To ensure the evaluators could not identify the source of the responses, all generated responses were formatted as plain text, concealing any model-specific features. These responses were then randomly shuffled and mixed before being presented to the evaluators.

The evaluation was conducted in 2 rounds with a 1-month washout period to mitigate residual effects [Lim DSW, Makmur A, Zhu L, Zhang W, Cheng AJL, Sia DSY, et al. Improved productivity using deep learning-assisted reporting for lumbar spine MRI. Radiology. Oct 2022;305(1):160-166. [CrossRef] [Medline]36,Hu B, Shi Z, Lu L, Miao Z, Wang H, Zhou Z, et al. China Aneurysm AI Project Group. A deep-learning model for intracranial aneurysm detection on CT angiography images in China: a stepwise, multicentre, early-stage clinical validation study. Lancet Digit Health. Apr 2024;6(4):e261-e271. [FREE Full text] [CrossRef] [Medline]37]. In the first round, we compared models using different fine-tuning approaches, including original, role-play, and fine-tune 1-3. The goal was to determine the best fine-tuning model for the subsequent RAG. In the second round, we compared models using different RAG strategies based on the best-performing fine-tuned model selected from the first round. These models included best fine-tune (the best fine-tuned model from round 1), role-play+database, best fine-tune+database, role-play+book, and best fine-tune+book.

Best-Ranked Comparison

After independently evaluating the different EyeGPT variants, we identified the best-performing system. To assess if EyeGPT can match ophthalmologists’ expertise and offer them assistance, we conducted a human-machine best-ranked comparison. This evaluation method, inspired by that of Tu et al [Tu T, Azizi S, Driess D, Schaekermann M, Amin M, Chang P, et al. Towards generalist biomedical AI. NEJM AI. Feb 22, 2024;1(3):1. [CrossRef]38], aimed to efficiently assess answers comprehensively, reducing the need for assessors to delve into every detail and thereby minimizing subjectivity.

We invited 2 junior ophthalmologists (with 1-3 years of clinical experience) to answer the 120 questions with and without the aid of EyeGPT. The answers from different groups (EyeGPT, unassisted ophthalmologist, and EyeGPT+ophthalmologist) were evaluated by 2 senior ophthalmologists (with over 3 years of clinical experience) who were unaware of the sources, and the presentation order was randomized. Raters were asked to rank the 3 answers based on their clinical judgment across 4 dimensions, without the option of declaring a tie. In cases of disagreement, an ophthalmology expert (with over 10 years of clinical experience) reviewed the case until consensus was reached. The final result was recorded as the proportion of responses from different sources ranked as the best.

Error Analysis

To further investigate the quality of EyeGPT answers and identify areas for improvement, we conducted an error analysis on the best-performing EyeGPT model. The quality of the EyeGPT-generated QA pairs was evaluated by 2 board-certified ophthalmologists based on their expert judgment. The analysis focused on identifying occurrences of unrelated information, factual errors, incomplete information, and faulty logic [Chen X, Xu P, Li Y, Zhang W, Song F, He M, et al. ChatFFA: An ophthalmic chat system for unified vision-language understanding and question answering for fundus fluorescein angiography. iScience. Jul 19, 2024;27(7):110021. [FREE Full text] [CrossRef] [Medline]39].

Statistical Analysis

Statistical analyses were conducted using R (Version 4.3.1). The Mann-Whitney U test was used to compare the scores of the 2 models in the independent evaluation. When creating the bar chart, we compared the performance of the base model (Llama 2 or best fine-tune) with the most competitive optimization model in the same round to display statistically significant differences on the chart. The score for each answer in the independent evaluation was based on the average score from 2 raters. The scoring criteria used in the bar chart were as follows: strongly disagree (1 to <2), agree (2 to <3), neutral (3 to <4), approve (4 to <5), strongly agree (5). For subgroup analysis based on different confounding variables, the Kruskal-Wallis test and Mann-Whitney U test were used, depending on the number of comparison groups. Cohen kappa was calculated to determine the agreement among raters [Mandrekar JN. Measures of interrater agreement. Journal of Thoracic Oncology. Jan 2011;6(1):6-7. [CrossRef]40]. P values <.05 were considered statistically significant.

Comparative Study of Model Construction Strategies

Overall Performance

In the first round of evaluation, the total scores for the original, role-play, and fine-tune 1-3 models were 9.30, 12.79, 12.95, 12.83, and 13.79, respectively. All optimized models significantly outperformed the original model in accuracy, understandability, trustworthiness, and empathy, with fine-tune 3 performing the best. For the different fine-tuning variants, we observed that, as the number of iterations increased, the evaluation loss on the test data decreased (refer to

Multimedia Appendix 7

Tensorboard training logs of Finetune 3.

PDF File (Adobe PDF File), 118 KB Multimedia Appendix 7) and the model performance improved. In the subsequent comparison of RAG strategies, the best fine-tune+book model scored the highest, at 15.14, outperforming other strategies, as elaborated in . To ensure reliability, we compared the scores of fine-tune 3 (named best fine-tune in round 2) across 2 rounds. We found no statistically significant difference between the scores of the 2 rounds (P=.11). Inter-rater reliability in 2 rounds of independent evaluation was confirmed, with kappa values ranging from 0.611 to 0.872, indicating substantial agreement among raters (). For illustrative examples of the varied grades of responses from the independent evaluation, see 0.

Figure 2 demonstrates that more than one-half (accuracy: 67/120, 55.8%; understandability: 74/120, 61.7%; trustworthiness: 75/120, 62.5%; empathy: 74/120, 61.7%) of responses from the best fine-tune model were considered “good” responses (rated 4 or above) across all 4 dimensions. Compared with the original model (with an 80.8% [97/120] hallucination rate), the role-play and best fine-tune models mitigated hallucinations, by 30% (36/120) and 36.7% (44/120), respectively. Figure 3 shows that the best fine-tune+book model further enhanced the proportion of “good” responses to the maximum. We compared the performance of the best model in round 1 (fine-tune 3) with the most competitive modified model to check for statistically significant differences. The scores and scoring criteria are the same as in Figure 2. Compared with the best fine-tune model, the best fine-tune+database and best fine-tune+book models further reduced hallucinations by 3.3% (4/120) and 11.7% (14/120), respectively.

**Figure 2.** Performance in terms of (A) accuracy, (B) understandability, (C) trustworthiness, and (D) empathy of the different models in round 1 of the human evaluation, with the percentage of good responses (strongly agree and agree) indicated by the black numbers, the percentage of hallucinations indicated by the blue numbers, and significance determined using Mann-Whitney U tests.

**Figure 3.** Performance in terms of (A) accuracy, (B) understandability, (C) trustworthiness, and (D) empathy of the different models in round 2 of the human evaluation, with the percentage of good responses (strongly agree and agree) indicated by the black numbers, the percentage of hallucinations indicated by the blue numbers, and significance determined using Mann-Whitney U tests.

Subgroup Analysis

We also performed subgroup analysis to further evaluate the model performance under different confounding factors, including subspecialty questions of varying difficulty levels, questions raised by different characters, and question domains.

Different Subspecialties

Across all RAG strategies, the models scored higher for common diseases than for specialty and rare conditions (Table 1). For common ophthalmic conditions, the RAG models delivered more precise and contextually relevant information. For more specialized conditions like central serous chorioretinopathy, the best fine-tune model provided general information about its treatment options, while the RAG models offered more specialized responses concerning laser treatment and photodynamic therapy depending on the specific circumstances. For rare conditions like morning glory syndrome, although the best fine-tune model could not generate responses as it mistakenly identified it as “bilateral posterior superior temporal arcade spikes,” the RAG model was able to retrieve relevant information from the external knowledge database and make accurate responses. The best fine-tune model accurately recognized 43% (13/30) of ophthalmic abbreviations. RAG strategies improved this recognition rate, ranging from 60% (18/30) to 83% (25/30) for different models.

Table 1. Subgroup analysis of the performance of EyeGPT by subspecialty.

EyeGPT model	Common diseases, mean (SD)^a	Specialty diseases, mean (SD)^a	Rare diseases, mean (SD)^a	P value
Best fine-tune^b	15.79 (4.15)	13.11 (5.40)	10.33 (5.31)	<.001
Role-play+database^c	15.28 (5.17)	14.18 (5.11)	12.18 (4.88)	.01
Best fine-tune+database^d	15.45 (4.57)	14.29 (4.89)	12.17 (4.58)	.01
Role-play+book^e	15.70 (3.94)	12.89 (5.32)	14.66 (4.53)	.02
Best fine-tune+book^f	17.24 (3.02)	14.08 (5.49)	14.23 (4.42)	.003

^aOverall response score (the sum of 4 rating dimensions, with a maximum score of 20 representing the best performance).

^bThe fine-tuned model with 10,000 iterations.

^cRole-play plus manual database retrieval.

^dThe best fine-tuned model plus manual database retrieval.

^eRole-play plus book retrieval.

^fThe best fine-tuned model plus book retrieval.

Different Role-Play Characters

When comparing the influence of the questioner’s assumed identity—patient versus medical student—on model performance, responses to patients consistently scored higher than those of medical students (Table 2). This difference reached statistical significance in the best fine-tune and role-play+database models. However, no significant differences were observed with the best fine-tune+database, role-play+book, and best fine-tune+book models, suggesting that these adjusted models can answer both general patient questions and more specialized queries from medical students.

Table 2. Subgroup analysis of the performance of EyeGPT by role-play character.

EyeGPT model	Patients, mean (SD)^a	Medical students, mean (SD)^a	P value
Best fine-tune^b	13.45 (5.79)	10.99 (6.32)	.03
Role-play+database^c	14.67 (5.07)	11.62 (6.55)	.03
Best fine-tune+database^d	14.52 (4.99)	12.84 (5.34)	.08
Role-play+book^e	14.85 (4.78)	12.65 (5.85)	.06
Best fine-tune+book^f	14.44 (6.13)	13.38 (5.83)	.07

^aOverall response score (the sum of 4 rating dimensions, with a maximum score of 20 representing the best performance).

^bThe fine-tuned model with 10,000 iterations.

^cRole-play plus manual database retrieval.

^dThe best fine-tuned model plus manual database retrieval.

^eRole-play plus book retrieval.

^fThe best fine-tuned model plus book retrieval.

Different Domains

In the subgroup analysis of EyeGPT’s performance across different domains, there were no statistically significant differences in the scores of disease description, risk factors, diagnosis, treatment and prevention, and prognosis across all models (Table 3).

Table 3. Subgroup analysis of the performance of EyeGPT by domain.

EyeGPT model	Disease description, mean (SD)^a	Risk factors, mean (SD)^a	Diagnosis, mean (SD)^a	Treatment and prevention, mean (SD)^a	Prognosis, mean (SD)^a	P value
Best fine-tune^b	12.92 (6.51)	12.67 (5.38)	11.81 (6.45)	13.21 (5.12)	9.90 (6.60)	.35
Role-play+database^c	12.98 (6.57)	12.73 (6.00)	13.73 (5.38)	13.53 (4.55)	11.48 (6.26)	.49
Best fine-tune+database^d	15.14 (4.76)	12.08 (5.64)	14.70 (2.98)	12.78 (4.82)	11.17 (6.10)	.06
Role-play+book^e	12.19 (6.15)	13.78 (5.61)	13.91 (5.92)	14.60 (3.31)	13.00 (5.18)	.80
Best fine-tune+book^f	11.70 (7.46)	13.33 (6.53)	15.15 (4.64)	13.64 (5.79)	12.17 (6.14)	.36

^aOverall response score (the sum of 4 rating dimensions, with a maximum score of 20 representing the best performance).

^bThe fine-tuned model with 10,000 iterations.

^cRole-play plus manual database retrieval.

^dThe best fine-tuned model plus manual database retrieval.

^eRole-play plus book retrieval.

^fThe best fine-tuned model plus book retrieval.

Performance Comparison: AI Model Versus Human Ophthalmologists

In the human-machine best-ranked comparison, EyeGPT showed competitive capabilities, particularly in understandability and empathy. With the assistance of EyeGPT, human ophthalmologists’ performance was notably improved. Figure 4 summarizes the frequencies of answers generated by EyeGPT, unassisted ophthalmologists, or EyeGPT-assisted ophthalmologists, ranked as the best among the 3 candidate answers across 4 dimensions. Regarding understandability and empathy, the EyeGPT answers ranked best for 23 (19.2%) and 41 (34.2%) of the 120 questions, respectively, which were higher than those of ophthalmologists, which ranked best for 12 (10%) and 8 (6.7%) of the 120 questions. The answers provided by EyeGPT-assisted ophthalmologists were most frequently ranked as the best, at 85 (85/120, 70.8%) and 71 (71/120, 59.2%) for understandability and empathy, respectively; however, the accuracy and trustworthiness of EyeGPT answers were slightly lower than those by the ophthalmologists (accuracy: 12/120, 10% vs 14/120, 11.7%; trustworthiness: 12/120, 10% vs 15/120, 12.5%), highlighting areas for improvement. With the assistance of EyeGPT, the answers provided by the ophthalmologists excelled, ranking highest in accuracy and trustworthiness in 94 (78.3%) and 93 (77.5%) of the 120 questions, respectively. For illustrative examples of the best-ranked comparison, see