%0 Journal Article
%@ 1438-8871
%I JMIR Publications
%V 26
%N 
%P e58158
%T Evaluating and Enhancing Large Language Models’ Performance in Domain-Specific Medicine: Development and Usability Study With DocOA
%A Chen,Xi
%A Wang,Li
%A You,MingKe
%A Liu,WeiZhi
%A Fu,Yu
%A Xu,Jie
%A Zhang,Shaoting
%A Chen,Gang
%A Li,Kang
%A Li,Jian
%+ Sports Medicine Center, West China Hospital, Sichuan University, No. 37, Guoxue Alley, Wuhou District, Chengdu, 610041, China, 86 18980601388, lijian_sportsmed@163.com
%K large language model
%K retrieval-augmented generation
%K domain-specific benchmark framework
%K osteoarthritis management
%D 2024
%7 22.7.2024
%9 Original Paper
%J J Med Internet Res
%G English
%X Background: The efficacy of large language models (LLMs) in domain-specific medicine, particularly for managing complex diseases such as osteoarthritis (OA), remains largely unexplored. Objective: This study focused on evaluating and enhancing the clinical capabilities and explainability of LLMs in specific domains, using OA management as a case study. Methods: A domain-specific benchmark framework was developed to evaluate LLMs across a spectrum from domain-specific knowledge to clinical applications in real-world clinical scenarios. DocOA, a specialized LLM designed for OA management integrating retrieval-augmented generation and instructional prompts, was developed. It can identify the clinical evidence upon which its answers are based through retrieval-augmented generation, thereby demonstrating the explainability of those answers. The study compared the performance of GPT-3.5, GPT-4, and a specialized assistant, DocOA, using objective and human evaluations. Results: Results showed that general LLMs such as GPT-3.5 and GPT-4 were less effective in the specialized domain of OA management, particularly in providing personalized treatment recommendations. However, DocOA showed significant improvements. Conclusions: This study introduces a novel benchmark framework that assesses the domain-specific abilities of LLMs in multiple aspects, highlights the limitations of generalized LLMs in clinical contexts, and demonstrates the potential of tailored approaches for developing domain-specific medical LLMs. 
%M 38833165
%R 10.2196/58158
%U https://www.jmir.org/2024/1/e58158
%U https://doi.org/10.2196/58158
%U http://www.ncbi.nlm.nih.gov/pubmed/38833165