%0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e65547 %T Large Language Models in Summarizing Radiology Report Impressions for Lung Cancer in Chinese: Evaluation Study %A Hu,Danqing %A Zhang,Shanyuan %A Liu,Qing %A Zhu,Xiaofeng %A Liu,Bing %+ Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education), Department of Thoracic Surgery II, Peking University Cancer Hospital and Institute, 52 Fucheng Road, Haidian District, Beijing, 100142, China, 86 15201469501, liubing983811735@126.com %K large language model %K impression summarization %K radiology report %K radiology %K evaluation study %K ChatGPT %K natural language processing %K ultrasound %K radiologist %K thoracic surgeons %D 2025 %7 3.4.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Large language models (LLMs), such as ChatGPT, have demonstrated impressive capabilities in various natural language processing tasks, particularly in text generation. However, their effectiveness in summarizing radiology report impressions remains uncertain. Objective: This study aims to evaluate the capability of nine LLMs, that is, Tongyi Qianwen, ERNIE Bot, ChatGPT, Bard, Claude, Baichuan, ChatGLM, HuatuoGPT, and ChatGLM-Med, in summarizing Chinese radiology report impressions for lung cancer. Methods: We collected 100 Chinese computed tomography (CT), positron emission tomography (PET)–CT, and ultrasound (US) reports each from Peking University Cancer Hospital and Institute. All these reports were from patients with suspected or confirmed lung cancer. Using these reports, we created zero-shot, one-shot, and three-shot prompts with or without complete example reports as inputs to generate impressions. We used both automatic quantitative evaluation metrics and five human evaluation metrics (completeness, correctness, conciseness, verisimilitude, and replaceability) to assess the generated impressions. Two thoracic surgeons (SZ and BL) and one radiologist (QL) compared the generated impressions with reference impressions, scoring them according to the five human evaluation metrics. Results: In the automatic quantitative evaluation, ERNIE Bot, Tongyi Qianwen, and Claude demonstrated the best overall performance in generating impressions for CT, PET-CT, and US reports, respectively. In the human semantic evaluation, ERNIE Bot outperformed the other LLMs in terms of conciseness, verisimilitude, and replaceability on CT impression generation, while its completeness and correctness scores were comparable to those of other LLMs. Tongyi Qianwen excelled in PET-CT impression generation, with the highest scores for correctness, conciseness, verisimilitude, and replaceability. Claude achieved the best conciseness, verisimilitude, and replaceability scores on US impression generation, and its completeness and correctness scores are close to the best results obtained by other LLMs. The generated impressions were generally complete and correct but lacked conciseness and verisimilitude. Although one-shot and few-shot prompts improved conciseness and verisimilitude, clinicians noted a significant gap between the generated impressions and those written by radiologists. Conclusions: Current LLMs can produce radiology impressions with high completeness and correctness but fall short in conciseness and verisimilitude, indicating they cannot yet fully replace impressions written by radiologists. %R 10.2196/65547 %U https://www.jmir.org/2025/1/e65547 %U https://doi.org/10.2196/65547