Research Letter
Abstract
This study demonstrates that GPT-4V outperforms GPT-4 across radiology subspecialties in analyzing 207 cases with 1312 images from the Radiological Society of North America Case Collection.
J Med Internet Res 2024;26:e54948doi:10.2196/54948
Keywords
Introduction
The launch of GPT-4 has generated significant interest in the scientific and medical communities, demonstrating its potential in medicine with notable achievements such as an 83.76% zero-shot accuracy on the United States Medical Licensing Examination (USMLE) [
]. In radiology, GPT has spanned text-based tasks, including board exam question scoring, data mining, and report structuring [ , ]. The recent release of GPT-4’s visual capabilities (GPT-4V) enables the combined analysis of text and visual data [ ]. Our study focuses on evaluating the diagnostic capabilities of GPT-4V by comparing it to GPT-4 in advanced radiological tasks, benchmarking the potential of this multimodal large language model in the medical imaging field.Methods
We sourced 207 cases with 1312 images from the Radiological Society of North America (RSNA) Case Collection (accessible for RSNA members on the RSNA Case Collection website [
]), aiming to cover at least 10 cases for each of the 22 presented subspecialties. The cases within each subspeciality were chosen to present different pathologies. Each case had varying numbers of images and were usually labeled for more than 1 subspecialty, so that the total number of cases per subspeciality varied between 1 (for “Physics and Basic Science,” no more than 1 case was available) and 43 (for “Gastrointestinal,” 10 cases in this category were chosen, with 33 additional cases from other subspecialties that were also labeled for “Gastrointestinal”).GPT-4 and GPT-4V were accessed between November 6, 2023, and November 17, 2023. We utilized an application programming interface (API) account, which allowed us to use the models programmatically and ensure a consistent environment for each test. This access level was crucial, as it provided stable and repeatable interactions with the models, unlike what might be experienced with fluctuating conditions of regular account usage. The ground truth was established based on the final diagnoses stated in the RSNA case entries. We prompted each model 3 times via the API for the following two tasks: first, the models were asked to identify the diagnosis and 2 differentials (providing the patient history only for GPT-4 or patient history with images for GPT-4V); second, the models were asked to answer corresponding multiple-choice questions from the RSNA Case Collection. The GPT-4V assessment used a “chain-of-thought” prompt that guided the model through diagnostic reasoning (
), in contrast to the text-only assessment of GPT-4. For both tasks, a case was considered correctly diagnosed if the same correct result appeared for at least 2 of 3 prompts. Cases with no repeated correct diagnoses and cases with only false diagnoses across the 3 prompts were marked as incorrectly diagnosed. Mean accuracies and bootstrapped 95% CIs were calculated, and statistical significance was determined by using the McNemar test (P<.001).Results
GPT-4 accurately identified the primary diagnosis in 18% (95% CI 12%-25%) of cases (first task). When including differential diagnoses, this accuracy increased to 28% (95% CI 22%-33%). In contrast, GPT-4V achieved a 27% (95% CI 21%-34%) accuracy rate for primary diagnosis, which increased to 35% (95% CI 29%-40%) when differential diagnoses were accounted for. After being presented with multiple-choice questions, including information about clinical history and presentation (second task), GPT-4 achieved an accuracy of 47% (95% CI 42%-56%). Again, GPT-4V demonstrated a higher accuracy of 64% (95% CI 59%-72%). The observed difference in performance was statistically significant (P<.001). Across 15 subspecialties, GPT-4V outperformed GPT-4, with the sole exception being in “Cardiac Imaging.”
summarizes the accuracies across all subspecialties.Discussion
Our study shows that GPT-4V has improved performance over GPT-4 in solving complex radiological problems, indicating its potential to detect pathological features in medical images and thus its radiological domain knowledge. The RSNA Case Collection, which is aimed at expert-level professional radiologists, highlights the promise of GPT-4V in specialized medical contexts.
However, the use of GPT-4V warrants a cautious approach. At this time, it should be considered, at best, as a supplemental tool to augment—not replace—the comprehensive analyses performed by trained medical professionals.
Extending the initial research by Yang et al [
], our study explores the medical image analysis capabilities of GPT-4V in more complex scenarios and with a wider range of cases. The ongoing development of multimodal models, such as Med-Flamingo, for medical applications signals a growing interest in this area [ ].One challenge is the scarcity of specialized medical data sets. As our study used RSNA member–exclusive cases, it was unlikely that these cases were in GPT-4V’s training data; thus, the risk of data contamination was minimized. However, the corresponding images for each case were indented to highlight specific pathologies, and this does not fully replicate clinical practice, where one would have to analyze each separate image to identify potential pathologies—a task that specialized deep learning models would be better suited to perform.
Future efforts should focus on detailed performance comparisons between generalist models (like GPT-4V) and emerging, radiological domain–specialized, artificial intelligence diagnostic models to clarify the clinical relevance and applicability of generalist models in clinical practice.
Our results encourage conducting further performance evaluations of multimodal models in different radiologic subdisciplines, as well as using larger data sets, to gain a more holistic understanding of their role in radiology.
Data Availability
The cases analyzed in this study are available from the Radiological Society of North America (RSNA) Case Collection. This repository can be accessed by RSNA members on the RSNA Case Collection website [
], where each case is presented with detailed clinical information, imaging data, questions, multiple-choice answers, and diagnostic conclusions. The data set was utilized under the terms and conditions provided by the RSNA, which permits the use of these cases for educational and research purposes. No additional unpublished data from these cases were utilized in this study. Researchers and readers are encouraged to directly access the RSNA Case Collection for further information.Conflicts of Interest
None declared.
References
- Nori H, King N, McKinney SM, Carignan D, Horvitz E. Capabilities of GPT-4 on medical challenge problems. arXiv. Preprint posted online on Apr 12, 2023. [FREE Full text] [CrossRef]
- Bhayana R, Krishna S, Bleakney RR. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. Jun 2023;307(5):e230582. [CrossRef] [Medline]
- Adams LC, Truhn D, Busch F, Kader A, Niehues SM, Makowski MR, et al. Leveraging GPT-4 for post hoc transformation of free-text radiology reports into structured reporting: a multilingual feasibility study. Radiology. May 2023;307(4):e230725. [CrossRef] [Medline]
- GPT-4V(ision) system card. OpenAI. Sep 25, 2023. URL: https://openai.com/research/gpt-4v-system-card [accessed 2023-10-14]
- RSNA Case Collection. Radiological Society of North America. URL: https://cases.rsna.org/ [accessed 2024-04-24]
- Yang Z, Li L, Lin K, Wang J, Lin CC, Liu Z, et al. The dawn of LMMs: preliminary explorations with GPT-4V(ision). arXiv. Preprint posted online on Oct 11, 2023. [FREE Full text] [CrossRef]
- Moor M, Huang Q, Wu S, Yasunaga M, Zakka C, Dalmia Y, et al. Med-Flamingo: a multimodal medical few-shot learner. arXiv. Preprint posted online on Jul 27, 2023. [FREE Full text] [CrossRef]
Abbreviations
API: application programming interface |
RSNA: Radiological Society of North America |
USMLE: United States Medical Licensing Examination |
Edited by G Eysenbach; submitted 28.11.23; peer-reviewed by L Zhu, S Kommireddy, H Younes; comments to author 06.02.24; revised version received 10.02.24; accepted 20.03.24; published 01.05.24.
Copyright©Felix Busch, Tianyu Han, Marcus R Makowski, Daniel Truhn, Keno K Bressem, Lisa Adams. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 01.05.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research, is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.