Review
Abstract
Background: Large language models (LLMs) can understand natural language and generate corresponding text, images, and even videos based on prompts, which holds great potential in medical scenarios. Orthopedics is a significant branch of medicine, and orthopedic diseases contribute to a significant socioeconomic burden, which could be alleviated by the application of LLMs. Several pioneers in orthopedics have conducted research on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies, and a systematic summary of existing research is absent.
Objective: The objective of this review was to comprehensively summarize research findings on the application of LLMs in the field of orthopedics and explore the potential opportunities and challenges.
Methods: PubMed, Embase, and Cochrane Library databases were searched from January 1, 2014, to February 22, 2024, with the language limited to English. The terms, which included variants of “large language model,” “generative artificial intelligence,” “ChatGPT,” and “orthopaedics,” were divided into 2 categories: large language model and orthopedics. After completing the search, the study selection process was conducted according to the inclusion and exclusion criteria. The quality of the included studies was assessed using the revised Cochrane risk-of-bias tool for randomized trials and CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) guidance. Data extraction and synthesis were conducted after the quality assessment.
Results: A total of 68 studies were selected. The application of LLMs in orthopedics involved the fields of clinical practice, education, research, and management. Of these 68 studies, 47 (69%) focused on clinical practice, 12 (18%) addressed orthopedic education, 8 (12%) were related to scientific research, and 1 (1%) pertained to the field of management. Of the 68 studies, only 8 (12%) recruited patients, and only 1 (1%) was a high-quality randomized controlled trial. ChatGPT was the most commonly mentioned LLM tool. There was considerable heterogeneity in the definition, measurement, and evaluation of the LLMs’ performance across the different studies. For diagnostic tasks alone, the accuracy ranged from 55% to 93%. When performing disease classification tasks, ChatGPT with GPT-4’s accuracy ranged from 2% to 100%. With regard to answering questions in orthopedic examinations, the scores ranged from 45% to 73.6% due to differences in models and test selections.
Conclusions: LLMs cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. More high-quality clinical trials are needed in the future, aiming to identify optimal applications of LLMs and advance orthopedics toward higher efficiency and precision.
doi:10.2196/59607
Keywords
Introduction
Background
Large language models (LLMs) typically refer to pretrained language models (PLMs) that have a large number of parameters and are trained on massive amounts of data. In recent years, this area has emerged as one of the most prominent areas of research in artificial intelligence (AI) innovation [
, ]. What makes LLMs different from smaller-scale PLMs is their remarkable emergent abilities to solve complex tasks. Studies have found that LLMs, such as generative pretrained transformer (GPT)-3 with approximately 175 billion parameters, exhibit a significant leap in natural language processing (NLP) capabilities compared to PLMs with fewer parameters, such as GPT-2 with approximately 1.5 billion parameters [ , ]. Generative AI applications developed based on LLMs not only possess the ability to understand natural language but can also generate corresponding text, images, and even videos based on input sources. This human-machine interaction mode holds great potential in medical scenarios.LLMs have undergone significant advancements in recent years; currently, the most prevalent web-based LLM service is ChatGPT (OpenAI). Launched in November 2022, ChatGPT is a chatbot application developed based on GPT-3.5 or GPT-4 after fine-tuning, and it can quickly respond to questions posed by users. In addition to ChatGPT, applications include Bard (upgraded to Gemini in December 2023) based on Language Model for Dialogue Applications (Google LLC); Med-PaLM 2 (Google LLC); ERNIE Bot (Baidu); and MOSS (Fudan University). GPT-4 can approach or achieve human-level performance in cognitive tasks across various fields, including medical domains [
]. When answering the 2022 United States Medical Licensing Examination questions, without further training or reinforcement, ChatGPT reached or approached a passing level in all 3 examinations [ ]. However, answering examination questions does not directly reflect the performance of LLMs in clinical applications. The value and safety of a chatbot that is already in use are still not fully understood, making clinical research both essential and imperative. Published narrative reviews and editorials have explored the medical applications of LLM technology from 3 perspectives: clinical practice, education, and research [ , - ]. These publications also provide a preliminary assessment of the value and safety of LLMs, offering guidance for exploring their use in specialized medical fields.Orthopedics is a significant branch of medicine, typically encompassing disciplines such as trauma, spine surgery, joint surgery, sports medicine, hand surgery, and bone oncology. Orthopedic diseases have a broad impact on populations and pose a major global health threat. Low back pain, a common symptom in orthopedics or spine surgery, has been identified as the leading cause of global productivity loss, as measured in years, according to a large-scale epidemiological study covering 195 countries and regions; in 126 countries, low back pain ranks first among the causes of years lived with disability [
]. In traditional health care systems, the annual medical expenditure for low back pain in the United States is estimated to exceed US $100 billion, contributing to a significant socioeconomic burden [ ]. Similarly, osteoarthritis is also a critical global health issue. The global prevalence of knee osteoarthritis in adults aged >40 years is 23%, with approximately 61% of adults aged >45 years showing radiographic evidence of knee osteoarthritis [ ]. Therefore, applying LLMs in orthopedics holds the potential to alleviate the current heavy socioeconomic burden.It is worth noting that several pioneers in orthopedics have conducted studies on LLMs across various subspecialties to explore their performance in addressing different issues. However, there are currently few reviews and summaries of these studies. The published reviews primarily focus on introducing and popularizing the basic concepts of LLMs in orthopedics [
, ], or they offer forward-looking perspectives by categorizing LLM applications in clinical practice, education, and research [ ]. A systematic summary of existing research is absent. To the best of our knowledge, this review is the first to systematically summarize existing research findings. In contrast to prior works, we place greater emphasis on the quantitative evaluation methods and results of these studies because we believe that these methods and outcomes can help orthopedic and computer science researchers better understand the current state of LLM research and the performance of LLMs. Regarding application categorization, we consider tasks involving NLP in management as another important application area for LLMs in orthopedics. Therefore, this review adds a category for orthopedic management applications to the existing classification framework.Objectives
The objective of this review was to comprehensively summarize the research findings on the application of LLMs in orthopedics and outline the advantages, limitations, and methodological evaluations, while also exploring the potential opportunities and challenges emerging in this era, for facilitating interdisciplinary collaboration and advancement among researchers in computer science and orthopedics. The ultimate goal is to contribute to improved efficiency and quality of orthopedic care as well as a reduction in medical costs and the associated socioeconomic burden.
Methods
Search Strategy
The protocol for this systematic review followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines (checklist can be found in the
) [ ]. PubMed, Embase, and Cochrane Library databases were searched, with the language limited to English. The time frame was set from January 1, 2014, to February 22, 2024. Search terms were divided into 2 categories, with the first category including LLM-related terms and the second containing words related to orthopedics and its subspecialties ( ). Terms within each category were connected using “OR,” while terms within different categories were connected using “AND.” The full search strategy can be found in .Category 1
- “large language model,” “LLM,” “generative artificial intelligence,” “generative AI,” “ChatGPT,” and “Generative Pre-Trained Transformer”
Category 2
- “orthopedics,” “bone,” “musculoskeletal,” “injury,” “wound,” “trauma,” “articular,” “joint,” “sports medicine,” “hand surgery,” “spine,” “spinal, “cervical vertebrae,” “thoracic vertebrae,” “lumbar vertebrae,” “sacrum,” “coccyx,” “spinal canal,” “vertebral body,” and “intervertebral disc”
Study Selection
The records were downloaded from the databases and imported into EndNote (version 21.2; Clarivate) for article management. The study selection process was conducted independently by 2 investigators (CZ and SL). The inclusion and exclusion criteria are listed in
. The results were cross-checked, and discrepancies were resolved through discussion, with the final determination made by a third investigator (YT).Inclusion criteria
- Article type
- Original research
- Language
- Articles written in English
- Content
- Studies that use at least 1 large language model (LLM)
- Studies that are relevant to the field of orthopedics
Exclusion criteria
- Article type
- Reviews, editorials, letters, and study protocols
- Language
- Articles written in a language other than English
- Content
- Studies that do not involve LLMs
- Studies that use LLMs for tasks such as code generation, debugging, or text generation without any performance evaluation of the model
Quality Assessment of Studies
Quality assessment was conducted by 2 investigators (CZ and SL) independently. First, the study designs were identified. Studies that involved only posing questions to LLMs, did not recruit participants, and did not report a study design were classified as surveys. Given the diverse nature of the survey types included in the review, quality assessments were conducted only for studies that recruited participants. The revised Cochrane risk-of-bias tool for randomized trials [
] was used to assess randomized controlled trials (RCTs), and the CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) guidance [ ] was used to evaluate prospective or retrospective observational studies. The revised Cochrane risk-of-bias tool (version of August 22, 2019) is designed for assessing RCTs and contains 5 domains: bias arising from the randomization process, bias due to deviations from the intended interventions, bias due to missing outcome data, bias in measurement of the outcome, and bias in selection of the reported result. The CONSORT-AI guidance is a new reporting guideline specifically designed for clinical trials that assess interventions with an AI component. The quality assessment domains under this guidance include a statement of the AI algorithm used, details of how the AI intervention fits within the clinical pathway, inclusion and exclusion criteria for input data, a description of the approaches used to handle unavailable input data, a description of the input data acquisition process for the AI intervention, specifications of human-AI interaction in the collection of input data, the output of the AI algorithm, and explanations of how the AI intervention’s outputs contribute to health behavior changes. The results were cross-checked, and discrepancies were resolved through discussion, with the final determination made by another investigator (YT).Data Extraction and Synthesis
The studies were categorized into 4 groups based on their application areas: clinical practice, education, research, and management. Data extraction and synthesis were conducted by 2 investigators (CZ and SL) independently. In addition to general characteristics, the composition of extracted data varied depending on the specific category. Details of the data extraction strategy can be found in
. In cases where there were inconsistencies in the process, a third investigator (XZ) participated in the discussion and made the final decision. For studies with high heterogeneity, we did not synthesize the parameters for model performance evaluation and instead focused on providing a descriptive analysis of the data. Microsoft Excel 2019 was used for data collection, analysis, and visualization.Results
Characteristics of Included Studies
A total of 829 studies were identified; after removing duplicates and screening, 68 (8.2%) studies were selected in the literature review. The inclusion process is shown in
.The application of LLMs in orthopedics involves the fields of clinical practice, education, research, and management. Of the 68 included studies, 47 (69%) focused on clinical practice (
) [ - ], 12 (18%) addressed orthopedic education ( ) [ - ], 8 (12%) were related to scientific research ( ) [ - ], and 1 (1%) pertained to the field of management ( ) [ ]. Of the 68 studies, 55 (81%) were classified as surveys; furthermore, only 8 (12%) recruited patients, only 1 (1%) was a high-quality study (RCT), and only 1 (1%) was a prospective study. Since June 2023, research on the application of LLMs in orthopedics has increased month by month ( ).Study, year | Study design | Task | LLMa tools | Main evaluation metrics for model performance and their values | Enrolled participants, n | Subjective or objective assessment of the model’s performance |
Agharia et al [ | ], 2024Survey | Formulate clinical decisions | GPT-3.5; GPT-4; Bard | Proportion of most popular response: 68% (GPT-4); 40.2% (GPT-3.5); 45.4% (Bard) | —b | Subjective |
Anastasio et al [ | ], 2023Survey | Generate answers to clinical questions | GPT-3.5 | Ratio of responses in different quality grades: bottom-tier rating 4.5%; middle-tier rating 27.3%; top-tier rating 68.2% | — | Subjective |
Baker et al [ | ], 2024RCTc | Assist with writing patient histories | GPT-4 | Mean time: 69.8 (SD 26.2) s; mean word count: 135.8 (SD 40.3); mean PDQI-9d score: 35.6 (SD 3.1); mean overall rating: 3.8 (SD 0.6); ratio of erroneous documents: 36% | 11 | Subjective |
Christy et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5 | Ratio of appropriate responses in total responses: 78%; intraclass correlation coefficient: 0.12 | — | Subjective |
Coraci et al [ | ], 2023Cross-sectional study | Create questionnaire for assessment | GPT-3.5 | Correlation: acceptable correlation with ODIe and QBPDSf; no statistical correlation with RMDQg or NRSh | 20 | Subjective |
Crook et al [ | ], 2023Survey | Generate answers to clinical questions | GPT-3 | DISCERN score: 58; JAMAi benchmark score: 0/4; FREj score: 34; FKGLk score: 15 | — | Subjective |
Daher et al [ | ], 2023Prospective study | Diagnose and manage patients | GPT-3 | Accuracy of diagnosis: 93%; accuracy of management: 83% | 29 | Objective |
Decker et al [ | ], 2023Cross-sectional study | Generate informed consent documentation | GPT-3.5 | Mean readability, accuracy, and completeness scores (surgeons vs LLMs): readability= 15.7 vs 12.9; risks=1.7 vs 1.7; benefits=1.4 vs 2.3; alternatives=1.4 vs 2.7; overall impression=1.9 vs 2.3; composite: 1.6 vs 2.2 | — | Subjective |
Draschl et al [ | ], 2023Survey | Generate answers to clinical questions | GPT-3.5 | 5-point Likert scores, mean: completeness=3.80 (SD 0.63); misleading=4.04 (SD 0.67); errors=4.14 (SD 0.58); up-to-dateness=3.90 (SD 0.45); suitability for patients=3.69 (SD 0.64); suitability for surgeons=3.63 (SD 0.95) | — | Subjective |
Dubin et al [ | ], 2023Survey | Generate answers to clinical questions | GPT-3 | 25% of the questions were similar when performing a Google web search and a search of ChatGPT for all search terms; 75% of the questions were answered by government websites; 55% of the answers were different between Google web search and ChatGPT in terms of numerical questions | — | Subjective |
Duey et al [ | ], 2023Survey | Generate answers to clinical questions | GPT-3.5; GPT-4 | Accuracy: 33% (GPT-3.5); 92% (GPT-4) | — | Subjective |
Fabijan et al [ | ], 2023Cross-sectional study | Classify cases of single-curve scoliosis | GPT-4; Microsoft Bing with GPTl; Scholar AI Premium | GPT-4 and Scholar AI Premium excelled in classifying single-curve scoliosis with perfect sensitivity (100%) and specificity (100%) | 56 | Objective |
Fahy et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5; GPT-4 | GPT-3.5 vs GPT-4 mean DISCERN score: 55.4 vs 62.09; mean reading grade level score: 18.08 vs 17.90 | — | Subjective |
Gianola et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5 | Internal consistency: 49%; accuracy: 33% | — | Subjective |
Hurley et al [ | ], 2024Survey | Generate answers to clinical questions | ChatGPT | DISCERN score: 60; JAMA benchmark score: 0; FRE score: 26.2; FKGL score: considered to be that of a college graduate | — | Subjective |
Johns et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5 | Satisfaction rate: 60% | — | Subjective |
Johns et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5 | DISCERN score: 41; FKGL score: 13.4; satisfaction rate: 40% | — | Subjective |
Kaarre et al [ | ], 2023Survey | Generate answers to clinical questions | GPT-4 | Average correctness of responses for patients and physicians: 1.69 and 1.66, respectively (on a scale ranging from 0=incorrect, 1=partially correct, and 2=correct) | — | Subjective |
Kasthuri et al [ | ], 2024Survey | Generate answers to clinical questions | Microsoft Bing with GPT-4 | Mean completeness score: 2.03; mean accuracy score: 4.49 | — | Subjective |
Kienzle et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-4 | Mean DISCERN score in overall quality: 3.675 | — | Subjective |
Kirchner et al [ | ], 2023Survey | Rewrite patient education materials | GPT-3.5 | Mean FKGL score in patient education materials related to herniated lumbar disk, scoliosis, stenosis, TKAm, and THAn: before rewrite=9.5, 12.6, 10.9, 12.0, and 6.3, respectively; after rewrite=5.0, 5.6, 6.9, 11.6, and 6.1, respectively | — | Subjective |
Kuroiwa et al [ | ], 2023Survey | Generate answers to clinical questions | GPT-3.5 | Ratios of correct answers: 25/25, 1/25, 24/25, 16/25, and 17/25 for carpal tunnel syndrome, cervical myelopathy, lumbar spinal stenosis, knee osteoarthritis, and hip osteoarthritis, respectively | — | Objective |
Li et al [ | ], 2023Survey | Generate answers to clinical questions | GPT-4 | Mean accuracy score (out of 5): 4.3; mean completeness score (out of 3): 2.8 | — | Subjective |
Li et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5 | 1 response was excellent, requiring no clarification; 4 responses were satisfactory, requiring minimal clarification; 3 responses were satisfactory, requiring moderate clarification; 2 responses were unsatisfactory | — | Subjective |
Lower et al [ | ], 2023Survey | Deliver safe and coherent medical advice | GPT-4 | Mean Likert scale score: 3.2 | — | Subjective |
Magruder et al [ | ], 2024Survey | Generate answers to clinical questions | ChatGPT | Answer grades (from 1 to 5), mean: relevance=4.43 (SD 0.77); clarity=4.22 (SD 0.86); accuracy=4.10 (SD 0.90); evidence based=3.92 (SD 1.01); completeness=3.91 (SD 0.88); consistency=3.54 (SD 1.10) | — | Subjective |
Mika et al [ | ], 2023Survey | Generate answers to clinical questions | GPT-3.5 | 2 responses were excellent, requiring no clarification; 4 responses were satisfactory, requiring minimal clarification; 3 responses were satisfactory, requiring moderate clarification; 1 response was unsatisfactory | — | Subjective |
Pagano et al [ | ], 2023Retrospective observational study | Formulate diagnosis and potential treatment suggestions | GPT-4 | Diagnostic accuracy: 100% for the total cases; concordance in therapeutic recommendations: 83% for the total cases | 100 | Objective |
Mejia et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5; GPT-4 | Accuracy: 52% (GPT-3.5); 59% (GPT-4); overconclusiveness: 48% (GPT-3.5); 45% (GPT-4) | — | Subjective |
Russe et al [ | ], 2023Retrospective observational study | Provide accurate fracture classification based on radiology reports | FraCChat; GPT-3.5-Turbo; GPT-4 | Accuracy: GPT 3.5=3%; GPT 4=2%; FraCChat 3.5=48%; FraCChat 4=71% | — | Objective |
Schonfeld et al [ | ], 2024Retrospective cohort study | Predict outcome of adult spinal deformities | Gatortron | AUCo scores: 0.565 (pulmonary complication); 0.559 (neurological complication); 0.557 (sepsis); 0.508 (delirium); F1-scores: 0.545 (pulmonary complication); 0.250 (neurological complication); 0.383 (sepsis); 0.156 (delirium) | 209 | Objective |
Seth et al [ | ], 2023Survey | Generate answers to clinical questions | ChatGPT | Mean Likert scale score: 3.1 | — | Subjective |
Shrestha et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5 | Accuracy: 44%-65% for different guideline variations | — | Subjective |
Sosa et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-4; Bard; Bing AI | Ratios of appropriate answers to questions related to bone physiology: 83.3% (GPT-4); 23.3% (Bing AI); 16.7% (Bard) | — | Subjective |
Stroop et al [ | ], 2023Survey | Generate answers to clinical questions | ChatGPT | Ratio of medically complete correct answers: 52%; ratio of medically complete and comprehensive answers: 55% | — | Subjective |
Suthar et al [ | ], 2023Retrospective observational study | Generate diagnosis | GPT-4 | Accuracy rate in spine cases: 55% | — | Objective |
Taylor et al [ | ], 2024Survey | Generate answers to clinical questions | ChatGPT | Ratio of surgeons who reported that the questions had been appropriately answered: 91% | — | Subjective |
Temel et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-4 | Ensuring Quality Information for Patients score: mean 43.02 (SD 6.37); FRE score: mean 26.24 (SD 13.81); FKGL score: mean 14.84 (SD 1.79) | — | Subjective |
Tharakan et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3 | Answers provided by ChatGPT cited more academic references than those provided by a Google search (80% vs 50%) | — | Subjective |
Truhn et al [ | ], 2023Retrospective observational study | Prioritize treatment recommendations | GPT-4 | The overall quality of the treatment recommendations was rated as good or better | 20 | Subjective |
Warren et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5 | Answers to fact, policy, and value questions (mean scores): DISCERN=51, 53, and 55, respectively; JAMA benchmark=0, 0, and 0, respectively; FRE=48.3, 42.0, and 38.4, respectively; FKGL=10.3, 10.9, and 11.6, respectively | — | Subjective |
Wilhelm et al [ | ], 2023Survey | Generate treatment recommendations | Claude-instant-v1.0; GPT 3.5-Turbo; Command-xlarge-nightly; Bloomz | Mean DISCERN quality scores: 3.4 (Claude-instant-v1.0); 2.8 (GPT 3.5-Turbo); 2.2 (Command-xlarge-nightly); 1.1 (Bloomz) | — | Subjective |
Wright et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-3.5 | Mean accuracy score: 4.26; mean comprehensiveness score: 3.79 | — | Subjective |
Yang et al [ | ], 2024Retrospective observational study | Generate diagnosis | GPT-3.5 | Accuracy: 0.87; sensitivity: 0.99; specificity: 0.73 | 1366 | Objective |
Yang et al [ | ], 2024Survey | Generate answers to clinical questions | ChatGPT; Bard | Concordance with the AAOSp Clinical Practice Guidelines: 80% (ChatGPT); 60% (Bard) | — | Subjective |
Yapar et al [ | ], 2024Survey | Generate answers to clinical questions | GPT-4 | Accuracy: 79.8%; applicability: 75.2%; comprehensiveness: 70.6%; communication clarity: 75.6% | — | Subjective |
Zhou et al [ | ], 2024Case study | Generate answers to clinical questions related to the case | GPT-3.5 | No statistical results | — | Subjective |
aLLM: large language model.
bNot applicable.
cRCT: randomized controlled trial.
dPDQI-9: Physician Documentation Quality Instrument-9.
eODI: Oswestry Disability Index.
fQBPDS: Quebec Back Pain Disability Scale.
gRMDQ: Roland-Morris Disability Questionnaire.
hNRS: numerical rating scale.
iJAMA: Journal of the American Medical Association.
jFRE: Flesch reading ease.
kFKGL: Flesch-Kincaid grade level.
lGPT: generative pretrained transformer.
mTKA: total knee arthroplasty.
nTHA: total hip arthroplasty.
oAUC: area under the curve.
pAAOS: American Academy of Orthopaedic Surgeons.
Study, year | Study design | Task | LLMa tools | Source | Questions, n | Scores or accuracy (%) |
Cuthbert and Simpson [ | ], 2023Survey | Examination | GPT-3.5 | UKITEb | 134 | 35.8 |
Ghanem et al [ | ], 2023Survey | Examination | GPT-4 | OITEc | 201 | 61.2 |
Han et al [ | ], 2024Survey | Examination | GPT-3.5 | ASSHd | 1583 | 36.2 |
Hofmann et al [ | ], 2023Survey | Examination | GPT-3.5; GPT-4 | OITE | 410 (GPT-3.5); 396 (GPT-4) | GPT-3.5: 46.3; GPT-4: 63.4 |
Jain et al [ | ], 2024Survey | Examination | GPT-3.5 | OITE | 360 | 52.8 |
Kung et al [ | ], 2023Survey | Examination | GPT-3.5; GPT-4 | OITE | 360 | GPT-3.5: 54.3; GPT-4: 73.6 |
Lum [ | ], 2023Survey | Examination | GPT-3.5 | OITE | 207 | 47 |
Massey et al [ | ], 2023Survey | Examination | GPT-3.5; GPT-4 | ResStudy Orthopaedic Examination Question Bank | 180 | GPT-3.5: 29.4; GPT-4: 47.2 |
Ozdag et al [ | ], 2023Survey | Examination | GPT-3.5 | OITE | 102 | 45 |
Rizzo et al [ | ], 2024Survey | Examination | GPT-3.5-Turbo; GPT-4 | OITE | 2022: 207; 2021: 213; 2020: 215 | 2022: GPT-4=67.63; GPT 3.5-Turbo=50.24; 2021: GPT-4=58.69; GPT 3.5-Turbo=47.42; 2020: GPT-4=59.53; GPT 3.5-Turbo=46.51 |
Saad et al [ | ], 2023Survey | Examination | GPT-4 | Mock FRCS Orthe Part A | 240 | 67.5 |
Traoré et al [ | ], 2023Survey | Examination | GPT-3.5 | EBHSf diploma examination | 18 | 0 |
aLLM: large language model.
bUKITE: United Kingdom and Ireland In-Training Examination.
cOITE: Orthopaedic Surgery In-Training Examination.
dASSH: American Society for Surgery of the Hand.
eFRCS Orth: Orthopaedic Fellow of the Royal College of Surgeons.
fEBHS: European Board of Hand Surgery.
Study, year | Study design | Task | LLMa tools | Input | Key findings |
Gill et al [ | ], 2024Survey | Improve readability | GPT-3.5 | IRBb-approved orthopedic surgery research consent forms | ChatGPT can significantly improve the readability of orthopedic clinical research consent forms; 63.2% of the post-ChatGPT consent forms had at least 1 error |
Hakam et al [ | ], 2024Survey | AIc-Generated scientific literature | GPT-3.4; You.com | Five abstracts about meniscal injuries | The AI-generated texts could not be successfully identified |
Kacena et al [ | ], 2024Survey | Write scientific review articles | GPT-4 | Prompts | AI reduced the time for writing but had significant inaccuracies |
Lawrence et al [ | ], 2024Survey | Generate abstract | GPT-3 | A standard set of input commands | Interrater reliability for abstract quality scores was moderate |
Lotz et al [ | ], 2023Survey | Assist new research hypothesis exploration | Toolkit based on GPT-3.5 | Prior studies | LLMs may be useful for analyzing and distinguishing publications, as well as determining the degree to which the literature supports or contradicts emergent hypotheses |
Methnani et al [ | ], 2023Survey | Calculate sample size | GPT-3.5 | All necessary data, such as mean, percentage SD, normal deviations, and study design | In 1 (25%) of the 4 trials, the sample size was correctly calculated |
Nazzal et al [ | ], 2024Survey | Write a review article | GPT-4 | Prompts | The AI-only paper was the most inaccurate, with inappropriate reference use, and the AI-assisted paper had the greatest incidence of plagiarism |
Sanii et al [ | ], 2023Survey | Perform an orthopedic surgery literature review | GPT-3; Perplexity | Standard prompts | The current iteration of ChatGPT cannot perform a reliable literature review, and Perplexity is only able to perform a limited review of the medical literature |
Zaidat et al [ | ], 2023Retrospective cohort study | Predict CPTd codes | GPT-4 | Surgical operative notes | The AUROCe score was 0.87, and the AUPRCf score was 0.67 |
aLLM: large language model.
bIRB: institutional review board.
cAI: artificial intelligence.
dCPT: current procedural terminology.
eAUROC: area under the receiver operating characteristic curve.
fAUPRC: area under the precision-recall curve.
Quality Assessment of Studies
We conducted quality assessments for the 8 studies that recruited participants (
and ). The RCT study was evaluated using the revised Cochrane risk-of-bias tool for randomized trials, and it was found to have a low risk of bias in all 5 domains—bias arising from the randomization process, bias due to deviations from the intended interventions, bias due to missing outcome data, bias in measurement of the outcome, and bias in selection of the reported result—indicating high study quality. The remaining studies (7/8, 88%; observational studies) were evaluated using the CONSORT-AI guidance and were found to be of good quality.Distribution of LLM Tools
Among all the LLM tools applied, ChatGPT was the most commonly mentioned. Other LLM tools included Bard, Microsoft Bing, Scholar AI, Perplexity, Gatortron, Claude, Command-xlarge-nightly, and Bloomz, as well as software developed by researchers based on commonly used LLM kernels. Currently, there are 2 main versions of ChatGPT available: GPT-3 (including GPT-3.5 and GPT-3.5-Turbo) and GPT-4. Most of the studies (48/68, 71%) specified the version of the tool used. The majority of the studies (25/48, 52%) only used GPT-3 or 3.5, likely due to the publication lag because these studies were conducted before the release of GPT-4. Given that GPT-4 outperforms GPT-3 in most tasks, future research should primarily use GPT-4.
Model Performance Evaluation
As shown in
- , there is considerable heterogeneity in the definition, measurement, and evaluation of LLM performance across the included studies. Currently, there is no unified research paradigm for the application of LLMs in medicine. Therefore, this review focused on different model performance evaluation metrics according to various application categories. For clinical applications of LLMs, we were particularly concerned with the accuracy of model reasoning and the readability of the generated text; unfortunately, the majority of the studies (39/47, 83%) relied on subjective assessments of the model’s performance. In studies with objective evaluations, the heterogeneity in the subtasks performed by the LLMs (including diagnosis, classification, clinical case analysis, and case text generation) prevented us from pooling the data. For diagnostic tasks alone, the accuracy ranged from 55% to 93% [ , ]. When performing disease classification tasks, GPT-4’s accuracy ranged from 2% to 100% [ , ] ( ). In studies on readability, the most commonly used metrics are the Flesch Reading Ease and the Flesch-Kincaid Grade Level (FKGL) scores. The FKGL metric correlates reading difficulty with years of education, providing a straightforward reflection of the readability of generated materials. In these studies, the generated texts had FKGL scores ranging from a minimum of 5.0 [ ], indicating primary school reading difficulty, to the maximum required years of education [ ], showing significant variability. This variability is likely due to differences in the research questions, methodologies, prompts, and evaluators. In the educational applications (eg, answering questions from examination papers), the most frequently used test source (7/12, 58%) was the Orthopaedic Surgery In-Training Examination (OITE). The test scores are widely recognized as performance evaluation metrics for the models, with final scores ranging from 45% to 73.6% due to differences in models and test selections ( ). For applications of LLMs in research and management, the flexible and varied nature of the tasks led to substantial differences in performance measurement and evaluation. Therefore, we collected the model inputs and major findings for a descriptive presentation ( ).Discussion
Overview
Despite the relatively short time since their introduction and the absence of rigorous and comprehensive performance evaluation in highly specialized fields such as orthopedics, it is an undeniable fact that LLMs have already been made accessible to the public. Given the increasing acceptance and widespread adoption of LLMs, it is imperative for orthopedic surgeons to possess a comprehensive understanding of their operational mechanisms and limitations. Users should also delineate the safe application boundaries while harnessing the benefits offered by LLMs, all while mitigating potential risks in their daily clinical practice. This section presents a comprehensive overview of application examples and model performance across diverse fields, while providing strategic approaches to address LLMs based on our findings. In addition, in this section, we critically evaluate research methodologies and offer potential recommendations for future investigations.
Application of LLMs in the Field of Orthopedic Education
LLMs can not only provide answers but also offer explanations and even engage in further discussions on a given topic, demonstrating potential value in orthopedic education. Several studies have evaluated the performance of ChatGPT in answering questions related to orthopedics and further discussed its value in the field of orthopedic education. The source of questions includes the OITE [
, - , , ], the ResStudy Orthopaedic Examination Question Bank [ ], the Fellowship of the Royal College of Surgeons (Trauma and Orthopaedic Surgery) examination [ , ], and hand surgery examinations in the United States and Europe [ , ]. Accuracy (scores) and whether the answers meet the standard are important evaluation criteria for LLM performance. Another educational indicator is the correctness and reasonableness of answer explanations. Studies evaluating OITE questions usually convert accuracy into postgraduate year (PGY) levels for evaluation. Due to differences in software applications and question selection, different studies have reported varying performances of ChatGPT. ChatGPT with GPT-4 performed at an average level ranging from PGY-2 to PGY-5 [ , , , ], while ChatGPT with GPT-3.5 performed slightly better than PGY-1 or below the average level of PGY-1 [ - , , ]. For correct answers, ChatGPT can provide explanations and reasoning processes consistent with those of examiners, which helps students understand the questions and general orthopedic principles [ , ]. However, ChatGPT failed to pass the Fellowship of the Royal College of Surgeons (Trauma and Orthopaedic Surgery) examination and hand surgery examinations in the United States and Europe [ , , , ]. In addition, as a language model, ChatGPT cannot analyze medical images correctly [ ], limiting its role in orthopedic imaging education.Although LLMs currently cannot fully replace orthopedic instructors, they can still serve as a valuable supplementary tool for learning. Integrating their responses with authoritative resources for verification and using appropriate prompts can optimize their capacity to offer logical explanations and foster critical thinking.
Application of LLMs in Clinical Practice
Medical Consultation and Physician-Patient Communication
One challenge faced by orthopedic physicians is that, unlike in the case of other clinical interventions, LLMs have already been integrated as medical consultation tools in the diagnosis and treatment process of numerous diseases without sufficient clinical evidence and regulatory review from authorities such as the US Food and Drug Administration. LLMs can be considered an alternative approach for patients who have sustained injuries or experience discomfort before seeking guidance from primary care physicians or specialists. When confronted with medical issues, individuals who rely heavily on the internet for problem-solving in their personal and professional lives often exhibit a tendency to seek treatment decisions on the web [
]. Compared to traditional search engines or Wikipedia, LLMs could potentially become a significant source of medical consultation information, especially in cases of nonacute diseases such as lower back pain or joint pain. Meanwhile, many physicians also hope that LLMs can help alleviate their burden of simple medical consultations and repetitive paperwork related to physician-patient communication (such as preoperative consent forms), which is considered 1 of the important factors contributing to physician burnout [ ]. Although LLMs can provide concise, clarified, or simplified responses related to the given topic and deliver high-quality and empathetic answers [ , ], given their imperfect performance in addressing questions related to orthopedics [ - ], caution should be exercised regarding their reliability in orthopedic consultation scenarios.Studies have evaluated the performance of LLMs in answering questions related to hand surgery [
], spinal cord injuries [ ], joint and sports medicine [ , , , , , ], and preoperative physician-patient communication for lumbar disk herniation [ ] and hip replacement surgery [ ]. In these studies, the evaluation criteria of interest typically encompass the model’s answer accuracy, readability, completeness, and information sources. Evaluation methods often encompass scale assessments or subjective ratings conducted by researchers. The DISCERN score is commonly used to evaluate answer quality [ , , , ], while FKGL and Flesch Reading Ease scores are commonly used to measure readability [ , , ]. The accuracy of LLMs’ responses is closely correlated with the specific topic. Questions in the field of joint and sports medicine often receive high-quality responses, while there are serious issues with the quality of answers regarding spinal cord injuries. There are also significant differences in the evaluation of the readability or comprehensibility of LLMs’ answers, with some researchers considering them to be easily understood [ , ], while studies using Flesch-related scales suggest that LLMs’ answers require a reading level of at least 10 years of education or even university level for full comprehension [ , , ]. The underlying factors contributing to this phenomenon can be attributed to variations in question topics, prompts, and evaluation methodologies used for answer assessment. Consequently, orthopedic surgeons should exercise caution when interpreting the findings of these studies.Although LLMs can offer more scholarly health information in comparison to search engines [
, ], they still cannot replace orthopedic physicians in medical consultation and physician-patient communication. Using LLMs as a guiding tool and maintaining communication with physicians during further diagnosis and treatment decisions may be a safer and more effective strategy.Clinical Workflow
The performance of LLMs in orthopedic examinations suggests that they cannot handle complex tasks independently, but they hold potential to serve as valuable assistants for orthopedic physicians. One possible application is using LLMs to automate simple, repetitive tasks such as writing medical records for common orthopedic diseases [
]. In the context of complex disease management tasks, LLMs can possess a more extensive and specialized knowledge base than less experienced newly graduated physicians and assist them in various aspects of disease management. Some researchers have tested the performance of LLMs using specific clinical questions or guidelines [ , , , ], while others have directly inputted clinical case data into the model, allowing it to summarize and provide corresponding diagnostic or treatment decisions autonomously [ , , , , ]. Currently, there is no further research on introducing LLMs into orthopedic operations, likely because of the limited availability of intelligent terminals and digital scenarios that may combine operative procedures with LLMs. Potential docking scenarios for the LLM model could include intelligent surgical applications such as mixed reality operating rooms [ , ] and autonomous laminectomy robots [ , ].In the context of clinical practice, apart from the fundamental requirement of accurate response, time consumption and work efficiency also serve as crucial reference indicators for evaluating LLMs’ performance. Despite variations in the assessment of model accuracy across the included studies, potentially attributed to differences in research objectives, prompt design, evaluation criteria, and assessment tools, no study has presented evidence indicating that LLMs can independently perform clinical work. Therefore, the current models still require rigorous supervision during their use. An RCT study evaluating the performance of ChatGPT in assisting with orthopedic clinical documentation found that there was no significant efficiency advantage in using ChatGPT: the time taken to complete medical history writing was not superior to voice input, and instances of fabricated content were observed within the ChatGPT-generated medical histories [
].Although LLMs currently have limitations, they remain valuable tools for orthopedic surgeons in their daily practice. It is important to approach cautiously the responses provided by LLMs and seek additional evidence and explanations from the model used when faced with unclear answers. By incorporating evidence-based medicine tools, we can ultimately achieve superior clinical diagnoses and treatment plans, thereby elevating the quality of care delivered by physicians.
Application of LLMs in the Field of Research
Research is generally considered a creative endeavor, and introducing LLMs into the field of research may offer more flexibility. Currently, there are limited attempts to use LLMs in orthopedic research. A study found that lowering the reading threshold of professional texts through LLMs can assist in improving the readability of informed consent forms for orthopedic clinical research, but the forms did not meet the recommended sixth-grade reading level set by the American Medical Association [
]. In addition, the literature summarization and generation capabilities of LLMs can contribute to independent or assisted writing of literature reviews in the orthopedic field [ , ]. On the other side of the coin, concerns about integrity arise when scholars find that the model’s output can be deceptively realistic. The abstracts generated by LLMs for studies on meniscal injuries and joint replacement were indistinguishable from those written by human researchers [ , ]. However, web-based LLMs do not perform well in tasks such as literature review or sample size estimation in sports medicine research [ , ]. Possible reasons may include the potential limitations of LLMs in meeting logical reasoning requirements and the inappropriate use of prompts. For more complex tasks, an optimization approach could involve developing task-specific toolkits based on the fundamental architecture of LLMs. The feasibility of this approach has been validated in interdisciplinary research on the management of back pain [ ].Application of LLMs in Management
Trained NLP models can convert natural language into structured data and have demonstrated superior performance in tasks involving the current procedural terminology for identifying spinal surgery records [
]. However, ChatGPT, with its larger parameters, performs weaker than NLP models in the task of identifying spinal surgery current procedural terminology codes [ ]. One possible reason is that traditional NLP models have been trained on more targeted datasets, whereas researchers cannot fine-tune the backend model of ChatGPT using these data. Despite the current model’s performance limitations hindering its further application in this field, the potential advancements in “fine-tuning” techniques may enable LLMs to assume a more influential role in orthopedic management in the future.Current Advantages and Limitations of LLMs in Orthopedic Applications
Overview
In contrast to conventional pretrained machine learning models, LLMs exhibit the advantage of versatility by accurately addressing problems across various domains without necessitating additional training on specific samples. In the field of orthopedics, another advantage of LLMs is their user-friendly and convenient nature. Users do not need to go through the long process of waiting and referral from general practitioners to specialists. By simply accessing apps equipped with LLMs, users can inquire about diverse subspecialties in orthopedics at any time and from anywhere, receiving answers promptly at a minimal cost or even free of charge. This service surpasses the capabilities of current health care systems and is unlikely to be replicated in the foreseeable future.
However, as mentioned previously, these advantages are based on unverified answers and unpredictable risks. The answers provided by LLMs for questions related to orthopedics are less robust than those for everyday common knowledge and have significant limitations in terms of accuracy, readability, reliability, and timeliness, as detailed in the following subsections.
Accuracy
Almost all studies (66/68, 97%) found errors in LLMs’ responses, with more noticeable inaccuracies in specialized areas such as hip and knee joints and hand surgery [
, , ]. Some answers even contradicted fundamental orthopedic knowledge [ ]. Therefore, some researchers argue that current expectations for guidance provided by AI platforms should be tempered by both physicians and patients [ ]. Possible reasons include the limited availability of publicly accessible orthopedic data for training, especially for specialized diseases, as well as privacy concerns that restrict public access to a large amount of data. In the future, besides waiting for more powerful next-generation LLMs, using existing LLMs to learn orthopedic cases and fine-tuning them may be a potential solution to improve accuracy.Readability
Some of the included studies (3/68, 4%) suggest that the content generated by LLMs is not satisfactory in terms of readability for the general population [
, , ]. The potential reasons for the lack of readability may include not only the limited training data but also the quality of the trained data. By incorporating more popular science content and common clinical responses, it may be possible to address the issue of readability through fine-tuning the model.Reliability
Different ways of asking the same question may yield completely different answers [
]. This instability, particularly in response to specific prompts, not only affects users’ experience and trust but also greatly interferes with researchers’ homogenized evaluations. It is imperative to establish standardized questioning processes and prompt criteria.Timeliness
Training LLMs from scratch is both costly and time consuming, leading to significant retraining expenses. However, unlike everyday common knowledge, orthopedics is constantly evolving with new diagnostic and treatment approaches as well as surgical techniques. Therefore, outdated information becomes an important risk factor leading to inaccurate answers, necessitating caution in this context.
Methodological Limitations of the Selected Studies
Although there are 47 studies related to clinical issues, only 8 (17%) recruited patients [
, , , , , , , ]. Many of the studies (39/47, 83%) only focus on investigation and evaluation, lacking rigorous methods for clinical research, such as RCTs. Furthermore, there is a lack of research end points directly linked to patient outcomes, such as cure rates or improvements in quality of life, making it difficult to find direct evidence of prognosis. Most of the studies (46/68, 68%) rely on subjective methodologies, such as expert ratings, for model evaluation and lack objective criteria and approaches for assessment, leading to unreliable research results. Furthermore, the absence of standardized questioning paradigms has led to instability in LLM responses, posing challenges for reproducibility and limiting the reliability and clinical significance of the study findings.Limitations of This Review
This systematic review has several limitations. First, only English-language articles were included, which may have led to the exclusion of relevant studies published in other languages. Second, due to significant heterogeneity in study designs, model tasks, and evaluation parameters among the included studies, we did not perform a comprehensive synthesis of most of the data, nor did we conduct a meta-analysis. Third, our search was restricted to commonly used medical research databases such as PubMed, Embase, and Cochrane Library, potentially overlooking relevant studies from other sources, including conference papers and gray literature. Finally, given the limited availability of rigorous clinical studies, we included a considerable number of subjective surveys. Although our objective was to provide a broad overview of LLM-related information, this may have introduced bias into the findings. These limitations are expected to be addressed as more standardized, high-quality clinical studies become available in future research.
Conclusions
Due to the current limitations of LLMs, they cannot replace orthopedic professionals in the short term. However, using LLMs as copilots could be a potential approach to effectively enhance work efficiency at present. In addition, developing task-specific downstream tools based on LLMs is also a potential solution to improve model performance for further use.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (82272577). The funding organization had no involvement in the design of the study, data collection, data analysis, data interpretation, writing of the report, or the decision to publish. Its role was strictly limited to providing financial support. We used the generative AI tool ChatGPT with GPT-4 [
] for language refinement purposes. The AI tool was only used to perform grammar checks and assist with enhancing the clarity and fluency of the writing. All intellectual contributions, including research design, data collection, analysis, and interpretation, were made by the authors. The final manuscript has been reviewed and approved by the authors, who take full responsibility for its content.Data Availability
All data generated or analyzed during this study are included in this published paper and its supplementary information files.
Authors' Contributions
NX, WL, CZ, and SL made contributions to conception and design. XZ, SZ, and YT made contributions to the acquisition, analysis, and interpretation of data. CZ, SL, and XZ drafted the manuscript, and NX, WL, SZ, YT, and SW revised it critically for important intellectual content. All authors approved the version to be published.
Conflicts of Interest
None declared.
PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) checklist.
PDF File (Adobe PDF File), 83 KBSearch strategy.
DOCX File , 25 KBData extraction strategy.
DOCX File , 22 KBRevised Cochrane risk-of-bias tool for randomized trials template for assessment completion.
PDF File (Adobe PDF File), 232 KBQuality assessment of studies of large language models based on CONSORT-AI (Consolidated Standards of Reporting Trials–Artificial Intelligence) guidance.
DOCX File , 53 KBReferences
- Thirunavukarasu AJ, Ting DS, Elangovan K, Gutierrez L, Tan TF, Ting DS. Large language models in medicine. Nat Med. Aug 17, 2023;29(8):1930-1940. [CrossRef] [Medline]
- Zhao WX, Zhou K, Li J, Tang T, Wang X, Hou Y, et al. A survey of large language models. arXiv. Preprint posted online on March 31, 2023. [FREE Full text]
- Wei J, Tay Y, Bommasani R, Raffel C, Zoph B, Borgeaud S, et al. Emergent abilities of large language models. arXiv. Preprint posted online on June 15, 2022. [FREE Full text]
- Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, Aleman FL, et al. GPT-4 technical report. arXiv. Preprint posted online on March 15, 2023. [FREE Full text]
- Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. Feb 2023;2(2):e0000198. [FREE Full text] [CrossRef] [Medline]
- Shah NH, Entwistle D, Pfeffer MA. Creation and adoption of large language models in medicine. JAMA. Sep 05, 2023;330(9):866-869. [CrossRef] [Medline]
- Omiye JA, Gui H, Rezaei SJ, Zou J, Daneshjou R. Large language models in medicine: the potentials and pitfalls. Ann Intern Med. Feb 2024;177(2):210-220. [CrossRef]
- Tang YD, Dong ED, Gao W. LLMs in medicine: the need for advanced evaluation systems for disruptive technologies. Innovation (Camb). May 06, 2024;5(3):100622. [FREE Full text] [CrossRef] [Medline]
- GBD 2017 Disease and Injury Incidence and Prevalence Collaborators. Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990-2017: a systematic analysis for the Global Burden of Disease Study 2017. Lancet. Nov 10, 2018;392(10159):1789-1858. [FREE Full text] [CrossRef] [Medline]
- Knezevic NN, Candido KD, Vlaeyen JW, Van Zundert J, Cohen SP. Low back pain. Lancet. Jul 03, 2021;398(10294):78-92. [CrossRef] [Medline]
- Duong V, Oo WM, Ding C, Culvenor AG, Hunter DJ. Evaluation and treatment of knee pain: a review. JAMA. Oct 24, 2023;330(16):1568-1580. [CrossRef] [Medline]
- Yao JJ, Aggarwal M, Lopez RD, Namdari S. Current concepts review: large language models in orthopaedics: definitions, uses, and limitations. J Bone Joint Surg Am. Jun 19, 2024. [CrossRef] [Medline]
- Fayed AM, Mansur NS, de Carvalho KA, Behrens A, D'Hooghe P, de Cesar Netto C. Artificial intelligence and ChatGPT in orthopaedics and sports medicine. J Exp Orthop. Jul 26, 2023;10(1):74. [FREE Full text] [CrossRef] [Medline]
- Merrell LA, Fisher ND, Egol KA. Large language models in orthopaedic trauma: a cutting-edge technology to enhance the field. J Bone Joint Surg Am. Sep 06, 2023;105(17):1383-1387. [CrossRef] [Medline]
- Page MJ, McKenzie JE, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ. Mar 29, 2021;372:n71. [FREE Full text] [CrossRef] [Medline]
- Sterne JA, Savović J, Page MJ, Elbers RG, Blencowe NS, Boutron I, et al. RoB 2: a revised tool for assessing risk of bias in randomised trials. BMJ. Aug 28, 2019;366:l4898. [FREE Full text] [CrossRef] [Medline]
- Liu X, Rivera SC, Moher D, Calvert MJ, Denniston AK, SPIRIT-AICONSORT-AI Working Group. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. BMJ. Sep 09, 2020;370:m3164. [FREE Full text] [CrossRef] [Medline]
- Agharia S, Szatkowski J, Fraval A, Stevens J, Zhou Y. The ability of artificial intelligence tools to formulate orthopaedic clinical decisions in comparison to human clinicians: an analysis of ChatGPT 3.5, ChatGPT 4, and Bard. J Orthop. Apr 2024;50:1-7. [FREE Full text] [CrossRef] [Medline]
- Anastasio AT, Mills FB4, Karavan MPJ, Adams SBJ. Evaluating the quality and usability of artificial intelligence-generated responses to common patient questions in foot and ankle surgery. Foot Ankle Orthop. Oct 22, 2023;8(4):24730114231209919. [FREE Full text] [CrossRef] [Medline]
- Baker HP, Dwyer E, Kalidoss S, Hynes K, Wolf J, Strelzow JA. ChatGPT's ability to assist with clinical documentation: a randomized controlled trial. J Am Acad Orthop Surg. Feb 01, 2024;32(3):123-129. [CrossRef] [Medline]
- Christy M, Morris MT, Goldfarb CA, Dy CJ. Appropriateness and reliability of an online artificial intelligence platform's responses to common questions regarding distal radius fractures. J Hand Surg Am. Feb 2024;49(2):91-98. [CrossRef] [Medline]
- Coraci D, Maccarone MC, Regazzo G, Accordi G, Papathanasiou JV, Masiero S. ChatGPT in the development of medical questionnaires. The example of the low back pain. Eur J Transl Myol. Dec 15, 2023;33(4):12114. [FREE Full text] [CrossRef] [Medline]
- Crook BS, Park CN, Hurley ET, Richard MJ, Pidgeon TS. Evaluation of online artificial intelligence-generated information on common hand procedures. J Hand Surg Am. Nov 2023;48(11):1122-1127. [CrossRef] [Medline]
- Daher M, Koa J, Boufadel P, Singh J, Fares MY, Abboud JA. Breaking barriers: can ChatGPT compete with a shoulder and elbow specialist in diagnosis and management? JSES Int. Nov 2023;7(6):2534-2541. [FREE Full text] [CrossRef] [Medline]
- Decker H, Trang K, Ramirez J, Colley A, Pierce L, Coleman M, et al. Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw Open. Oct 02, 2023;6(10):e2336997. [FREE Full text] [CrossRef] [Medline]
- Draschl A, Hauer G, Fischerauer SF, Kogler A, Leitner L, Andreou D, et al. Are ChatGPT's free-text responses on periprosthetic joint infections of the hip and knee reliable and useful? J Clin Med. Oct 20, 2023;12(20):6655. [FREE Full text] [CrossRef] [Medline]
- Dubin JA, Bains SS, Chen Z, Hameed D, Nace J, Mont MA, et al. Using a Google web search analysis to assess the utility of ChatGPT in total joint arthroplasty. J Arthroplasty. Jul 2023;38(7):1195-1202. [CrossRef] [Medline]
- Duey AH, Nietsch KS, Zaidat B, Ren R, Ndjonko LC, Shrestha N, et al. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J. Nov 2023;23(11):1684-1691. [CrossRef] [Medline]
- Fabijan A, Polis B, Fabijan R, Zakrzewski K, Nowosławska E, Zawadzka-Fabijan A. Artificial intelligence in scoliosis classification: an investigation of language-based models. J Pers Med. Dec 09, 2023;13(12):1695. [FREE Full text] [CrossRef] [Medline]
- Fahy S, Oehme S, Milinkovic D, Jung T, Bartek B. Assessment of quality and readability of information provided by ChatGPT in relation to anterior cruciate ligament injury. J Pers Med. Jan 18, 2024;14(1):104. [FREE Full text] [CrossRef] [Medline]
- Gianola S, Bargeri S, Castellini G, Cook C, Palese A, Pillastrini P, et al. Performance of ChatGPT compared to clinical practice guidelines in making informed decisions for lumbosacral radicular pain: a cross-sectional study. J Orthop Sports Phys Ther. Mar 2024;54(3):222-228. [CrossRef] [Medline]
- Hurley ET, Crook BS, Lorentz SG, Danilkowicz RM, Lau BC, Taylor DC, et al. Evaluation high-quality of information from ChatGPT (artificial intelligence-large language model) artificial intelligence on shoulder stabilization surgery. Arthroscopy. Mar 2024;40(3):726-31.e6. [CrossRef] [Medline]
- Johns WL, Kellish A, Farronato D, Ciccotti MG, Hammoud S. ChatGPT can offer satisfactory responses to common patient questions regarding elbow ulnar collateral ligament reconstruction. Arthrosc Sports Med Rehabil. Apr 2024;6(2):100893. [FREE Full text] [CrossRef] [Medline]
- Johns WL, Martinazzi BJ, Miltenberg B, Nam HH, Hammoud S. ChatGPT provides unsatisfactory responses to frequently asked questions regarding anterior cruciate ligament reconstruction. Arthroscopy. Jul 2024;40(7):2067-79.e1. [CrossRef] [Medline]
- Kaarre J, Feldt R, Keeling LE, Dadoo S, Zsidai B, Hughes JD, et al. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg Sports Traumatol Arthrosc. Nov 2023;31(11):5190-5198. [FREE Full text] [CrossRef] [Medline]
- Kasthuri VS, Glueck J, Pham H, Daher M, Balmaceno-Criss M, McDonald CL, et al. Assessing the accuracy and reliability of AI-generated responses to patient questions regarding spine surgery. J Bone Joint Surg Am. Jun 19, 2024;106(12):1136-1142. [CrossRef] [Medline]
- Kienzle A, Niemann M, Meller S, Gwinner C. ChatGPT may offer an adequate substitute for informed consent to patients prior to total knee arthroplasty-yet caution is needed. J Pers Med. Jan 05, 2024;14(1):69. [FREE Full text] [CrossRef] [Medline]
- Kirchner GJ, Kim RY, Weddle JB, Bible JE. Can artificial intelligence improve the readability of patient education materials? Clin Orthop Relat Res. Nov 01, 2023;481(11):2260-2267. [CrossRef] [Medline]
- Kuroiwa T, Sarcon A, Ibara T, Yamada E, Yamamoto A, Tsukamoto K, et al. The potential of ChatGPT as a self-diagnostic tool in common orthopedic diseases: exploratory study. J Med Internet Res. Sep 15, 2023;25:e47621. [FREE Full text] [CrossRef] [Medline]
- Li J, Gao X, Dou T, Gao Y, Zhu W. Assessing the performance of GPT-4 in the filed of osteoarthritis and orthopaedic case consultation. medRxiv. Preprint posted online on August 09, 2023. [FREE Full text] [CrossRef]
- Li LT, Sinkler MA, Adelstein JM, Voos JE, Calcei JG. ChatGPT responses to common questions about anterior cruciate ligament reconstruction are frequently satisfactory. Arthroscopy. Jul 2024;40(7):2058-2066. [CrossRef] [Medline]
- Lower K, Seth I, Lim B, Seth N. ChatGPT-4: transforming medical education and addressing clinical exposure challenges in the post-pandemic era. Indian J Orthop. Sep 2023;57(9):1527-1544. [CrossRef] [Medline]
- Magruder ML, Rodriguez AN, Wong JCJ, Erez O, Piuzzi NS, Scuderi GR, et al. Assessing ability for ChatGPT to answer total knee arthroplasty-related questions. J Arthroplasty. Aug 2024;39(8):2022-2027. [CrossRef] [Medline]
- Mika AP, Martin JR, Engstrom SM, Polkowski GG, Wilson JM. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am. Oct 04, 2023;105(19):1519-1526. [CrossRef] [Medline]
- Pagano S, Holzapfel S, Kappenschneider T, Meyer M, Maderbacher G, Grifka J, et al. Arthrosis diagnosis and treatment recommendations in clinical practice: an exploratory investigation with the generative AI model GPT-4. J Orthop Traumatol. Nov 28, 2023;24(1):61. [FREE Full text] [CrossRef] [Medline]
- Mejia MR, Arroyave JS, Saturno M, Ndjonko LC, Zaidat B, Rajjoub R, et al. Use of ChatGPT for determining clinical and surgical treatment of lumbar disc herniation with radiculopathy: a North American Spine Society guideline comparison. Neurospine. Mar 2024;21(1):149-158. [FREE Full text] [CrossRef] [Medline]
- Russe MF, Fink A, Ngo H, Tran H, Bamberg F, Reisert M, et al. Performance of ChatGPT, human radiologists, and context-aware ChatGPT in identifying AO codes from radiology reports. Sci Rep. Aug 30, 2023;13(1):14215. [FREE Full text] [CrossRef] [Medline]
- Schonfeld E, Pant A, Shah A, Sadeghzadeh S, Pangal D, Rodrigues A, et al. Evaluating computer vision, large language, and genome-wide association models in a limited sized patient cohort for pre-operative risk stratification in adult spinal deformity surgery. J Clin Med. Jan 23, 2024;13(3):656. [FREE Full text] [CrossRef] [Medline]
- Seth I, Xie Y, Rodwell A, Gracias D, Bulloch G, Hunter-Smith DJ, et al. Exploring the role of a large language model on carpal tunnel syndrome management: an observation study of ChatGPT. J Hand Surg Am. Oct 2023;48(10):1025-1033. [CrossRef] [Medline]
- Shrestha N, Shen Z, Zaidat B, Duey AH, Tang JE, Ahmed W, et al. Performance of ChatGPT on NASS clinical guidelines for the diagnosis and treatment of low back pain: a comparison study. Spine (Phila Pa 1976). May 01, 2024;49(9):640-651. [CrossRef] [Medline]
- Sosa BR, Cung M, Suhardi VJ, Morse K, Thomson A, Yang HS, et al. Capacity for large language model chatbots to aid in orthopedic management, research, and patient queries. J Orthop Res. Jun 21, 2024;42(6):1276-1282. [CrossRef] [Medline]
- Stroop A, Stroop T, Zawy Alsofy S, Nakamura M, Möllmann F, Greiner C, et al. Large language models: are artificial intelligence-based chatbots a reliable source of patient information for spinal surgery? Eur Spine J. Oct 11, 2023. (forthcoming). [CrossRef] [Medline]
- Suthar PP, Kounsal A, Chhetri L, Saini D, Dua SG. Artificial intelligence (AI) in radiology: a deep dive into ChatGPT 4.0's accuracy with the American Journal of Neuroradiology's (AJNR) "case of the month". Cureus. Aug 2023;15(8):e43958. [FREE Full text] [CrossRef] [Medline]
- Taylor WL4, Cheng R, Weinblatt AI, Bergstein V, Long WJ. An artificial intelligence chatbot is an accurate and useful online patient resource prior to total knee arthroplasty. J Arthroplasty. Aug 2024;39(8S1):S358-S362. [CrossRef] [Medline]
- Temel MH, Erden Y, Bağcıer F. Information quality and readability: ChatGPT's responses to the most common questions about spinal cord injury. World Neurosurg. Jan 2024;181:e1138-e1144. [CrossRef] [Medline]
- Tharakan S, Klein B, Bartlett L, Atlas A, Parada SA, Cohn RM. Do ChatGPT and Google differ in answers to commonly asked patient questions regarding total shoulder and total elbow arthroplasty? J Shoulder Elbow Surg. Aug 2024;33(8):e429-e437. [CrossRef] [Medline]
- Truhn D, Weber CD, Braun BJ, Bressem K, Kather JN, Kuhl C, et al. A pilot study on the efficacy of GPT-4 in providing orthopedic treatment recommendations from MRI reports. Sci Rep. Nov 17, 2023;13(1):20159. [FREE Full text] [CrossRef] [Medline]
- Warren EJ, Hurley ET, Park CN, Crook BS, Lorentz S, Levin JM, et al. Evaluation of information from artificial intelligence on rotator cuff repair surgery. JSES Int. Jan 2024;8(1):53-57. [FREE Full text] [CrossRef] [Medline]
- Wilhelm TI, Roos J, Kaczmarczyk R. Large language models for therapy recommendations across 3 clinical specialties: comparative study. J Med Internet Res. Oct 30, 2023;25:e49324. [FREE Full text] [CrossRef] [Medline]
- Wright BM, Bodnar MS, Moore AD, Maseda MC, Kucharik MP, Diaz CC, et al. Is ChatGPT a trusted source of information for total hip and knee arthroplasty patients? Bone Jt Open. Feb 15, 2024;5(2):139-146. [FREE Full text] [CrossRef] [Medline]
- Yang F, Yan D, Wang Z. Large-scale assessment of ChatGPT's performance in benign and malignant bone tumors imaging report diagnosis and its potential for clinical applications. J Bone Oncol. Feb 2024;44:100525. [FREE Full text] [CrossRef] [Medline]
- Yang J, Ardavanis KS, Slack KE, Fernando ND, Della Valle CJ, Hernandez NM. Chat generative pretrained transformer (ChatGPT) and bard: artificial intelligence does not yet provide clinically supported answers for hip and knee osteoarthritis. J Arthroplasty. May 2024;39(5):1184-1190. [CrossRef] [Medline]
- Yapar D, Demir Avcı Y, Tokur Sonuvar E, Eğerci Ö, Yapar A. ChatGPT's potential to support home care for patients in the early period after orthopedic interventions and enhance public health. Jt Dis Relat Surg. Jan 01, 2024;35(1):169-176. [FREE Full text] [CrossRef] [Medline]
- Zhou Y, Moon C, Szatkowski J, Moore D, Stevens J. Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis. Eur J Orthop Surg Traumatol. Feb 2024;34(2):927-955. [FREE Full text] [CrossRef] [Medline]
- Cuthbert R, Simpson AI. Artificial intelligence in orthopaedics: can Chat Generative Pre-trained Transformer (ChatGPT) pass section 1 of the Fellowship of the Royal College of Surgeons (trauma and orthopaedics) examination? Postgrad Med J. Sep 21, 2023;99(1176):1110-1114. [CrossRef] [Medline]
- Ghanem D, Covarrubias O, Raad M, LaPorte D, Shafiq B. ChatGPT performs at the level of a third-year orthopaedic surgery resident on the orthopaedic in-training examination. JBJS Open Access. 2023;8(4):e23.00103. [FREE Full text] [CrossRef] [Medline]
- Han Y, Choudhry HS, Simon ME, Katt BM. ChatGPT's performance on the hand surgery self-assessment exam: a critical analysis. J Hand Surg Glob Online. Mar 2024;6(2):200-205. [FREE Full text] [CrossRef] [Medline]
- Hofmann HL, Guerra GA, Le JL, Wong AM, Hofmann GH, Mayfield CK, et al. The rapid development of artificial intelligence: GPT-4's performance on orthopedic surgery board questions. Orthopedics. 2024;47(2):e85-e89. [CrossRef] [Medline]
- Jain N, Gottlich C, Fisher J, Campano D, Winston T. Assessing ChatGPT's orthopedic in-service training exam performance and applicability in the field. J Orthop Surg Res. Jan 03, 2024;19(1):27. [FREE Full text] [CrossRef] [Medline]
- Kung J, Marshall C, Gauthier C, Gonzalez T, Jackson JB3. Evaluating ChatGPT performance on the orthopaedic in-training examination. JB JS Open Access. 2023;8(3):e23.00056. [FREE Full text] [CrossRef] [Medline]
- Lum ZC. Can artificial intelligence pass the American Board of Orthopaedic Surgery examination? Orthopaedic residents versus ChatGPT. Clin Orthop Relat Res. Aug 01, 2023;481(8):1623-1630. [CrossRef] [Medline]
- Massey PA, Montgomery C, Zhang AS. Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg. Dec 01, 2023;31(23):1173-1179. [FREE Full text] [CrossRef] [Medline]
- Ozdag Y, Hayes DS, Makar GS, Manzar S, Foster BK, Shultz MJ, et al. Comparison of artificial intelligence to resident performance on upper-extremity orthopaedic in-training examination questions. J Hand Surg Glob Online. Mar 2024;6(2):164-168. [FREE Full text] [CrossRef] [Medline]
- Rizzo MG, Cai N, Constantinescu D. The performance of ChatGPT on orthopaedic in-service training exams: a comparative study of the GPT-3.5 turbo and GPT-4 models in orthopaedic education. J Orthop. Apr 2024;50:70-75. [CrossRef] [Medline]
- Saad A, Iyengar KP, Kurisunkal V, Botchu R. Assessing ChatGPT's ability to pass the FRCS orthopaedic part A exam: a critical analysis. Surgeon. Oct 2023;21(5):263-266. [CrossRef] [Medline]
- Traoré SY, Goetsch T, Muller B, Dabbagh A, Liverneaux PA. Is ChatGPT able to pass the first part of the European Board of Hand Surgery diploma examination? Hand Surg Rehabil. Sep 2023;42(4):362-364. [CrossRef] [Medline]
- Gill B, Bonamer J, Kuechly H, Gupta R, Emmert S, Kurkowski S, et al. ChatGPT is a promising tool to increase readability of orthopedic research consents. J Orthop Trauma Rehab. Jan 22, 2024. [CrossRef]
- Hakam HT, Prill R, Korte L, Lovreković B, Ostojić M, Ramadanov N, et al. Human-written vs AI-generated texts in orthopedic academic literature: comparative qualitative analysis. JMIR Form Res. Feb 16, 2024;8:e52164. [FREE Full text] [CrossRef] [Medline]
- Kacena MA, Plotkin LI, Fehrenbacher JC. The use of artificial intelligence in writing scientific review articles. Curr Osteoporos Rep. Feb 2024;22(1):115-121. [FREE Full text] [CrossRef] [Medline]
- Lawrence KW, Habibi AA, Ward SA, Lajam CM, Schwarzkopf R, Rozell JC. Human versus artificial intelligence-generated arthroplasty literature: a single-blinded analysis of perceived communication, quality, and authorship source. Int J Med Robot. Feb 13, 2024;20(1):e2621. [CrossRef] [Medline]
- Lotz JC, Ropella G, Anderson P, Yang Q, Hedderich MA, Bailey J, et al. An exploration of knowledge-organizing technologies to advance transdisciplinary back pain research. JOR Spine. Dec 2023;6(4):e1300. [FREE Full text] [CrossRef] [Medline]
- Methnani J, Latiri I, Dergaa I, Chamari K, Ben Saad H. ChatGPT for sample-size calculation in sports medicine and exercise sciences: a cautionary note. Int J Sports Physiol Perform. Oct 01, 2023;18(10):1219-1223. [CrossRef] [Medline]
- Nazzal MK, Morris AJ, Parker RS, White FA, Natoli RM, Fehrenbacher JC, et al. Using AI to write a review article examining the role of the nervous system on skeletal homeostasis and fracture healing. Curr Osteoporos Rep. Feb 13, 2024;22(1):217-221. [FREE Full text] [CrossRef] [Medline]
- Sanii RY, Kasto JK, Wines WB, Mahylis JM, Muh SJ. Utility of artificial intelligence in orthopedic surgery literature review: a comparative pilot study. Orthopedics. 2024;47(3):e125-e130. [FREE Full text] [CrossRef] [Medline]
- Zaidat B, Lahoti YS, Yu A, Mohamed KS, Cho SK, Kim JS. Artificially intelligent billing in spine surgery: an analysis of a large language model. Global Spine J. Dec 26, 2023:21925682231224753. [FREE Full text] [CrossRef] [Medline]
- Tajirian T, Stergiopoulos V, Strudwick G, Sequeira L, Sanches M, Kemp J, et al. The influence of electronic health record use on physician burnout: cross-sectional survey. J Med Internet Res. Jul 15, 2020;22(7):e19274. [FREE Full text] [CrossRef] [Medline]
- Ayers JW, Poliak A, Dredze M, Leas EC, Zhu Z, Kelley JB, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. Jun 01, 2023;183(6):589-596. [FREE Full text] [CrossRef] [Medline]
- Charnock D, Shepperd S, Needham G, Gann R. DISCERN: an instrument for judging the quality of written consumer health information on treatment choices. J Epidemiol Community Health. Feb 1999;53(2):105-111. [FREE Full text] [CrossRef] [Medline]
- Dennler C, Bauer DE, Scheibler AG, Spirig J, Götschi T, Fürnstahl P, et al. Augmented reality in the operating room: a clinical feasibility study. BMC Musculoskelet Disord. May 18, 2021;22(1):451. [FREE Full text] [CrossRef] [Medline]
- Verhey JT, Haglin JM, Verhey EM, Hartigan DE. Virtual, augmented, and mixed reality applications in orthopedic surgery. Int J Med Robot. Apr 2020;16(2):e2067. [CrossRef] [Medline]
- Li Z, Jiang S, Song X, Liu S, Wang C, Hu L, et al. Collaborative spinal robot system for laminectomy: a preliminary study. Neurosurg Focus. Jan 2022;52(1):E11. [CrossRef] [Medline]
- Li Z, Wang C, Song X, Liu S, Zhang Y, Jiang S, et al. Accuracy evaluation of a novel spinal robotic system for autonomous laminectomy in thoracic and lumbar vertebrae: a cadaveric study. J Bone Joint Surg Am. Jun 21, 2023;105(12):943-950. [CrossRef] [Medline]
- Kim JS, Vivas A, Arvind V, Lombardi J, Reidler J, Zuckerman SL, et al. Can natural language processing and artificial intelligence automate the generation of billing codes from operative note dictations? Global Spine J. Sep 2023;13(7):1946-1955. [FREE Full text] [CrossRef] [Medline]
- ChatGPT. URL: https://chatgpt.com [accessed 2024-04-29]
Abbreviations
AI: artificial intelligence |
CONSORT-AI: Consolidated Standards of Reporting Trials–Artificial Intelligence |
FKGL: Flesch-Kincaid grade level |
GPT: generative pretrained transformer |
LLM: large language model |
NLP: natural language processing |
OITE: Orthopaedic Surgery In-Training Examination |
PGY: postgraduate year |
PLM: pretrained language model |
PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
RCT: randomized controlled trial |
Edited by G Eysenbach, A Mavragani; submitted 17.04.24; peer-reviewed by J Wu, X Liu, D-G Chang; comments to author 20.06.24; revised version received 01.08.24; accepted 11.09.24; published 15.11.24.
Copyright©Cheng Zhang, Shanshan Liu, Xingyu Zhou, Siyu Zhou, Yinglun Tian, Shenglin Wang, Nanfang Xu, Weishi Li. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 15.11.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.