Published on in Vol 26 (2024)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/54676, first published .
Machine Learning and Deep Learning for Diagnosis of Lumbar Spinal Stenosis: Systematic Review and Meta-Analysis

Machine Learning and Deep Learning for Diagnosis of Lumbar Spinal Stenosis: Systematic Review and Meta-Analysis

Machine Learning and Deep Learning for Diagnosis of Lumbar Spinal Stenosis: Systematic Review and Meta-Analysis

Review

Beijing Chaoyang Hospital, Capital Medical University, Beijing, China

*these authors contributed equally

Corresponding Author:

Lei Zang, MD

Beijing Chaoyang Hospital

Capital Medical University

5 JingYuan Road, Shijingshan District

Beijing, 100043

China

Phone: 86 51718268

Email: zanglei@ccmu.edu.cn


Background: Lumbar spinal stenosis (LSS) is a major cause of pain and disability in older individuals worldwide. Although increasing studies of traditional machine learning (TML) and deep learning (DL) were conducted in the field of diagnosing LSS and gained prominent results, the performance of these models has not been analyzed systematically.

Objective: This systematic review and meta-analysis aimed to pool the results and evaluate the heterogeneity of the current studies in using TML or DL models to diagnose LSS, thereby providing more comprehensive information for further clinical application.

Methods: This review was performed under the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines using articles extracted from PubMed, Embase databases, and Cochrane Library databases. Studies that evaluated DL or TML algorithms assessment value on diagnosing LSS were included, while those with duplicated or unavailable data were excluded. Quality Assessment of Diagnostic Accuracy Studies 2 was used to estimate the risk of bias in each study. The MIDAS module and the METAPROP module of Stata (StataCorp) were used for data synthesis and statistical analyses.

Results: A total of 12 studies with 15,044 patients reported the assessment value of TML or DL models for diagnosing LSS. The risk of bias assessment yielded 4 studies with high risk of bias, 3 with unclear risk of bias, and 5 with completely low risk of bias. The pooled sensitivity and specificity were 0.84 (95% CI: 0.82-0.86; I2=99.06%) and 0.87 (95% CI 0.84-0.90; I2=98.7%), respectively. The diagnostic odds ratio was 36 (95% CI 26-49), the positive likelihood ratio (LR+) was 6.6 (95% CI 5.1-8.4), and the negative likelihood ratio (LR–) was 0.18 (95% CI 0.16-0.21). The summary receiver operating characteristic curves, the area under the curve of TML or DL models for diagnosing LSS of 0.92 (95% CI 0.89-0.94), indicating a high diagnostic value.

Conclusions: This systematic review and meta-analysis emphasize that despite the generally satisfactory diagnostic performance of artificial intelligence systems in the experimental stage for the diagnosis of LSS, none of them is reliable and practical enough to apply in real clinical practice. Further efforts, including optimization of model balance, widely accepted objective reference standards, multimodal strategy, large dataset for training and testing, external validation, and sufficient and scientific report, should be made to bridge the distance between current TML or DL models and real-life clinical applications in future studies.

Trial Registration: PROSPERO CRD42024566535; https://tinyurl.com/msx59x8k

J Med Internet Res 2024;26:e54676

doi:10.2196/54676

Keywords



Lumbar spinal stenosis (LSS) is a major cause of pain and disability in older individuals [1]. LSS has become a worldwide public health issue as it is estimated that more than 102 million people are diagnosed with LSS annually, with high incidence in Europe and the United States of America [2,3]. According to the clinical guideline developed by the North American Spine Society, LSS is characterized as a condition of diminished space available for the neural and vascular elements in the lumbar spine, secondary to degenerative changes in the spinal canal [4]. An accurate LSS diagnosis is essential for treatment options and effectiveness. Currently, clinicians diagnose LSS based on a comprehensive evaluation combined with the patient’s history, physical examination, and spinal imaging tests such as x-ray, computed tomography (CT), and magnetic resonance imaging (MRI) [1,2]. As a superior radiographic screening tool for soft tissues, MRI plays a crucial role in detecting the presence, classification, and grading of LSS [5-7]. However, detailing numerous information in spinal MRI is time-consuming and repetitive, which causes laborious clinical workloads [7]. Furthermore, existing LSS grading systems are mainly qualitative or semiquantitative, which highly depend on expertise and suffer from high interobserver variations because of the complexity of the spinal canal and foramen [5,6,8-11]. Therefore, more intelligent radiographic diagnostic and grading methods of LSS are warranted.

Machine learning (ML), a subdiscipline of artificial intelligence (AI), has shown great advantages in analyzing medical imaging and predicting outcome decisions [12-14]. ML begins with algorithms trained with a set of data, such as image features, to establish the prediction or diagnosis by extracting and classifying relevant information. More recently, a crucial branch of ML, named deep learning (DL), was standing out rapidly. DL algorithms were designed with multiple processing layers, which can learn more complex image features than traditional ML methods [15]. Although DL is still challenged by the demand for large-scale datasets and the difficulty of interpretation, it owns the incomparable advantage of automatic feature extraction, minimizing the bias by manual intervention [12,14]. In 2016, He et al [16] attempted to use traditional ML (TML) methods based on their newly proposed synchronized superpixel representation model to recognize the presence of radiographic lumbar foraminal stenosis (LFS). Subsequently, increasing studies of TML and DL were conducted in the field of diagnosing and grading LSS and gained prominent results [16-34]. However, most of these studies focus either on algorithm development or clinical validation, causing great variations in experimental settings and incompleteness of evaluation parameters of accuracy and reliability. Hence, a systematic review and meta-analysis were believed to be necessary to evaluate the heterogeneity and provide comprehensive results from these studies. However, to our knowledge, no systematic review and meta-analysis was previously conducted to address this issue.

Therefore, this systematic review and meta-analysis aimed to evaluate the heterogeneity and pool the results of the current studies in using ML or DL models to diagnose LSS, thereby providing more comprehensive information for further clinical application.


Study Design and Registration

This systematic literature review was conducted following the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines and flowchart [35,36] and the PRISMA diagnostic test accuracy checklist (Multimedia Appendix 1) [37]. The protocol for this systematic review was registered in PROSPERO (ID: CRD42024566535). Ethical approval was not required because this systematic literature review focused on retrospective studies.

Search Strategy

This review collected the records from 3 major databases up to October 2023. A second search was performed in February 2024 to complement newly published studies. Those databases include PubMed, Embase, and the Cochrane Library (CENTRAL), which are recommended academic search systems for systematic reviews and meta-analyses [38]. We used the MeSH (Medical Subject Headings) and Emtree headings in several combinations and supplemented them with free text to increase sensitivity. In addition, we searched references contained in the included studies to supplement the relevant literature. An experienced librarian designed and implemented the search strategy. The following MeSH terms were used for PubMed: “Spinal Stenosis,” “Intervertebral Disc Degeneration,” “Lumbar Vertebrae,” “Machine Learning,” “Deep Learning,” and “Neural Networks, Computer*.” The details of the search strategy are stated in Multimedia Appendix 2.

Inclusion and Exclusion Criteria

We included studies that evaluated the assessment value of DL or TML algorithms for diagnosing LSS and that were available in English. The included studies in the meta-analysis should provide or could be reconstructed as a 2×2 confusion matrix from sensitivity, specificity, and precision. Applied statistical, non–artificial intelligence, and general AI methods are not considered DL or TML. Articles with duplicated or unavailable data were excluded. Furthermore, abstracts from protocols, case reports, editorials, and review articles were excluded.

Review Process

A total of 2 reviewers (TW and NF) independently performed an initial screening of the titles and abstracts of the remaining articles to determine potential eligibility after removing duplicates. We reviewed the full texts of the remaining articles and excluded those that did not meet the inclusion criteria. We searched and screened a list of references for all relevant studies and a systematic review of potentially relevant studies. Disagreements were resolved by discussion and by third-party adjudication when necessary. For studies enrolled in systematic review while lack of available data for meta-analysis, an email was sent to the corresponding authors for acquisition of the necessary data.

Data Extraction

A total of 2 reviewers independently extracted, summarized, and tabulated the following data using a standard form: baseline characteristics of studies, including the publication year, study type, model type, algorithms used, LSS classifications, number of participants, validation strategy, imaging modality, and diagnosis criteria of LSS. Any discrepancies in the extracted data were resolved by discussion. For the studies that provided multiple contingency tables based on different classifier algorithms, datasets, LSS types, or label strategies, we assumed these to be independent of each other. For the studies that provided multiple contingency tables based on different preprocessing strategies, we selected the best-performing result. If there was no preprocessing strategy that performed significantly better than the others, we also enrolled each strategy as an individual study and collected the corresponding results. For repeat test results based on the same classifier algorithms, datasets, and so on, we calculated the average values of metrics as the final results.

Quality Assessment

A total of 2 reviewers (TW and NF) used the Quality Assessment of Diagnostic Accuracy Studies 2 (QUADAS-2), which is a tool for assessing the quality of primary diagnostic accuracy studies, to independently assess the risk of bias for each eligible study [39]. The QUADAS-2 criteria assessed the risk of bias in 4 domains: patient selection, index test, reference standard, and flow and timing. Any disagreements were resolved by discussion with a third author.

Statistical Analysis

We used the MIDAS module and the METAPROP module [40] of Stata (version 17.0; StataCorp) for statistical analysis. Postestimation procedures for model diagnostic were used to assess heterogeneity using the I2 statistic. The following metrics were used: 0%-40% (low heterogeneity), 30%-60% (moderate heterogeneity), 50%-90% (substantial heterogeneity), and 75%-100% (considerable heterogeneity). Bivariate mixed-effects logistic regression modeling was conducted, and forest plots were used to compare the sensitivity and specificity of DL or TML models for diagnosing LSS. We used summary receiver operating characteristic (SROC) curves to assess overall diagnostic accuracy. We used the Fagan nomogram to explore the relationship between pretest probability, likelihood ratio (LR), and posttest probability. LR dot plots were divided into 4 quadrants according to the strength of the evidence threshold, which was used to determine DL or TML model exclusion and confirmation. Finally, subgroup analyses were performed to examine whether the estimated sensitivity, specificity, and associated I2 differed by several moderators when each subgroup included ≥4 datasets.


Study Selection and Characteristics

The initial search identified 934 titles and abstracts, of which 269 were duplicates. After screening, 567 articles were excluded following this study’s inclusion and exclusion criteria. In addition, 98 studies were reviewed for full text, of which 19 and 12 studies were included in the systematic review and meta-analysis, respectively (Figure 1). Table 1 summarizes the characteristics of the studies in the systematic review and meta-analyses, including study type, model type, algorithms used, LSS classifications, number of participants, validation strategy, imaging modality, and diagnosis criteria of LSS. The 19 studies included in the systematic review were published from 2016 to 2024. The 12 studies included in the meta-analysis were all retrospective and included 21 external tests [17,25-27,30,32,34] and 35 internal tests [17,20,22,24,26-28,33,34]. Therefore, the meta-analysis included 56 datasets and completely different data sources. Among the 56 datasets, 32 identified LSS on MRI [17,20,22,24,25,27,28,30,32], 20 on x-ray [26,34], and 4 on CT [33]. Furthermore, 29 datasets have developed and internally tested DL models [17,20,24,26-28,33,34], 6 datasets internally tested TML models [22,28], and 21 datasets externally tested the DL models [17,25-27,30,32,34].

Figure 1. Flowchart depicting PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) search strategy.
Table 1. Characteristics of the included studies in the systemic review and meta-analysis.
StudyStudy typeModel typeAlgorithms usedaLSSb typeNumber of participants, nValidation strategyImaging modalityDiagnosis criteria
He et al [16]Development study and internal testTMLcKNNd and SVMe and LDAfLFSg110Cross-validationMRIhLee et al [5]
Jamaludin et al [18]Development study and internal testDLiCNNj (SpineNetk)LCSl2009Hold-out validationMRINo reference
Zhang et al [19]Development study and internal testTMLSVM and Decision TreeLCS and LFS600Hold-out validationMRINo reference
Lu et al [21]Development study and internal testDLResNeXt-50LCS and LFS4075Hold-out validationMRINo reference
Han et al [20]mDevelopment study and internal testDLCNN and FCN and SegNet and DeepLabv3+ and U-NetLFS253Cross-validationMRILee et al [5]
Huber et al [22]mDevelopment study and internal testTMLDecision TreeLCS82Cross-validationMRILee et al [9] and Schizas et al [6]
Ishimoto et al [23]Replication study and internal testDLCNN (SpineNetk)LCS971Hold-out validationMRILurie et al [8]
Won et al [24]mDevelopment study and internal testDLVGGLCS542Cross-validationMRISchizas et al [5]
Hallinan et al [17]mDevelopment study and internal test and external testDLCNNLCS and LRSn and LFS446/100Hold-out validationMRILurie et al [8] and
Bartynski and Lin [10]
Lehnen et al [25]mExternal testDLCNN (CoLumbok)LCS146oMRILee et al [9]
Grob et al [30]mExternal testDLCNN (SpineNetk)LCS882MRILurie et al [8]
Kim et al [26]mDevelopment study and internal test and external testDLVGG19 and VGG16 and ResNet50 and Efficient1LCS4644/199Cross-validationX-RayLee et al [9]
Su et al [27]mDevelopment study and internal test and external testDLResNet-50LCS1015/100Hold-out validationMRILee et al [9] and Park et al [11]
Altun et al [28]mDevelopment study and internal testTML and DLRF and SVM and VGG16 and ResNet and MobileNet and InceptionNetLSS1030Cross-validationMRINo reference
Bharadwaj et al [29]Development study and internal testTML and DLDecision Tree and BiTCNNLCS and LFS200Hold-out validationMRISchizas et al [6] and Lee et al [5]
Tumko et al [32]mDevelopment study and external testDLRegNetY32GFLCS and LRS and LFS1635/150MRISchizas et al [6]
Shahzadi et al [31]Development study and internal testDLCNNLRS and LFS515Cross-validationMRINo reference
Li et al [33]mDevelopment study and internal testDLVGG11 and ResNet-18LCS and LRS236Hold-out validationCTLurie et al [8] and Bartynski and Lin [10]
Park et al [34]aDevelopment study and internal test and extra-internal test and external testDLResNet50 and VGG19 and VGG16 and EfficientNet-B1LCS3831/199/100Cross-validation (validation) and Hold-out validation (Internal test)X-RayNo reference

aAlgorithms for only classifiers.

bLSS: lumbar spinal stenosis.

cTML: traditional machine learning.

dKNN: k-nearest neighbors.

eSVM: support vector machine.

fLDA: linear discriminant analysis.

gLFS: lumbar foraminal stenosis.

hMRI: magnetic resonance imaging.

iDL: deep learning.

jCNN: convolutional neural network.

kName of software.

lLCS: lumbar central stenosis.

mStudies included in meta-analysis (confusion matrix available or can be reconstructed).

nLRS: lateral recess stenosis.

oNot applicable.

Methodological Quality

Regarding the QUADAS-2 risk of bias assessment (Figure 2 [17,20,22,24-28,30,32-34]), we revealed 4 studies with a high risk of bias [26,28,30,33], 3 with an unclear risk of bias [20,25,34], and 5 with a completely low risk of bias [17,22,24,27,32]. In particular, 2 of the included studies reported no details of patient selection [26,28], causing a high bias in patient selection. Furthermore, 1 study provided unclear information on how to perform the index test [28], thereby causing an unclear risk of bias. Furthermore, 1 study used the improper reference standard, which was not likely to correctly classify the target condition [30], causing a high risk of bias in the reference standard. Besides, 1 study showed a high risk of bias with regard to flow and timing issues [33].

Figure 2. Methodological assessment by Quality Assessment of Diagnostic Accuracy Studies-2 (QUADAS-2). (A) The proportion of risk of bias for all domains and proportion of applicability concerns in three domains. (B) Summary of the risk of bias for each study. Green, yellow, and red circles indicate low, unclear, and high risk of bias, respectively [8,20,22,24-28,30,32-34].

Performance of TML and DL Models for LSS

A total of 12 studies with 15,044 patients reported the assessment value of TML or DL models for diagnosing LSS. The pooled sensitivity was 0.84 (95% CI 0.82-0.86; I2=99.06%), and specificity was 0.87 (95% CI 0.84-0.90; I2=98.7%; Figure 3). The diagnostic odds ratio was 36 (95% CI 26-49). The SROC curve (Figure 4) revealed that the area under the curve of TML or DL models for diagnosing LSS was 0.92 (95% CI 0.89-0.94), indicating a high diagnostic value.

Figure 3. Forest plots in sensitivity and specificity of traditional machine learning (TML) or deep learning (DL) models. The pooled sensitivity and specificity were 0.84 (95% CI 0.82-0.86) and 0.87 (95% CI 0.84-0.90), respectively.
Figure 4. The SROC graph for the studies. The AUC of summary receiver operating characteristic (TML) or deep learning (DL) models for diagnosing LSS was 0.92 (95% CI 0.89-0.94). SROC: summary receiver operating characteristic; AUC: area under the curve; SENS: sensitivity; SPEC: specificity.

We set the pretest probability to 50% based on the pretest probability of disease. At this point, true positives accounted for 87% when patients were diagnosed with LSS by the TML or DL model, and false negatives accounted for 15% when the diagnosis was not LSS (Figure 5). Furthermore, the models showed a positive likelihood ratio (LR+) of 6.6 (95% CI 5.1-8.4) and a negative likelihood ratio (LR–) of 0.18 (95% CI 0.16-0.21), respectively (Figure 5). However, the summary likelihood ratio plot of TML or DL models was in the right lower quadrant (LR+<10 and LR–>0.1: no exclusion or confirmation), and the individual plots were scattered and distributed (Figure 6). The results indicated that although the TML or DL models achieved an acceptable performance generally, it was still insufficient enough for diagnosing or excluding LSS, and the current models suffered from certain performance variations.

Figure 5. Fagan nomogram of traditional machine learning (TML) or deep learning (DL) models for diagnosing lumbar spinal stenosis (LSS). The first column of this nomogram represents the pretest probability, the second column represents the likelihood ratio, and the third shows the posttest probability.
Figure 6. Likelihood ratio (LR) dot plot of traditional machine learning (TML) or deep learning (DL) prediction models. The summary point of TML or DL models was in the right lower quadrant (LR+<10 and LR–>0.1: no exclusion or confirmation). LRN: negative likelihood ratio; LRP: positive likelihood ratio; LUQ: left upper quadrant; RLQ: right lower quadrant; RUQ: right upper quadrant.

In total, 4 studies [17,23,29,32] simultaneously provided the performance of reliability both of observers and TML or DL models, including 3 studies [17,29,32] that performed a direct comparison between their reliabilities based on the same assessment datasets (Table 2).

Table 2. Characteristics of the studies available for the agreement between models and observers and reference standard.
StudyNumber of participants, nAgreement assessment strategyControl groupModel typeType of
classification
LSSa typeModel resultsControl group results
Hallinan et al [17]446Gwet ĸ2 radiologistsDLbBinaryLCSc0.960.98/0.98
Hallinan et al [17]446Gwet ĸ2 radiologistsDLBinaryLRSd0.920.92/0.95
Hallinan et al [17]446Gwet ĸ2 radiologistsDLBinaryLFSe0.890.94/0.95
Hallinan et al [17]446Gwet ĸ2 radiologistsDLMultigradingLCS0.820.89/0.89
Hallinan et al [17]446Gwet ĸ2 radiologistsDLMultigradingLRS0.720.71/0.79
Hallinan et al [17]446Gwet ĸ2 radiologistsDLMultigradingLFS0.750.80/0.87
Bharadwaj et al [29]200Cohen ĸ2 radiologistsDLMultigradingLCS0.540.80/0.86
Bharadwaj et al [29]200Cohen ĸ2 radiologistsTMLfMultigradingLCS0.800.80/0.86
Tumko et al [32]150Cohen ĸ7 radiologistsDLBinaryLCS0.431Average 0.372
Tumko et al [32]150Cohen ĸ7 radiologistsDLBinaryLRS0.315Average 0.323
Tumko et al [32]150Cohen ĸ7 radiologistsDLBinaryLFS0.672Average 0.596
Tumko et al [32]150Cohen ĸ7 radiologistsDLMultigradingLCS0.310Average 0.376
Tumko et al [32]150Cohen ĸ7 radiologistsDLMultigradingLRS0.199Average 0.359
Tumko et al [32]150Cohen ĸ7 radiologistsDLMultigradingLFS0.637Average 0.620

aLSS: lumbar spinal stenosis.

bDL: deep learning.

cLCS: lumbar central stenosis.

dLRS: lateral recess stenosis.

eLFS: lumbar foraminal stenosis.

fTML: traditional machine learning.

Subgroup Analysis

We conducted the subgroup analyses in 3 areas, including data partition (internal test or external test), model networks (TML or DL), and image (MRI or x-ray), to effectively understand how the 3 different types affected the performance of the algorithm for LSS assessment (Table 3). The internal test group demonstrated a lower sensitivity (P<.01) yet higher specificity (P<.01) than the external test group. Besides, the MRI group showed a lower sensitivity (P<.01) yet higher specificity (P<.01) than the x-ray group. The sensitivity in the DL group achieved 0.85, which was significantly higher than that (0.80) in the TML group (P<.01). Meanwhile, the DL group showed a more stable performance on specificity than the TML group (P=.04).

Table 3. Results of subgroup analysis.
CategoriesStudies, nSensitivity (95% CIa)P value (HBGb of sensitivity)Specificity (95% CI)P value (HBG of specificity)
Data partition<.001
<.001

Internal test350.83 (0.80-0.86)
0.89 (0.85-0.92)

External test210.86 (0.82-0.90)
0.85 (0.79-0.91)
Model networks<.001
.04

TMLc60.80 (0.72-0.89)
0.87 (0.77-0.97)

DLd500.85 (0.82-0.87)
0.87 (0.84-0.91)
Image<.001
<.001

MRI320.83 (0.79-0.86)
0.91 (0.88-0.93)

X-ray200.85 (0.82-0.89)
0.77 (0.70-0.84)

aCI: confidence interval.

bHBG: heterogeneity between groups.

cTML: traditional machine learning.

dDL: deep learning.


Principal Findings

In recent years, there has been a boom in assessing the diagnosis and grading of LSS by TML or DL methods. After systemically reviewing the available evidence, we revealed that all related studies were published after 2016 and increased annually. It can also be said that TML and DL algorithms have been showing promising potential in this field. To the best of our knowledge, this is the first systematic review and meta-analysis for addressing this issue. Our pooled results showed an overall sensitivity of 0.84 and a specificity of 0.87 for diagnosing LSS by TML or DL models. The area under the SROC was 0.92, indicating a high diagnostic value. Subgroup analysis revealed a better diagnostic performance in internal validation than in external validation, while DL algorithms demonstrated higher sensitivity and specificity than TML algorithms. However, 37% of studies enrolled in the systemic review were unavailable in the meta-analysis, which may have caused a discrepancy between pooled results and reality. Therefore, the results should be interpreted with caution.

A permanent debate focuses on whether the diagnostic performance of ML or DL algorithms surpassed that of clinicians. High-level evidence showed that the performance of AI diagnostic systems is equivalent to health care professionals, and AI-assistance systems improve clinician diagnostic performance [41-43]. However, in the field of LSS, there were few studies designed to directly compare the performance of additional radiologists or orthopedic surgeons with ML or DL algorithms in the same dataset. Hallinan et al [17] developed a DL method for diagnosing different LSS and compared the sensitivity and specificity of the DL model with 2 independent clinicians (a neuroradiologist and a musculoskeletal radiologist) with less than 10 years of experience. The study revealed that the sensitivity of DL in detecting LSS was on par with clinicians in general, with even slightly higher in lumbar central stenosis (LCS) and lateral recess stenosis (LRS), but with lower specificity of DL. It is reasonable because pursuing sensitivity to reduce false-negative results on the premise of maximizing the accuracy and AUC may be an alternative and beneficial method for clinical demands [44]. Compared with the complete replacement of clinicians, AI diagnostic systems are more expected to be assisted screening tools to use in areas with poor medical resources without experts or to reduce the workload of clinicians and missed diagnoses, followed by high-level medical team screening of image marked positive by the automatic diagnosis [45].

Although the general performance of diagnostic models was satisfactory, it was still insufficient enough for diagnosing or excluding LSS according to the summary likelihood ratio plot. Besides, our systemic review and meta-analysis found that ML or DL models showed similar, even slightly lower, sensitivity compared with specificity in general, especially in the MRI modality. There may be several reasons. First, the complexity and variety of pathological structures in individuals with LSS result in no broadly accepted quantitative radiologic evidence for diagnosis, even in expert evaluation [46], which makes automatic detection by MRI difficult. Furthermore, we cannot exclude that the results may be influenced by heterogeneity. Consideration should be taken for developers to optimize models prone to higher sensitivity than specificity for diagnosis and grading of LSS, which may be more beneficial to clinical workflow.

Notably, a consensus of reference standards in determining ML or DL performance for diagnosing LSS has not been reached till now. The reference standards in almost all included studies were labeled by qualitative or semiquantitative expert evaluation, which suffered from considerable heterogeneity due to the different amount, specialties, and years of experience of experts. Huber et al [22] combined texture analysis and decision trees to detect LSS based on the cross-sectional area (CSA) as a quantitative reference standard. However, a CSA of <130mm2 was not a widely accepted criterion, and it is only appropriate for LCS, while quantitative radiological criteria remain unavailable for diagnosing LRS or LFS [46,47]. More comprehensive and rigorous criteria for reference standards should be developed in future work. In addition, the diagnosis of LSS should combine the imaging findings with history and clinical presentation because LSS is a clinical syndrome, and solely radiographic LSS may be symptom-free [2]. However, the diagnosis criteria in all reviewed studies were only based on radiographic criteria or reports, which means that the current TML or DL models were developed for the diagnosis of radiographic LSS objectively. Yet, it is not said that radiographic evaluation is valueless for LSS. On the one hand, it can provide details in pathological anatomy, which guides further treatment options and surgical approaches. On the other hand, a potentially imperceptible relationship between radiographic characters and clinical LSS may be explored with the help of AI models. Therefore, we suggest attempting to label the data by clinical LSS as golden standards on the premise of model interpretability and eliminating confounding factors. Furthermore, developers can set multiple data types, such as crucial details of patient’s history, physical examination, and imaging tests, as inputs to build a multimodal to improve the clinical value of LSS diagnosis and grading by AI approaches [48].

Overall, our meta-analysis revealed a better performance for diagnosing LSS in DL than in TML. Whereas results should be interpreted with caution because of the limited number of enrolled studies on TML in meta-analysis. Only 2 included studies in the systemic review designed a direct comparison of the capability of DL and TML models for diagnosing LSS, yet showed contradictory results. Altun et al [28] found that VGG16 and 3 other DL techniques performed better in addressing the issue of binary LSS classification compared with random forest and support vector machines. Conversely, Bharadwaj et al [29] combined segmentation with DL and TML classifiers to conduct multiclass and binary LSS grading. Both accuracy, AUC, and reproducibility were higher in the TML group [29]. The inconsistency may be attributed to the scale of training data. DL was generally acknowledged as the most outstanding ML technique for automatic medical image analysis [49]. However, DL is restricted to a stronger data dependency compared with any other ML, as it is designed with a more complex architecture [14,15,50]. In particular, there is an extreme need for DL to be trained with a sufficiently large sample set, particularly considering the complexity of spinal MRI. In the 2 studies above, the scales of training samples were more than 5 times higher in the study of Altun et al [28] than that of Bharadwaj et al [29] (927 vs 170). Hence, we recommend that a larger scale dataset for training both TML and DL models is beneficial for exploring their capabilities in diagnosing and grading LSS in order to reduce data overfitting and improve the performance of models.

Any AI diagnostic systems should be clinically oriented instead of technically oriented. This poses a challenge for developers in developing ML or DL models more appropriately for clinical practice rather than more technically challenging. Currently, although promising results in exactitude (accuracy, sensitivity, etc) have been widely reported, other aspects of great clinical significance, like reliability, usability, and safety, were rarely assessed. A good agreement between the model and reference standard can verify the validity and reliability of the model. However, only 3 included studies performed a direct comparison of reliabilities between models and observers [17,29,32], and a generally higher consistency of diagnosis was achieved by clinicians than that of ML or DL systems. Besides, external validation is a valuable approach to validate the generalizability of ML or DL algorithms, testing their capabilities for adapting the differences between initial settings of data collection, imaging tests, and imaging processing with replication or real-world settings [45]. Inspiringly, this meta-analysis showed the performances of external validation were generally on par with that of the internal validation, with better sensitivity but worse specificity. However, the results may be inconclusive because only 37% of studies (7/19) enrolled in this systemic review tested the models by external validation (separate datasets for model validation only) [17,25-27,30,32,34].

Anyway, a gap remains between current ML or DL algorithms for diagnosing and grading LSS with real clinical applications. A recent review highlighted the importance of large-scale and mixed-source datasets, clinician collaboration, and a clear statement of data collection to facilitate DL in clinical applications [15]. Furthermore, only a few software, such as SpineNet (University of Oxford) [18,23,30] and CoLumbo (SmartSoft Ltd) [25], were introduced into public view despite several AI models having been developed in this field. We urged that the exploration of software design may be beneficial to extend the application of AI diagnostic models.

Limitations

Several limitations exist in this systematic review and meta-analysis. First, most of the enrolled studies were conducted under small sample sizes, and only 6 studies (32%) had a sample size >1000 [18,21,26,27,32,34]. However, a large-scale dataset is warranted for both training and validation in AI diagnostic algorithms, especially for DL algorithms [14,15,50]. Second, few models performed external validation to test the reproducibility and extensibility. Thus, the reported performance should be interpreted with caution. Third, only a few studies provided a contingency table, while the incompleteness of reported performance metrics made it difficult to conduct a comprehensive meta-analysis, which a recent systematic review and meta-analysis in the spine field also mentioned [51]. This may cause a discrepancy between pooled results and reality. Finally, the risk of bias in this study was identified by the QUADAS-2, which is more suitable for traditional diagnostic models [52]. A more specific and practical guideline for diagnostic AI models remains under development [53].

Conclusions

This systematic review and meta-analysis emphasize that despite the generally satisfactory diagnostic performance of artificial intelligence systems in the experimental stage for the diagnosis of LSS, none of them is reliable and practical enough to apply in real clinical practice. Further efforts, including optimization of model balance, widely accepted objective reference standards, multimodal strategy, large dataset for training and testing, external validation, and sufficient and scientific report, should be made to bridge the distance between current TML or DL models and real-life clinical applications in future studies.

Authors' Contributions

TW, RC, and NF contributed equally to this work. LZ contributed to the conception and design. TW, RC, and NF contributed to the acquisition of data. TW and NF contributed to the analysis and interpretation of data. TW and NF contributed to drafting the article. RC, NF, SY, PD, QW, AW, JL, XK, and WZ contributed to critically revising the article. LZ reviewed the submitted version of the manuscript.

Conflicts of Interest

None declared.

Multimedia Appendix 1

PRISMA-DTA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses extension for diagnostic test accuracy) checklist.

DOCX File , 32 KB

Multimedia Appendix 2

Details of search strategy.

DOCX File , 13 KB

  1. Katz JN, Zimmerman ZE, Mass H, Makhni MC. Diagnosis and management of lumbar spinal stenosis: a review. JAMA. 2022;327(17):1688-1699. [CrossRef] [Medline]
  2. Lurie J, Tomkins-Lane C. Management of lumbar spinal stenosis. BMJ. 2016;352:h6234. [FREE Full text] [CrossRef] [Medline]
  3. Ravindra VM, Senglaub SS, Rattani A, Dewan MC, Härtl R, Bisson E, et al. Degenerative lumbar spine disease: estimating global incidence and worldwide volume. Global Spine J. 2018;8(8):784-794. [FREE Full text] [CrossRef] [Medline]
  4. Kreiner DS, Shaffer WO, Baisden JL, Gilbert TJ, Summers JT, Toton JF, et al. North American Spine Society. An evidence-based clinical guideline for the diagnosis and treatment of degenerative lumbar spinal stenosis (update). Spine J. 2013;13(7):734-743. [CrossRef] [Medline]
  5. Lee S, Lee JW, Yeom JS, Kim KJ, Kim HJ, Chung SK, et al. A practical MRI grading system for lumbar foraminal stenosis. AJR Am J Roentgenol. 2010;194(4):1095-1098. [CrossRef] [Medline]
  6. Schizas C, Theumann N, Burn A, Tansey R, Wardlaw D, Smith FW, et al. Qualitative grading of severity of lumbar spinal stenosis based on the morphology of the dural sac on magnetic resonance images. Spine (Phila Pa 1976). 2010;35(21):1919-1924. [CrossRef] [Medline]
  7. Sigmundsson FG, Kang XP, Jönsson B, Strömqvist B. Correlation between disability and MRI findings in lumbar spinal stenosis: a prospective study of 109 patients operated on by decompression. Acta Orthop. 2011;82(2):204-210. [FREE Full text] [CrossRef] [Medline]
  8. Lurie JD, Tosteson AN, Tosteson TD, Carragee E, Carrino JA, Kaiser J, et al. Reliability of readings of magnetic resonance imaging features of lumbar spinal stenosis. Spine (Phila Pa 1976). 2008;33(14):1605-1610. [FREE Full text] [CrossRef] [Medline]
  9. Lee GY, Lee JW, Choi HS, Oh K, Kang HS. A new grading system of lumbar central canal stenosis on MRI: an easy and reliable method. Skeletal Radiol. 2011;40(8):1033-1039. [CrossRef] [Medline]
  10. Bartynski WS, Lin L. Lumbar root compression in the lateral recess: MR imaging, conventional myelography, and CT myelography comparison with surgical confirmation. AJNR Am J Neuroradiol. 2003;24(3):348-360. [FREE Full text] [Medline]
  11. Park HJ, Kim SS, Lee YJ, Lee SY, Park NH, Choi YJ, et al. Clinical correlation of a new practical MRI method for assessing central lumbar spinal stenosis. Br J Radiol. 2013;86(1025):20120180. [FREE Full text] [CrossRef] [Medline]
  12. Erickson BJ, Korfiatis P, Akkus Z, Kline TL. Machine learning for medical imaging. Radiographics. 2017;37(2):505-515. [FREE Full text] [CrossRef] [Medline]
  13. Helm JM, Swiergosz AM, Haeberle HS, Karnuta JM, Schaffer JL, Krebs VE, et al. Machine learning and artificial intelligence: definitions, applications, and future directions. Curr Rev Musculoskelet Med. 2020;13(1):69-76. [FREE Full text] [CrossRef] [Medline]
  14. Choi RY, Coyner AS, Kalpathy-Cramer J, Chiang MF, Campbell JP. Introduction to machine learning, neural networks, and deep learning. Transl Vis Sci Technol. 2020;9(2):14. [FREE Full text] [CrossRef] [Medline]
  15. Chen X, Wang X, Zhang K, Fung KM, Thai TC, Moore K, et al. Recent advances and clinical applications of deep learning in medical image analysis. Med Image Anal. 2022;79:102444. [FREE Full text] [CrossRef] [Medline]
  16. He X, Yin Y, Sharma M, Brahm G, Mercado A, Li S. Automated diagnosis of neural foraminal stenosis using synchronized superpixels representation. In: Medical Image Computing and Computer-Assisted Intervention-MICCAI. Cham. Springer International Publishing; 2016.
  17. Hallinan JTPD, Zhu L, Yang K, Makmur A, Algazwi DAR, Thian YL, et al. Deep learning model for automated detection and classification of central canal, lateral recess, and neural foraminal stenosis at lumbar spine MRI. Radiology. 2021;300(1):130-138. [CrossRef] [Medline]
  18. Jamaludin A, Kadir T, Zisserman A. SpineNet: automated classification and evidence visualization in spinal MRIs. Med Image Anal. 2017;41:63-73. [CrossRef] [Medline]
  19. Zhang Q, Bhalerao A, Hutchinson C. Weakly-supervised evidence pinpointing and description. In: Information Processing in Medical Imaging. Cham. Springer International Publishing; 2017.
  20. Han Z, Wei B, Mercado A, Leung S, Li S. Spine-GAN: semantic segmentation of multiple spinal structures. Med Image Anal. 2018;50:23-35. [CrossRef] [Medline]
  21. Lu JT, Pedemonte S, Bizzo B, Doyle S, Andriole KP, Michalski MH, et al. Deep spine: automated lumbar vertebral segmentation, disc-level designation, and spinal stenosis grading using deep learning. 2018. Presented at: Proceedings of the 3rd Machine Learning for Healthcare Conference; August 17-18, 2018; Palo Alto, California. URL: https://proceedings.mlr.press/v85/lu18a.html
  22. Huber FA, Stutz S, Vittoria de Martini I, Mannil M, Becker AS, Winklhofer S, et al. Qualitative versus quantitative lumbar spinal stenosis grading by machine learning supported texture analysis-Experience from the LSOS study cohort. Eur J Radiol. 2019;114:45-50. [CrossRef] [Medline]
  23. Ishimoto Y, Jamaludin A, Cooper C, Walker-Bone K, Yamada H, Hashizume H, et al. Could automated machine-learned MRI grading aid epidemiological studies of lumbar spinal stenosis? Validation within the wakayama spine study. BMC Musculoskelet Disord. 2020;21(1):158. [FREE Full text] [CrossRef] [Medline]
  24. Won D, Lee HJ, Lee SJ, Park SH. Spinal stenosis grading in magnetic resonance imaging using deep convolutional neural networks. Spine (Phila Pa 1976). 2020;45(12):804-812. [CrossRef] [Medline]
  25. Lehnen NC, Haase R, Faber J, Rüber T, Vatter H, Radbruch A, et al. Detection of degenerative changes on MR images of the lumbar spine with a convolutional neural network: a feasibility study. Diagnostics (Basel). 2021;11(5):902. [FREE Full text] [CrossRef] [Medline]
  26. Kim T, Kim YG, Park S, Lee JK, Lee CH, Hyun SJ, et al. Diagnostic triage in patients with central lumbar spinal stenosis using a deep learning system of radiographs. J Neurosurg Spine. 2022;37(1):104-111. [CrossRef] [Medline]
  27. Su ZH, Liu J, Yang MS, Chen ZY, You K, Shen J, et al. Automatic grading of disc herniation, central canal stenosis and nerve roots compression in lumbar magnetic resonance image diagnosis. Front Endocrinol (Lausanne). 2022;13:890371. [FREE Full text] [CrossRef] [Medline]
  28. Altun S, Alkan A, Altun İ. LSS-VGG16: diagnosis of lumbar spinal stenosis with deep learning. Clin Spine Surg. 2023;36(5):E180-E190. [CrossRef] [Medline]
  29. Bharadwaj UU, Christine M, Li S, Chou D, Pedoia V, Link TM, et al. Deep learning for automated, interpretable classification of lumbar spinal stenosis and facet arthropathy from axial MRI. Eur Radiol. 2023;33(5):3435-3443. [FREE Full text] [CrossRef] [Medline]
  30. Grob A, Loibl M, Jamaludin A, Winklhofer S, Fairbank JCT, Fekete T, et al. External validation of the deep learning system "SpineNet" for grading radiological features of degeneration on MRIs of the lumbar spine. Eur Spine J. 2022;31(8):2137-2148. [CrossRef] [Medline]
  31. Shahzadi T, Ali MU, Majeed F, Sana MU, Diaz RM, Samad MA, et al. Nerve root compression analysis to find lumbar spine stenosis on MRI using CNN. Diagnostics (Basel). 2023;13(18):2975. [FREE Full text] [CrossRef] [Medline]
  32. Tumko V, Kim J, Uspenskaia N, Honig S, Abel F, Lebl DR, et al. A neural network model for detection and classification of lumbar spinal stenosis on MRI. Eur Spine J. 2024;33(3):941-948. [CrossRef] [Medline]
  33. Li KY, Weng JJ, Li HL, Ye HB, Xiang JW, Tian NF. Development of a dep-learning model for diagnosing lumbar spinal stenosis based on CT images. Spine (Phila Pa 1976). 2024;49(12):884-891. [CrossRef] [Medline]
  34. Park S, Kim JH, Ahn Y, Lee CH, Kim YG, Yuh WT, et al. Multi-pose-based convolutional neural network model for diagnosis of patients with central lumbar spinal stenosis. Sci Rep. 2024;14(1):203. [FREE Full text] [CrossRef] [Medline]
  35. Page MJ, Moher D, Bossuyt PM, Boutron I, Hoffmann TC, Mulrow CD, et al. PRISMA 2020 explanation and elaboration: updated guidance and exemplars for reporting systematic reviews. BMJ. 2021;372:n160. [FREE Full text] [CrossRef] [Medline]
  36. Haddaway NR, Page MJ, Pritchard CC, McGuinness LA. PRISMA2020: An R package and shiny app for producing PRISMA 2020-compliant flow diagrams, with interactivity for optimised digital transparency and open synthesis. Campbell Syst Rev. 2022;18(2):e1230. [FREE Full text] [CrossRef] [Medline]
  37. McInnes MDF, Moher D, Thombs BD, McGrath TA, Bossuyt PM, the PRISMA-DTA Group, et al. Preferred reporting items for a systematic review and meta-analysis of diagnostic test accuracy studies: the PRISMA-DTA statement. JAMA. 2018;319(4):388-396. [CrossRef] [Medline]
  38. Gusenbauer M, Haddaway NR. Which academic search systems are suitable for systematic reviews or meta-analyses? Evaluating retrieval qualities of Google Scholar, PubMed, and 26 other resources. Res Synth Methods. 2020;11(2):181-217. [FREE Full text] [CrossRef] [Medline]
  39. Whiting PF, Rutjes AWS, Westwood ME, Mallett S, Deeks JJ, Reitsma JB, et al. QUADAS-2 Group. QUADAS-2: a revised tool for the quality assessment of diagnostic accuracy studies. Ann Intern Med. 2011;155(8):529-536. [FREE Full text] [CrossRef] [Medline]
  40. Nyaga VN, Arbyn M, Aerts M. Metaprop: a stata command to perform meta-analysis of binomial data. Arch Public Health. 2014;72(1):39. [FREE Full text] [CrossRef] [Medline]
  41. Liu X, Faes L, Kale AU, Wagner SK, Fu DJ, Bruynseels A, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health. 2019;1(6):e271-e297. [FREE Full text] [CrossRef] [Medline]
  42. Vasey B, Ursprung S, Beddoe B, Taylor EH, Marlow N, Bilbro N, et al. Association of clinician diagnostic performance with machine learning-based decision support systems: a systematic review. JAMA Netw Open. 2021;4(3):e211276. [FREE Full text] [CrossRef] [Medline]
  43. Xue P, Si M, Qin D, Wei B, Seery S, Ye Z, et al. Unassisted clinicians versus deep learning-assisted clinicians in image-based cancer diagnostics: systematic review with meta-analysis. J Med Internet Res. 2023;25:e43832. [FREE Full text] [CrossRef] [Medline]
  44. Leeflang MMG, Deeks JJ, Gatsonis C, Bossuyt PMM, Cochrane Diagnostic Test Accuracy Working Group. Systematic reviews of diagnostic test accuracy. Ann Intern Med. 2008;149(12):889-897. [FREE Full text] [CrossRef] [Medline]
  45. Lex JR, Di Michele J, Koucheki R, Pincus D, Whyne C, Ravi B. Artificial intelligence for hip fracture detection and outcome prediction: a systematic review and meta-analysis. JAMA Netw Open. 2023;6(3):e233391. [FREE Full text] [CrossRef] [Medline]
  46. Mamisch N, Brumann M, Hodler J, Held U, Brunner F, Steurer J, et al. Lumbar Spinal Stenosis Outcome Study Working Group Zurich. Radiologic criteria for the diagnosis of spinal stenosis: results of a delphi survey. Radiology. 2012;264(1):174-179. [CrossRef] [Medline]
  47. Andreisek G, Deyo RA, Jarvik JG, Porchet F, Winklhofer SFX, Steurer J, et al. LSOS working group. Consensus conference on core radiological parameters to describe lumbar stenosis - an initiative for structured reporting. Eur Radiol. 2014;24(12):3224-3232. [FREE Full text] [CrossRef] [Medline]
  48. Ramachandram D, Taylor GW. Deep multimodal learning: a survey on recent advances and trends. IEEE Signal Process. Mag. 2017;34(6):96-108. [CrossRef]
  49. Wu JH, Liu TYA, Hsu WT, Ho JHC, Lee CC. Performance and limitation of machine learning algorithms for diabetic retinopathy screening: meta-analysis. J Med Internet Res. 2021;23(7):e23863. [FREE Full text] [CrossRef] [Medline]
  50. Chan HP, Samala RK, Hadjiiski LM, Zhou C. Deep learning in medical image analysis. Adv Exp Med Biol. 2020;1213:3-21. [FREE Full text] [CrossRef] [Medline]
  51. Compte R, Granville Smith I, Isaac A, Danckert N, McSweeney T, Liantis P, et al. Are current machine learning applications comparable to radiologist classification of degenerate and herniated discs and Modic change? A systematic review and meta-analysis. Eur Spine J. 2023;32(11):3764-3787. [FREE Full text] [CrossRef] [Medline]
  52. Zhang Z, Yang L, Han W, Wu Y, Zhang L, Gao C, et al. Machine learning prediction models for gestational diabetes mellitus: meta-analysis. J Med Internet Res. 2022;24(3):e26634. [FREE Full text] [CrossRef] [Medline]
  53. Collins GS, Dhiman P, Andaur Navarro CL, Ma J, Hooft L, Reitsma JB, et al. Protocol for development of a reporting guideline (TRIPOD-AI) and risk of bias tool (PROBAST-AI) for diagnostic and prognostic prediction model studies based on artificial intelligence. BMJ Open. 2021;11(7):e048008. [FREE Full text] [CrossRef] [Medline]


AI: artificial intelligence
CSA: cross-sectional area
CT: computed tomography
DL: deep learning
LCS: lumbar central stenosis
LFS: lumbar foraminal stenosis
LR: likelihood ratio
LRS: lateral recess stenosis
LSS: lumbar spinal stenosis
MeSH: Medical Subject Headings
ML: machine learning
MRI: magnetic resonance imaging
PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses
QUADAS-2: Quality Assessment of Diagnostic Accuracy Studies 2
SROC: summary receiver operating characteristic
TML: traditional machine learning


Edited by Q Jin; submitted 18.11.23; peer-reviewed by Z Li, Y Wang, S Mao, A Claus; comments to author 31.05.24; revised version received 23.10.24; accepted 11.11.24; published 23.12.24.

Copyright

©Tianyi Wang, Ruiyuan Chen, Ning Fan, Lei Zang, Shuo Yuan, Peng Du, Qichao Wu, Aobo Wang, Jian Li, Xiaochuan Kong, Wenyi Zhu. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 23.12.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.