Comparison of Severity of Illness Scores and Artificial Intelligence Models That Are Predictive of Intensive Care Unit Mortality: Meta-analysis and Review of the Literature

Background Severity of illness scores—Acute Physiology and Chronic Health Evaluation, Simplified Acute Physiology Score, and Sequential Organ Failure Assessment—are current risk stratification and mortality prediction tools used in intensive care units (ICUs) worldwide. Developers of artificial intelligence or machine learning (ML) models predictive of ICU mortality use the severity of illness scores as a reference point when reporting the performance of these computational constructs. Objective This study aimed to perform a literature review and meta-analysis of articles that compared binary classification ML models with the severity of illness scores that predict ICU mortality and determine which models have superior performance. This review intends to provide actionable guidance to clinicians on the performance and validity of ML models in supporting clinical decision-making compared with the severity of illness score models. Methods Between December 15 and 18, 2020, we conducted a systematic search of PubMed, Scopus, Embase, and IEEE databases and reviewed studies published between 2000 and 2020 that compared the performance of binary ML models predictive of ICU mortality with the performance of severity of illness score models on the same data sets. We assessed the studies' characteristics, synthesized the results, meta-analyzed the discriminative performance of the ML and severity of illness score models, and performed tests of heterogeneity within and among studies. Results We screened 461 abstracts, of which we assessed the full text of 66 (14.3%) articles. We included in the review 20 (4.3%) studies that developed 47 ML models based on 7 types of algorithms and compared them with 3 types of the severity of illness score models. Of the 20 studies, 4 (20%) were found to have a low risk of bias and applicability in model development, 7 (35%) performed external validation, 9 (45%) reported on calibration, 12 (60%) reported on classification measures, and 4 (20%) addressed explainability. The discriminative performance of the ML-based models, which was reported as AUROC, ranged between 0.728 and 0.99 and between 0.58 and 0.86 for the severity of illness score–based models. We noted substantial heterogeneity among the reported models and considerable variation among the AUROC estimates for both ML and severity of illness score model types. Conclusions ML-based models can accurately predict ICU mortality as an alternative to traditional scoring models. Although the range of performance of the ML models is superior to that of the severity of illness score models, the results cannot be generalized due to the high degree of heterogeneity. When presented with the option of choosing between severity of illness score or ML models for decision support, clinicians should select models that have been externally validated, tested in the practice environment, and updated to the patient population and practice environment. Trial Registration PROSPERO CRD42021203871; https://tinyurl.com/28v2nch8


Background
In the United States, intensive care unit (ICU) care costs account for 1% of the US gross domestic product, underscoring the need to optimize its use to attenuate the continued increase in health care expenditures [1].Models that characterize the severity of illnesses of patients who are critically ill by predicting complications and ICU mortality risk can guide organizational resource management and planning, implementation and support of critical clinical protocols, and benchmarking and are proxies for resource allocation and clinical performance [2].Although the medical community values the information provided by such models, they are not consistently used in practice because of their complexity, marginal predictive capacity, and limited internal or external validation [2][3][4][5].
Severity of illness score models require periodic updates and customizations to reflect changes in medical care and regional case pathology [6].Scoring models are prone to high interrater variability, are less accurate for patients with increased severity of illness score or specific clinical subgroups, are not designed for repeated applications, and cannot represent patients' status trends [7].The Acute Physiology and Chronic Health Evaluation (APACHE)-II (APACHE-II) and Simplified Acute Physiology Score (SAPS), developed in the 80s, are still in use [8].The underlying algorithms for APACHE-IV are in the public domain and are available at no cost; however, their use is time intensive and is facilitated by software that requires payments for licensing implementation and maintenance [9].Compared with SAPS-III, which uses data exclusively obtained within the first hour of ICU admission [10], APACHE-IV uses data from the first day (24 hours) [11].Although the Sequential Organ Failure Assessment (SOFA) is an organ dysfunction score that detects differences in the severity of illness and is not designed to predict mortality, it is currently used to estimate mortality risk based on the mean, highest, and time changes accrued in the score during the ICU stay [11].
The availability of machine-readable data from electronic health records enables the analysis of large volumes of medical data using machine learning (ML) methods.ML algorithms enable the exploration of high-dimensional data and the extraction of features to develop models that solve classification or regression problems.These algorithms can fit linear and nonlinear associations and interactions between predictive variables and relate all or some of the predictive variables to an outcome.The increased flexibility of ML models comes with the risk of overfitting training data; therefore, model testing on external data is essential to ensure adequate performance on previously unseen data.In model development, the balance between the model's accuracy and generalizability, or bias and variance, is achieved through model training on a training set and hyperparameter optimization on a tuning set.Once a few models have been trained, they can be internally validated on a split-sample data set or cross-validated; the candidate model chosen is then validated on an unseen test data set to calculate its performance metrics and out of sample error [12].The choice of algorithm is critical for providing a balance between interpretability, accuracy, and susceptibility to bias and variance [13].Compared with the severity of illness scores, ML models can incorporate large numbers of covariates and temporal data, nonlinear predictors, trends in measured variables, and complex interactions between variables [14].Numerous ML algorithms have been integrated into ICU predictive models, such as artificial neural networks (NNs), deep reinforcement learning, support vector machines (SVMs), random forest models, genetic algorithms, clinical trajectory models, gradient boosting models, k-nearest neighbor, naive Bayes, and the Ensemble approach [15].Despite the rapidly growing interest in using ML methods to support clinical care, modeling processes and data sources have been inadequately described [16,17].Consequently, the ability to validate and generalize the current literature's results is questionable.

Objectives
This study aims to systematically review and meta-analyze studies that compare binary classification ML models with the severity of illness scores for predicting ICU mortality and determine which models have superior performance.This review intends to provide actionable guidance to clinicians on the prognostic value of ML models compared with the severity of illness scores in supporting clinical decision-making, as well as on their performance, in the context of the current guidelines [18] and recommendations for reporting ML analysis in clinical research [19] (Table 1).

Methods
We conducted a systematic review of the relevant literature.The research methods and reporting followed the PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) 2020 statement and guide to review and meta-analysis of prediction models [20,21].

Information Sources and Search Strategy
Between December 15 and 18, 2020, we performed a comprehensive search in the bibliographic databases PubMed, Scopus, Embase, and IEEE of the literature published between December 2000 and December 15, 2020.These databases were available free of charge from the university library.We selected PubMed for its significance in biomedical electronic research; Scopus for its wide journal range, keyword search, and citation analysis; Embase because of its European union literature coverage; and IEEE Xplore for its access to engineering and computer science literature.
The search terms included control terms (Medical Subject Headings and Emtree) and free-text terms.The filters applied during the search of all 4 databases were Humans and Age:Adult.A search of the PubMed database using the terms (AI artificial intelligence) OR (machine learning) AND (intensive care unit) AND (mortality) identified 125 articles.The Scopus database was searched using the terms KEY (machine learning) OR KEY (artificial-intelligence) AND KEY (intensive care unit) AND KEY (mortality) revealed 182 articles.The Embase database queries using the terms (AI Artificial Intelligence) OR (machine learning) AND (intensive care unit) AND (mortality) resulted in 103 articles.The IEEE database search using the terms (machine learning) OR (artificial intelligence) AND (intensive care unit) AND (mortality) produced 51 citations.
A total of 2 authors (CB and AT) screened titles and abstracts and recorded the reasons for exclusion.The same authors (CB and AT) independently reviewed the previously selected full-text articles to determine their eligibility for quantitative and qualitative assessments.Both authors revisited the discrepancies to guarantee database accuracy and checked the references of the identified articles for additional papers.A third researcher (LNM) was available to resolve any disagreements.

Eligibility Criteria and Study Selection
We included studies that compared the predictive performance of newly developed ML classification models predictive of ICU mortality with the severity of illness score models on the same data sets in the adult population.To be included in the review, the studies had to provide information on the patient cohort, model development and validation, and performance metrics.Both prospective and retrospective studies were eligible for inclusion.

Data Collection Process
Data extraction was performed by CB, reviewed by AT, and guided by the CHARMS (Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies) checklist [22] specifically designed for systematic reviews of prognostic prediction models.The methodological qualities of the included studies were appraised with guidance from the Prediction model Risk of Bias (ROB) Assessment Tool (PROBAST) [23].The reported features of the ML models are shown in Table 2.

Assessment of the ROB and Quality of Reviewed Studies
The reviewers used the PROBAST tool to assess the methodological quality of each study for ROB and concerns regarding applicability in 4 domains: study participants, predictors, outcome, and analysis [23].The reviewers evaluated the applicability of the selected studies by assessing the extent to which the studied outcomes matched the goals of the review in the 4 domains.We evaluated the ROB by assessing the primary study design and conduct, predictor selection process, outcome definition, and performance analysis.The ROB in the reporting models' performance was appraised by exploring the reported measures of calibration (model's predicted risk of mortality vs the observed risk), discrimination (model's ability to discriminate between patients who are alive or expired), classification (sensitivity and specificity), and reclassification (net reclassification index).The performance of the models on internal data sets not used for model development-internal validation-and on data sets originating from an external patient population-external validation-were weighted in the ROB assignment.The ROB and applicability were assigned as low risk, high risk, or unclear risk according to PROBAST recommendations [42].

Meta-analysis and Performance Metrics
The C statistic-area under the receiver operating curve (AUROC) is the most commonly reported estimate of discriminative performance for binary outcomes [43][44][45][46] and the pragmatic performance measure of ML and severity of illness score models previously used in the medical literature to compare models based on different computational methods [21,[45][46][47].It is generally interpreted as follows: an AUROC of 0.5 suggests no discrimination, 0.7 to 0.8 is considered acceptable performance, 0.8 to 0.9 is considered excellent performance, and >0.9 is considered outstanding performance [48].We included the performance of models developed using similar algorithms in forest plots and performed heterogeneity diagnostics and investigations without calculating a pooled estimate [49].The results were pooled only for studies that followed a consistent methodology that included the external validation or benchmarking of the models.Random-effects meta-analyses computed the pooled AUROC for the following subgroups of ML algorithms-NNs and Ensemble-and the following subgroups of scoring models-SAPS II, APACHE II and SOFA.The AUROC for each model type was weighted using the inverse of its variance.Pooled AUROC estimates for each model were meta-analyzed along with 95% CIs of the estimates and were reported in forest plots together with the associated heterogeneity statistics (I 2 , τ 2 , and Cochran Q).
Cochran Q statistic (also known as the chi-square statistic) determines the within-study variation, τ 2 determines the between-study variability, and I 2 represents the percentage of variability from the AUROC estimate not caused by sampling error [36].The Cochran Q P value is denoted as P. Meta-analyses were conducted in R (version 3.6.1)[37] (see Multimedia Appendix 1 for scripts).

Selection Process
Of the 461 screened abstracts, we excluded 372 (80.7%) because of relevance (models not developed to predict ICU mortality), 9 (2%) duplicates, 6 (1.3%) reviews, and 8 (1.7%) conference proceedings (not intended for clinical application).We assessed the full text of 66 articles; the most common performance method reported to allow comparison between all models and a meta-analysis was the C statistic-AUROC.Of the 66 articles, we excluded 12 (18%) articles because of limited information on model development, 22 (33%) articles because of a lack of comparison with clinical scoring models, and 12 (18%) articles as the AUROC was not reported.The search strategy and selection process are illustrated in Figure 1.

Assessment of the Prediction Model Development
The 20 studies reported 47 ML models that were developed based on 7 types of algorithms and compared them with 3 severity of illness score models.All ML models were developed through a retrospective analysis of the ICU data sets.Of the 20 studies, 10 (50%) used data from the publicly available Medical Information Mart for Intensive Care database (Beth Israel Deaconess Medical Center in the United States) at different stages of expansion.Of the 20 studies, 10 (50%) used national health care databases (Danish, Australia-New Zealand, United Kingdom, and Sweden) or ICU-linked databases (Korea, India, and the United Kingdom).One of the studies included data from >80 ICUs belonging to >40 hospitals [33], and one of the studies' ICU-linked database collected data from 9 European countries [37].The cohorts generating the data sets used for model development and internal testing ranged from 1571 to 217,289 patients, with a median of 15,789 patients.Of the 20 studies, 10 (50%) used data from patients admitted to general ICUs, whereas 10 (50%) studies used data from patients who were critically ill with specific pathologies: gastrointestinal bleeds [39], COVID-19 and pneumonia-associated respiratory failure [40], postcardiac arrest [28], postcardiac surgery [29,36], acute renal insufficiency [30,32], sepsis [35,41], or neurological pathology [25].The lower age thresholds for study inclusion ranges were 12 years [25], 15 years [26,27], 16 years [33,35,38], 18 years [24,29,40], and 19 years [30].Within the studied cohorts, mortality ranged from 0.08 to 0.5 [29,32,36].
The processes and tools used for the selection of predicting variables were described in 65% (13/20) of studies and included the least absolute shrinkage and selection operator, stochastic gradient boosting [33,35], genetic algorithms, and particle swarm optimization [33].Approximately 15% (3/20) of studies [25,26,35] reported multiple models developed on variable predictor sets, which were subsequently tested for the best performance, validation, and calibration.The number of predictive variables used in the final models varied between 1 and 80, with a median of 21.The most common predicting variables are shown in Figure 2 and are grouped by the frequency of occurrence in the studies.All studies developed models on 24-hour data; furthermore, ML models were developed on the first hour of ICU data [34]; the first 48-hour data [27,38,41]; 3-day data [40]; 5-day data [7]; 10-day data [26]; or on patients' prior medical history collected from 1 month, 3 months, 6 months, 1 year, 2.5 years, 5 years, 7.5 years, 10 years, and 23 years [24].The frequency of data collection ranged from every 30 minutes [29], 1 hour [1,25,27,37], 3 hours, 6 hours, 12 hours, 15 hours [38], and 24 hours [7,36] to every 27 hours, 51 hours, and 75 hours [40].
The data were normalized using the minimum-maximum normalization technique.The time prediction of hospital mortality was undefined in 45% (9/20) of studies and varied from 2 or 3 days to 28 days, 30 days [26], 90 days [24], and up to 1 year [24] in the others.
There was a wide range in the prevalence of mortality among studies (0.08-0.56), creating a class imbalance in the data sets.
In studies with low investigated outcome mortality, few researchers addressed the problem of class imbalance (survivors vs nonsurvivors) through balanced training [24,37], random resampling [29], undersampling [36], or class penalty and reweighting schemes [38].A breakdown of the model characteristics is presented in Table 3.

Holmgren et al [34]
N/A Calibration plot

El-Rashidy et al [36]
N/A N/A N/A N/A 0.944 0.937 0.911 N/A 0.94 Ensemble

Silva et al [37]
N/A N/A N/A N/A 0.7921 N/A 0.78 N/A 0.79 NN

Ryan et al [40]
N/A N/A N/A N/A 0.75 0.378 0.801 N/A 0.75 XGB

Principal Findings
This is the first study to critically appraise the literature comparing the ML and severity of illness score models to predict ICU mortality.In the reviewed articles, the AUROC of the ML models demonstrated very good discrimination.The range of the ML model AUROC was superior to that of the severity of illness score AUROC.The meta-analysis demonstrated a high degree of heterogeneity and variability within and among studies; therefore, the AUROC performances of the ML and severity of illness score models cannot be pooled, and the results cannot be generalized.Every I 2 value is >97.7%;most of the 95% CIs of AUROC estimates from various studies did not overlap within the forest plot, suggesting considerable variation among AUROC estimates for model types.The CI for AUROC and the statistical significance of the difference in model performance were inconsistently reported within studies.The high heterogeneity came from the diverse study population and practice location, age of inclusion, primary pathology, medical management leading to the ICU admission, and time prediction window.The heterogenous data management (granularity, frequency of data input, data management, number of predicting variables, prediction timeframe, time series analysis, and training set imbalance) affected model development.It may have resulted in bias, primarily in studies where it has not been addressed (Table 2).Generally, authors reported the ML algorithms with predictive power superior to the clinical scoring system (Table 3); the number of ML models with inferior performance not reported is unknown, which raises the concern of reporting bias.The classification measures of performance were inconsistently reported and required a predefined probability threshold; therefore, models showed different sensitivity and specificity based on the chosen threshold.The variations in the prevalence of the studied outcome secondary to imbalanced data sets make the interpretation of the accuracy difficult.The models' calibration cannot be interpreted because of limited reporting.The external validation process that is necessary to establish generalization was lacking in 65% (13/20) of studies (Table 2).The limited and variable performance metrics reported precludes a comprehensive model performance comparison among studies.The decision curve analysis and model interpretability (explainability) that are necessary to promote transparency and understanding of the model's predictive reasoning was addressed in 25% (5/20) of studies.Results of the clinical performance of ML mortality prediction models as alternatives to the severity of illness score are scarce.
The reviewed studies inconsistently and incompletely captured the descriptive characteristics and other method parameters for ML-based predictive model development.Therefore, we cannot fully assess the superiority or inferiority of ML-based ICU mortality prediction compared with traditional models; however, we recognize the advantage that flexibility in model design offers in the ICU setting.

Study Limitations
This review included studies that were retrospective analyses of data sets with known outcome distributions and incorporated the results of interventions.It is unclear which models were developed exclusively for research purposes; hence, they were not validated.We evaluated studies that compared ML-based mortality prediction models with the severity of illness score-based models, although these models relied on different development statistical methods, variable collection times, and outcome measurement methodologies (SOFA).
The comparison between the artificial intelligence (AI) and severity of illness score models relies only on AUROC values as measures of calibration, discrimination, and classification are not uniformly reported.The random-effects meta-analysis was limited to externally validated models.Owing to the level of heterogeneity, the performance results for most AI and severity of illness score models could not be pooled.The authors recognize that 25% (5/20) of the articles were published between 2004 and 2015 before the TRIPOD (Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis) recommendations for model development and reporting [18]; thus, they were not aligned with the guidelines.
The reviewers assessed the models' ROB and applicability and were aware of the risk of reporting and publication bias favoring the ML models.However, the high heterogeneity among studies prevents an unambiguous interpretation of the funnel plot.

Conclusions and Recommendations
The results of our analysis show that the reporting methodology is incomplete, nonadherent to the current recommendations, and consistent with previous observations [16,50].The lack of consistent reporting of the measures of the reliability calibration (Brier score and calibration curve of reliability deviation), discrimination, and classification of the probabilistic estimates on external data makes the comparative effectiveness of risk prediction models challenging and has been noted by other authors [43].
Predictive models of mortality can substantially increase patient safety, and by incorporating subtle changes in organ functions that affect outcomes, these models support the early recognition and diagnosis of patients who are deteriorating, thus providing clinicians with additional time to intervene.The heterogeneity of the classification models that was revealed in detail in this review underlines the importance of recognizing the models' ability for temporal and geographical generalization or proper adaptation to previously unseen data [51].These concepts apply to both models; similar to the ML models, severity of illness score requires periodical updates and customizations to reflect changes in medical care and regional case pathology over time [6].
Our findings lead to the following recommendations for model developers: 1. State whether the developed ML models are intended for clinical practice 2. If models are intended for clinical applications, provide full transparency of the clinical setting from which the data are acquired and all the model development steps; validate the models externally to ensure generalizability 3.If intended for clinical practice, report models' performance metrics, which include measures of discrimination, calibration, and classification, and attach explainer models to facilitate interpretability Before using ML and/or severity of illness score models as decision support systems to guide clinical practice, we make the following recommendations for clinicians: 1. Be cognizant of the similarities or discrepancies between the cohort used for model development and the local practice population, the practice setting, the model's ability to function prospectively, and the models' lead times 2. Acquire knowledge of the model's performance during testing in the local practice 3. Ensure that the model is periodically updated to changes in patient characteristics and/or clinical variables and adjusted to new clinical practices and therapeutics 4. Confirm that the models' data are monitored and validated and that the model's performance is periodically updated 5.When both the severity of illness score and ML models are available, determine one model's superiority and clinical reliability versus the other through randomized controlled trials 6.When ML models guide clinical practice, ensure that the model makes the correct recommendation for the right reasons and consult the explainer model 7. Identify clinical performance metrics that evaluate the impact of the AI tool on the quality of care, efficiency, productivity, and patient outcomes and account for variability in practice AI developers must search for and clinicians must be cognizant of the unintended consequences of AI tools; both must understand human-AI tool interactions.Healthcare organization administrators must be aware of the safety, privacy, causality, and ethical challenges when adopting AI tools and recognize the Food and Drug Administration guiding principles for AI/ML development [52].

Figure 1 .
Figure 1.Search strategy and selection process.AUROC: area under the receiver operating curve; ICU: intensive care unit.

Figure 3 .
Figure 3. Meta-analysis results: pooled AUROC for externally validated Ensemble models.Gray boxes represent the fixed weight estimates of the AUROC value from each study.Larger gray boxes represent larger fixed weight estimates of the AUROC values.The horizontal line through each gray box illustrates the 95% CI of the AUROC value from that study.Black horizontal lines through a gray box indicate that the CI limits exceed the length of the gray box.White horizontal lines represent CI limits that are within the length of the gray box.The vertical dashed lines in the forest plot are the estimated random pooled effect of the AUROC value from the random-effects meta-analysis.The gray diamonds illustrate the 95% CI for the random pooled effects.Tests of heterogeneity included I 2 , τ 2 , and Cochran Q P value (denoted as P).AUROC: area under the receiver operating curve;.

Figure 4 .
Figure 4. Meta-analysis results: pooled AUROC for externally validated NN models.Gray boxes represent the fixed weight estimates of the AUROC value from each study.Larger gray boxes represent larger fixed weight estimates of the AUROC values.The horizontal line through each gray box illustrates the 95% CI of the AUROC value from that study.Black horizontal lines through a gray box indicate that the CI limits exceed the length of the gray box.White horizontal lines represent CI limits that are within the length of the gray box.The vertical dashed lines in the forest plot are the estimated random pooled effect of the AUROC value from the random-effects meta-analysis.The gray diamonds illustrate the 95% CI for the random pooled effects.Tests of heterogeneity included I 2 , τ 2 , and Cochran Q P value (denoted as P).AUROC: area under the receiver operating curve; NN: neural network.

Figure 5 .
Figure 5. Meta-analysis results: pooled AUROC for SAPS-II.Gray boxes represent the fixed weight estimates of the AUROC value from each study.Larger gray boxes represent larger fixed weight estimates of the AUROC values.The horizontal line through each gray box illustrates the 95% CI of the AUROC value from that study.Black horizontal lines through a gray box indicate that the CI limits exceed the length of the gray box.White horizontal lines represent CI limits that are within the length of the gray box.The vertical dashed lines in the forest plot are the estimated random pooled effect of the AUROC value from the random-effects meta-analysis.The gray diamonds illustrate the 95% CI for the random pooled effects.Tests of heterogeneity included I 2 , τ 2 , and Cochran Q P value (denoted as P).AUROC: area under the receiver operating curve; SAPS-II: Simplified Acute Physiology Score II.

Figure 6 .
Figure 6.Meta-analysis results: pooled AUROC for SOFA.Gray boxes represent the fixed weight estimates of the AUROC value from each study.Larger gray boxes represent larger fixed weight estimates of the AUROC values.The horizontal line through each gray box illustrates the 95% CI of the AUROC value from that study.Black horizontal lines through a gray box indicate that the CI limits exceed the length of the gray box.White horizontal lines represent CI limits that are within the length of the gray box.The vertical dashed lines in the forest plot are the estimated random pooled effect of the AUROC value from the random-effects meta-analysis.The gray diamonds illustrate the 95% CI for the random pooled effects.Tests of heterogeneity included I 2 , τ 2 , and Cochran Q P value (denoted as P).AUROC: area under the receiver operating curve; SOFA: Sequential Organ Failure Assessment.

Figure 7 .
Figure 7. Meta-analysis results: pooled AUROC for APACHE-II.Gray boxes represent the fixed weight estimates of the AUROC value from each study.Larger gray boxes represent larger fixed weight estimates of the AUROC values.The horizontal line through each gray box illustrates the 95% CI of the AUROC value from that study.Black horizontal lines through a gray box indicate that the CI limits exceed the length of the gray box.White horizontal lines represent CI limits that are within the length of the gray box.The vertical dashed lines in the forest plot are the estimated random pooled effect of the AUROC value from the random-effects meta-analysis.The gray diamonds illustrate the 95% CI for the random pooled effects.Tests of heterogeneity included I 2 , τ 2 , and Cochran Q P value (denoted as P).APACHE-II: Acute Physiology and Chronic Health Evaluation-II; AUROC: area under the receiver operating curve;.

Table 1 .
Recommended structure for reporting ML a models.
b N/A: not applicable.

Table 2 .
CHARMS (Checklist for Critical Appraisal and Data Extraction for Systematic Reviews of Prediction Modeling Studies) checklist.

Table 3 .
Information on the ML a prediction model development, validation, and performance, and on the severity of illness score performance.

Table 4 .
Reported performance measures of the ML a models.