A Machine Learning–Based Algorithm for the Prediction of Intensive Care Unit Delirium (PRIDE): Retrospective Study

Background Delirium frequently occurs among patients admitted to the intensive care unit (ICU). There is limited evidence to support interventions to treat or resolve delirium in patients who have already developed delirium. Therefore, the early recognition and prevention of delirium are important in the management of critically ill patients. Objective This study aims to develop and validate a delirium prediction model within 24 hours of admission to the ICU using electronic health record data. The algorithm was named the Prediction of ICU Delirium (PRIDE). Methods This is a retrospective cohort study performed at a tertiary referral hospital with 120 ICU beds. We only included patients who were 18 years or older at the time of admission and who stayed in the medical or surgical ICU. Patients were excluded if they lacked a Confusion Assessment Method for the ICU record from the day of ICU admission or if they had a positive Confusion Assessment Method for the ICU record at the time of ICU admission. The algorithm to predict delirium was developed using patient data from the first 2 years of the study period and validated using patient data from the last 6 months. Random forest (RF), Extreme Gradient Boosting (XGBoost), deep neural network (DNN), and logistic regression (LR) were used. The algorithms were externally validated using MIMIC-III data, and the algorithm with the largest area under the receiver operating characteristics (AUROC) curve in the external data set was named the PRIDE algorithm. Results A total of 37,543 cases were collected. After patient exclusion, 12,409 remained as our study population, of which 3816 (30.8%) patients experienced delirium incidents during the study period. Based on the exclusion criteria, out of the 96,016 ICU admission cases in the MIMIC-III data set, 2061 cases were included, and 272 (13.2%) delirium incidents occurred. The average AUROCs and 95% CIs for internal validation were 0.916 (95% CI 0.916-0.916) for RF, 0.919 (95% CI 0.919-0.919) for XGBoost, 0.881 (95% CI 0.878-0.884) for DNN, and 0.875 (95% CI 0.875-0.875) for LR. Regarding the external validation, the best AUROC were 0.721 (95% CI 0.72-0.721) for RF, 0.697 (95% CI 0.695-0.699) for XGBoost, 0.655 (95% CI 0.654-0.657) for DNN, and 0.631 (95% CI 0.631-0.631) for LR. The Brier score of the RF model is 0.168, indicating that it is well-calibrated. Conclusions A machine learning approach based on electronic health record data can be used to predict delirium within 24 hours of ICU admission. RF, XGBoost, DNN, and LR models were used, and they effectively predicted delirium. However, with the potential to advise ICU physicians and prevent ICU delirium, prospective studies are required to verify the algorithm’s performance.


Introduction
Delirium, defined as acute brain dysfunction characterized by disturbances of awareness, attention, and cognition with a fluctuating course linked with an underlying medical condition, frequently occurs among patients admitted to intensive care units (ICUs) [1]. Up to 80% of critically ill patients affected by delirium are at an increased risk of requiring ventilation for a substantially long duration, high hospital and ICU mortality, and long-term cognitive impairment. The medical care for these patients also results in increased medical costs [2][3][4].
There is currently limited evidence to support interventions to treat or resolve delirium in patients who have already developed delirium [5]. Therefore, the early recognition and prevention of delirium are indispensable for patients with a high risk of developing delirium. Previous studies have shown that a proportion of the cases of delirium may be avoidable [6]. Accordingly, several prediction models have been developed to predict delirium in patients who may benefit from delirium prevention [7][8][9]. The models developed thus far focus on predicting delirium during the entire ICU stay using predisposing clinical features obtained within 24 hours of ICU admission or immediately upon ICU admission. Considering that ICU patients experience dynamic changes in medical conditions within the initial 24 hours after ICU admission, these models are limited because they focus on predicting only the long-term occurrence of delirium during the entire ICU stay.
Furthermore, these prediction models only include variables that have already been identified as risk factors for delirium in other studies [7,9,10].
Therefore, we developed a machine learning-based model for the early prediction of delirium among medical and surgical ICU patients using electronic health record (EHR) data. This prediction model uses data obtained within 4 hours of ICU admission to predict delirium within 24 hours after ICU admission.

Study Setting and Population
We conducted a retrospective study of all critically ill patients admitted to the ICUs of the Samsung Medical Center (a 1989-bed university-affiliated, tertiary referral hospital in Seoul, South Korea) from July 1, 2016, to August 31, 2019. We only included patients who were 18 years or older at the time of admission and who stayed in the medical or surgical ICU. Patients were excluded if they lacked a Confusion Assessment Method for the ICU (CAM-ICU) record from the day of ICU admission or if they had a positive CAM-ICU record at the time of ICU admission. The flow diagram in Figure 1 shows the patient selection process. The study protocol was removed from all identifiers and approved by the SMC (Samsung Medical Center) Institutional Review Board (IRB No. 2020-02-026), as all identifiers were removed. The IRB approval form is presented in Multimedia Appendix 1.

Source of Data
This study used data from the Clinical Data Warehouse Darwin-C database of the SMC and the Medical Information Mart for Intensive Care III (MIMIC-III) database (v1.4). The SMC data set was used for the derivation and validation cohort, and the MIMIC-III data set was used for the external validation cohort. The MIMIC-III database is a clinical database consisting of data from more than 38,000 ICU patients (medical, surgical, trauma-surgical, coronary, and cardiac-surgery data) admitted to Beth Israel Deaconess Medical center (Boston, MA) from June 2001 to October 2012 [11]. The MIMIC-III database can be accessed upon obtaining approval from its administrators.

Outcome
To screen for delirium, all ICU patients were assessed with the CAM-ICU [12]. The primary outcome of the study was the prediction of the occurrence of delirium within 24 hours of ICU admission. Delirium was defined as a negative CAM-ICU result obtained within the first 4 hours, and a positive CAM-ICU result obtained between 4 and 24 hours of ICU admission. In our institute, CAM-ICU results were obtained 3 times a day, and a senior nurse rechecked the recorded CAM-ICU scores.

Predictor Variables
We used clinical characteristics, ICU admission category (medical or surgical), primary cause of admission (respiratory, cardiovascular, gastrointestinal, neurology, perioperative, nephrology, metabolic, or trauma), primary diagnosis, vital signs, prescription medications, and laboratory test results as the predictor variables. All variables were extracted from the EHR data set.

Feature Selection and Data Processing
We first extracted all relevant variables for the prediction model from other studies. Next, 2 clinical experts (CRC and REK) reviewed the relevant variables and selected the crucial ones based on previous clinical studies and clinical relevance. We then further restricted the variables depending on whether they could be automatically extracted from EHRs and had low missing rates. Finally, for the external validation in the MIMIC-III data set, we selected variables found in both SMC and MIMIC-III. The MIMIC-III data set shows the final variables used as input for model development. The list of variables used is shown in Textbox 1, and the missing rate in the variable list is presented in Multimedia Appendix 2.

Model Development and Validation
We split the data set into a development data set and a data set. For the development data set, we used the data obtained between July 1, 2016, and December 31, 2018. For the validation set, we used the data obtained between January 1, 2019, and August 31, 2019. Of the 37,543 admitted cases, 12,409 cases were selected in this study. These were divided into the development set (n=9589, 77.3%) and the internal validation set (n=2820, 22.7%). Among the 9589 cases in the development data set, there were 3060 (31.9%) cases of delirium, and among the 2820 cases in the validation data set, there were 756 (26.8%) cases of delirium. We did not apply specific methodology (eg, undersampling) to resolve the outcome imbalance problem because it was not extreme.
We employed RF, extreme gradient boosting (XGBoost), deep neural network (DNN), and logistic regression (LR) as the candidate prediction models.

Parameter Tuning
We also used an automated machine learning called the Tree-based Pipeline Optimization Tool for model selection and parameter searching [13].
For DNN, we used 512, 256, and 128 neurons for hidden layers, ReLU function for activation function in hidden layers, sigmoid function for activation function in the output layer, and binary cross-entropy function as the loss function. For XGBoost, we used a tree booster with 100 estimators, the learning rate as 0.1, and the subsample ratio as 0.75.

External Validation
After development and internal validation, we performed the external validation of our delirium prediction model using the MIMIC-III database. The validation set was extracted from the MIMIC-III database, which included patients with at least two CAM-ICU records obtained within at least 24 hours.
The model with the highest area under the receiver operating characteristics (AUROC) curve in the external validation was named the PRIDE (Prediction of ICU Delirium) algorithm.

Statistical Analysis
Continuous variables are presented in terms of means and SD, and categorical variables are presented in terms of their frequencies and percentages. The performances of the different models were compared using the AUROC, sensitivity, specificity, positive predictive value (PPV), and negative predictive value (NPV) at the threshold. In the internal validation, model performance was evaluated through the average and 95% CI of the AUROCs. Additionally, we used a calibration curve and the Brier score to test the reliability of our model. To determine the clinically relevant threshold, we used a decision curve.
We employed the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) statement to report the results of our prediction model. Data processing, statistical analysis, and the development and validation of the machine learning algorithms were performed using R version 3.6.2 [14] and Python version 3.6.8 [15].
The source code has been made available on Github [16].

Study Population
During the study period, a total of 37,543 cases were collected. Patients who were 18 years or older at the time of ICU admission were included. Cases with less than two CAM-ICU records after admission to the ICU and those with a positive CAM-ICU upon ICU admission were excluded. After patient exclusion, 12,409 remained as our study population. Baseline characteristics of the training and test sets of the SMC and MIMIC-III data sets are shown in Table 1.

External Validation
For the external validation, the average AUROCs and 95% CI  Table 2, and the ROC curves are shown in Figure 2.
For the external validation with MIMIC-III, we only selected variables that could be found both in SMC and MIMIC-III. As a result, only 59 variables were selected. The variables were categorized into general information, flowsheet, laboratory test results, and prescription of medication. The most important variable was the use of invasive mechanical ventilation in the general information. The importance of each final variable used in model development is shown in Figure 3.

Model Assessment
For further model evaluation, calibration and decision curve analyses were performed. The Brier score for the XGBoost model with regard to predicting delirium was 0.094 for the internal validation data set, indicating that our model is reliable. The best model for external validation is RF, with the Brier score of 0.168. A Brier score of 0 indicates a perfect calibration, and the closest the value is to 0, the better model calibration.
The calibration plot is shown in Multimedia Appendix 3. The decision curve analysis showed that the net benefit was useful for determining the threshold. For the PRIDE algorithm, the threshold for delirium prediction was selected as 0.13, and at this cut-off point, the net benefit was 0.234. The PRIDE model has a wide range of threshold probabilities and offers reasonable clinical applicability. The decision curve analysis is presented in Multimedia Appendix 4.

Principal Results
We have demonstrated that the proposed delirium prediction model, which employs a machine learning algorithm with EHR data, can predict the development of delirium in medical and surgical ICU patients. In addition to our internal validation, we externally validated our findings using the MIMIC-III patient database. With the PRIDE model, we showed that delirium prediction models could be automated exclusively using risk factors derived from EHR data. The three main results of our study are as follows: (1) the model predicted delirium within the first 24 hours of ICU admission by only using data collected within the first 4 hours after ICU admission, (2) all variables were extracted from EMR data obtained from both medical and surgical intensive care patients, and (3) the model showed acceptable performance with regard to the external validation data set.
Among the various departments in a hospital, the incidence of delirium is the highest in the ICU, and it is well-documented that delirium occurs in 25% of critically ill adults in ICUs within the first 24 hours after admission [17][18][19]. This data shows that the early prediction of delirium upon initial ICU admission is crucial. Furthermore, the early prediction of the development of delirium can help clinicians make clinical decisions at an optimal time and provide preventive and personalized care with nondrug interventions for high-risk patients. Examples of such care are cognitive stimulation, orientation improvement, and early mobilization [20].

Comparison With Prior Work
Owing to the prevalence of delirium in patients admitted to ICUs, the routine use of preventive measures for delirium is recommended. However, previous studies have shown that clinicians' predictions of the development of delirium are less accurate than those of ICU delirium prediction models [7]. Thus, delirium prediction models developed using machine learning can support clinicians in the early recognition of delirium, thereby immensely benefiting patients at high risk of delirium [21]. Furthermore, although several risk prediction models have been proposed, they are based on the manual evaluation of individual risk factors, and thus, may be challenging to implement [7,22,23]. Hence, in practice, automated models are preferable and more feasible. For these reasons, the implementation of automated tools for predicting the risk of delirium development using data extracted from EHR would improve clinical practices with regard to ICU management. Furthermore, the EHR-based prediction model uses a pipeline that automatically extracts variables and calculates models containing enough variables.
Previous studies have used several risk factors for delirium in ICUs, including age, severity score, cause of admission, usage of sedative agents, and laboratory results. In contrast with previous studies, the PRIDE model includes several additional variables such as vital signs (heart rate and blood pressure) and medication information that is excluded from EHRs. These differences allow our model to predict delirium incidents within 4 hours of ICU admission only using EHR data. Further, the PRIDE model did not include a severity score because this can only be obtained after 24 hours of ICU admission; in addition, since this information is separate from EMR data, using a severity score would require further efforts by the clinician. A few reports have also presented EMR-based machine learning models to predict delirium [24,25]. Whereas the prediction models presented in these reports are for all hospital-admitted patients, in this study, we developed a versatile model specifically for ICU patients at risk of delirium.
The strength of our study is the EMR-driven model that was both internally and externally validated, using SMC and MIMIC-III data, respectively. Although our result showed lower accuracy with external data than internal data, this result can be improved if the missing rate of key features decreases. In the case of CAM-ICU, 96% was missing in MIMIC-III. In addition, a decrease in accuracy with an external database was not uncommon in literature [26]. For example, a study predicting serious bacterial infections among fevers in children reported that the AUC of the external data was 0.26 lower than the internal data [27].
In clinical settings, missing values occur for various reasons.
To handle missing data, we used mean values in the numerical data. We left the missing values in the categorical data blank such that the dummy variables were all equal to 0 method. Recently, deep learning-based advanced techniques, such as long short-term memory and recurrent neural network, were also introduced to impute missing data, and by employing these methods, they could improve model performances [28]. When choosing a missing handling method, knowing the missing pattern can improve model performance and work better when applied to clinical applications.

Limitations
There are potential limitations to our study that should be acknowledged. First, our study was retrospectively performed and validated. Prospective interventional studies are needed to verify the performance of the model and to reconfirm its clinical usefulness. Second, a selection bias might exist because we selected variables available in all cohorts, and this study was conducted in a retrospective manner. Furthermore, we excluded patients without CAM-ICU data (47% of the total number of ICU-admitted patients). In this regard, it should be noted that the purpose of this study was to develop a readily available model; therefore, we only selected the variables that could be used commonly in all cohorts. Finally, although the CAM-ICU tool is regarded as highly sensitive and specific to the detection of ICU delirium, it has critical limitations. As it only has binary labels, we cannot access the degree of delirium exacerbation. Furthermore, it is recorded in a "point-in-time" manner; thus, there may be some patients whose CAM-ICU tests were missed because they were completed outside the study's time frame [29,30].

Conclusions
We have developed and validated the delirium prediction model, which can predict the occurrence of delirium within 24 hours of ICU admission, using clinical data obtained in the first 4 hours after ICU admission. The PRIDE algorithm has acceptable AUCROC and sensitivity; thus, it has the potential to help advise ICU physicians and prevent ICU delirium.