Federated Learning of Electronic Health Records to Improve Mortality Prediction in Hospitalized Patients With COVID-19: Machine Learning Approach

Background Machine learning models require large datasets that may be siloed across different health care institutions. Machine learning studies that focus on COVID-19 have been limited to single-hospital data, which limits model generalizability. Objective We aimed to use federated learning, a machine learning technique that avoids locally aggregating raw clinical data across multiple institutions, to predict mortality in hospitalized patients with COVID-19 within 7 days. Methods Patient data were collected from the electronic health records of 5 hospitals within the Mount Sinai Health System. Logistic regression with L1 regularization/least absolute shrinkage and selection operator (LASSO) and multilayer perceptron (MLP) models were trained by using local data at each site. We developed a pooled model with combined data from all 5 sites, and a federated model that only shared parameters with a central aggregator. Results The LASSOfederated model outperformed the LASSOlocal model at 3 hospitals, and the MLPfederated model performed better than the MLPlocal model at all 5 hospitals, as determined by the area under the receiver operating characteristic curve. The LASSOpooled model outperformed the LASSOfederated model at all hospitals, and the MLPfederated model outperformed the MLPpooled model at 2 hospitals. Conclusions The federated learning of COVID-19 electronic health record data shows promise in developing robust predictive models without compromising patient privacy.


Introduction
COVID-19 has led to over 1 million deaths worldwide and other devastating outcomes [1].The accurate prediction of COVID-19 outcomes requires data from large, diverse patient populations; however, pertinent data are siloed.Although many studies have produced significant findings for COVID-19 outcomes by using single-hospital data, larger representation from additional populations is needed for generalizability, especially for the generalizability of machine learning applications [2][3][4][5][6][7][8][9][10][11].Large-scale initiatives have been combining local meta-analysis and statistics data derived from several hospitals, but this framework does not provide information on patient trajectories and does not allow for the joint modeling of data for predictive analysis [12,13].
In light of patient privacy, federated learning has emerged as a promising strategy, particularly in the context of COVID-19 [14].Federated learning allows for the decentralized refinement of independently built machine learning models via the iterative exchange of model parameters with a central aggregator, without sharing raw data.Several studies have assessed machine learning models that use federated learning in the context of COVID-19 and have shown promise.Kumar et al. built a blockchain-based federated learning schema and achieved enhanced sensitivity for detecting COVID-19 from lung computed tomography scans [15].Additionally, Xu et al. used deep learning to identify COVID-19 from computed tomography scans from multiple hospitals in China, and found that models built on data from hospitals in 1 region did not generalize well to hospitals in other regions.However, they were able to achieve considerable performance improvements when they used a federated learning approach [16].A more detailed background on COVID-19, machine learning in the context of COVID-19, challenges for multi-institutional collaborations, and federated learning can be found in Multimedia Appendices 1-8.
Although federated learning approaches have been proposed, to our knowledge there have been no published studies that implement, or assess the utility of, federated learning to predict key COVID-19 outcomes from electronic health record (EHR) data [17].The aim of this study was not to compare the performance of various classifiers in a federated learning environment, but to assess if a federated learning strategy could outperform locally trained models that use 2 common modeling techniques in the context of COVID-19.We are the first to build federated learning models that use EHR data to predict mortality in patients diagnosed with COVID-19 within 7 days of hospital admission.

Clinical Data Source and Study Population
Data from patients who tested positive for COVID-19 (N=4029) were derived from the EHRs of 5 Mount Sinai Health System (MSHS) hospitals in New York City.Study inclusion criteria are shown in Figure 1.Further details, as well as cross-hospital demographic and clinical comparisons, are in Multimedia Appendices 1-8.

Study Design
We performed multiple experiments, as outlined in Figure 1.First, we developed classifiers that used, and were tested on, local data from each hospital separately.Second, we built a federated learning model by averaging the model parameters of each individual hospital.Third, we combined all individual hospital data into a superset to develop a pooled model that represented an ideal framework.
Study data included the demographics, past medical history, vital signs, lab test results, and outcomes of all patients (Table 1, Table S1 in Multimedia Appendix 2).Due to the varying prevalence of COVID-19 across hospitals, we assessed multiple class balancing techniques (Table S2 in Multimedia Appendix 3).To simulate federated learning in practice, we also performed experiments with the addition of Gaussian noise (Multimedia Appendix 7).To promote replicability, we used the TRIPOD (Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis) guidelines (Table S3 in Multimedia Appendix 4) and released our code under a general public license (Multimedia Appendices 1-8).

Model Development and Selection
The primary outcome was mortality within 7 days of admission.We generated 2 baseline conventional predictive models-a multilayer perceptron (MLP) model and a logistic regression with L1-regularization or least absolute shrinkage and selection operator (LASSO) model.To maintain consistency and enable direct comparisons, each MLP model was built with the same architecture.We provide more information on model architecture and tuning in Multimedia Appendix 5. MLP and LASSO models were fit on all 5 hospitals.Our primary model of interest was a federated learning model.Training was performed at different sites, and parameters were sent to a central location (Figure 1).A central aggregator was used to initialize the federated model with random parameters.This model was sent to each site and trained for 1 epoch.Afterward, model parameters were sent back to the central aggregator, which is where federated averaging was performed.Updated parameters from the central aggregator were then sent back to each site.This cycle was repeated for multiple epochs.Federated averaging scales the parameters of each site according to the number of available data points and sums all parameters by layer.Through this technique, federated models did not receive any raw data.

Experimental Evaluation
All models were trained and evaluated by using 490-fold bootstrapping.Each experiment had a 70%-30% training-testing data split and was initialized with a unique random seed.We used the models' probability scores to calculate average areas under the receiver operating characteristic curve (AUROCs) across 490 iterations.

Intercohort Comparisons
EHR data consisted of patient demographics, past medical history, vitals, and lab test results (Table 1, Table S1 in Multimedia Appendix 2).After performing Bonferroni correction, we found significant differences in the proportions of outcomes across hospitals, specifically mortality within 7 days (Table 1).There were also significant differences in gender, age, ethnicity, race, and the majority of key clinical features (Table S1 in Multimedia Appendix 2).

Classifier Training and Performance
LASSO and MLP models were trained on data from each of the 5 MSHS hospitals separately (ie, local models), data from a combined dataset (ie, pooled models), and data from a federated learning framework (ie, federated models).All 3 training strategies for both models were evaluated for all sites (Figure 1).Training curves and AUROC curves versus the epoch number demonstrate that federated models improve performance after increased passes of training data (Figure 2).The results for model optimization (Figure S2

Learning Framework Comparisons
The performance of all LASSO and MLP models (ie, local, pooled, and federated models) was assessed at each site (Table 2, Figure 3).The LASSO federated model outperformed the LASSO local model at all hospitals except the Mount Sinai Brooklyn and Mount Sinai Queens hospitals; the LASSO federated model achieved AUROCs that ranged from 0.694 (95% CI 0.690-0.698)to 0.801 (95% CI 0.796-0.807).The LASSO pooled model outperformed the LASSO federated model at all hospitals; the LASSO pooled model achieved AUROCs that ranged from 0.734 (95% CI 0.730-0.737)to 0.829 (95% CI 0.824-0.834).

Discussion
This is the first study to evaluate the efficacy of applying federated learning to the prediction mortality in patients with COVID-19.EHR data from 5 hospitals were used to represent demonstrative use cases.By using disparate patient characteristics from each hospital after performing multiple-hypothesis correction in terms of demographics, outcomes, sample size, and lab values, this study was able to reflect a real-world scenario, in which federated learning could be used for diverse patient populations.
The primary findings of this study show that the MLP federated and LASSO federated models outperformed their respective local models at most hospitals.Differences in MLP model performance may have been attributed to the experimental condition, wherein the same underlying architecture was used for all MLP models.Although this framework allowed for consistency in learning strategy comparisons, it may have led to the improper tuning of pooled models.Collectively, our results show the potential of federated learning in overcoming the drawbacks of fragmented, case-specific local models.
Our study shows scenarios in which federated models should either be approached with caution or favored.The Mount Sinai Queens hospital was the only hospital where the LASSO federated model performed worse than the LASSO local model, with a difference of 0.012 in AUROC values.This may have been attributed to the hospital having a smaller sample size (n=540) and higher mortality prevalence (23%) than the other sites.However, at the Mount Sinai West hospital, the LASSO local model severely underperformed compared to the LASSO federated model, with an AUROC difference of 0.319.The Mount Sinai West hospital had the lowest sample size (n=485) and the lowest COVID-19 mortality prevalence (5.6%) compared to all hospitals.This finding emphasizes the benefit of using federated learning for sites with small sample sizes and large class imbalances.
We noted a few limitations in our study.First, data collection was limited to MSHS hospitals.This may limit model generalizability to hospitals in other regions.Second, this study focused on applying federated learning to the prediction of outcomes based on patient EHR data as proof of principle, rather than creating an operational framework for immediate deployment.As such, there are various aspects of the federated learning process that this study does not address, such as load balancing, convergence, and scaling.Third, our models only included clinical data.The models can be enhanced by incorporating other modalities.Fourth, we only implemented 2 widely used classifiers within this framework, but other algorithms may perform better.Finally, although identical MLP architectures were used across all learning strategies for direct comparisons, these architectures could have been further optimized.Future studies should focus on model accessibility and the expansion analysis of federated models to improve scalability, understand feature importance, and integrate additional data modalities.

Figure 1 .
Figure 1.Study design and model workflow.(A) Criteria for patient inclusion in this study.(B) An overview of the local and pooled models.Local models only used data from the site itself, whereas pooled models incorporated data from all sites.Both the local and pooled MLP and LASSO models were used.(C) An overview of the federated model.Parameters from a central aggregator are shared with each site, and sites do not have direct access to clinical data from other sites.After the models are locally trained at a site, parameters with and without added noise are sent back to the central aggregator to update federated model parameters.Federated LASSO and MLP models were used.LASSO: least absolute shrinkage and selection operator; MLP: multilayer perceptron; MSB: Mount Sinai Brooklyn; MSH: Mount Sinai Hospital; MSM: Mount Sinai Morningside; MSQ: Mount Sinai Queens; MSW: Mount Sinai West.

Figure 2 .
Figure 2. Federated model training.The performance of (A) federated MLP and (B) federated LASSO models, as measured by AUROCs versus the number of training epochs.The binary cross-entropy loss of (C) federated MLP and (D) federated LASSO models versus the number of training epochs.AUROC: area under the receiver operating characteristic curve; LASSO: least absolute shrinkage and selection operator; MLP: multilayer perceptron; MSB: Mount Sinai Brooklyn; MSH: Mount Sinai Hospital; MSM: Mount Sinai Morningside; MSQ: Mount Sinai Queens; MSW: Mount Sinai West.
b MLP: multilayer perceptron.cAUROC: area under the receiver operating characteristic curve.

Table 1 .
Demographic characteristics of all hospitalized patients with COVID-19 included in this study (N=4029) a .Interhospital comparisons for categorical data were assessed with Chi-square tests.Numerical data were assessed with Kruskal-Wallis tests, and Bonferroni-adjusted P values were reported.Values relating to <10 patients per field were not provided to protect patient privacy (--).
a b Not available.
in Multimedia Appendix 8) and class balancing experiments (Table S2 in Multimedia Appendix 3) can be found in Multimedia Appendices 1-8.The final model hyperparameters are listed in Table S4 in Multimedia Appendix 5.

Table 2 .
Performance of the local, pooled, and federated LASSO a and MLP b models at each site, based on AUROCs c with 95% confidence intervals.