Original Paper
Abstract
Background: The presence of bias in artificial intelligence has garnered increased attention, with inequities in algorithmic performance being exposed across the fields of criminal justice, education, and welfare services. In health care, the inequitable performance of algorithms across demographic groups may widen health inequalities.
Objective: Here, we identify and characterize bias in cardiology algorithms, looking specifically at algorithms used in the management of heart failure.
Methods: Stage 1 involved a literature search of PubMed and Web of Science for key terms relating to cardiac machine learning (ML) algorithms. Papers that built ML models to predict cardiac disease were evaluated for their focus on demographic bias in model performance, and open-source data sets were retained for our investigation. Two open-source data sets were identified: (1) the University of California Irvine Heart Failure data set and (2) the University of California Irvine Coronary Artery Disease data set. We reproduced existing algorithms that have been reported for these data sets, tested them for sex biases in algorithm performance, and assessed a range of remediation techniques for their efficacy in reducing inequities. Particular attention was paid to the false negative rate (FNR), due to the clinical significance of underdiagnosis and missed opportunities for treatment.
Results: In stage 1, our literature search returned 127 papers, with 60 meeting the criteria for a full review and only 3 papers highlighting sex differences in algorithm performance. In the papers that reported sex, there was a consistent underrepresentation of female patients in the data sets. No papers investigated racial or ethnic differences. In stage 2, we reproduced algorithms reported in the literature, achieving mean accuracies of 84.24% (SD 3.51%) for data set 1 and 85.72% (SD 1.75%) for data set 2 (random forest models). For data set 1, the FNR was significantly higher for female patients in 13 out of 16 experiments, meeting the threshold of statistical significance (–17.81% to –3.37%; P<.05). A smaller disparity in the false positive rate was significant for male patients in 13 out of 16 experiments (–0.48% to +9.77%; P<.05). We observed an overprediction of disease for male patients (higher false positive rate) and an underprediction of disease for female patients (higher FNR). Sex differences in feature importance suggest that feature selection needs to be demographically tailored.
Conclusions: Our research exposes a significant gap in cardiac ML research, highlighting that the underperformance of algorithms for female patients has been overlooked in the published literature. Our study quantifies sex disparities in algorithmic performance and explores several sources of bias. We found an underrepresentation of female patients in the data sets used to train algorithms, identified sex biases in model error rates, and demonstrated that a series of remediation techniques were unable to address the inequities present.
doi:10.2196/46936
Keywords
Introduction
Background
Artificial intelligence (AI) has been proposed as an effective solution to many health care challenges and depends on the construction of machine learning (ML) algorithms from health care data. Recent research has drawn attention to the possibility that algorithms may exhibit bias when applied to different demographic groups [
- ]. Such biases may widen health inequalities and negatively impact marginalized patients, such as female patients, minoritized racial and ethnic groups, and other neglected subpopulations [ - ].Over the past 5 years, an increasing number of studies have quantified disparities in algorithmic performance for underserved populations [
- ]. Daneshjou and colleagues [ ] demonstrated that state-of-the-art dermatology algorithms tend to perform worse on darker skin tones; Seyyed-Kalantari and colleagues [ ] exposed biases in radiology algorithms; and Thompson and colleagues [ ] reported increased false negative errors when classifying opioid misuse disorder for Black patients compared to White patients. Beyond specific diagnoses, researchers have demonstrated that infrastructural AI systems used in hospital settings can be subject to referral bias, demonstrated by Obermeyer and colleagues [ ] who highlighted a hospital treatment allocation algorithm that overlooked the health needs of Black patients. Yet despite the increasing number of papers describing this issue, most of the current uses of biomedical AI technologies do not account for the problem of bias [ - ]. Here, we evaluate algorithmic inequity in ML algorithms used for predicting cardiac disease, focusing on heart failure (HF).ML for HF
HF is a clinical syndrome in which the heart is unable to maintain a cardiac output adequate to meet the metabolic demands of the body [
]. Traditionally, algorithmic tools capable of identifying at-risk patients have played a key role in informing decisions on HF management and end-of-life care [ - ]. In recent years, ML algorithms that leverage biochemical data have been proposed as a superior alternative to traditional statistical models for identifying at-risk patients with HF [ ]. A range of ML techniques outperforms traditional risk scores in forecasting HF-related events [ ]. Yet given that existing medical research has described sex differences in both the presentation and management of HF, algorithms trained on existing data may perform differently for male versus female patients [ , ].Sex Differences in HF
HF presents differently in female patients compared with male patients [
]. Female patients experience a wider range of symptoms, including higher fluid overload and lower health-related quality of life [ , ]. Moreover, female patients who present with HF are on average older, sustain a higher ejection fraction (EF) throughout later stages of the disease, and have a lower incidence of previous ischemic heart disease [ ]. Furthermore, the biochemical tests used to detect cardiac disease have been demonstrated to perform less well for female patients [ ]. Troponin is 1 key biomarker used to predict disease, which has been demonstrated to be less sensitive in female patients [ ]. Standard troponin criteria fail to detect 1 out of 5 acute myocardial infarcts occurring in female patients [ ]. Historically, the neglect of sex differences in cardiac pathophysiology has disadvantaged female patients, and if not considered during ML development, these inequities may manifest in the novel algorithms being integrated into cardiac care [ - ].In our research, we scope the published literature reporting algorithms that predict HF and investigate whether existing papers give attention to bias in ML algorithms. Furthermore, we examine the data sets of existing models for demographic representation, evaluate demographic inequities in algorithmic performance, and assess the efficacy of a series of bias-mitigation techniques.
Methods
Study Design
Our analysis consists of two stages: (1) a literature review of papers describing ML models used to predict HF and (2) a quantitative analysis of identified models, evaluating inequities in algorithm performance. The flowchart in
provides an overview of our approach.Stage 1 Literature Review: Qualitative Evaluation of Published Papers
We searched PubMed and Web of Science between April 1, 2022, and May 22, 2022, to identify ML algorithms used to predict cardiac disease adhering to PRISMA (Preferred Reporting Items for Systematic Reviews and Meta-Analyses) guidelines for systematic reviews (
[ ] and Tables S1 and S2 in [ , ]). All abstracts were reviewed, and papers were included for full-text review if they met the following criteria: (1) the target diagnosis was HF, (2) the model used biochemical markers to predict disease, and (3) the computational methods involved an ML approach (including supervised, unsupervised, and deep learning).Of the retained papers, full texts were then reviewed to evaluate whether authors (1) reported the demographic make-up of data sets and (2) evaluated demographic inequities in algorithm performance, meaning that the authors specifically examined differences in algorithmic performance by demographic groups defined by protected characteristics [
].Throughout the literature review, any identified open-source data sets were maintained for use in stage 2.
Stage 2: Quantitative Evaluation of Model Performance
Two open-source data sets were uncovered in our literature review: (1) data set 1: University of California Irvine for Heart Failure Prediction [
] and (2) data set 2: University of California Irvine Cleveland Heart Disease data set for identifying coronary artery disease (CAD) [ ]. Descriptive statistics were performed on both data sets, evaluating the mean and variance of the data set variables for sexes separately, affected by disease or death ( and Tables S3-S5 in ).Variables | Sex and death (target variable)b | |||
Female (sex=0; n=105) | Male (sex=1; n=194) | |||
Survived (HFc death=0) | Death (HF death=1) | Survived (HF death=0) | Death (HF death=1) | |
Total count, n (%) | 71 (67.62) | 34 (32.38) | 132 (68.04) | 62 (32.96) |
Age (years), mean (SD) | 58.6 (10.6) | 62.2 (12.3) | 58.8 (10.7) | 66.9 (13.5) |
Anemia (Boolean), mean (SD) | 0.5 (0.5) | 0.6 (0.5) | 0.4 (0.5) | 0.4 (0.5) |
Creatinine phosphokinase (mcg/L), mean (SD) | 462.0 (517.7) | 507.7 (779.7) | 582.8 (853.2) | 759.3 (1532.3) |
Diabetes mellitus (Boolean), mean (SD) | 0.5 (0.5) | 0.6 (0.5) | 0.4 (0.5) | 0.3 (0.5) |
Ejection fraction (percentage), mean (SD) | 41.9 (11.6) | 37.5 (14.6) | 39.4 (10.4) | 31.2 (10.7) |
High blood pressure (Boolean), mean (SD) | 0.4 (0.5) | 0.5 (0.5) | 0.3 (0.5) | 0.4 (0.5) |
Platelets (kiloplatelets/mL), mean (SD) | 289,757.6 (98,655.9) | 259,512.7 (107,588.6) | 254,232.4 (94,985.6) | 254,663.7 (94,060.8) |
Serum creatinine (mg/dL), mean (SD) | 1.1 (0.6) | 1.9 (1.6) | 1.2 (0.7) | 1.8 (1.4) |
Serum sodium (mEq/L), mean (SD) | 137.4 (3.6) | 135.5 (6.7) | 137.1 (4.2) | 135.3 (3.8) |
Smoking (Boolean), mean (SD) | 0.0 (0.1) | 0.1 (0.3) | 0.5 (0.5) | 0.4 (0.5) |
aFull details of data set variables are available in Tanvir et al [
].bFor the death variable, a value of 1 indicates mortality.
cHF: heart failure.
Using these data sets, we rebuilt the ML algorithms described in the published literature and performed an additional analysis exploring inequities in algorithmic performance for demographic subgroups. As the only protected characteristic reported was sex, we focus on sex disparities in performance. Despite our initial aim to focus on HF, we retained an uncovered CAD data set to investigate whether trends identified for HF generalized to patients with CAD [
]. Tables S3 and S4 in provide details on data set 1 and data set 2, respectively.Model Reproduction
We rebuilt the models described in the existing literature for these data sets, focusing on random forest (RF) algorithms, which have been widely reported to be the most effective models [
]. For both data sets, data was split into test or training subsets (0.7:0.3), RF models were built using SciKit Learn, and RF parameters were tuned using GridSearch CV (SciKit Learn). We adopted a bootstrapping approach to quantify uncertainty, such that models were built, trained, and tested 100 times, from which average results were derived with SD.Statistical Analysis
Across the 100 runs, sex differences in each algorithm evaluation metric (equations 1-10) were calculated and averaged, with accompanying statistical tests performed to evaluate for statistical significance of any identified sex disparities. Our method for examining differences in algorithmic error rates builds on the foundational work from Buolamwini and Gebru [
], who demonstrated that a range of ML algorithms for facial recognition performed poorly on darker-skinned female patients. To evaluate for statistical significance, independent 2-tailed t tests were performed where the data was normally distributed, and Mann-Whitney U tests were performed where the data was not normally distributed. Kolmogorov-Smirnov tests were used to assess for normality [ ].Variations in Model Development
Overview
We then introduced a variety of changes to the model development, to evaluate the impact on the identified sex disparities in performance.
Changes to Model Training Data
In total, 1 widely proposed bias mitigation technique includes preprocessing the training data of a model to account for demographic representation, with previous research highlighting the benefit of training on demographically balanced or demographically stratified data sets [
]. We therefore created a range of data sets with varied sex representation and assessed for the impact on algorithm performance disparities. To form the sex-balanced data set, we used the oversampling function of SMOTE(), which has been proposed as an effective method for improving the representation of underserved populations in ML data sets [ ]. The SMOTE package generates new minority data points based on existing minority samples through linear interpolation [ , ]. Models were rebuilt as per the Model Reproduction section, using 4 different training data sets (sex-imbalanced, sex-balanced, and sex-specific; Tables S6 and S7 in ): (1) original sex-imbalanced training data, (2) sex-balanced training data, (3) female-only training data, and (4) male-only training data experiments.Changes to Feature Selection
To understand why models make certain decisions, researchers in the domain of “explainable AI” have demonstrated how feature evaluation may provide important information regarding model performance for different subpopulations [
, ]. To do this, Shapley values have been widely accepted as a unified measure of feature importance since their proposal in 2017 [ ].In our experiments, we first perform an exploratory analysis, comparing feature importance for models trained on the male versus female data sets. Second, we create 4 feature subsets from the original data sets, to evaluate the impact of changing the feature selection on performance disparities. As described in the introduction, existing clinical research has described demographic differences in the biochemical and clinical markers of HF disease (eg, sex differences in EF and troponin levels) [
]. Thus, we delineate 4 different feature subsets that vary in this information, to examine whether certain feature subsets perform better for different demographic groups. These four feature subsets are described in detail in Tables S8 and S9 in and include (1) features with sex, (2) features without sex, (3) biochemical features, and (4) clinical features.Our final series of experiments are therefore performed across the four training data sets (sex-imbalanced, sex-balanced, and sex-specific), and the four feature sets giving 16 total experiments: (1) original sex-imbalanced training data experiments (across four feature subsets), (2) sex-balanced training data experiments (across four feature subsets), (3) female training data experiments (across four feature subsets), and (4) male training data experiments (across four feature subsets)
Model Evaluation and Identification of Performance Disparities
Models are evaluated using global evaluation metrics (eg, accuracy) and specific error rates (eg, false negative rate [FNR]; equations 1-10). The difference between male and female scores is calculated to give a model’s “sex performance disparity” (equation 10). To evaluate for statistical significance, Kolmogorov-Smirnov Tests were used to assess for the normality of the data, following which independent 2-tailed t tests were performed where the data were normally distributed, and Mann-Whitney U tests were performed where the data were not normally distributed.
Our choice of evaluation metrics is guided by the clinical consequence of each of these scores.
The existing research on algorithmic bias has highlighted the importance of examining error rates, particularly in medicine where a false negative clinically translates to missed diagnoses or opportunities for treatment [
- , ]. As described by Afrose and colleagues [ ], focusing on global metrics of performance such as area under the receiver operating characteristic curve scores can neglect subtler disparities arising from differences in error rates affecting subgroups. When selecting a bias assessment metric, previous studies have chosen to focus on FNR and false positive rate (FPR), due to the clinical implications of these errors [ , , ]. Equations 5-8 places the error rates in their clinical context, demonstrating that the FNR represents missed diagnoses and potentially missed treatment. For the error rates, we use the threshold of 0.5, as we are investigating performance inequities in the existing reported models that used these default settings.Error rate definitions are as follows:
Clinical implications of error rates are as follows:
True Positive Rate = Correct diagnosis that patient as disease(5)
False Positive Rate = Misdiagnosis of disease when patient is healthy(6)
True Negative Rate = Correct diagnosis that patient is healthy(7)
False Negative Rate = Misdiagnosis that patient is healthy when patient has disease(8)
The accuracy evaluation metric is calculated as follows:
Sex performance disparity is calculated as follows:
Sex performance disparity = Score for male patients (mean) – Score for female patients (mean)(10)
Fairness Techniques: Fair Adversarial Gradient Tree Boosting
We implemented a recent fairness technique to evaluate whether these approaches applied to bias in HF algorithms. The Fair Adversarial Gradient Tree Boosting (FAGTB) is a recent technique proposed by Grari et al [
] for mitigating bias in decision tree classifiers and the authors demonstrate the success of their technique on 4 data sets. The authors focus on 2 definitions of fairness: demographic parity and equalized odds [ ]. The equalized odds metric focuses on model FPR and FNR, and hence we highlight this for our paper. A summary of these fairness metrics is provided in Section S1 in for further interest.The definition of equalized odds is as follows:
To assess for the equalized odds the authors measure the disparate mistreatment, which computes the absolute difference between FPR and the FNR for both demographics.
The disparate FPR is calculated as follows:
The disparate FNR is calculated as follows:
We compare the performance of the FAGTB algorithm to a standard Gradient Tree Algorithm. As per the original FAGTB paper, we repeat 10 experiments randomly sampling 2 subsets (0.8:0.2) and report evaluation metrics for the test set.
Ethical Considerations
Ethical approval was not required for this study as all data used were sourced from publicly available open-source data sets [
, ] under a CC-BY 4.0 license. No direct patient contact or sensitive personal data was involved, ensuring compliance with research standards.Results
Literature Review Search Results
Our search returned 127 papers, of which 60 met the criteria for full review and 3 highlighted sex differences in model performance. In the papers that reported sex, there was a consistent underrepresentation of female patients. No papers investigated racial or ethnic differences. Further, 1 paper focused specifically on female patients with HF, in which Tison et al [
] highlighted that HF was more common in people who were older, White, with a higher mean number of pregnancies, a higher BMI, and were less likely to have Medicare.Descriptive Statistics and Feature Importance
Data Set 1 (HF)
The mean descriptive statistics for each feature present in the HF data set are provided in
, which demonstrates subtle sex differences in the presentation of the disease. For HF deaths, male patients tend to be older than their female counterparts, with a higher creatinine phosphokinase, lower likelihood of diabetes, lower EF, and lower blood pressure.Our exploratory analysis identified further sex differences on examining feature importance.
compares the rankings of feature importance for ML models built to predict HF built from the female data set compared to the male data set. These differences are important as existing ML algorithms built on mixed-sex cohorts suggest that EF can be used alone for modeling, an approach that may disadvantage female patients [ ].Data Set 2 (CAD)
Table S5 in
provides details of the CAD data set and demonstrates that female patients with CAD have higher resting blood pressure and higher cholesterol compared to male patients. The categorical variable “resting electrocardiogram” is also higher for female patients, due to a higher incidence of left ventricular hypertrophy.Model Results and Performance Disparities
We replicated the algorithms described in the existing literature, reproducing the same previously reported mean predictive accuracies of 84.24% (3.51 SD) for data set 1 and 85.72% (1.75 SD) for data set 2 [
]. In and , we present the disparity in performance for the sexes, where a positive value indicates a higher value for male patients (see equation 10).For data set 1,
demonstrates that in 13 out of 16 experiments, the FNR is higher for female patients, meeting the threshold of statistical significance (mean difference of –17.81% to –3.37%; P<.05). represents this disparity in performance graphically, providing the point estimates of FNR for the sexes separately and highlighting that the disparity in FNR persisted across the variations in training data and selected features.Disparity in model performance (score for male patients – score for female patients) | Feature subset used in model training | ||||||||
Features with sex | P value | Features without sex | P value | Biochemical features | P value | Clinical features | P value | ||
Sex-imbalanced training data | |||||||||
Accuracy disparity (%) | 1.63 | .03a | –0.72 | .30 | 0.10 | .88 | –0.50 | .49 | |
ROC_AUCb disparity (%) | 3.14 | <.01a | 0.43 | .61 | 1.51 | .09 | 0.47 | .60 | |
FNRc disparity (%) | –7.53 | <.01a | –3.84 | .02a | –5.15 | .01a | –3.49 | .049a | |
FPRd disparity (%) | 1.26 | .07 | 2.97 | <.01a | 2.11 | <.01a | 2.56 | <.01a | |
Sex-balanced training data | |||||||||
Accuracy disparity (%) | –4.78 | <.01a | –7.25 | <.01a | –9.42 | <.01a | –3.63 | <.01a | |
ROC_AUC disparity (%) | 7.0 | <.01a | 4.27 | <.01a | 0.15 | .83 | 8.32 | <.01a | |
FNR disparity (%) | –17.81 | <.01a | –13.91 | <.01a | –3.37 | .04a | –16.09 | <.01a | |
FPR disparity (%) | 3.90 | <.01a | 5.37 | <.01a | 3.07 | <.001a | –0.54 | .24 | |
Female training data | |||||||||
Accuracy disparity (%) | –10.95 | <.01a | –9.75 | <.01a | –12.32 | <.01a | –9.64 | <.01a | |
ROC_AUC disparity (%) | 0.60 | .57 | 0.57 | .23 | –2.92 | <.01a | –0.53 | .07 | |
FNR disparity (%) | –7.42 | <.01a | –10.91 | <.01a | –2.24 | .27 | 1.55 | .01a | |
FPR disparity (%) | 8.61 | <.01a | 9.77 | <.01a | 8.08 | <.01a | –0.48 | .04a | |
Male training data | |||||||||
Accuracy disparity (%) | –5.46 | <.01a | –5.73 | <.01a | –8.73 | <.01a | –2.46 | <.01a | |
ROC_AUC disparity (%) | 4.98 | <.01a | 4.54 | <.01a | –1.59 | .049a | 8.32 | <.01a | |
FNR disparity (%) | –13.96 | <.01a | –13.32 | <.01a | –1.68 | .33 | –16.58 | <.01a | |
FPR disparity (%) | 4.00 | <.01a | 4.24 | <.01a | 4.86 | <.01a | –0.06 | .35 |
aIndicates a statistically significant difference (P<.05) between the model’s performance on male versus female patients.
aROC_AUC: area under the receiver operating characteristic curve.
bFNR: false negative rate.
cFPR: false positive rate.
Disparity in model performance (score for male patients – score for female patients) | Feature subset used in model training | ||||||||
Features with sex | P value | Features without sex | P value | Biochemical features | P value | Clinical features | P value | ||
Sex-imbalanced training data | |||||||||
Accuracy disparity (%) | 0.32 | .50 | 0.64 | .17 | 0.13 | .80 | 0.25 | .61 | |
ROC_AUCb disparity (%) | 3.86 | <.01a | 4.24 | <.01a | 3.05 | <.01a | 3.91 | <.01a | |
FNRc disparity (%) | –11.66 | <.01a | –12.52 | <.01a | –10.81 | <.01a | –12.38 | <.01a | |
FPRd disparity (%) | 3.94 | <.01a | 4.04 | <.01a | 4.71 | <.01a | 4.57 | <.01a | |
Sex-balanced training data | |||||||||
Accuracy disparity (%) | –4.01 | <.01a | –5.12 | <.01a | –7.32 | <.01a | –2.86 | <.01a | |
ROC_AUC disparity (%) | –3.89 | .01a | –4.91 | .01a | –7.18 | <.001a | –2.75 | <.01a | |
FNR disparity (%) | 7.69 | <.01a | 10.54 | <.01a | 15.59 | <.01a | 6.61 | <.01a | |
FPR disparity (%) | 0.10 | .87 | –0.72 | .19 | –1.23 | .29 | –1.11 | .06 | |
Female training data | |||||||||
Accuracy disparity (%) | –9.25 | <.01a | –11.34 | <.01a | –11.49 | <.01a | –8.69 | <.01a | |
ROC_AUC disparity (%) | –8.97 | <.01a | –10.95 | <.01a | –11.10 | <.01a | –8.45 | <.01a | |
FNR disparity (%) | 18.98 | <.01a | 22.60 | <.01a | 27.23 | <.01a | 17.86 | <.01a | |
FPR disparity (%) | –1.04 | .07 | –0.70 | .20 | –5.02 | <.01a | –0.96 | .09 | |
Male training data | |||||||||
Accuracy disparity (%) | 6.38 | <.01a | 5.66 | <.01a | –1.66 | .02a | 6.10 | <.01a | |
ROC_AUC disparity (%) | 6.30 | <.01a | 5.57 | <.01a | 1.52 | .07 | 5.86 | .01a | |
FNR disparity (%) | –10.12 | <.01a | –10.10 | <.001a | 1.67 | .17 | –12.64 | <.01a | |
FPR disparity (%) | –2.48 | <.01a | –1.04 | .07 | 1.38 | .24 | 0.92 | .15 |
aIndicates a statistically significant difference (P<.05) between the model’s performance on male versus female patients. To determine statistical significance, the Kolmogorov-Smirnov tests were first run on the sex-stratified results to determine the distribution of data (normal or not). Independent 2-tailed t tests were used where data were normally distributed, and Mann-Whitney U tests were used when data were not normally distributed.
bROC_AUC: area under the receiver operating characteristic curve.
cFNR: false negative rate.
dFPR: false positive rate.
A smaller disparity in the FPR was statistically significant for male patients in 13 out of 16 experiments (–0.48% to +9.77%; P<.05). The sex performance disparities in accuracy and area under the receiver operating characteristic curve varied depending on the underlying shifts in the error rates for each sex (
and ). On examining the individual error rates, we see consistencies in the sex disparities across feature sets, most notably an overprediction of disease for male patients (higher FPR) and an underprediction of disease for female patients (higher FNR: ).Our findings for data set 2 were similar to those for data set 1, such that models built on the original sex-imbalanced data set demonstrated a higher FNR for female patients (mean difference of –10.81% to –12.52%; P<.05;
) and a higher FPR for male patients (3.94% to 4.71%; P<.05; ). visualizes the disparity graphically, and demonstrates that, unlike data set 1, the disparity in error rates reversed when training on sex-balanced data and female-only data ( ). illustrates the disparity in accuracy between the sexes, where we see that the direction of the disparity varies depending on the training data and feature set ( ).Variations in Training Data
Sex-Balanced Training Data
Training on sex-balanced data led to a fall in mean accuracy for all patients in data set 1 (76%, SD 3.46% vs 84.24%, SD 3.51%), with a more substantial drop in mean accuracy for male patients (73.61%, SD 4.84% vs 84.84%, SD 4.16%;
and ). The opposite trend was seen in data set 2, with models trained on sex-balanced data outperforming models trained on sex-imbalanced data for all patients (87.65%, SD 1.77% vs 85.72%, SD 1.75%) and for female patients (89.66%, SD 2.44% vs 85.48%, SD 4.12%; ). The models trained on sex-balanced data in data set 2 reduced the FNR for both sexes when using the full feature set (female patients 4.79%, SD 2.58% vs 24.86%, SD 11.35%; male patients 12.48%, SD 4.11% vs 13.19%, SD 3.26%; and ). The differences between the data sets may relate to underlying differences in the 2 cardiac conditions. Further, the failure to improve performance with sex-balanced training data may reflect the issues of mixing data that has conflicting indicators for disease.Results | Data set 1 (heart failure) | Data set 2 (coronary artery disease) | |||||||
Sex-imbalanced training data (n=209) | Sex-balanced training data (n=272) | Female training data (n=136) | Male training data (n=136) | Sex-imbalanced training data (n=522) | Sex-balanced training data (n=715) | Female training data (n=358) | Male training data (n=358) | ||
All patients, mean accuracy (SD) | 84.24 (3.51) | 76.0 (3.46) | 74.68 (3.53) | 75.12 (3.71) | 85.72 (1.75) | 87.65 (1.77) | 86.06 (1.67) | 82.63 (1.94) | |
Female patients, mean accuracy (SD) | 83.21 (6.37) | 78.39 (19.68) | 80.15 (4.43) | 77.85 (5.21) | 85.48 (4.12) | 89.66 (2.44) | 90.69 (2.38) | 79.44 (3.20) | |
Male patients, mean accuracy (SD) | 84.84 (4.16) | 73.61 (4.84) | 69.20 (5.96) | 72.39 (5.32) | 85.80 (2.14) | 85.65 (2.23) | 81.44 (3.02) | 85.82 (2.30). | |
Female patients, mean FNRa (SD) | 35.98 (16.72) | 85.25 (14.58) | 74.04 (17.68) | 78.66 (14.0) | 24.86 (11.35) | 4.79 (2.58) | 4.00 (2.74) | 22.32 (5.25) | |
Male patients, mean FNR (SD) | 28.45 (10.41) | 67.43 (16.6) | 66.62 (17.32) | 64.70 (14.9) | 13.19 (3.26) | 12.48 (4.11) | 22.97 (5.20) | 12.20 (3.41) |
aFNR: false negative rate.
Sex-Specific Training Data
For data set 1, mean accuracy for all patients when trained on sex-imbalanced data (84.24%, SD 3.51%) falls when training both on female-specific data (74.68%, SD 3.53%) and male-specific training data (75.12%, SD 3.71%), likely related to the smaller training data. For data set 2, mean accuracy for all patients when trained on sex-imbalanced data (85.72%, SD 1.75%) improves when training on female-specific data (86.06%, SD 1.67%) and falls when training on male-specific training data (82.62%, SD 1.94%). The overall improvement seen in the data set 2 models when trained on female data, relates to the increase in accuracy for female patients (90.69%, SD 2.38% vs 85.48%, SD 4.12%) co-occurring with a smaller decrease in accuracy for male patients (81.44%, SD 3.02% vs 85.80%, SD 2.14%;
and ).Unsurprisingly, performance for each sex is lowest when trained on the opposing sex (
, - ). In data set 1, same-sex training was preferable to opposite-sex training; however, this did not improve results compared to the models built from sex-imbalanced and sex-balanced training data, likely relating to the smaller sample size ( ). In contrast, data set 2 had greater training data available and demonstrated that sex-specific training is beneficial to both sexes above the sex-imbalanced models ( ).Variations in Feature Sets
Models built on the biochemical features subset gave the worst performance in terms of accuracy and FNR (
- ). For data set 2, biochemical features included just cholesterol and fasting blood sugar, and so, the fall in performance may relate to information loss. Additionally, Table S5 in highlights the different biochemical profiles for male and female patients who were sick, with female patients who were sick demonstrating a far higher cholesterol level than their male counterparts (mean values: 279.2 female patients who were sick vs 247.5 male patients who were sick).FAGTB Model
The disparity in false negative rate (DispFNR) was consistently higher than the disparity in false positive rate (
). Compared to the Gradient Boosting Classifier, the FAGTB reduced the DispFNR for both data sets (data set 1: 0.20 vs 0.21; data set 2: 0.19 vs 0.28), however, the DispFNR that disadvantaged female patients persisted. The fall in DispFNR and disparity in false positive rate that occurred with FAGTB was associated with a fall in overall accuracy for both data sets.Results on test set, averaged over 10 experiments | Gradient boosting classifier | FAGTB | |||
Data set 1 (heart failure): experiments run on sex-imbalanced data with all features (averaged over 10 experiments) | |||||
Accuracy | 71.3 | 71.2 | |||
DispFPRa | 0.08 | 0.08 | |||
DispFNRb | 0.21 | 0.20 | |||
Data set 2 (coronary artery disease): experiments run on sex-imbalanced data with all features (averaged over 10 experiments) | |||||
Accuracy | 86.3 | 82.9 | |||
DispFPR | 0.06 | 0.06 | |||
DispFNR | 0.28 | 0.19 |
aDispFPR: disparity in false positive rate.
bDispFNR: disparity in false negative rate.
Discussion
Principal Findings
Our study sheds light on an important gap in existing cardiac ML research, with significant implications for digital health equity. We find that the majority of published ML studies predicting HF fail to acknowledge the underrepresentation of female patients in their data sets and do not perform stratified model evaluations, thus failing to assess sex disparities in algorithmic performance. Our secondary evaluation of 2 cardiac data sets exposed a neglected sex disparity in model performance, highlighting the importance of integrating these methods into future studies that use ML methods for cardiac modeling. In our approach, we identified several potential sources of algorithmic bias.
First, we detected the underrepresentation of female patients in training data sets that may produce inequalities in model fidelity. Despite introducing oversampling techniques to address this omission, the disparities in performance persisted suggesting that addressing data set representation alone is not a sufficient measure for mitigating bias. Further, our experiments demonstrated that oversampling could reduce overall performance, which may result from the mixing of conflicting data (ie, male vs female feature rankings). In addition, oversampling with synthetic instances solely from the data set at hand does not provide the machine with more information, it simply redirects attention and therefore cannot easily compensate for demographic underrepresentation [
]. When balancing the data set, our methods did not include undersampling due to our small data sets, however, this may be a potential avenue for future research.Second, we considered featurization and highlighted sex differences in the biochemical manifestation of disease. In current clinical practice, the diagnostic parameters used for identifying pathology are drawn from research trials dominated by male physiology: it is perhaps unsurprising therefore that algorithms built from these data tend to underperform in female disease. There is a growing body of research that critiques the use of unisex thresholds in medicine for biochemical tests; our sex-stratified analysis of the cardiac data sets and the identified sex differences in feature rankings supports these proposals [
].There are further sources of inequitable performance that our evaluation cannot distinguish between. It may be that the sex differences in the physiological expression of disease mean that the prediction is harder to extract from 1 population. As a result, 1 sex may require more complex models than another, with differing architecture and degrees of flexibility. It may also simply be that there are differences in the predictability of 1 group compared with another, such that if the physiology of 1 group is more opaque, it may ultimately not be possible to resolve the observed disparities. McCradden and colleagues [
] detail this challenge further in their review, highlighting that differences across groups may not always indicate inequity. There are complex causal relationships between biological, environmental, and social factors that underpin the differences in disease rates seen across population subgroups [ ]. While models must not promote different standards of care according to protected characteristics, differences between groups may not necessarily reflect discriminatory practice [ ].Our research was limited by the available information in the data sets. The absence of race or ethnicity data precluded the evaluation of their effects. Furthermore, the absence of other demographic data in the studies we identified prevented the investigation of health inequities that might impact the LGBTQ+ (lesbian, gay, bisexual, transgender, queer) community, disadvantaged socioeconomic groups, or other subgroups. Previous research has described historic and institutional biases that contribute to worse health outcomes for these groups, and evolving AI systems require the same scrutiny to ensure these harms do not become embedded within digital systems [
- ].Throughout this paper, we have used the terms male and female to reference biological sex, so as not to conflate sex and gender. With the ongoing problematic conflation of sex and gender in medicine, stratification of model performance by either sex or gender is often impossible, which was noted in our own work [
- ]. Beyond the features discussed above, there is a wide range of additional factors that we cannot account for. For example, creatinine phosphokinase was a key feature in HF modeling yet existing studies have demonstrated the variation in these levels for manual laborers and athletes, illustrating how occupation may impact a patient’s physiology [ ].To account for the complex interactions that potentiate disease, and the heterogeneous nature of patient cohorts, we require more complex modeling capable of capturing the full range of intersecting factors influencing patient health (eg, sex differences may be mediated by income). Unsupervised high-dimensional representation learning may be the path forward for this purpose [
]. In addition to improving representation, unsupervised techniques enable us to detect neglected subpopulations without predetermining a characteristic of interest, facilitating the identification of the previously overlooked disadvantaged. In this sense, AI may provide a route forward to uncovering and addressing bias, by deploying more complex modeling that can improve patient representation and by revealing previously neglected disparities in the provision of care.Conclusions and Limitations
In our paper, we have identified inequities in the performance of cardiac ML algorithms. Our findings are limited by the small size of the uncovered data sets, reducing their potential generalizability, and hence we propose that larger studies focused on this issue are required. These data sets also came from the same source, as we found a limited number of open-access databases due to the confidential nature of patient data and issues of proprietary ownership. In addition, we focused on RF models to replicate the papers uncovered in our literature search; however, ML models may differ in their degrees of performance disparity, and an evaluation across the range of ML model options is an important next step.
In our paper we did not attempt to solve bias; instead, we highlighted a problem that exists throughout cardiology that requires further attention. The issue we have identified in these ML models is a foundational problem across medical modeling, in any instance where the use of an “average” is applied to a diverse population. It is possible that unsupervised ML and complex representational modeling may be a route forward for capturing heterogeneity in a previously unattainable manner and addressing issues of bias [
]. Our findings demonstrate that examining performance inequities across demographic subgroups is an essential approach for identifying biases in AI and preventing the perpetuation of inequalities in digital health systems.Acknowledgments
The data sets analyzed during this study are publicly available. Data set 1 is available from the University of California Irvine Machine Learning Repository [
]. Data set 2 is available from the IEEE Dataport Repository [ ]. This work was supported by UK Research and Innovation (UKRI; EP/S021612/1).Conflicts of Interest
None declared.
Multimedia Appendix 2
Details of Fair Adversarial Gradient Tree Boosting.
PDF File (Adobe PDF File), 92 KBReferences
- O'Neil C. Weapons of math destruction: how big data increases inequality and threatens democracy. New York City, U.S. Crown; 2017.
- Daneshjou R, Vodrahalli K, Novoa RA, Jenkins M, Liang W, Rotemberg V, et al. Disparities in dermatology AI performance on a diverse, curated clinical image set. Sci Adv. 2022;8(32):eabq6147. [FREE Full text] [CrossRef] [Medline]
- Seyyed-Kalantari L, Liu G, McDermott M, Chen IY, Ghassemi M. CheXclusion: fairness gaps in deep chest X-ray classifiers. Biocomputing. 2021:232-243. [FREE Full text] [CrossRef] [Medline]
- Thompson HM, Sharma B, Bhalla S, Boley R, McCluskey C, Dligach D, et al. Bias and fairness assessment of a natural language processing opioid misuse classifier: detection and mitigation of electronic health record data disadvantages across racial subgroups. J Am Med Inform Assoc. 2021;28(11):2393-2403. [FREE Full text] [CrossRef] [Medline]
- Obermeyer Z, Powers B, Vogeli C, Mullainathan S. Dissecting racial bias in an algorithm used to manage the health of populations. Science. 2019;366(6464):447-453. [FREE Full text] [CrossRef] [Medline]
- Cirillo D, Catuara-Solarz S, Morey C, Guney E, Subirats L, Mellino S, et al. Sex and gender differences and biases in artificial intelligence for biomedicine and healthcare. NPJ Digit Med. 2020;3:81. [FREE Full text] [CrossRef] [Medline]
- Liu X, Hu P, Yeung W, Zhang Z, Ho V, Liu C, et al. Illness severity assessment of older adults in critical illness using machine learning (ELDER-ICU): an international multicentre study with subgroup bias evaluation. Lancet Digit Health. 2023;5(10):e657-e667. [FREE Full text] [CrossRef] [Medline]
- Grari V, Ruf B, Lamprier S, Detyniecki M. Fair adversarial gradient tree boosting. 2019. Presented at: 2019 IEEE International Conference on Data Mining (ICDM); November 8-11, 2019:1060-1065; Beijing, China. URL: https://ieeexplore.ieee.org/document/8970941 [CrossRef]
- Savarese G, Becher PM, Lund LH, Seferovic P, Rosano GMC, Coats AJS. Global burden of heart failure: a comprehensive and updated review of epidemiology. Cardiovasc Res. 2022;118(17):3272-3287. [FREE Full text] [CrossRef] [Medline]
- Goldraich L, Beck-da-Silva L, Clausell N. Are scores useful in advanced heart failure? Expert Rev Cardiovasc Ther. 2009;7(8):985-997. [CrossRef] [Medline]
- Treece J, Chemchirian H, Hamilton N, Jbara M, Gangadharan V, Paul T, et al. A review of prognostic tools in heart failure. Am J Hosp Palliat Med. 2018;35(3):514-522. [CrossRef] [Medline]
- Thorvaldsen T, Benson L, Ståhlberg M, Dahlström U, Edner M, Lund LH. Triage of patients with moderate to severe heart failure: who should be referred to a heart failure center? J Am Coll Cardiol. 2014;63(7):661-671. [FREE Full text] [CrossRef] [Medline]
- Escamilla AKG, Hassani AHE, Andres E. A comparison of machine learning techniques to predict the risk of heart failure. In: Machine Learning Paradigms: Applications of Learning and Analytics in Intelligent Systems. Switzerland AG. Springer; 2019:9-26.
- Sullivan K, Doumouras BS, Santema BT, Walsh MN, Douglas PS, Voors AA, et al. Sex-specific differences in heart failure: pathophysiology, risk factors, management, and outcomes. Can J Cardiol. 2021;37(4):560-571. [CrossRef] [Medline]
- Walsh MN, Jessup M, Lindenfeld J. Women with heart failure: unheard, untreated, and unstudied. J Am Coll Cardiol. 2019;73(1):41-43. [FREE Full text] [CrossRef] [Medline]
- Sobhani K, Castro DKN, Fu Q, Gottlieb RA, Van Eyk JE, Merz CNB. Sex differences in ischemic heart disease and heart failure biomarkers. Biol Sex Differ. 2018;9(1):43. [FREE Full text] [CrossRef] [Medline]
- Straw I. The automation of bias in medical Artificial Intelligence (AI): decoding the past to create a better future. Artif Intell Med. 2020;110:101965. [CrossRef] [Medline]
- Hamberg K. Gender bias in medicine. Womens Health (Lond). 2008;4(3):237-243. [FREE Full text] [CrossRef] [Medline]
- Krieger N, Fee E. Man-made medicine and women's health: the biopolitics of sex/gender and race/ethnicity. Int J Health Serv. 1994;24(2):265-283. [CrossRef] [Medline]
- PRISMA flow diagram. PRISMA. URL: https://www.prisma-statement.org/prisma-2020-flow-diagram [accessed 2024-07-09]
- Tanvir AAM, Bhatti SH, Aftab M, Raza MA. Heart failure clinical records data set F. University of California Irvine Machine Learning Repository. 2020. URL: https://archive.ics.uci.edu/dataset/519/heart+failure+clinical+records [accessed 2024-05-17]
- Siddhartha M. Heart disease dataset (comprehensive). IEEE Dataport. 2020. URL: https://ieee-dataport.org/open-access/heart-disease-dataset-comprehensive [accessed 2024-05-17]
- Chicco D, Jurman G. Machine learning can predict survival of patients with heart failure from serum creatinine and ejection fraction alone. BMC Med Inform Decis Mak. 2020;20(1):16. [FREE Full text] [CrossRef] [Medline]
- Buolamwini J, Gebru T. Gender shades: intersectional accuracy disparities in commercial gender classification. 2018. Presented at: 1st Conference on Fairness, Accountability and Transparency, PMLR 81; February 23-24, 2018:77-91; New York, NY. URL: https://proceedings.mlr.press/v81/buolamwini18a.html
- Mishra P, Pandey CM, Singh U, Gupta A, Sahu C, Keshri A. Descriptive statistics and normality tests for statistical data. Ann Card Anaesth. 2019;22(1):67-72. [FREE Full text] [CrossRef] [Medline]
- Afrose S, Song W, Nemeroff CB, Lu C, Yao DD. Subpopulation-specific machine learning prognosis for underrepresented patients with double prioritized bias correction. Commun Med (Lond). 2022;2:111. [FREE Full text] [CrossRef] [Medline]
- Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. J Artif Intell Res. 2002;16(3):321-357. [FREE Full text] [CrossRef] [Medline]
- Islam SR, Eberle W, Ghafoor SK, Ahmed M. Explainable artificial intelligence approaches: a survey. arXiv. Preprint posted online on January 23, 2021. [FREE Full text]
- Lundberg SM, Lee SI. A unified approach to interpreting model predictions. 2017. Presented at: NIPS'17: 31st International Conference on Neural Information Processing Systems; December 4-9, 2017:4768-4777; Long Beach, CA. URL: https://dl.acm.org/doi/proceedings/10.5555/3295222
- Borgese M, Joyce C, Anderson EE, Churpek MM, Afshar M. Bias assessment and correction in machine learning algorithms: a use-case in a natural language processing algorithm to identify hospitalized patients with unhealthy alcohol use. AMIA Annu Symp Proc. 2022;2021:247-254. [FREE Full text] [Medline]
- Allen A, Mataraso S, Siefkas A, Burdick H, Braden G, Dellinger RP, et al. A racially unbiased, machine learning approach to prediction of mortality: algorithm development study. JMIR Public Health Surveill. 2020;6(4):e22400. [FREE Full text] [CrossRef] [Medline]
- Tison GH, Avram R, Nah G, Klein L, Howard BV, Allison MA, et al. Predicting incident heart failure in women with machine learning: the women's health initiative cohort. Can J Cardiol. 2021;37(11):1708-1714. [FREE Full text] [CrossRef] [Medline]
- Pombo G, Gray R, Cardoso MJ, Ourselin S, Rees G, Ashburner J, et al. Equitable modelling of brain imaging by counterfactual augmentation with morphologically constrained 3D deep generative models. Med Image Anal. 2023;84:102723. [FREE Full text] [CrossRef] [Medline]
- McCradden MD, Joshi S, Mazwi M, Anderson JA. Ethical limitations of algorithmic fairness solutions in health care machine learning. Lancet Digit Health. 2020;2(5):e221-e223. [FREE Full text] [CrossRef] [Medline]
- Safer JD, Coleman E, Feldman J, Garofalo R, Hembree W, Radix A, et al. Barriers to healthcare for transgender individuals. Curr Opin Endocrinol Diabetes Obes. 2016;23(2):168-171. [FREE Full text] [CrossRef] [Medline]
- Rutherford L, Stark A, Ablona A, Klassen BJ, Higgins R, Jacobsen H, et al. Health and well-being of trans and non-binary participants in a community-based survey of gay, bisexual, and queer men, and non-binary and two-spirit people across Canada. PLoS One. 2021;16(2):e0246525. [FREE Full text] [CrossRef] [Medline]
- Beckwith N, McDowell MJ, Reisner SL, Zaslow S, Weiss RD, Mayer KH, et al. Psychiatric epidemiology of transgender and nonbinary adult patients at an urban health center. LGBT Health. 2019;6(2):51-61. [FREE Full text] [CrossRef] [Medline]
- Vejjajiva A, Teasdale GM. Serum creatine kinase and physical exercise. Br Med J. 1965;1(5451):1653-1654. [FREE Full text] [CrossRef] [Medline]
- Carruthers R, Straw I, Ruffle JK, Herron D, Nelson A, Bzdok D, et al. Representational ethical model calibration. NPJ Digit Med. 2022;5(1):170. [FREE Full text] [CrossRef] [Medline]
Abbreviations
AI: artificial intelligence |
CAD: coronary artery disease |
DispFNR: disparity in false negative rate |
EF: ejection fraction |
FAGTB: Fair Adversarial Gradient Tree Boosting |
FNR: false negative rate |
FPR: false positive rate |
HF: heart failure |
LGBTQ+: lesbian, gay, bisexual, transgender, queer |
ML: machine learning |
PRISMA: Preferred Reporting Items for Systematic Reviews and Meta-Analyses |
RF: random forest |
Edited by A Mavragani; submitted 03.03.23; peer-reviewed by J Zeng, S Antani, E van der Velde, L Guo; comments to author 16.06.23; revised version received 13.10.23; accepted 04.05.24; published 26.08.24.
Copyright©Isabel Straw, Geraint Rees, Parashkev Nachev. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 26.08.2024.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.