Columbia Open Health Data for COVID-19 Research: Database Analysis

doi:10.2196/31122

Original Paper

Columbia University, New York, NY, United States

Corresponding Author:

Chunhua Weng, PhD

Columbia University

Ph-20, 622 W 168 ST

New York, NY

United States

Phone: 1 212 304 7907

Email: cw2384@cumc.columbia.edu

Background: COVID-19 has threatened the health of tens of millions of people all over the world. Massive research efforts have been made in response to the COVID-19 pandemic. Utilization of clinical data can accelerate these research efforts to combat the pandemic since important characteristics of the patients are often found by examining the clinical data. Publicly accessible clinical data on COVID-19, however, remain limited despite the immediate need.

Objective: To provide shareable clinical data to catalyze COVID-19 research, we present Columbia Open Health Data for COVID-19 Research (COHD-COVID), a publicly accessible database providing clinical concept prevalence, clinical concept co-occurrence, and clinical symptom prevalence for hospitalized patients with COVID-19. COHD-COVID also provides data on hospitalized patients with influenza and general hospitalized patients as comparator cohorts.

Methods: The data used in COHD-COVID were obtained from NewYork-Presbyterian/Columbia University Irving Medical Center’s electronic health records database. Condition, drug, and procedure concepts were obtained from the visits of identified patients from the cohorts. Rare concepts were excluded, and the true concept counts were perturbed using Poisson randomization to protect patient privacy. Concept prevalence, concept prevalence ratio, concept co-occurrence, and symptom prevalence were calculated using the obtained concepts.

Results: Concept prevalence and concept prevalence ratio analyses showed the clinical characteristics of the COVID-19 cohorts, confirming the well-known characteristics of COVID-19 (eg, acute lower respiratory tract infection and cough). The concepts related to the well-known characteristics of COVID-19 recorded high prevalence and high prevalence ratio in the COVID-19 cohort compared to the hospitalized influenza cohort and general hospitalized cohort. Concept co-occurrence analyses showed potential associations between specific concepts. In case of acute lower respiratory tract infection in the COVID-19 cohort, a high co-occurrence ratio was obtained with COVID-19–related concepts and commonly used drugs (eg, disease due to coronavirus and acetaminophen). Symptom prevalence analysis indicated symptom-level characteristics of the cohorts and confirmed that well-known symptoms of COVID-19 (eg, fever, cough, and dyspnea) showed higher prevalence than the hospitalized influenza cohort and the general hospitalized cohort.

Conclusions: We present COHD-COVID, a publicly accessible database providing useful clinical data for hospitalized patients with COVID-19, hospitalized patients with influenza, and general hospitalized patients. We expect COHD-COVID to provide researchers and clinicians quantitative measures of COVID-19–related clinical features to better understand and combat the pandemic.

J Med Internet Res 2021;23(9):e31122

doi:10.2196/31122

Keywords

COVID-19; open data; electronic health record; data science; research; data; access; database; symptom; cohort; prevalence

COVID-19 has threatened the health of tens of millions of people all over the world. The global pandemic caused by COVID-19 has sparked massive research efforts in the fight against the novel disease, including characterizing the disease and clinical progression, identifying risk factors for hospitalization, and finding drugs that can be repurposed to lessen disease severity [1-3]. Utilization of clinical data from different institutions, hospitals, and nations can accelerate these research efforts since important characteristics of the patients are often found by examining the shared clinical data. Although many studies sharing epidemiological data [4,5], public health [6], and social measures [7] for COVID-19 research have been conducted, publicly accessible clinical data on COVID-19 remain limited despite the immediate need [8], mainly owing to the potential risk to patient privacy that can still exist even after deidentification of the data [9].

Recognizing the need for publicly accessible electronic health record (EHR)–derived data in a broad range of clinical and translational research, we previously developed Columbia Open Health Data (COHD). COHD provides open and easy access to prevalence and co-occurrence statistics on conditions, drugs, procedures, and demographics derived from structured EHR data from NewYork-Presbyterian/Columbia University Irving Medical Center (NYP/CUIMC) [10], which serves the large and diverse population of NYC and its surrounding areas. Since its deployment, COHD has accelerated biomedical research by providing 2 informative resources, prevalence and co-occurrence statistics, and their derived association metrics [10-12].

NYC was one of the first epicenters of COVID-19 in the United States with the first confirmed case on February 29, 2020 [13]. As one of the largest academic medical centers in NYC, NYP/CUIMC has admitted more than 4000 patients as of September 1, 2020. We aim to provide shareable clinical data to catalyze future COVID-19 research by presenting Columbia Open Health Data for COVID-19 Research (COHD-COVID), a publicly accessible database providing clinical concept prevalence, the clinical concept prevalence ratio between cohorts, clinical concept co-occurrence, and clinical symptom prevalence for a cohort of hospitalized patients with COVID-19 and comparator cohorts (a cohort of hospitalized patients with influenza and a general hospitalized patient cohort) derived from NYP/CUIMC’s electronic health records. In addition to providing publicly accessible data files via the Figshare data repository, we also developed the COHD-COVID web application programming interface (API) for easy access and better usability.

Methods Overview

We used the term “concept” to refer to clinical entities and events such as conditions (ie, diagnosis), drugs, and procedures. The concepts and their names are defined by the Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM). When concepts are referenced in this paper, the name of the concept is styled in italics (eg, Disorder of respiratory system) to distinguish the formalized concepts from regular text. We also styled entities in the OMOP CDM (eg, person_id column in condition_occurrence table) in italics.

Figure 1 depicts the overall workflow to create COHD-COVID. Columbia’s clinical data warehouse was converted to the OMOP CDM. We first filtered the EHR data in accordance with the cohort definitions, and then EHR for each patient’s inpatient visits were identified. Condition, drug, and procedure concepts were obtained from the identified visits. Concept prevalence, concept prevalence ratio, concept co-occurrence, and symptom prevalence analyses followed using the obtained concepts. To protect patient privacy, we excluded rare concepts observed in 10 or fewer visits and perturbed the true counts using Poisson randomization. Perturbation of the true counts and exclusion of rare concepts reduce the uniqueness of individual in the data, which can minimize the risk of reidentification [9]. Perturbed counts generated by the Poisson randomization process do not show a significant difference from the true counts [10]. The resulting data were stored in a MySQL database and made publicly available via the COHD-COVID web API [14]. All analyses were conducted using Python 3.5.2. This study received institutional review board approval with a waiver for informed consent.

Figure 1. Overall workflow of Columbia Open Health Data for COVID-19 Research (COHD-COVID). API: application programming interface; NYP/CUIMC: NewYork-Presbyterian/Columbia University Irving Medical Center; OMOP CDM: Observational Medical Outcomes Partnership Common Data Model.

Data Source

We utilized EHR data from the NYP/CUIMC’s clinical data warehouse, where inpatient and outpatient data dating back to 1985 are stored. EHR data were collected during routine clinical care. Patients were notified of potential uses of their data for research at the point of care and data collection. NYP/CUIMC has converted its clinical data warehouse to OMOP CDM on a regular basis. Since NYP/CUIMC covers NYC and the surrounding area, which has a diverse population of 8.2 million people and was an early COVID-19 epicenter, the EHR data from NYP/CUIMC can provide a diverse and large sample of patients with COVID-19.

Three different patient cohorts were used in this study. The COVID-19 cohort was defined as hospitalized patients aged ≥18 years with a COVID-19–related condition diagnosis or a confirmed positive COVID-19 test result during their hospitalization period or within the prior 21 days. Patients identified with the COVID-19 cohort definition from March 1, 2020, to September 1, 2020, were included in the COVID-19 cohort. The influenza cohort was similarly defined as patients aged ≥18 years who had at least 1 occurrence of influenza conditions or precoordinated positive measurements or positive influenza test results during their hospitalization period or within the prior 21 days. The general cohort was defined as all hospitalized patients aged ≥18 years. Patient visits from calendar years 2014 to 2019 were included for the influenza and general cohorts. All cohorts were divided into subcohorts stratified by sex (male vs female) and age (adults aged 18-64 years vs seniors aged >65 years) for further investigation. The COVID-19 and influenza cohort definitions were adapted from the cohort definitions created by the Observational Health Data Science and Informatics’ (OHDSI’s) international network study for COVID-19 [3]. Summary statistics of the cohorts are provided in Table 1.

Patients belonging to the cohorts based on the 3 cohort definitions above were identified using the unique person_id from the person table in the OMOP CDM. Condition, drug, and procedure concepts observed in these patients during inpatient visits were extracted from the condition_occurrence, drug_exposure, and procedure_occurrence tables in the OMOP CDM, respectively. Inpatient visits of patients and the concepts in these visits were identified using person_id along with visit_occurrence_id from the visit_occurrence table. We used visit-based counts instead of patient-based counts in the following analyses for robust comparison between cohorts that had different observation windows. For example, if we use patient-based counts, the patients included in cohorts with longer observation windows are likely to be observed with more concepts than the patients in cohorts with shorter observation windows, which could inject bias into the metrics used in the analyses. Thus, we used visit-based counts to mitigate the effect of different observation window lengths to reduce the bias.

Table 1. Basic statistics of 3 cohorts in the Columbia Open Health Data for COVID-19 Research database. The statistics summarized here are based on the data from NewYork-Presbyterian/Columbia University Irving Medical Center as of September 1, 2020.

	COVID-19 cohort	Influenza cohort	General cohort
Time range	March 1, 2020, to August 31, 2020	January 1, 2014, to December 12, 2019	January 1, 2014, to December 12, 2019
Patients, n	4127	3261	175,930
Total inpatient visits, n	4846	3860	314,680
Male patients; inpatient visits, n (%)	2103 (51.0); 2518 (52.0)	1454 (44.6); 1732 (44.9)	67,662 (38.5); 128,908 (41.0)
Female patients; inpatient visits, n (%)	2024 (49.0); 2328 (48.0)	1807 (55.4); 2128 (55.1)	108,268 (61.5); 185,772 (59.0)
Adult patients (aged 18-64 years); inpatient visits, n (%)	2147 (52.0); 2511 (51.8)	1315 (40.3); 1553 (40.2)	104,020 (59.1); 173,843 (55.2)
Senior patients (>65 years); inpatient visits, n (%)	1980 (48.0); 2335 (48.2)	1946 (59.7); 2307 (59.8)	71,910 (40.9); 140,837 (44.8)

Concept Prevalence and Concept Prevalence Ratio Analysis

We calculated the concept prevalence in each cohort as detected from the EHR. The concept prevalence is defined as in equation (1).

where is the prevalence of concept C in cohort H, is the set of unique inpatient visits of patients in cohort H observed with concept C, and T_H is the set of unique inpatient visits of patients in cohort H. We also calculated hierarchical concept prevalence by defining as the set of unique visits of patients observed with concept C or any of concept C’s descendant concepts as defined in the concept_ancestor table in the OMOP CDM. For example, the hierarchical count for Ibuprofen (OMOP concept ID 1177480) not only includes entries where the specific concept Ibuprofen was used, but also includes entries using other descendant concepts, such as Ibuprofen 600 MG Oral Tablet (OMOP concept ID 19019073). Taking hierarchical relationships into account mitigates some of the issues with coding variations across time and practices as different concepts with minor semantic differences can be aggregated into higher-level concepts.

The concept prevalence ratio indicates how frequently concept C occurs in cohort A relative to cohort B. The natural logarithm of the concept prevalence ratio is defined as in equation (2).

where LR(C_A,B) is the log ratio of the prevalence of concept C for cohort A to cohort B, is prevalence of concept C in cohort A, and is prevalence of concept C in cohort B. Hierarchical concept prevalence ratio can be calculated by using hierarchical concept prevalence and .

Concept Co-occurrence Analysis

Concept co-occurrence represents how frequently a specific concept pair appears in a cohort. We defined concept co-occurrence prevalence as in equation (3).

where is the co-occurrence prevalence of concepts C₁ and C₂ in cohort H, is the set of unique visits of patients observed with concept C₁ and C₂ in cohort H, and T_H is the set of unique visits of patients in the cohort H. We also calculated hierarchical concept co-occurrence using the hierarchy of concepts as described above.

COVID-19 Symptom Prevalence Analysis

Since clinical symptoms often include multiple granular clinical concepts, a set of related concepts for a symptom is required to calculate the prevalence of the symptom. For example, dyspnea, which is one of the major symptoms of COVID-19, can be detected as standard concept Dyspnea or Acute respiratory distress in different patients. The 2 concepts do not have any hierarchical relationship and thus will not be aggregated by the hierarchical prevalence analyses. Thus, a concept set containing both Dyspnea and Acute respiratory distress is needed for accurate calculation of the prevalence of dyspnea. We defined symptom prevalence as in equation (4).

where is prevalence of symptom S in cohort H, is the set of unique visits of patients observed with concept C_i, T_H is the set of unique visits of patients observed in the cohort H, s is the number of the unique concepts in the concept set for symptom S, and ∪ is the union operator. Hierarchy between concepts is not considered in the symptom prevalence analysis since a concept set of a symptom already reflects hierarchy for that symptom.

We used concept sets for 11 major symptoms of COVID-19 (Cough, Chills, Abdominal pain, Diarrhea, Dyspnea, Fatigue, Fever, Myalgia, Nausea and vomiting, Tachypnea, and Throat pain), which have been defined by OHDSI to calculate symptom prevalence. The concepts included in each symptom are available in Figshare [15].

Data Quality Analysis

Assessing the quality of EHR data is critical for its effects on secondary analysis for research in the health care and medical domains [16]. We calculated the sum of nonrandomized counts of all concepts on a monthly basis for the COVID-19 cohort and on a yearly basis for the general and influenza cohorts. For each of the condition, drug, and procedure domains, we examined the total counts of concepts with the number of visits to detect any issues regarding data quality and temporal plausibility of the EHR data we used in the study [17]. We also calculated the annual mean (SD) values of concept prevalence and concept co-occurrence for the general and influenza cohorts to assess the temporal variance of each concept and concept co-occurrence pair.

Results Overview

In this section, we show a sample of the results of analyses using the data from COHD-COVID. Since COHD-COVID contains massive amounts of data covering several thousand concepts, it is worth noting that only a small sample of the results is shown in this section. Users can obtain the results of interest in addition to the results shown in this section by using the COHD-COVID API [14] or by downloading the flat data files from Figshare [15]. The results of all analyses, concept definitions, and concepts included in each symptom are available as tab-delimited flat files in Figshare, except the concept prevalence ratio analyses, since they can be directly computed from concept prevalence data. COHD-COVID API [14] provides all results.

Concept Prevalence Analysis

Figure 2 shows the prevalence of 10 condition (Figure 2A) and drug concepts (Figure 2B) in the COVID-19 cohort, influenza cohort, and general cohort. We chose the 10 most prevalent condition and drug concepts in the COVID-19 cohort for this use case. For drug concepts, we chose the most prevalent drug ingredient concepts using hierarchical analysis to count multiple drugs that have different brand names, dosages, and formulations but are based on the same ingredients together. The condition concepts were chosen without hierarchical analysis to identify the top 10 specific conditions. Fever showed the highest prevalence (0.2619) among all condition concepts for the COVID-19 cohort, followed by Cough (0.2491) and Dyspnea (0.2594). Acetaminophen was the most prevalent (0.7912) drug ingredient used for patients with COVID-19, followed by Enoxaparin (0.5803) and Glucose (0.4424). Figures 3 and 4 show the prevalence of 10 condition and drug concepts, respectively, in the COVID-19 cohort stratified by age (Figures 3A and 4A) and gender (Figure 4A and 4B). The 10 condition and drug concepts were the top 10 most prevalent concepts in the full COVID-19 cohort without stratification.

Figure 2. (A) Condition and (B) drug concept prevalence in the COVID-19 cohort, influenza cohort, and general cohort.

Figure 3. Condition concept prevalence in (A) age and (B) sex sub-cohorts of the COVID-19 cohort. The full COVID-19 cohort indicates original COVID-19 cohort without stratification.

Figure 4. Drug concept prevalence in (A) age and (B) sex subcohorts of the COVID-19 cohort. The full COVID-19 cohort indicates original COVID-19 cohort without stratification.

Concept Prevalence Ratio Analysis

Table 2 shows the top 10 condition concepts that showed the highest concept prevalence ratio for the COVID-19 cohort relative to the comparator cohorts (general cohort and influenza cohort). Disease due to Coronavirus and Acute lower respiratory tract infection showed the highest concept prevalence ratio for the COVID-19 cohort relative to the influenza cohort and the general cohort, respectively.

Table 2. Top 10 condition concepts that show the highest concept prevalence ratio for the COVID-19 cohort relative to the general and influenza cohorts.

Influenza cohort (prevalence ratio)	General cohort (prevalence ratio)
Disease due to Coronavirus (3.28)	Acute lower respiratory tract infection (7.30)
Disorientated (2.44)	Disorder of respiratory system (5.56)
Blood chemistry abnormal (2.25)	General finding of observation of patient (5.17)
Acute respiratory distress syndrome (2.16)	Outcome of delivery – finding (5.13)
Cerebral infarction (1.84)	Chest pain on breathing (5.09)
Cough (1.81)	Disease due to Coronavirus (4.95)
Viral pneumonia (1.61)	Patient status finding (4.60)
Acute respiratory distress (1.52)	Unplanned pregnancy (4.23)
Heart disease (1.32)	Acute respiratory distress syndrome (4.17)
Delivery normal (1.28)	Deliveries by cesarean (3.98)