Tutorial
Abstract
Dynamic predictive modeling using electronic health record data has gained significant attention in recent years. The reliability and trustworthiness of such models depend heavily on the quality of the underlying data, which is, in part, determined by the stages preceding the model development: data extraction from electronic health record systems and data preparation. In this paper, we identified over 40 challenges encountered during these stages and provided actionable recommendations for addressing them. These challenges are organized into 4 categories: cohort definition, outcome definition, feature engineering, and data cleaning. This comprehensive list serves as a practical guide for data extraction engineers and researchers, promoting best practices and improving the quality and real-world applicability of dynamic prediction models in clinical settings.
J Med Internet Res 2025;27:e73987doi:10.2196/73987
Keywords
Background
Predictive modeling using electronic health record (EHR) data has become increasingly important in enhancing patient outcomes through real-time risk detection and timely interventions. However, the effectiveness of these models is heavily reliant on the quality and structure of the underlying data, which are influenced by the processes of data extraction and preparation. While recent advancements in artificial intelligence and machine learning [,] have shown promise in this area [-], significant challenges remain in ensuring that the data used for model development are representative and reliable. The use case that inspired this work involves a recent project on the dynamic prediction of central line-associated bloodstream infections (CLABSI) using hospital-wide data from the University Hospitals Leuven (Belgium). [-] (use case: CLABSI prediction using data from a single-center university hospital) summarizes the main challenges we encountered together with corresponding solutions.
Data collection is part of the routine hospital workflow. Data extraction entails retrieving and structuring raw EHR data from hospital databases using an extract, transform, load (ETL) process, with standards such as Observational Medical Outcomes Partnership Common Data Model (OMOP CDM) facilitating structure and terminology consistency (. Glossary of technical terms). Publicly available intensive care unit (ICU) and emergency admission datasets [-], such as MIMIC [,], support reproducible research, while hospitals also extract private, not publicly shared data for predictive modeling. Data preparation transforms extracted data into a structured format for predictive modeling, often using R (R Foundation) or Python (Python Software Foundation) pipelines. Open-source frameworks, whether generalizable [,] or specific to MIMIC [,], aim to establish reproducible workflows for EHR data preparation. Data quality issues, such as inconsistencies and artifacts introduced during extraction and preparation, can compromise the performance of predictive models. Understanding the intricacies of data extraction and preparation challenges is essential for developing robust predictive models that can be reliably integrated into clinical practice.
Structured data quality assessments [-] are facilitated by frameworks such as by Weiskopf and Weng [] and METRIC (Measurement Process, Timeliness, Representativeness, Informativeness, Consistency) []. However, researchers often conduct unstructured assessments, documenting “lessons learned” from their projects [-]. These assessments tend to focus primarily on data cleaning rather than the entire data extraction and preparation process, including cohort definition, outcome definition, and feature engineering. While literature on data quality assessments exists, mitigation strategies remain largely undocumented [].
Objective
This paper provides a comprehensive list of challenges encountered during data extraction and preparation using EHR data for developing dynamic prediction models for clinical use. It further proposes recommendations with actionable insights, with the intention to enhance data quality and improve the practical applicability of (dynamic) prediction models in clinical settings. Our insights are drawn from a selective literature review, as well as our experience with various EHR data extractions. This list is intended as a hands-on resource for data extraction engineers (who perform the data extraction) and researchers (who prepare the data for model building) to consult during the extraction and preparation process. By addressing these challenges, we hope to contribute to the development of more reliable and effective predictive modeling frameworks that can ultimately benefit patient care.
We focus on single-hospital structured data, covering both ICU-specific and hospital-wide extractions. We focus on medium to large-scale extractions, that is, generated through structured ETL processes for a large number of extracted items spanning diverse clinical domains (laboratory results, medications, demographics, comorbidities, etc), and we cover standardized (eg, OMOP CMD) and nonstandard, private extractions. We do not address combined data registries, such as national registries (integrating surveys, general practice, insurance data, and EHR extractions), multicenter data extractions, hospital data collected for clinical trials, or the extraction and processing of unstructured data (eg, text notes or images). We cover the broader application of dynamic models (or continuous prediction) for offering the most comprehensive framework, although many of the recommendations are applicable to static prediction models (eg, a single prediction per admission at 24 hours after admission).
Recognizing that EHR software may contain bugs, human errors will occur, and hospital processes will evolve and generally improve over time, we do not provide recommendations for EHR vendors or for modifying hospital workflows, clinical practices, or data recording procedures within EHR systems. When data quality issues, such as those arising from data recording procedures, render the extracted data inadequate for a specific prediction task, we leave it to researchers to assess adequacy, without offering guidance on what constitutes suitable data for a given prediction task.
The section “EHR Data Flow from Data Collection to Model Building” explains the typical trajectory of data, for both model building and clinical implementation, setting the stage and providing context for the detailed list of challenges that follows. The section “Challenges and Recommendations” lists the challenges and actionable recommendations, which represent the core of the paper. These are categorized into 4 groups: cohort definition, outcome definition, feature engineering, and data cleaning. This is one possible categorization, inspired by the stages of the data preparation process. The “Discussion and Conclusion” section summarizes the recommendations and reflects on their broader implications.
EHR Data Flow From Data Collection to Model Building
The data flow from the patient’s bed to the prediction model building typically follows three stages: (1) data collection, (2) data extraction, and (3) data preparation (A). (1) Patient data are either manually entered in EHR software modules or collected by devices and stored in one or multiple databases. The data collected serve multiple purposes, for example, daily bedside clinical work, national benchmarking [], or reimbursing the care delivered. (2) EHR data are then extracted from relational databases in a simplified (denormalized) format and typically stored in a data warehouse as multiple “base tables” per category, each capturing different aspects of patient data and health care events. Examples of base tables naming as per OMOP CDM are PERSON, DRUG_EXPOSURE, DEVICE_EXPOSURE, CONDITION_OCCURRENCE, MEASUREMENT, NOTE, OBSERVATION, but naming and granularity of extraction can vary for nonstandard extractions. (3) Researchers building a prediction model either have access to the data warehouse or get a copy of all tables or, under specific ethical and legal considerations, a subset of the base tables or a subset of the patients in the data warehouse (eg, admissions during a specified period of interest for patients undergoing mechanical ventilation). Further, they process the extracted data and bring it in a format on which a prediction model can be built.
At model implementation time in clinical practice in the EHR software (B), a trigger (eg, update of laboratory results) or a scheduled task (eg, every 24 hours) will initiate the request for a prediction. Data are already collected (1) for the patient in the EHR databases. The same logic as for extraction (2) is reproduced (typically by reusing the queries or code used for extraction). Using the data packed in a specific format (eg, JSON and XML), the prediction service (typically using a RESTful (Representational State Transfer) application programming interface for communication) is invoked. Data exchange between the EHR and the prediction service is generally performed using the Fast Healthcare Interoperability Resources standard. The prediction service will further prepare the data (3), invoke the model, and return a prediction (and additional information, if foreseen) to the EHR software, which will present it to users in the form of alerts or patient flags within the patient’s chart. The process is typically logged in the EHR system for monitoring purposes. In summary, at real-time prediction, the (1) data collection happens implicitly and is part of the normal clinical flow; (2) data extraction and (3) data preparation are identically reproduced as for model building.
We provide additional background on some particularities for each of the processes of collection, extraction, and preparation, which represent transition phases from a data format to another.

Data Collection
The EHR database will not reflect with maximum accuracy the “true state” of the patient. First, it will suffer from incompleteness, as not all possible markers and observations can be collected for all patients at all times. The decisions with regards to what data are collected (eg, which laboratory tests are ordered and performed) are highly dependent on the patient’s conditions and on the hospital procedures. Data collected during routine clinical practice are generally documented more carefully when also used for national reporting of quality of care indicators []. Sufficient data are collected during clinical practice by doctors, nurses, assistants, etc, in the system to support the patient’s clinical follow-up and treatment. From this perspective, EHR data differ vastly from data collected for clinical trials, where researchers specify the measurements, measurement methods, and collection procedure. Second, nurses and clinicians might have slightly more information than the data collected in the system, either from patient conversations or organizational knowledge. Considering that we do not cover the extraction of text notes and reports, tabular data only will always suffer from a level of incompleteness. Nevertheless, tabular data might prove sufficient for specific prediction tasks. Third, both manual data entry and data collected by devices can be, at times, error-prone, software can have bugs, and data recording procedures in the system will affect the granularity of observed data and will change over time. These considerations, which can be summarized as data completeness, correctness, and currency, as defined by Weiskopf and Weng [], have to be carefully considered by researchers to assess if EHR data are fit for the prediction goal [].
Data Extraction
The extracted data might not correctly reflect the EHR database. Although data extraction generally aims for completeness of all clinically relevant data, the extraction process can introduce undesired artifacts or can have its own limitations. Whenever extraction logic bugs are detected, these should be corrected and safeguarded through unit or integration tests. Mature extraction platforms, tested through repeated use and proven reliability, will generally be less error-prone, and researchers can consider the maturity of the extraction platform as a factor influencing the time they will spend on data preparation. The extraction process is typically carried out by a data extraction engineer, data integration developer, data warehouse engineer, or ETL developer representing the EHR vendor or the hospital information technology department. Hereafter, we refer to this role as the data extraction engineer. They work closely with EHR software developers, hospital information technology staff, and clinical personnel to understand database structures and data recording procedures. They perform clinical concept and terminology mapping and document the extracted data.
Data Preparation
The prepared data might not correctly reflect the extracted data. Undesired artifacts can be introduced by feature engineering or data cleaning. Good documentation of the extraction format and close collaboration between researchers, data engineers, and clinical experts will ensure that the data and the features are not misinterpreted. Coding errors can always occur. Time-sensitive data poses an additional challenge to ensure no temporal leaks (using future data to predict past events) are introduced by mistake. Outcome leaks (including outcome information in predictors) and test-train leaks (including information from test patients in the training datasets) can also occur if data are not preprocessed carefully. A good practice is to separate the train and test sets first and apply the data preprocessing separately. A modular organization of the code will facilitate unit testing (testing individual functions of a program in isolation to ensure they work as expected) and easy modifications without introducing new errors. The more complex the data preparation, the more error-prone it becomes. The data preparation is performed by a researcher, data scientist, statistician, or machine learning engineer, whom we will refer to as a researcher for the remainder of this paper.
Each stage of processing the data has its own challenges and can introduce new problems, widening the gap between the patient’s state and the data used by the prediction model and ultimately impacting the model building (suboptimal model), model evaluation (misleading performance metrics), or model implementation in clinical practice.
Challenges and Recommendations
We list common problems originating in the (1) data collection process, the (2) data extraction process, and the (3) data preparation process (A). We provide recommendations for mitigation strategies that can be implemented during the (2) data extraction or (3) data preparation. We also focus on problems that can impede the identical reproduction of the extraction and preparation at clinical implementation time. We have categorized the challenges and recommendations into four groups: (1) cohort definition (and inclusion or exclusion criteria), (2) outcome definition, (3) feature engineering, and (4) data cleaning, and each group contains problems originating in the collection, extraction, or preparation process.
We provide mapping of the items to both the Weiskopf and Weng framework [] and the METRIC framework [], whenever applicable. Weiskopf and Weng do not include dimensions covering the cohort representativeness and completeness of the extracted features, which are important in prediction settings. We will use the completeness dimension to refer to completeness of data values, as in the original definition [], as well as completeness of the cohort and of the extracted features (our extension).
While we aimed to make these challenges as generic as possible, we recognize that specific issues are often unique to individual projects and may not apply to all prediction tasks. We advise users to assess the impact of each listed challenge in their specific context. Similarly, the recommendations might not always be universally applicable and depend on the project context. Our guidance remains pragmatic; at times, the best approach may be to “leave as is” to avoid the risk of overcorrection, which can backfire. While certain corrections may improve alignment with clinical meaning for a specific patient group, they can introduce errors or biases when applied to all patients or be difficult to reproduce at clinical implementation time, affecting the model’s performance in clinical use. Following the fitness-for-use principle [], we encourage readers to remain pragmatic and address only the issues relevant to their data and prediction task.
presents a high-level schematic overview of the recommendations, without detailing each specific challenge. For a full understanding of the context in which recommendations are made, we encourage hands-on users involved in data extraction and preparation to consult the following sections and tables below for a comprehensive understanding.

Cohort Definition
Defining the cohort of interest (eg, hospital-wide population or patients with a specific condition) represents a first step in the prediction task definition (-). It can also be of interest during the project planning phase. An inaccurately defined cohort can lead to selection bias, resulting in performance estimates that do not accurately reflect the model’s performance in clinical practice.
| Problem | Description and recommendation |
| Linking identifiers |
|
| Cross-linking patients and admissions across various data sources creates inconsistencies |
|
| Invalid admissions, not reflecting real patients, are present in the extraction |
|
| Duplicated admissions are present in the extraction |
|
| Restricting extraction to hospitalized patients may lead to the omission of useful patient data |
|
aICU: intensive care unit.
bMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
cWW: Weiskopf and Weng dimension mapping.
dHIS: hospital information system.
eEHR: electronic health record.
| Problem | Description and recommendation |
| Incorrect or incomplete mappings leading to an incorrect cohort definition applied at extraction |
|
| Inclusion or exclusion criteria require patient history before the study period |
|
| Restrictive inclusion or exclusion criteria applied at the extraction limit the scope of missing data imputation |
|
aICD: International Classification of Diseases.
bMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
cWW: Weiskopf and Weng dimension mapping.
| Problem | Description and recommendation |
| Discontinuity or fragmentation of the patient stay due to cross-linking across different data sources |
|
| Temporal leaks in cohort definition |
|
| Definition of episodes of interest |
|
| Episode fragmentation |
|
aICU: intensive care unit.
bEHR: electronic health record.
cMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
dWW: Weiskopf and Weng dimension mapping.
eICD: International Classification of Diseases.
fCPT: Current Procedural Terminology.
gPEEP: positive end-expiratory pressure.
hFiO2: fraction of inspired oxygen.
Outcome Definition
Prediction models using EHR data usually focus on in-hospital or postdischarge outcomes, including mortality, length of stay, readmission, acute events (such as bacteremia, sepsis, and acute kidney injury), and chronic diseases (such as heart failure, cancer, and cardiovascular disease) []. We focus on outcomes that can be derived solely from structured EHR data, without linkage to external data sources or extraction of text notes, and are linked to a patient; that is, we exclude resource use and workflow optimization outcomes ( and ). Similar to the cohort definition, a good practice is assessing the feasibility of defining the outcome during the project planning phase.
| Problem | Description and recommendation |
| Outcomes derived based on ICDa codes are not timestamped and often unreliable (under- or upcoded) |
|
| Outcomes retrospectively assessed by clinical experts are labor-intensive and subject to human error |
|
| Outcomes documented in the system are subject to mislabeling and delay in registration |
|
| Specific prediction tasks require the extraction of auxiliary outcomes or competing events |
|
| Label leakage due to temporal leakage |
|
aICD: International Classification of Diseases.
bICD-9: International Classification of Diseases, Ninth Revision.
cICD-10: International Statistical Classification of Diseases, Tenth Revision.
dMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
eWW: Weiskopf and Weng dimension mapping.
fCAM: Confusion Assessment Method.
gDRS: Delirium Rating Scale.
hAKI: acute kidney injury.
| Problem | Description and recommendation |
| Outcomes derived using clinical or surveillance definitions often require the extraction of multiple data sources |
|
| Clinical or surveillance definitions are complex, and their implementation is prone to variations |
|
| Outcomes derived using clinical or surveillance definitions can be affected by missing or erroneous data |
|
aICD: International Classification of Diseases.
bAKI: acute kidney injury.
cCLABSI: central line-associated bloodstream infection.
dVAE: ventilator-associated event.
eMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
fWW: Weiskopf and Weng dimension mapping.
gCDC: Centers for Disease Control and Prevention.
Feature Engineering
Feature engineering is the process of selecting, transforming, and creating features (variables) from the extracted EHR data to ensure that the data are brought into a suitable format for modeling and that relevant information is processed in a meaningful way. It includes data mapping (eg, mapping medication brand names to generic drug names) and transformation (eg, grouping medical specialties in meaningful categories), converting timestamped events (eg, laboratory results, medication administration, and vital signs) into snapshot-based features, and data aggregation (summarizing multiple observations or measurements into single features). The feature engineering is typically performed during data preparation, and usually it happens concomitantly with data cleaning. Sometimes feature engineering is also performed at data extraction time, for example, for reducing the data volume for high-frequency time-series data, especially in ICU settings, or for computing scores that are calculated and displayed in the EHR software but not stored in the system. Clinical concept mapping, such as mapping medication to Anatomical Therapeutic Chemical codes, laboratory tests to LOINC (Logical Observation Identifiers, Names, and Codes) codes or procedures and clinical observations to Systematized Nomenclature of Medicine Clinical Terms can occur at different stages, including within the EHR software, during data extraction (eg, OMOP CMD Standardized Vocabularies), or as part of data preparation.
We distinguish between generic feature engineering and time-sensitive feature engineering. Generic feature engineering ( and ) deals with the mapping of clinical items Time-sensitive feature engineering ( and ), which deals with data aggregation for which correct processing of timestamps is critical and that can result in temporal leaks (ie, data available at a specific time for training the model, but not available in the system at that timestamp. Time-sensitive feature engineering is critical for dynamic prediction models, can also be useful for some static models (eg, predictions at 24 hours after admission), and has less impact on models with a prediction trigger at the end of the admission (eg, readmission prediction). We acknowledge that some minimal temporal leaks might not be very detrimental for the model performance or for its applicability, while others can have a large impact. However, following the “do it right the first time” principle, a good understanding of the extracted timestamps and correct handling of dates or times during data extraction and preparation can safeguard against future problems, big or small.
As for the previous 2 groups, certain aspects of feature engineering can be evaluated during project planning, at least for the key features relevant to the prediction task and, at a minimum, for their availability. For instance, if a key feature was only recorded for a limited period before being discontinued or if timestamps of key features are unavailable for a dynamic prediction task, the project may become jeopardized.
| Problem | Description and recommendation |
| Inconsistencies due to the data extraction of the same clinical items from various databases |
|
| The same clinical concept is available in different tables in the same database |
|
| Clinical concept mapping differs between wards or over time |
|
| Partially overlapping data regarding the same clinical concept is available in multiple extracted tables |
|
| Changes in hospital processes over time result in gaps in the extracted features |
|
aEHR: electronic health record.
bICU: intensive care unit.
cOMOP CDM: Observational Medical Outcomes Partnership Common Data Model.
dMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
eWW: Weiskopf and Weng dimension mapping.
fCVC: central venous catheter.
gETL: extract, transform, load.
hICD: International Classification of Diseases.
iICD-9: International Classification of Diseases, Ninth Revision.
jICD-10: International Statistical Classification of Diseases, Tenth Revision.
kCPT: Current Procedural Terminology.
| Problem | Description and recommendation |
| Ward-specific data recording patterns |
|
| Weekend effects result in missing data |
|
| The feature set definition is too restrictive |
|
| Overmapping during extraction can dilute the predictive signal |
|
| Use of hospital aggregate features |
|
| Use of patient aggregate features |
|
aGCS: Glasgow Coma Scale.
bMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
cWW: Weiskopf and Weng dimension mapping.
dICD: International Classification of Diseases.
eCCS: Clinical Classifications Software
fCCSR: Clinical Classifications Software Refined
gAPI: application programming interface.
| Problem | Description and recommendation |
| Timestamps shifting for anonymization |
|
| Timestamps when data are available in the system for real-time prediction differ from the clinically relevant timestamps |
|
| Extracted timestamps may reflect the time of the last modification rather than the original creation time of an item |
|
| No date or timestamps available |
|
aMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
bWW: Weiskopf and Weng dimension mapping.
cAKI: acute kidney injury.
dCLABSI: Central Line-Associated Bloodstream Infections.
eEHR: electronic health record.
fICD: International Classification of Diseases.
| Problem | Description and recommendation |
| The extracted timestamp does not reflect the time when the data are recorded in the system. |
|
| Exact-hour timestamps often do not reflect the time when data are recorded in the system |
|
| Temporal leaks due to extraction errors |
|
| Temporal leaks due to incorrect date or time handling, or incorrect linking during data preparation |
|
aEHR: electronic health record.
bCPT: Current Procedural Terminology.
cMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
dWW: Weiskopf and Weng dimension mapping.
eETL: extract, transform, load.
fICU: intensive care unit.
Data Cleaning
The data cleaning process consists of identifying and correcting data issues that could negatively impact the performance or applicability of prediction models. Errors in EHR data can arise from manual entry mistakes, improperly connected or malfunctioning devices, or bugs within the EHR software. Such errors are typically uncovered during data exploration and addressed during data preparation. Tools such as the Data Quality Dashboard [] have been developed to support and streamline the data cleaning process. Artifacts can also be introduced during data extraction or data preparation. Unit tests for both the ETL process and the data preparation pipeline can safeguard against introducing additional errors.
Overzealous correction of errors, such as manual correction during data preparation of every error encountered, might not necessarily prove beneficial for prediction tasks when such corrections cannot be programmatically reproduced during model implementation in clinical settings. This discrepancy can lead to a situation in which training data accurately represent the true patient state, while real-time predictions rely on erroneous data, resulting in reduced predictive performance. Although measures can be taken to address common and documented errors from previous studies or identified during data exploration ( and ), it is impossible to anticipate and guard against all future errors. For example, new EHR software versions and updates may introduce new bugs, even as older bugs are resolved.
| Problem | Description and recommendation |
| Nonnumeric values in a numeric field |
|
| Values outside a physiologically plausible range (outliers) |
|
| Implausible or inconsistent values within a time series |
|
| Character encoding problems |
|
| Inconsistent or incorrectly specified units |
|
aMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
bWW: Weiskopf and Weng dimension mapping.
cEHR: electronic health record.
dICU: intensive care unit.
eCPR: cardiopulmonary resuscitation.
| Problem | Description and recommendation |
| Clinically implausible combinations of patient characteristics |
|
| Syntactic variability for missing values |
|
| Missing values that can be fully recovered from other fields |
|
| Implausible missingness rates |
|
aEHR: electronic health record.
bMETRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency Cluster and Dimension Mapping.
cWW: Weiskopf and Weng dimension mapping.
dETL: extract, transform, load.
Discussion and Conclusion
In this work, we highlighted challenges and recommendations for the extraction and preparation of EHR data for predictive modeling. Adhering to the “garbage in, garbage out” principle, prediction models rely heavily on the quality and relevance of the input data to generate meaningful and reliable predictions. High-quality data extraction and preparation processes can support prediction models with clinical utility. While it is often argued that EHR “data are not collected for research purposes” [,-,,], the fact that these data are sufficient for supporting patient care suggests they could be adequate for developing prediction models. Even though the list of challenges may seem extensive, it should not discourage researchers new to working with EHR data. Not all challenges will apply to every project, and as EHR systems continue to evolve, many issues might affect only historical data and not current data at implementation time. Successfully leveraging EHR data requires both an understanding of its limitations and an appreciation of its potential. Despite its complexities, EHR data offer a rich, comprehensive source of real-world clinical information that can drive impactful research and improve patient outcomes. Applied prediction models can exploit the comprehensive clinical information that is not always readily available to all health care professionals, such as nurses, doctors, hygienists, and other therapists, and have proven practical applicability [].
Data extraction and preparation for predictive modeling using EHR data are resource-intensive processes, with time and cost varying depending on the maturity of the extraction framework, the prediction task, and the team’s experience in working with EHR data. It is estimated that this phase (which generally includes collaboration with data extraction engineers and clinical experts) takes up to 3 months for an entire team [], but such estimations will be highly dependent on the maturity of the extraction process and of the team’s experience with EHR data. Data quality often varies significantly across EHR systems and extraction processes, as noted by Weiskopf and Weng []. Issues such as data gaps, temporal leaks, incorrect linking, inconsistencies in clinical concept and terminology mapping can affect the quality of extracted datasets and compromise the model’s performance and applicability. To address such challenges, we proposed a list of practical recommendations informed by our experiences with EHR data and insights from published studies. We organized the challenges and recommendations into: cohort definition, outcome definition, feature engineering, and data cleaning; the first 3 categories can also be consulted when planning a project. A clear definition of the prediction task or research question, of the intended use, and the intended users of a prediction model are the first critical steps for defining the outcome, the cohort, and the features of interest [,]. It is not uncommon to deem the EHR data inadequate for the prediction task before proceeding with the model-building phase [].
For cohort definition, we recommend extracting a broader patient context beyond the immediate focus, assessing the completeness of data used for inclusion or exclusion criteria and its availability at prediction time, preventing omissions or duplications, and carefully defining episodes of interest for prediction. Outcomes can be derived in different ways, each with advantages and shortcomings. We generally advise against the use of ICD (International Classification of Diseases) codes (as these can be up- or undercoded and are generally not timestamped), unless carefully assessed as appropriate for the prediction task. A good understanding of the hospital’s processes, thorough verification of outcomes derived in code, manual inspection of labels, and agreement between data sources can also prevent incorrect outcome definition. Mapping terminology to coding systems (eg, ICD, Current Procedural Terminology, LOINC, Systematized Nomenclature of Medicine Clinical Terms, or others) can facilitate feature engineering. However, not all items in the EHR system are aligned with standardized terminologies (eg, LOINC codes for laboratory results are often not used). High-quality data extraction documentation and a good understanding of the underlying clinical concepts and health care processes are essential to support feature engineering. Data exploration can reveal unaddressed problems and further inform the construction of meaningful features. For feature engineering in prediction settings, the timestamp when a clinical item is available in the system is of greatest interest, as this is the time at which predictor values become available for prediction in clinical practice. Good documentation and correct interpretation of the extracted timestamps will prevent temporal leaks. Thorough verification of the extraction and preparation processes (using manual or automated tests), data exploration, and reproducible data cleaning can safeguard against data quality issues.
Clinical assessment of the relevance and the sequence of extracted features and outcome within a patient admission by manual verification of a random sample of admissions can further help detect problems []. The solutions to specific issues can be implemented at either the extraction or preparation stage. Applicable to both data extraction and preparation are a good understanding of the underlying data structure and current and historical health care processes, collaboration with relevant experts in conducting the work [,], and ensuring a qualitative process of extraction and preparation, supported by unit tests. We recommend maintaining consistency in data extraction and preparation between model training and clinical implementation. Exceptions might, though, exist for correcting historical data problems that are not expected to recur in future data.
We hold the opinion that, in the context of prediction models, the extraction process should not attempt to correct errors residing in the EHR database in an attempt to align the data to the clinical reality, but it should reflect the information from the EHR, presented in a simplified format. We recommend that data cleaning steps be performed during data preparation, pragmatically and programmatically, so that they can be reproduced at implementation time. Extracting and preparing the training data in a different manner than for clinical implementation (eg, temporal leaks) poses the risk of potential overoptimistic evaluation, in the light of which the model’s performance when implemented in clinical practice will be lower than expected. We acknowledge that there are divergent views on this topic and that, in the context of inferential studies, when the analysis does not need to be reproduced on future data, corrections during data extraction might be preferred. At the same time, multipurpose extractions (for both prediction and inferential studies) pose the challenge of solving this divergent view.
We do not provide recommendations for correcting (historical) biases in EHR data (eg, racial, gender, and socioeconomic). Research on preprocessing to enhance fairness has mainly focused on weighting and sampling techniques, while our work focuses on data extraction and feature engineering. Identifying and correcting bias typically requires analyzing model performance and comparison across subgroups, which happens after the data are prepared and the model is built. While the primary focus of this tutorial is methodological, several of our recommendations, such as those concerning cohort definition strategies, implicitly reflect data governance considerations, including the need for ethical approval and proper access control. For example, we encourage defining cohorts during data preparation, provided appropriate ethical approvals are in place to extract broad datasets.
While the level of detail for reporting data extraction and preparation in published prediction studies varies, we do not provide specific recommendations on this aspect. Some researchers advocate for comprehensive documentation of these processes [,,], while others emphasize the importance of sharing the data preparation code to ensure transparency [,]. Data extraction for public datasets is generally documented in a separate publication. This can be complemented by sharing the data preparation code and describing in the main paper the key differences between the raw extracted data and the final prepared dataset []. Each strategy has its advantages, depending on the audience, journal requirements, and the desired trade-off between transparency and conciseness. While we advocate for transparent reporting, this paper does not specifically address reporting guidelines for data extraction and preparation.
Implementing terminology mapping and standardization during the data extraction phase (eg, through OMOP CDM) can significantly reduce the effort required during data preparation and enable research on multisite datasets extracted from different EHR systems. Furthermore, toolsets for OMOP CDM EHR data exploration and quality assessment have been developed as part of the Observational Health Data Sciences and Informatics initiative, such as Achilles [] and Data Quality Dashboard [], which facilitate more than 3500 data quality checks. While we recommend standardization as a solution for some of the listed challenges, we aimed to cover both standard and nonstandard extractions, as the adoption of the OMOP CDM framework may involve higher costs and effort, which can be prohibitive for some hospitals.
Our work has several limitations. First, it is based on our experiences and a selective literature review that cannot be exhaustive. We acknowledge that every project will face use-case-specific challenges. Insights from other research groups working with EHR data would likely highlight additional challenges and recommendations that we may have overlooked, offering a more complete set of recommendations. Second, our focus was on single-site structured EHR data. Extensions to multicenter datasets or unstructured data are possible. Multicenter datasets pose additional challenges with regard to aligning clinical concepts and following the same patient in different hospitals or general practitioner systems. Patient care happens across multiple systems, and single-site extractions provide only a fragmented view of the entire patient care. Third, we provide recommendations for problems that can impede the model implementation in practice, but we do not explicitly cover post implementation challenges, although some of our recommendations can inform on monitoring checks that can be implemented, which remains a subject for medical device post market surveillance. Fourth, while we emphasize the importance of high-quality data extraction and preparation as the foundation of reliable predictive models, we do not assess the impact of each challenge on the final prediction model. The impact will likely depend on the magnitude of the problem, the subgroups in which it manifests, the prediction task, or even the type of model (eg, tree-based models are generally resilient to outliers). While the impact of dataset size, missing data, and outcome definition on the model performance has been studied [], the impact of other steps in the data preparation procedure remains unknown. Researchers identify and resolve problems during data preparation without assessing their impact on model performance, which we acknowledge having done the same. Research on measurement error [] demonstrates that inconsistencies in predictor definitions between training and test data can affect model calibration. A similar impact may occur if correction strategies differ between model training and clinical implementation. Finally, although we focused on models with clinical applicability, by recommending an implementation-ready process to avoid discrepancies between the training data and future data due to data extraction and preparation, we did not specifically address model generalization to new hospital settings. The requirements for reproducing the data extraction and preparation can vary significantly based on the EHR software in use (whether from the same or different vendor), the data extraction platform, and the hospital workflow and data registration procedures. Sendak et al [] estimate “approximately 75% of the effort invested in the initial data preparation for developing prediction models must be reinvested for each hospital.” We argue that the estimation would vary widely, with standardized extractions from the same EHR software potentially requiring less time. Further research could focus on extending the list of challenges and recommendations based on the experience with EHR data of other research groups, extending the scope to larger extraction contexts, or assessing the impact of erroneous or suboptimal extraction and preparation on the final model. We strongly encourage researchers to conduct such impact studies, as this remains a notable gap in the current literature. Understanding how problems propagate through the pipeline is essential for developing trustworthy and clinically meaningful prediction models. We also recognize that decisions made during data extraction and preparation have direct implications for patient safety. Model transparency, reproducibility, and clinical accountability begin early in the pipeline; poor documentation or inconsistent preprocessing can lead to silent errors that affect downstream predictions. Future work could further explore how data extraction and preparation decisions impact not just model performance but also the safe and ethical use of prediction models in clinical settings.
The extensive list of challenges and practical recommendations for EHR data extraction and preparation presented here is intended to improve the quality of research and the practical applicability of clinical predictive models. As all modeling efforts begin with the underlying data, failing to address data quality issues risks producing unreliable and nongeneralizable models. Our focus extends beyond the initial data curation stage in the artificial intelligence life cycle []; we also address early-stage issues that can ultimately negatively impact model deployment. Recognizing that there is no “one-size-fits-all” solution, our list of challenges and recommendations, though not exhaustive, is comprehensive enough to support many EHR-based prediction projects. Implementing the strategies applicable to each project can ultimately enhance robustness, reproducibility, and real-world impact of EHR-based prediction models.
Acknowledgments
The authors used generative artificial intelligence (ChatGPT) to assist with rephrasing parts of the text. All examples (challenges and recommendations), ideas, and opinions are based on the authors’ research experiences. The authors take full responsibility for the content of this paper.
Authors' Contributions
EA handled the conceptualization, investigation, and writing of the original draft. SG, FER, BCTvB, TC, and TH-B reviewed and edited the writing. PS acquired the funding, managed the resources, worked on the formal analysis, and reviewed and edited the writing. LW and BVC acquired the funding, supervised this study, and reviewed and edited the writing.
Conflicts of Interest
None declared.
Use case and glossary.
DOCX File , 31 KBReferences
- Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez, AN. Attention is all you need. 2017. Presented at: NIPS'17: Proceedings of the 31st International Conference on Neural Information Processing Systems; December 4-9, 2017:6000-6010; Long Beach, CA.
- Lee C, Yoon J, Schaar MVD. Dynamic-deepHit: a deep learning approach for dynamic survival analysis with competing risks based on longitudinal data. IEEE Trans Biomed Eng. 2020;67(1):122-133. [CrossRef] [Medline]
- Tomašev N, Harris N, Baur S, Mottram A, Glorot X, Rae JW, et al. Use of deep learning to develop continuous-risk models for adverse event prediction from electronic health records. Nat Protoc. 2021;16(6):2765-2787. [CrossRef] [Medline]
- Tomašev N, Glorot X, Rae JW, Zielinski M, Askham H, Saraiva A, et al. A clinically applicable approach to continuous prediction of future acute kidney injury. Nature. 2019;572(7767):116-119. [FREE Full text] [CrossRef] [Medline]
- Lee TC, Shah NU, Haack A, Baxter SL. Clinical implementation of predictive models embedded within electronic health record systems: a systematic review. Informatics (MDPI). 2020;7(3):25. [FREE Full text] [CrossRef] [Medline]
- Gao S, Albu E, Putter H, Stijnen P, Rademakers FE, Cossey V, et al. A comparison of modeling approaches for static and dynamic prediction of central-line bloodstream infections using electronic health records (part 1): regression models. Diagn Progn Res. Jul 21, 2025;9(1):20. [CrossRef] [Medline]
- Albu E, Gao S, Stijnen P, Rademakers F, Janssens C, Cossey V, et al. A comparison of modeling approaches for static and dynamic prediction of central line-associated bloodstream infections using electronic health records (part 2): random forest models. Diagn Progn Res. Jul 21, 2025;9(1):21. [CrossRef] [Medline]
- Albu E, Gao S, Stijnen P, Rademakers FE, Janssens C, Cossey V, et al. Hospital-wide, dynamic, individualized prediction of central line-associated bloodstream infections-development and temporal evaluation of six prediction models. BMC Infect Dis. Apr 24, 2025;25(1):597. [FREE Full text] [CrossRef] [Medline]
- Pollard TJ, Johnson AEW, Raffa JD, Celi LA, Mark RG, Badawi O. The eICU collaborative research database, a freely available multi-center database for critical care research. Sci Data. 2018;5:180178. [FREE Full text] [CrossRef] [Medline]
- Thoral P, Peppink J, Driessen R, Sijbrands EJG, Kompanje EJO, Kaplan L, et al. Amsterdam University Medical Centers Database (AmsterdamUMCdb) Collaborators and the SCCM/ESICM Joint Data Science Task Force. Sharing ICU patient data responsibly under the society of critical care medicine/European society of intensive care medicine joint data science collaboration: the Amsterdam university medical centers database (AmsterdamUMCdb) example. Crit Care Med. 2021;49(6):e563-e577. [FREE Full text] [CrossRef] [Medline]
- de Kok JWTM, de la Hoz, de Jong Y, Brokke V, Elbers PWG, Thoral P, Collaborator group, et al. A guide to sharing open healthcare data under the general data protection regulation. Sci Data. 2023;10(1):404. [FREE Full text] [CrossRef] [Medline]
- All of Us Research Program Investigators, Denny JC, Rutter JL, Goldstein DB, Philippakis A, Smoller JW, et al. The "All of Us" Research Program. N Engl J Med. 2019;381(7):668-676. [FREE Full text] [CrossRef] [Medline]
- Johnson AE, Pollard TJ, Shen L, Lehman LH, Feng M, Ghassemi M, et al. MIMIC-III, a freely accessible critical care database. Sci Data. 2016;3:160035. [FREE Full text] [CrossRef] [Medline]
- Johnson AEW, Bulgarelli L, Shen L, Gayles A, Shammout A, Horng S, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data. 2023;10(1):1. [FREE Full text] [CrossRef] [Medline]
- Jarrett D, Yoon J, Bica I, Qian Z, Ercole A. Clairvoyance: a pipeline toolkit for medical time series. arXiv. Preprint published on Oct 28, 2023. 2023. [CrossRef]
- Tang S, Davarmanesh P, Song Y, Koutra D, Sjoding M, Wiens J. Democratizing EHR analyses with FIDDLE: a flexible data-driven preprocessing pipeline for structured clinical data. J Am Med Inform Assoc. 2020;27(12):1921-1934. [FREE Full text] [CrossRef] [Medline]
- Gupta M, Gallamoza B, Cutrona N, Dhakal P, Poulain R, Beheshti R. An extensive data processing pipeline for MIMIC-IV. 2022. Presented at: Machine Learning for Health. PMLR; December 16, 2024:311-325; Vancouver, Canada.
- Wang S, McDermott M, Chauhan G, Ghassemi M, Hughes M, Naumann T. Mimic-extract: a data extraction, preprocessing, and representation pipeline for MIMIC-III. 2020. Presented at: Proceedings of the ACM Conference on Health, Inference, and Learning; April 8-10, 2021:222-235; Virtual Event. [CrossRef]
- Lewis A, Weiskopf N, Abrams Z, Foraker R, Lai AM, Payne PRO, et al. Electronic health record data quality assessment and tools: a systematic review. J Am Med Inform Assoc. 2023;30(10):1730-1740. [FREE Full text] [CrossRef] [Medline]
- Miao Z, Sealey MD, Sathyanarayanan S, Delen D, Zhu L, Shepherd S. A data preparation framework for cleaning electronic health records and assessing cleaning outcomes for secondary analysis. Inf Syst. 2023;111:102130. [CrossRef]
- Thuraisingam S, Chondros P, Dowsey MM, Spelman T, Garies S, Choong PF, et al. Assessing the suitability of general practice electronic health records for clinical prediction model development: a data quality assessment. BMC Med Inform Decis Mak. 2021;21(1):297. [FREE Full text] [CrossRef] [Medline]
- Nobles AL, Vilankar K, Wu H, Barnes LE. Evaluation of data quality of multisite electronic health record data for secondary analysis. 2015. Presented at: IEEE International Conference on Big Data (Big Data); December 8-11, 2015:2612-2620; Macau SAR, China. [CrossRef]
- Weiskopf NG, Weng C. Methods and dimensions of electronic health record data quality assessment: enabling reuse for clinical research. J Am Med Inform Assoc. 2013;20(1):144-151. [FREE Full text] [CrossRef] [Medline]
- Schwabe D, Becker K, Seyferth M, Klaß A, Schaeffter T. The METRIC-framework for assessing data quality for trustworthy AI in medicine: a systematic review. npj Digit Med. 2024;7(1):203. [FREE Full text] [CrossRef] [Medline]
- Ferrão JC, Oliveira M, Janela F, Martins H. Preprocessing structured clinical data for predictive modeling and decision support. A roadmap to tackle the challenges. Appl Clin Inform. 2016;7(4):1135-1153. [FREE Full text] [CrossRef] [Medline]
- Sauer CM, Chen L, Hyland SL, Girbes A, Elbers P, Celi LA. Leveraging electronic health records for data science: common pitfalls and how to avoid them. Lancet Digit Health. 2022;4(12):e893-e898. [FREE Full text] [CrossRef] [Medline]
- Arbet J, Brokamp C, Meinzen-Derr J, Trinkley KE, Spratt HM. Lessons and tips for designing a machine learning study using EHR data. J Clin Transl Sci. 2020;5(1):e21. [FREE Full text] [CrossRef] [Medline]
- Maletzky A, Böck C, Tschoellitsch T, Roland T, Ludwig H, Thumfart S, et al. Lifting hospital electronic health record data treasures: challenges and opportunities. JMIR Med Inform. 2022;10(10):e38557. [FREE Full text] [CrossRef] [Medline]
- Sendak MP, Balu S, Schulman KA. Barriers to achieving economies of scale in analysis of EHR Data. A cautionary tale. Appl Clin Inform. 2017;8(3):826-831. [FREE Full text] [CrossRef] [Medline]
- de Kok JW, van Bussel BC, Schnabel R, van Herpt TT, Driessen RG, Meijs DA, et al. Table0; documenting the steps to go from clinical database to research dataset. J Clin Epidemiol. 2024;170:111342. [FREE Full text] [CrossRef] [Medline]
- Honeyford K, Expert P, Mendelsohn E, Post B, Faisal A, Glampson B, et al. Challenges and recommendations for high quality research using electronic health records. Front Digit Health. 2022;4:940330. [FREE Full text] [CrossRef] [Medline]
- van de Klundert N, Holman R, Dongelmans DA, de Keizer NF. Data resource profile: the Dutch National Intensive Care Evaluation (NICE) registry of admissions to adult intensive care units. Int J Epidemiol. 2015;44(6):1850-1850h. [CrossRef] [Medline]
- Pungitore S, Subbian V. Assessment of prediction tasks and time window selection in temporal modeling of electronic health record data: a systematic review. J Healthc Inform Res. 2023;7(3):313-331. [CrossRef] [Medline]
- O'Malley KJ, Cook KF, Price MD, Wildes KR, Hurdle JF, Ashton CM. Measuring diagnoses: ICD code accuracy. Health Serv Res. 2005;40(5 Pt 2):1620-1639. [FREE Full text] [CrossRef] [Medline]
- Ghassemi M, Naumann T, Schulam P, Beam A, Chen I, Ranganath R. A review of challenges and opportunities in machine learning for health. AMIA Jt Summits Transl Sci Proc. 2020;2020:191-200. [FREE Full text] [Medline]
- Wong A, Otles E, Donnelly JP, Krumm A, McCullough J, DeTroyer-Cooley O, et al. External validation of a widely implemented proprietary sepsis prediction model in hospitalized patients. JAMA Intern Med. 2021;181(8):1065-1070. [FREE Full text] [CrossRef] [Medline]
- Blacketer C, Defalco F, Ryan P, Rijnbeek P. Increasing trust in real-world evidence through evaluation of observational data quality. J Am Med Inform Assoc. 2021;28(10):2251-2257. [FREE Full text] [CrossRef] [Medline]
- Albert PS. Modeling longitudinal biomarker data from multiple assays that have different known detection limits. Biometrics. 2008;64(2):527-537. [CrossRef] [Medline]
- van der Lei J. Use and abuse of computer-stored medical records. Methods Inf Med. 1991;30(02):79-80. [CrossRef]
- Goldstein BA. Five analytic challenges in working with electronic health records data to support clinical trials with some solutions. Clin Trials. 2020;17(4):370-376. [CrossRef] [Medline]
- Chin MH, Afsar-Manesh N, Bierman AS, Chang C, Colón-Rodríguez CJ, Dullabh P, et al. Guiding principles to address the impact of algorithm bias on racial and ethnic disparities in health and health care. JAMA Netw Open. 2023;6(12):e2345050. [FREE Full text] [CrossRef] [Medline]
- Hernandez-Boussard T, Bozkurt S, Ioannidis J, Shah N. MINIMAR (MINimum Information for Medical AI Reporting): developing reporting standards for artificial intelligence in health care. J Am Med Inform Assoc. 2020;27(12):2011-2015. [FREE Full text] [CrossRef] [Medline]
- Goldstein ND. Improved reporting of selection processes in clinical database research. Response to de Kok et al. J Clin Epidemiol. 2024;172:111373. [CrossRef] [Medline]
- Johnson A, Pollard T, Mark R. Reproducibility in critical care: a mortality prediction case study. 2017. Presented at: Machine Learning for Healthcare Conference. PMLR; August 15-16, 2025:361-376; Hilton Rochester Mayo Clinic.
- DeFalco F, Ryan P, Schuemie M, Huser V, Knoll C, Londhe A, et al. Achilles: achilles data source characterization. R package version 1.7.2. 2023. URL: https://cran.r-project.org/web/packages/Achilles/index.html [accessed 2025-08-16]
- van Os HJA, Kanning JP, Wermer MJH, Chavannes NH, Numans ME, Ruigrok YM, et al. Developing clinical prediction models using primary care electronic health record data: the impact of data preparation choices on model performance. Front Epidemiol. 2022;2:871630. [CrossRef] [Medline]
- Luijken K, Wynants L, van Smeden M, Van Calster B, Steyerberg EW, Groenwold RH, et al. Collaborators. Changing predictor measurement procedures affected the performance of prediction models in clinical examples. J Clin Epidemiol. 2020;119:7-18. [FREE Full text] [CrossRef] [Medline]
- Ng MY, Kapur S, Blizinsky KD, Hernandez-Boussard T. The AI life cycle: a holistic approach to creating ethical AI for health decisions. Nat Med. 2022;28(11):2247-2249. [FREE Full text] [CrossRef] [Medline]
Abbreviations
| CLABSI: central line-associated bloodstream infections |
| EHR: electronic health record |
| ETL: extract, transform, load |
| ICD: International Classification of Diseases |
| ICU: intensive care unit |
| LOINC: Logical Observation Identifiers, Names, and Codes |
| METRIC: Measurement Process, Timeliness, Representativeness, Informativeness, Consistency |
| OMOP CDM: Observational Medical Outcomes Partnership Common Data Model |
| RESTful: Representational State Transfer |
Edited by J Sarvestan; submitted 17.Mar.2025; peer-reviewed by A Adekola, O Kehinde, CG Udensi, A Bakare; comments to author 07.Apr.2025; revised version received 15.Jul.2025; accepted 21.Jul.2025; published 17.Oct.2025.
Copyright©Elena Albu, Shan Gao, Pieter Stijnen, Frank E Rademakers, Bas CT van Bussel, Taya Collyer, Tina Hernandez-Boussard, Laure Wynants, Ben Van Calster. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 17.Oct.2025.
This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.

