TY - JOUR AU - Seinen, Tom M AU - Kors, Jan A AU - van Mulligen, Erik M AU - Rijnbeek, Peter R PY - 2025 DA - 2025/2/13 TI - Using Structured Codes and Free-Text Notes to Measure Information Complementarity in Electronic Health Records: Feasibility and Validation Study JO - J Med Internet Res SP - e66910 VL - 27 KW - natural language processing KW - named entity recognition KW - clinical concept extraction KW - machine learning KW - electronic health records KW - EHR KW - word embeddings KW - clinical concept similarity KW - text mining KW - code KW - free-text KW - information KW - electronic record KW - data KW - patient records KW - framework KW - structured data KW - unstructured data AB - Background: Electronic health records (EHRs) consist of both structured data (eg, diagnostic codes) and unstructured data (eg, clinical notes). It is commonly believed that unstructured clinical narratives provide more comprehensive information. However, this assumption lacks large-scale validation and direct validation methods. Objective: This study aims to quantitatively compare the information in structured and unstructured EHR data and directly validate whether unstructured data offers more extensive information across a patient population. Methods: We analyzed both structured and unstructured data from patient records and visits in a large Dutch primary care EHR database between January 2021 and January 2024. Clinical concepts were identified from free-text notes using an extraction framework tailored for Dutch and compared with concepts from structured data. Concept embeddings were generated to measure semantic similarity between structured and extracted concepts through cosine similarity. A similarity threshold was systematically determined via annotated matches and minimized weighted Gini impurity. We then quantified the concept overlap between structured and unstructured data across various concept domains and patient populations. Results: In a population of 1.8 million patients, only 13% of extracted concepts from patient records and 7% from individual visits had similar structured counterparts. Conversely, 42% of structured concepts in records and 25% in visits had similar matches in unstructured data. Condition concepts had the highest overlap, followed by measurements and drug concepts. Subpopulation visits, such as those with chronic conditions or psychological disorders, showed different proportions of data overlap, indicating varied reliance on structured versus unstructured data across clinical contexts. Conclusions: Our study demonstrates the feasibility of quantifying the information difference between structured and unstructured data, showing that the unstructured data provides important additional information in the studied database and populations. The annotated concept matches are made publicly available for the clinical natural language processing community. Despite some limitations, our proposed methodology proves versatile, and its application can lead to more robust and insightful observational clinical research. SN - 1438-8871 UR - https://www.jmir.org/2025/1/e66910 UR - https://doi.org/10.2196/66910 DO - 10.2196/66910 ID - info:doi/10.2196/66910 ER -