TY  - JOUR
AU  - Šuvalov, Hendrik
AU  - Lepson, Mihkel
AU  - Kukk, Veronika
AU  - Malk, Maria
AU  - Ilves, Neeme
AU  - Kuulmets, Hele-Andra
AU  - Kolde, Raivo
PY  - 2025
DA  - 2025/3/18
TI  - Using Synthetic Health Care Data to Leverage Large Language Models for Named Entity Recognition: Development and Validation Study
JO  - J Med Internet Res
SP  - e66279
VL  - 27
KW  - natural language processing
KW  - named entity recognition
KW  - large language model
KW  - synthetic data
KW  - LLM
KW  - NLP
KW  - machine learning
KW  - artificial intelligence
KW  - language model
KW  - NER
KW  - medical entity
KW  - Estonian
KW  - health care data
KW  - annotated data
KW  - data annotation
KW  - clinical decision support
KW  - data mining
AB  - Background: Named entity recognition (NER) plays a vital role in extracting critical medical entities from health care records, facilitating applications such as clinical decision support and data mining. Developing robust NER models for low-resource languages, such as Estonian, remains a challenge due to the scarcity of annotated data and domain-specific pretrained models. Large language models (LLMs) have proven to be promising in understanding text from any language or domain. Objective: This study addresses the development of medical NER models for low-resource languages, specifically Estonian. We propose a novel approach by generating synthetic health care data and using LLMs to annotate them. These synthetic data are then used to train a high-performing NER model, which is applied to real-world medical texts, preserving patient data privacy. Methods: Our approach to overcoming the shortage of annotated Estonian health care texts involves a three-step pipeline: (1) synthetic health care data are generated using a locally trained GPT-2 model on Estonian medical records, (2) the synthetic data are annotated with LLMs, specifically GPT-3.5-Turbo and GPT-4, and (3) the annotated synthetic data are then used to fine-tune an NER model, which is later tested on real-world medical data. This paper compares the performance of different prompts; assesses the impact of GPT-3.5-Turbo, GPT-4, and a local LLM; and explores the relationship between the amount of annotated synthetic data and model performance. Results: The proposed methodology demonstrates significant potential in extracting named entities from real-world medical texts. Our top-performing setup achieved an F1-score of 0.69 for drug extraction and 0.38 for procedure extraction. These results indicate a strong performance in recognizing certain entity types while highlighting the complexity of extracting procedures. Conclusions: This paper demonstrates a successful approach to leveraging LLMs for training NER models using synthetic data, effectively preserving patient privacy. By avoiding reliance on human-annotated data, our method shows promise in developing models for low-resource languages, such as Estonian. Future work will focus on refining the synthetic data generation and expanding the method’s applicability to other domains and languages. 
SN  - 1438-8871
UR  - https://www.jmir.org/2025/1/e66279
UR  - https://doi.org/10.2196/66279
DO  - 10.2196/66279
ID  - info:doi/10.2196/66279
ER  -