%0 Journal Article
%@ 1438-8871
%I JMIR Publications
%V 27
%N 
%P e57257
%T Assessing Racial and Ethnic Bias in Text Generation by Large Language Models for Health Care–Related Tasks: Cross-Sectional Study
%A Hanna,John J
%A Wakene,Abdi D
%A Johnson,Andrew O
%A Lehmann,Christoph U
%A Medford,Richard J
%+ Information Services, ECU Health, 2100 Stantonsburg Rd, Greenville, NC, 27834, United States, 1 2528474100, john.hanna@ecuhealth.org
%K sentiment analysis
%K racism
%K bias
%K artificial intelligence
%K reading ease
%K word frequency
%K large language models
%K text generation
%K healthcare
%K task
%K ChatGPT
%K cross sectional
%K consumer-directed
%K human immunodeficiency virus
%D 2025
%7 13.3.2025
%9 Original Paper
%J J Med Internet Res
%G English
%X Background: Racial and ethnic bias in large language models (LLMs) used for health care tasks is a growing concern, as it may contribute to health disparities. In response, LLM operators implemented safeguards against prompts that are overtly seeking certain biases. Objective: This study aims to investigate a potential racial and ethnic bias among 4 popular LLMs: GPT-3.5-turbo (OpenAI), GPT-4 (OpenAI), Gemini-1.0-pro (Google), and Llama3-70b (Meta) in generating health care consumer–directed text in the absence of overtly biased queries. Methods: In this cross-sectional study, the 4 LLMs were prompted to generate discharge instructions for patients with HIV. Each patient’s encounter deidentified metadata including race/ethnicity as a variable was passed over in a table format through a prompt 4 times, altering only the race/ethnicity information (African American, Asian, Hispanic White, and non-Hispanic White) each time, while keeping all other information constant. The prompt requested the model to write discharge instructions for each encounter without explicitly mentioning race or ethnicity. The LLM-generated instructions were analyzed for sentiment, subjectivity, reading ease, and word frequency by race/ethnicity. Results: The only observed statistically significant difference between race/ethnicity groups was found in entity count (GPT-4, df=42, P=.047). However, post hoc chi-square analysis for GPT-4’s entity counts showed no significant pairwise differences among race/ethnicity categories after Bonferroni correction. Conclusions: A total of 4 LLMs were relatively invariant to race/ethnicity in terms of linguistic and readability measures. While our study used proxy linguistic and readability measures to investigate racial and ethnic bias among 4 LLM responses in a health care–related task, there is an urgent need to establish universally accepted standards for measuring bias in LLM-generated responses. Further studies are needed to validate these results and assess their implications. 
%R 10.2196/57257
%U https://www.jmir.org/2025/1/e57257
%U https://doi.org/10.2196/57257