%0 Journal Article %@ 1438-8871 %I JMIR Publications %V 27 %N %P e57257 %T Assessing Racial and Ethnic Bias in Text Generation by Large Language Models for Health Care–Related Tasks: Cross-Sectional Study %A Hanna,John J %A Wakene,Abdi D %A Johnson,Andrew O %A Lehmann,Christoph U %A Medford,Richard J %+ Information Services, ECU Health, 2100 Stantonsburg Rd, Greenville, NC, 27834, United States, 1 2528474100, john.hanna@ecuhealth.org %K sentiment analysis %K racism %K bias %K artificial intelligence %K reading ease %K word frequency %K large language models %K text generation %K healthcare %K task %K ChatGPT %K cross sectional %K consumer-directed %K human immunodeficiency virus %D 2025 %7 13.3.2025 %9 Original Paper %J J Med Internet Res %G English %X Background: Racial and ethnic bias in large language models (LLMs) used for health care tasks is a growing concern, as it may contribute to health disparities. In response, LLM operators implemented safeguards against prompts that are overtly seeking certain biases. Objective: This study aims to investigate a potential racial and ethnic bias among 4 popular LLMs: GPT-3.5-turbo (OpenAI), GPT-4 (OpenAI), Gemini-1.0-pro (Google), and Llama3-70b (Meta) in generating health care consumer–directed text in the absence of overtly biased queries. Methods: In this cross-sectional study, the 4 LLMs were prompted to generate discharge instructions for patients with HIV. Each patient’s encounter deidentified metadata including race/ethnicity as a variable was passed over in a table format through a prompt 4 times, altering only the race/ethnicity information (African American, Asian, Hispanic White, and non-Hispanic White) each time, while keeping all other information constant. The prompt requested the model to write discharge instructions for each encounter without explicitly mentioning race or ethnicity. The LLM-generated instructions were analyzed for sentiment, subjectivity, reading ease, and word frequency by race/ethnicity. Results: The only observed statistically significant difference between race/ethnicity groups was found in entity count (GPT-4, df=42, P=.047). However, post hoc chi-square analysis for GPT-4’s entity counts showed no significant pairwise differences among race/ethnicity categories after Bonferroni correction. Conclusions: A total of 4 LLMs were relatively invariant to race/ethnicity in terms of linguistic and readability measures. While our study used proxy linguistic and readability measures to investigate racial and ethnic bias among 4 LLM responses in a health care–related task, there is an urgent need to establish universally accepted standards for measuring bias in LLM-generated responses. Further studies are needed to validate these results and assess their implications. %R 10.2196/57257 %U https://www.jmir.org/2025/1/e57257 %U https://doi.org/10.2196/57257