TY  - JOUR
AU  - Hanna, John J
AU  - Wakene, Abdi D
AU  - Johnson, Andrew O
AU  - Lehmann, Christoph U
AU  - Medford, Richard J
PY  - 2025
DA  - 2025/3/13
TI  - Assessing Racial and Ethnic Bias in Text Generation by Large Language Models for Health Care–Related Tasks: Cross-Sectional Study
JO  - J Med Internet Res
SP  - e57257
VL  - 27
KW  - sentiment analysis
KW  - racism
KW  - bias
KW  - artificial intelligence
KW  - reading ease
KW  - word frequency
KW  - large language models
KW  - text generation
KW  - healthcare
KW  - task
KW  - ChatGPT
KW  - cross sectional
KW  - consumer-directed
KW  - human immunodeficiency virus
AB  - Background: Racial and ethnic bias in large language models (LLMs) used for health care tasks is a growing concern, as it may contribute to health disparities. In response, LLM operators implemented safeguards against prompts that are overtly seeking certain biases. Objective: This study aims to investigate a potential racial and ethnic bias among 4 popular LLMs: GPT-3.5-turbo (OpenAI), GPT-4 (OpenAI), Gemini-1.0-pro (Google), and Llama3-70b (Meta) in generating health care consumer–directed text in the absence of overtly biased queries. Methods: In this cross-sectional study, the 4 LLMs were prompted to generate discharge instructions for patients with HIV. Each patient’s encounter deidentified metadata including race/ethnicity as a variable was passed over in a table format through a prompt 4 times, altering only the race/ethnicity information (African American, Asian, Hispanic White, and non-Hispanic White) each time, while keeping all other information constant. The prompt requested the model to write discharge instructions for each encounter without explicitly mentioning race or ethnicity. The LLM-generated instructions were analyzed for sentiment, subjectivity, reading ease, and word frequency by race/ethnicity. Results: The only observed statistically significant difference between race/ethnicity groups was found in entity count (GPT-4, df=42, P=.047). However, post hoc chi-square analysis for GPT-4’s entity counts showed no significant pairwise differences among race/ethnicity categories after Bonferroni correction. Conclusions: A total of 4 LLMs were relatively invariant to race/ethnicity in terms of linguistic and readability measures. While our study used proxy linguistic and readability measures to investigate racial and ethnic bias among 4 LLM responses in a health care–related task, there is an urgent need to establish universally accepted standards for measuring bias in LLM-generated responses. Further studies are needed to validate these results and assess their implications. 
SN  - 1438-8871
UR  - https://www.jmir.org/2025/1/e57257
UR  - https://doi.org/10.2196/57257
DO  - 10.2196/57257
ID  - info:doi/10.2196/57257
ER  -