Benchmark Integrity and Reasoning-Trace Errors in Medical Question Answering With Large Language Models: Mixed Methods Study With Sparse Autoencoders

Published on 12.Jun.2026 in Vol 28 (2026)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/90061, first published 20.Dec.2025.

; Siru Liu^{3, 4}

; Adam Wright^{3, 5}

Jialin Liu ^{1, 2
*} , MD ; Siru Liu ^{3, 4
*} , PhD ; Adam Wright ^{3, 5} , PhD

¹ Department of Medical Informatics, West China Hospital of Sichuan University, Chengdu, Sichuan, China

² Department of Otolaryngology-Head and Neck Surgery, West China Hospital of Sichuan University, Chengdu, Sichuan, China

³ Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States

⁴ Department of Computer Science, Vanderbilt University, Nashville, TN, United States

⁵ Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, United States

*these authors contributed equally

Corresponding Author: