Accessibility settings

Published on in Vol 28 (2026)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/90061, first published .
Person using a laptop with ChatGPT medical question interface open

Benchmark Integrity and Reasoning-Trace Errors in Medical Question Answering With Large Language Models: Mixed Methods Study With Sparse Autoencoders

Benchmark Integrity and Reasoning-Trace Errors in Medical Question Answering With Large Language Models: Mixed Methods Study With Sparse Autoencoders

Authors of this article:

Jialin Liu1, 2 Author Orcid Image ;   Siru Liu3, 4 Author Orcid Image ;   Adam Wright3, 5 Author Orcid Image

Jialin Liu   1, 2 * , MD ;   Siru Liu   3, 4 * , PhD ;   Adam Wright   3, 5 , PhD

1 Department of Medical Informatics, West China Hospital of Sichuan University, Chengdu, Sichuan, China

2 Department of Otolaryngology-Head and Neck Surgery, West China Hospital of Sichuan University, Chengdu, Sichuan, China

3 Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States

4 Department of Computer Science, Vanderbilt University, Nashville, TN, United States

5 Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, United States

*these authors contributed equally

Corresponding Author:

  • Siru Liu, PhD
  • Department of Biomedical Informatics
  • Vanderbilt University Medical Center
  • 2525 West End Ave
  • Nashville, TN 37203
  • United States
  • Phone: 1 615 936 6867
  • Email: siru.liu@vumc.org