TY - JOUR AU - McBain, Ryan K AU - Cantor, Jonathan H AU - Zhang, Li Ang AU - Baker, Olesya AU - Zhang, Fang AU - Halbisen, Alyssa AU - Kofner, Aaron AU - Breslau, Joshua AU - Stein, Bradley AU - Mehrotra, Ateev AU - Yu, Hao PY - 2025 DA - 2025/3/5 TI - Competency of Large Language Models in Evaluating Appropriate Responses to Suicidal Ideation: Comparative Study JO - J Med Internet Res SP - e67891 VL - 27 KW - depression KW - suicide KW - mental health KW - large language model KW - chatbot KW - digital health KW - Suicidal Ideation Response Inventory KW - ChatGPT KW - suicidologist KW - artificial intelligence AB - Background: With suicide rates in the United States at an all-time high, individuals experiencing suicidal ideation are increasingly turning to large language models (LLMs) for guidance and support. Objective: The objective of this study was to assess the competency of 3 widely used LLMs to distinguish appropriate versus inappropriate responses when engaging individuals who exhibit suicidal ideation. Methods: This observational, cross-sectional study evaluated responses to the revised Suicidal Ideation Response Inventory (SIRI-2) generated by ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. Data collection and analyses were conducted in July 2024. A common training module for mental health professionals, SIRI-2 provides 24 hypothetical scenarios in which a patient exhibits depressive symptoms and suicidal ideation, followed by two clinician responses. Clinician responses were scored from –3 (highly inappropriate) to +3 (highly appropriate). All 3 LLMs were provided with a standardized set of instructions to rate clinician responses. We compared LLM responses to those of expert suicidologists, conducting linear regression analyses and converting LLM responses to z scores to identify outliers (z score>1.96 or <–1.96; P<0.05). Furthermore, we compared final SIRI-2 scores to those produced by health professionals in prior studies. Results: All 3 LLMs rated responses as more appropriate than ratings provided by expert suicidologists. The item-level mean difference was 0.86 for ChatGPT (95% CI 0.61-1.12; P<.001), 0.61 for Claude (95% CI 0.41-0.81; P<.001), and 0.73 for Gemini (95% CI 0.35-1.11; P<.001). In terms of z scores, 19% (9 of 48) of ChatGPT responses were outliers when compared to expert suicidologists. Similarly, 11% (5 of 48) of Claude responses were outliers compared to expert suicidologists. Additionally, 36% (17 of 48) of Gemini responses were outliers compared to expert suicidologists. ChatGPT produced a final SIRI-2 score of 45.7, roughly equivalent to master’s level counselors in prior studies. Claude produced an SIRI-2 score of 36.7, exceeding prior performance of mental health professionals after suicide intervention skills training. Gemini produced a final SIRI-2 score of 54.5, equivalent to untrained K-12 school staff. Conclusions: Current versions of 3 major LLMs demonstrated an upward bias in their evaluations of appropriate responses to suicidal ideation; however, 2 of the 3 models performed equivalent to or exceeded the performance of mental health professionals. SN - 1438-8871 UR - https://www.jmir.org/2025/1/e67891 UR - https://doi.org/10.2196/67891 UR - http://www.ncbi.nlm.nih.gov/pubmed/40053817 DO - 10.2196/67891 ID - info:doi/10.2196/67891 ER -