@Article{info:doi/10.2196/70703, author="P{\'e}rez-Esteve, Clara and Guilabert, Mercedes and Matarredona, Valerie and Srulovici, Einav and Tella, Susanna and Strametz, Reinhard and Mira, Jos{\'e} Joaqu{\'i}n", title="AI in Home Care---Evaluation of Large Language Models for Future Training of Informal Caregivers: Observational Comparative Case Study", journal="J Med Internet Res", year="2025", month="Apr", day="28", volume="27", pages="e70703", keywords="large language models; older adults; informal caregiver; error prevention; patient safety; ChatGPT; Microsoft Copilot; training; health literacy", abstract="Background: The aging population presents an accomplishment for society but also poses significant challenges for governments, health care systems, and caregivers. Elevated rates of functional limitations among older adults, primarily caused by chronic conditions, necessitate adequate and safe care, including in-home settings. Traditionally, informal caregiver training has relied on verbal and written instructions. However, the advent of digital resources has introduced videos and interactive platforms, offering more accessible and effective training. Large language models (LLMs) have emerged as potential tools for personalized information delivery. While LLMs exhibit the capacity to mimic clinical reasoning and support decision-making, their potential to serve as alternatives to evidence-based professional instruction remains unexplored. Objective: We aimed to evaluate the appropriateness of home care instructions generated by LLMs (including GPTs) in comparison to a professional gold standard. Furthermore, it seeks to identify specific domains where LLMs show the most promise and where improvements are necessary to optimize their reliability for caregiver training. Methods: An observational, comparative case study evaluated 3 LLMs---GPT-3.5, GPT-4o, and Microsoft Copilot---in 10 home care scenarios. A rubric assessed the models against a reference standard (gold standard) created by health care professionals. Independent reviewers evaluated variables including specificity, clarity, and self-efficacy. In addition to comparing each LLM to the gold standard, the models were also compared against each other across all study domains to identify relative strengths and weaknesses. Statistical analyses compared LLMs performance to the gold standard to ensure consistency and validity, as well as to analyze differences between LLMs across all evaluated domains. Results: The study revealed that while no LLM achieved the precision of the professional gold standard, GPT-4o outperformed GPT-3.5, and Copilot in specificity (4.6 vs 3.7 and 3.6), clarity (4.8 vs 4.1 and 3.9), and self-efficacy (4.6 vs 3.8 and 3.4). However, the models exhibited significant limitations, with GPT-4o and Copilot omitting relevant details in 60{\%} (6/10) of the cases, and GPT-3.5 doing so in 80{\%} (8/10). When compared to the gold standard, only 10{\%} (2/20) of GPT-4o responses were rated as equally specific, 20{\%} (4/20) included comparable practical advice, and just 5{\%} (1/20) provided a justification as detailed as professional guidance. Furthermore, error frequency did not differ significantly across models (P=.65), though Copilot had the highest rate of incorrect information (20{\%}, 2/10 vs 10{\%}, 1/10 for GPT-4o and 0{\%}, 0/0 for GPT-3.5). Conclusions: LLMs, particularly GPT-4o subscription-based, show potential as tools for training informal caregivers by providing tailored guidance and reducing errors. Although not yet surpassing professional instruction quality, these models offer a flexible and accessible alternative that could enhance home safety and care quality. Further research is necessary to address limitations and optimize their performance. Future implementation of LLMs may alleviate health care system burdens by reducing common caregiver errors. ", issn="1438-8871", doi="10.2196/70703", url="https://www.jmir.org/2025/1/e70703", url="https://doi.org/10.2196/70703" }