@Article{info:doi/10.2196/63190, author="Xie, Jiacheng and Zhang, Ziyang and Zeng, Shuai and Hilliard, Joel and An, Guanghui and Tang, Xiaoting and Jiang, Lei and Yu, Yang and Wan, Xiufeng and Xu, Dong", title="Leveraging Large Language Models for Infectious Disease Surveillance---Using a Web Service for Monitoring COVID-19 Patterns From Self-Reporting Tweets: Content Analysis", journal="J Med Internet Res", year="2025", month="Feb", day="20", volume="27", pages="e63190", keywords="COVID-19; self-reporting data; large language model; Twitter; social media analysis; natural language processing; machine learning", abstract="Background: The emergence of new SARS-CoV-2 variants, the resulting reinfections, and post--COVID-19 condition continue to impact many people's lives. Tracking websites like the one at Johns Hopkins University no longer report the daily confirmed cases, posing challenges to accurately determine the true extent of infections. Many COVID-19 cases with mild symptoms are self-assessed at home and reported on social media, which provides an opportunity to monitor and understand the progression and evolving trends of the disease. Objective: We aim to build a publicly available database of COVID-19--related tweets and extracted information about symptoms and recovery cycles from self-reported tweets. We have presented the results of our analysis of infection, reinfection, recovery, and long-term effects of COVID-19 on a visualization website that refreshes data on a weekly basis. Methods: We used Twitter (subsequently rebranded as X) to collect COVID-19--related data, from which 9 native English-speaking annotators annotated a training dataset of COVID-19--positive self-reporters. We then used large language models to identify positive self-reporters from other unannotated tweets. We used the Hibert transform to calculate the lead of the prediction curve ahead of the reported curve. Finally, we presented our findings on symptoms, recovery, reinfections, and long-term effects of COVID-19 on the Covlab website. Results: We collected 7.3 million tweets related to COVID-19 between January 1, 2020, and April 1, 2024, including 262,278 self-reported cases. The predicted number of infection cases by our model is 7.63 days ahead of the official report. In addition to common symptoms, we identified some symptoms that were not included in the list from the US Centers for Disease Control and Prevention, such as lethargy and hallucinations. Repeat infections were commonly occurring, with rates of second and third infections at 7.49{\%} (19,644/262,278) and 1.37{\%} (3593/262,278), respectively, whereas 0.45{\%} (1180/262,278) also reported that they had been infected >5 times. We identified 723 individuals who shared detailed recovery experiences through tweets, indicating a substantially reduction in recovery time over the years. Specifically, the average recovery period decreased from around 30 days in 2020 to approximately 12 days in 2023. In addition, geographic information collected from confirmed individuals indicates that the temporal patterns of confirmed cases in states such as California and Texas closely mirror the overall trajectory observed across the United States. Conclusions: Although with some biases and limitations, self-reported tweet data serves as a valuable complement to clinical data, especially in the postpandemic era dominated by mild cases. Our web-based analytic platform can play a significant role in continuously tracking COVID-19, finding new uncommon symptoms, detecting and monitoring the manifestation of long-term effects, and providing necessary insights to the public and decision-makers. ", issn="1438-8871", doi="10.2196/63190", url="https://www.jmir.org/2025/1/e63190", url="https://doi.org/10.2196/63190" }