Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

doi:10.2196/52935

Published on 05.Apr.2024 in Vol 26 (2024)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/52935, first published 19.Sep.2023.

Laptop displaying ChatGPT interface on a wooden table with a red mug and phone

Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

Joseph Mugaanyi¹

; Liuying Cai²

; Sumei Cheng²

; Caide Lu¹

; Jing Huang¹

Article Authors Cited by (54) Tweetations (9) Metrics

Journals

Luo X, Chen F, Zhu D, Wang L, Wang Z, Liu H, Lyu M, Wang Y, Wang Q, Chen Y. Potential Roles of Large Language Models in the Production of Systematic Reviews and Meta-Analyses. Journal of Medical Internet Research 2024;26:e56780 View
Norberg K, Almoubayyed H, De Ley L, Murphy A, Weldon K, Ritter S. Rewriting Content with GPT-4 to Support Emerging Readers in Adaptive Mathematics Software. International Journal of Artificial Intelligence in Education 2025;35(2):587 View
Uribe S, Maldupa I. Estimating the use of ChatGPT in dental research publications. Journal of Dentistry 2024;149:105275 View
Oermann M. You Cannot Search the Literature Using Artificial Intelligence, and This Is Why. Nursing Education Perspectives 2024;45(6):337 View
Sun S, Huynh K, Cortes G, Hill R, Tran J, Yeh L, Ngo A, Houshyar R, Yaghmai V, Tran M. Testing the Ability and Limitations of ChatGPT to Generate Differential Diagnoses from Transcribed Radiologic Findings. Radiology 2024;313(1) View
Kayabaşı M, Köksaldı S, Durmaz Engin C. Evaluating the reliability of the responses of large language models to keratoconus-related questions. Clinical and Experimental Optometry 2025;108(7):784 View
Chang Y, Yin J, Li J, Liu C, Cao L, Lin S. Applications and Future Prospects of Medical LLMs: A Survey Based on the M-KAT Conceptual Framework. Journal of Medical Systems 2024;48(1) View
Luo Z, Qiao Y, Xu X, Li X, Xiao M, Kang A, Wang D, Pang Y, Xie X, Xie S, Luo D, Ding X, Liu Z, Liu Y, Hu A, Ren Y, Xie J. Cross sectional pilot study on clinical review generation using large language models. npj Digital Medicine 2025;8(1) View
Jongkind R, Elings E, Joukes E, Broens T, Leopold H, Wiesman F, Meinema J. Is your curriculum GenAI-proof? A method for GenAI impact assessment and a case study. MedEdPublish 2025;15:11 View
Oladokun B, Enakrire R, Emmanuel A, Ajani Y, Adetayo A. Hallucitation in Scientific Writing: Exploring Evidence from ChatGPT Versions 3.5 and 4o in Responses to Selected Questions in Librarianship. Journal of Web Librarianship 2025;19(1):62 View
Spinellis D. False authorship: an explorative case study around an AI-generated article published under my name. Research Integrity and Peer Review 2025;10(1) View
See Y, Lim K, Au W, Chia S, Fan X, Li Z. The Use of Large Language Models in Ophthalmology: A Scoping Review on Current Use-Cases and Considerations for Future Works in This Field. Big Data and Cognitive Computing 2025;9(6):151 View
Spitsberg T, Kettler T, McKamie J. Large language model AI-guided creative writing co-creation in secondary schools. Theory Into Practice 2025;64(4):374 View
Taloni A, Sangregorio A, Alessio G, Romeo M, Coco G, Busin L, Sollazzo A, Scorcia V, Giannaccare G. Large language models provide discordant information compared to ophthalmology guidelines. Scientific Reports 2025;15(1) View
Tsai C, Lin Y, Hou J, Tsai S, Yeh P, Kao C. Optimizing patient education for radioactive iodine therapy and the role of ChatGPT incorporating chain-of-thought technique: ChatGPT questionnaire. DIGITAL HEALTH 2025;11 View
Asiri S. Assessing the Reliability of ChatGPT and Gemini in Identifying Relevant Orthodontic Literature. European Journal of General Dentistry 2026;15(02):217 View
Çamlar M, Sevgi U, Erol G, Karakaş F, Doğruel Y, Güngör A. Comparative performance of neurosurgery-specific, peer-reviewed versus general AI chatbots in bilingual board examinations: evaluating accuracy, consistency, and error minimization strategies. Acta Neurochirurgica 2025;167(1) View
Ikhtiar I, Fidiana , Islamiyah W. Development and content validity analysis of artificial intelligence-generated Indonesian language insomnia questionnaire based on the International Classification of Sleep Disorders, Third Edition. Journal of Neurosciences in Rural Practice 2025;16:595 View
Kim E, Kipchumba F, Min S. Geographic Variation in LLM DOI Fabrication: Cross-Country Analysis of Citation Accuracy Across Four Large Language Models. Publications 2025;13(4):49 View
Moulaison‐Sandy H, Thach H. The Wicked Problem of AI: Information Avoidance, Uncomfortable Knowledge, and ChatGPT in Scholarly Communication. Proceedings of the Association for Information Science and Technology 2025;62(1):1030 View
Oermann M, Owens J, Carter-Templeton H, Peterson G, Bailey H. Using Artificial Intelligence for Scholarly Writing. AJN, American Journal of Nursing 2025;125(11):52 View
Bai J, Ji X, Yu J, Wang Y, Guo Y, Xue C, Zhang W, Zhu J. Assessing the Quality of AI Responses to Patient Concerns About Axial Spondyloarthritis: Delphi-Based Evaluation. JMIR AI 2026;5:e79153 View
Linardon J, Jarman H, McClure Z, Anderson C, Liu C, Messer M. Influence of Topic Familiarity and Prompt Specificity on Citation Fabrication in Mental Health Research Using Large Language Models: Experimental Study. JMIR Mental Health 2025;12:e80371 View
Oladokun B, Ogunjimi B, Olatunbosun I, Adefila E, Abdul A, Ebhonu S, Omoniyi Y, Enakrire R. Assessing Metadata Quality: Analysis of Bibliographic Entries in Librarianship Literature Generated by ChatGPT-5. Journal of Library Metadata 2026;26(1):51 View
Zhou M, Gui F, Ai C, Ling L, Zhang X, Lu Y, Han D, Zhao B, Zhong F, Liu J, Zhu Z, Li J, Huang F, Lin C, Liu W, Xiong J. Comparing the performance of four mainstream large language models on medical literature review generation: a human expert evaluation in SMILE surgery. Graefe's Archive for Clinical and Experimental Ophthalmology 2026;264(5):1481 View
Thelwall M. Do large language models know basic facts about journal articles?. Journal of Documentation 2026;82(2):381 View
Resnik D, Hosseini M, Hauswald R. Autonomous artificial intelligence, scientific research, and human values. AI and Ethics 2026;6(1) View
Jongkind R, Elings E, Joukes E, Broens T, Leopold H, Wiesman F, Meinema J. Is your curriculum GenAI-proof? A method for GenAI impact assessment and a case study. MedEdPublish 2026;15:11 View
Jiao R. Rethink literature review of design research in an age of AI – from ‘secretary work’ to scholarly synthesis of insight, frameworks, and foresight. Journal of Engineering Design 2026:1 View
Soyugür B. Spor Bilimlerinde Yapay Zekâ Kullanımında Etik İlkelere Duyulan Gereksinim: Ulusal Bir Çalıştay Önerisi. Spor Bilimleri Dergisi Hacettepe Üniversitesi 2026;37(1):97 View
Soldatkina M, Tsymbarovich P, Fomin D, Ivanov A. Assessing the quality of crop variety data extraction from unstructured text sources using large language models. Agronomy Journal 2026;118(2) View
Lin F, Cho S. Evaluating Source-Based Large Language Models for Preclinical Dermatology Education: Comparative Study. JMIR Formative Research 2026;10:e88008 View
Rajaratnam V, Omar U, Kee K, Kaliya-Perumal A. Citation Inaccuracies and the Need for Multi-Level Oversight in AI-Assisted Medical Writing. Standards 2026;6(1):10 View
Cook R, Kahan A, Scharfenberger T, Tasoulas J, Hawks-Ladds N, Chouake R, Jariwala S, Arora S. Comparing Large Language Models' Performances on Otolaryngology Knowledge Assessment Questions. Applied Clinical Informatics 2026;17(02):194 View
Yotov K, Hadzhikoleva S, Hadzhikolev E, Milev M, Rachovski T. A Conceptual Framework for Simulated Self-Assessment and Meta-Evaluation of Generative AI Models. AI 2026;7(4):134 View
Goto J, Ramnarain U. Does culture matter in AI adoption? A predictive analysis of cultural dimensions, social influence, and personal innovativeness in the modified UTAUT model. International Journal of Educational Technology in Higher Education 2026;23(1) View
Polizzi A, Isola G, Caponio V, Santos‐Silva A, González‐Serrano J, Albuquerque R, Brailo V, Farag A, López Jornet M, Robledo Sierra J, Sollecito T, Dan H, Diniz Freitas M, Bissonnette C, Wiriyakijja P, Hernández G, López‐Pintor R. Quality and Readability of Large Language Models' Responses to Oral Lichen Planus Patients' FAQs. Oral Diseases 2026 View
Cabezas-Clavijo Á, Sidorenko-Bautista P. Assessing the Performance of 8 AI Chatbots in Bibliographic Reference Retrieval: Grok and DeepSeek Outperform ChatGPT, but None are Entirely Accurate. Journal of Data and Information Science 2026 View
Santhosh V, Vas R, Roychowdhury B, Sakthi K, Rahaman M. Development and content validation of the CAREFUL-AI framework for evaluating AI-generated scientific manuscripts: an exploratory cross-platform study. Research Evaluation 2026;35 View
Jongkind R, Elings E, Joukes E, Broens T, Leopold H, Wiesman F, Meinema J. Is your curriculum GenAI-proof? A method for GenAI impact assessment and a case study. MedEdPublish 2026;15:11 View
Jafari M, Zare M, Abasi A. Comment on “Comparison of large language models in oral and maxillofacial surgery”. British Journal of Oral and Maxillofacial Surgery 2026 View
Correia A, Saarela M, Kärkkäinen T. Knowledge graphs and large language models for prompt-based scientometric inquiry. Information Processing & Management 2026;63(7):104882 View
Koo M, Lu M. Who Owns the Argument? A Practical Test for Author Responsibility in AI‐Assisted Scholarly Writing. Learned Publishing 2026;39(3) View
Williamson J. Think like an LLM: testing three AI citation verification prompts. Library Hi Tech News 2026:1 View
Picazo-Sanchez P, Ortiz-Martin L. Evaluating the Integrity of LLM-Generated Citations: Prevalence and Risks of Fabricated References in Scientific Literature. Data 2026;11(5):122 View
Shah R, Miranda J, Paustian T. Leveraging generative artificial intelligence errors to teach appropriate citation usage. Journal of Microbiology & Biology Education 2026 View
Hunt D, Di Miceli M. Evaluating the performance of 3 large language models in higher education essay-like assessments in 2024 and 2026. American Journal of STEM Education 2026;25:279 View
DeTemple D, Störzer S, Arbabzadah S, Riddermann A, Gronau F, Timrott K, Bektas H, Kleine M. The role of large language models in the writing of surgical reviews: fact or fantasy?. Langenbeck's Archives of Surgery 2026;411(1) View
Wang P, Cao K, Zhao J. Determinants of Generative AI Adoption Intention in Higher Education: A Multidimensional Integration Framework and Empirical Exploration in Eastern and Western Chinese Universities. Sage Open 2026;16(2) View
Ülkir M, Paslı B. Reference Hallucination, Citation Reliability, and Readability of Large Language Models in Anatomy‐Related Question Answering. Clinical Anatomy 2026 View
Farrahi V. The Promise of Foundational Large Language Models in Analysis and Interpretation of Wearable Data: Implications for Physical Behavior Research. Journal for the Measurement of Physical Behaviour 2026;9(1) View

Books/Policy Documents

Tariciotti L, Zohdy Y, Riva M, Levi R, Pessina F, Pradilla G. Neurosurgery's Frontline Role in Gliomas Treatment. View

Conference Proceedings

Saarela M, Correia A, Kärkkäinen T. 2025 9th International Symposium on Multidisciplinary Studies and Innovative Technologies (ISMSIT). Explainable and Interactive Scientometrics with Large Language Models and Knowledge Graphs View
Overney C, Jiang H, Haider U, Moe C, Mangat J, Pantano F, McMillian E, Riggins P, Gillani N. Proceedings of the 2026 CHI Conference on Human Factors in Computing Systems. Human-AI Narrative Synthesis to Foster Shared Understanding in Civic Decision-Making View

This paper is in the following e-collection/theme issue:

Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

Evaluation of Large Language Model Performance and Reliability for Citations and References in Scholarly Writing: Cross-Disciplinary Study

Journals

Books/Policy Documents

Conference Proceedings