Published on in Vol 26 (2024)

Preprints (earlier versions) of this paper are available at https://preprints.jmir.org/preprint/56500, first published .
Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study

Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study

Comparing GPT-4 and Human Researchers in Health Care Data Analysis: Qualitative Description Study

Original Paper

1Department of Urology, University of California San Francisco, San Francisco, CA, United States

2Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, CA, United States

3Department of Anesthesia and Perioperative Care, University of California San Francisco, San Francisco, CA, United States

4Division of General Internal Medicine, Department of Medicine, University of California San Francisco, San Francisco, CA, United States

5Department of Urology, Icahn School of Medicine at Mount Sinai, New York, NY, United States

Corresponding Author:

Kevin Danis Li, BS

Department of Urology

University of California San Francisco

400 Parnassus Ave

San Francisco, CA

United States

Phone: 1 415 353 2200

Email: kevin.d.li@ucsf.edu


Background: Large language models including GPT-4 (OpenAI) have opened new avenues in health care and qualitative research. Traditional qualitative methods are time-consuming and require expertise to capture nuance. Although large language models have demonstrated enhanced contextual understanding and inferencing compared with traditional natural language processing, their performance in qualitative analysis versus that of humans remains unexplored.

Objective: We evaluated the effectiveness of GPT-4 versus human researchers in qualitative analysis of interviews with patients with adult-acquired buried penis (AABP).

Methods: Qualitative data were obtained from semistructured interviews with 20 patients with AABP. Human analysis involved a structured 3-stage process—initial observations, line-by-line coding, and consensus discussions to refine themes. In contrast, artificial intelligence (AI) analysis with GPT-4 underwent two phases: (1) a naïve phase, where GPT-4 outputs were independently evaluated by a blinded reviewer to identify themes and subthemes and (2) a comparison phase, where AI-generated themes were compared with human-identified themes to assess agreement. We used a general qualitative description approach.

Results: The study population (N=20) comprised predominantly White (17/20, 85%), married (12/20, 60%), heterosexual (19/20, 95%) men, with a mean age of 58.8 years and BMI of 41.1 kg/m2. Human qualitative analysis identified “urinary issues” in 95% (19/20) and GPT-4 in 75% (15/20) of interviews, with the subtheme “spray or stream” noted in 60% (12/20) and 35% (7/20), respectively. “Sexual issues” were prominent (19/20, 95% humans vs 16/20, 80% GPT-4), although humans identified a wider range of subthemes, including “pain with sex or masturbation” (7/20, 35%) and “difficulty with sex or masturbation” (4/20, 20%). Both analyses similarly highlighted “mental health issues” (11/20, 55%, both), although humans coded “depression” more frequently (10/20, 50% humans vs 4/20, 20% GPT-4). Humans frequently cited “issues using public restrooms” (12/20, 60%) as impacting social life, whereas GPT-4 emphasized “struggles with romantic relationships” (9/20, 45%). “Hygiene issues” were consistently recognized (14/20, 70% humans vs 13/20, 65% GPT-4). Humans uniquely identified “contributing factors” as a theme in all interviews. There was moderate agreement between human and GPT-4 coding (κ=0.401). Reliability assessments of GPT-4’s analyses showed consistent coding for themes including “body image struggles,” “chronic pain” (10/10, 100%), and “depression” (9/10, 90%). Other themes like “motivation for surgery” and “weight challenges” were reliably coded (8/10, 80%), while less frequent themes were variably identified across multiple iterations.

Conclusions: Large language models including GPT-4 can effectively identify key themes in analyzing qualitative health care data, showing moderate agreement with human analysis. While human analysis provided a richer diversity of subthemes, the consistency of AI suggests its use as a complementary tool in qualitative research. With AI rapidly advancing, future studies should iterate analyses and circumvent token limitations by segmenting data, furthering the breadth and depth of large language model–driven qualitative analyses.

J Med Internet Res 2024;26:e56500

doi:10.2196/56500

Keywords



Recent advancements in artificial intelligence (AI), particularly in large language models, have significantly expanded their applications in health care and academic research. These developments raise critical questions about their potential and ethical use [1-3]. GPT-4, developed by OpenAI, is a large language model that uses deep learning algorithms, specifically the GPT, to process and generate human-like text [4]. Its training on diverse internet text sources through unsupervised learning enables it to interpret complex language data, making it a potentially invaluable tool for qualitative research [5]. This is especially important in areas where traditional qualitative data analysis is labor-intensive and requires expertise to understand subtle nuances [6]. Furthermore, it is unknown how AI-driven qualitative analysis may differ from human-driven analysis in research contexts.

Despite its potential, the application of AI and large language models to qualitative data remains underexplored [7,8]. Previous studies in the realm of qualitative data analysis have used traditional natural language processing (NLP) models, which often require benchmark-specific training and hand engineering, leading to a more constrained contextual understanding and inferencing abilities. For example, Lennon et al [9] combined human coding with an NLP system trained on internal data, significantly reducing coding time, while Cheligeer et al [10] used a model based on BERT (Bidirectional Encoder Representations from Transformers; Google) for faster keyword analysis. However, such models fall short of the advanced contextual and inferencing abilities exhibited by widely trained large language models like GPT-4, which has been shown to outperform traditional systems on standard NLP benchmarks [11]. Although the field is rapidly evolving, there remains a limited number of studies that directly compare AI-driven qualitative analysis with human-driven approaches [12-17].

In this study, we used GPT-4 to re-examine qualitative data from a previously published study of 20 patients with adult-acquired buried penis (AABP), a urological condition with significant psychosocial consequences, and compare its performance with that of human researchers [18]. Evaluating GPT-4 for qualitative analysis in this patient population is particularly important due to the unique and profound psychosocial distress associated with AABP, including issues related to body image, sexual function, and mental health. Understanding patients’ experiences through qualitative analysis can provide an increased understanding of their lived experiences. To accomplish these objectives, we created a series of generalizable prompts that allow the application of GPT-4 to qualitative analysis without requiring specialized knowledge or skills [19]. Finally, we evaluated the validity of our approach by measuring agreement between GPT-4 and human analysis and reliability by assessing if prompts consistently elicited similar outputs from the same data.


Data Source

Qualitative data were from a convenience sample of 20 patients who presented to urology clinics participating in TURNS (Trauma and Urologic Reconstructive Network of Surgeons), a multi-institutional collaborative research group focused on urologic trauma and reconstruction [18]. We conducted semistructured interviews focusing on the impact of AABP on personal relationships, social life, mental health, and physical health. Participants were interviewed for 15 to 30 minutes, and audio was transcribed electronically using Otter transcription software [20]. Interviews were conducted over Zoom live video conferencing [21]. For both human and GPT-4 qualitative analyses, only deidentified text transcripts were used, ensuring that the qualitative data were interpreted solely from text, providing a comparable basis for both human and AI-driven analyses.

Human Analysis

Our human-driven analysis used a general qualitative description approach which differs from other qualitative methods in that the analytic process stays close to the data, describing informants’ experiences using their own language [22-24]. The research team initially reviewed interview transcripts, taking notes to capture observations and ideas and facilitate a comprehensive understanding of the overall content. This preparatory work informed the subsequent structured coding process. To ensure consistency and reliability, the team convened at three key stages, which were (1) before coding, to share initial text impressions and establish a standardized coding protocol; (2) after initiating line-by-line coding, to discuss applied codes and refine categorization strategies; and (3) to assess coder interrater reliability using weighted Fleiss κ coefficients [25]. Codes with a κ value below 0.75 were discussed among all authors until a coding consensus was reached. This approach enabled the identification and categorization of relevant subthemes and themes.

AI Analysis

Each deidentified transcript underwent text formatting removal before analysis by GPT-4 using a standardized prompt set (Figure 1) [26]. The analysis of the GPT-4–generated output was conducted in 2 phases, the naïve phase and the comparison phase.

Figure 1. Procedure for using GPT-4 for qualitative description.

In the naïve phase, GPT-4’s outputs for each interview were examined to extract relevant codes and quotes. These were then combined into subthemes, with groupings based on conceptual coherence and content relevance, following a standard qualitative description process [24]. Subsequently, similar subthemes were grouped to form overarching themes. Multiple iterations were conducted to refine the subthemes before synthesizing generalizations that held true across the data. Memo writing was integral to this process, capturing the evolving understanding of the data. Importantly, no discussions with the human-analyst team were conducted during this phase to avoid biasing the process. All interactions and evaluations of GPT-4’s analyses were conducted by a blinded reviewer (KDL) who was not involved in the initial human-driven analysis and kept naïve to its outcomes.

In the comparison phase, AI-identified subthemes and themes were compared against those previously identified through human-driven analysis. This phase focused on identifying parallels and alignments between the 2 analyses to provide a direct comparison.

Interview data were collected in 2021, and human analyses were completed by 2022. All GPT-4 analyses were processed in separate instances on December 1, 2023, using the latest model of GPT-4 available at that time.

Measures to Ensure Rigor

The analytic team included KDL, who is a medical and data science master’s student, NR, who is a clinical research coordinator with extensive experience in managing and coordinating clinical studies in health care settings, and GMA, who is a fellowship-trained surgeon specializing in urologic conditions, including adult acquired buried penis. In addition, we consulted BNB, an expert in urologic reconstruction who frequently treats patients with buried penis, to provide in-depth clinical insights and ensure the medical accuracy of our interpretations, and RS, a health services researcher and communication scientist with expertise in qualitative methods, to guide us on appropriate methodologies and ensure the rigor of our analyses.

To ensure rigor, we implemented several strategies addressing credibility, transferability, dependability, and confirmability [27]. For credibility, we built patient rapport through prolonged engagement, as most patients had existing longitudinal relationships at the urology clinics where they received care, allowing for deeper insights into their experiences. For transferability, we reported clinical characteristics of the study participants to inform the applicability of our findings to other populations with AABP and used a multi-institutional sampling strategy to account for potential geographic or local institutional characteristics, ensuring broader applicability of our results.

Dependability was ensured through methodological documentation, where all codes, subthemes, and themes were documented at each step to provide transparency and replicability of our coding decisions. We also maintained detailed audit trails of raw outputs from GPT-4, processed outputs, and the subsequent organization into subthemes and themes, which the team reviewed to ensure consistency and reliability. Confirmability was achieved by having BNB, an expert in urologic reconstruction, review the study findings and provide critical insights during the design phase, and RS, who provided qualitative methodological support. In addition, data were shared with the entire research team, and feedback from all coauthors was incorporated into subsequent interpretation and analysis.

Comparison of Analyses

Qualitative analyses, including themes and subthemes, were summarized using descriptive statistics, including frequencies and proportions. To visually represent an agreement between human and AI-identified themes (validity), an agreement matrix was constructed. We measured interrater reliability using Cohen κ coefficient. A separate analysis was performed 10 times on the same interview transcript to assess the reliability of GPT-4’s analysis. Themes identified exclusively by GPT-4 were highlighted with exemplar quotes that best represented each theme. All analyses were performed using R statistical software (version 4.3.1; The R Foundation).

Ethical Considerations

The study was approved by the University of California San Francisco (UCSF) institutional review board (IRB; 20-32062), and consent was obtained from all participants. In addition to the original study’s IRB approval, we obtained an exemption from our institution’s IRB for the secondary analysis using GPT-4, as the data were deidentified. Before analysis, all transcripts were reviewed to ensure that they contained no protected health information or identifiable data to maintain participant confidentiality. We used a private instance of GPT-4, known as Versa, which operates independently of OpenAI’s commercial model and does not retain or learn from the data inputted [28]. This instance was used to develop our AI qualitative analysis methodology. For subsequent analyses, all data were confirmed to be thoroughly deidentified before using the commercial version of GPT-4.


Study Population

Participant characteristics are summarized in Table 1. Participants’ mean age and BMI were 58.8 (SD 13.9) years and 41.1 (SD 9.4) kg/m2, respectively. Most participants were White (17/20, 85%), married (12/20, 60%), heterosexual (19/20, 95%) men residing in the Western region of the United States (10/20, 50%). In total, 55% (11/20) of participants underwent surgical correction of their AABP, with interviews conducted at an average of 497 (SD 666) days after surgery.

Table 1. Participant demographics and characteristics.
CharacteristicsValues
Age (years), mean (SD)58.8 (13.9)
BMI (kg/m2), mean (SD)41.1 (9.4)
Self-identified race, n (%)

White17 (85)

Black or African American1 (5)

Other2 (10)

Hispanic or Latin ethnicity3 (15)
Relationship status, n (%)

Married12 (60)

Single6 (30)

In a relationship2 (10)
Sexual orientation, n (%)

Heterosexual19 (95)

Homosexual1 (5)
Region, n (%)

West10 (50)

Northeast7 (35)

Midwest2 (10)

South1 (5)
Patients who underwent AABPa surgical correction (n=11, 55%), n (%)

Escutcheonectomy9 (45)

Excision of penile skin with split-thickness skin graft6 (30)

Ventral slit scrotal flap 5 (25)

aAABP: adult-acquired buried penis.

Qualitative Description

Table 2 presents a comparative analysis of themes and subthemes identified by human researchers versus GPT-4. “Urinary issues” were common in interviews analyzed by human researchers (19/20, 95%) and GPT-4 (15/20, 75%). Issues with “spray or stream” were a notable subtheme (12/20, 60% humans vs 7/20, 35% GPT-4). “Sexual issues” were prominently coded as well, present in 95% (19/20) of human-analyzed interviews and 80% (18/20) by GPT-4, with “inability to perform intercourse” coded as a subtheme more frequently by human researchers (12/20, 60% vs 6/20, 30%). Humans coded a broader array of sexual function issues, such as “pain with sex or masturbation” (7/20, 35%) and “difficulty with sex or masturbation” (4/20, 20%). “Mental health issues” were similarly recognized by both humans and GPT-4 (11/20, 55%, both), with “depression” more frequently coded by humans compared with GPT-4 (10/20, 50% vs 4/20, 20%, respectively). “Impact on social life” was an additional significant theme, with humans coding “issues using public restrooms” (12/20, 60%), while GPT-4 emphasized “struggles with romantic relationships” (9/20, 45%). Both methods identified “hygiene issues” (14/20, 70% humans vs 13/20, 65% GPT-4), highlighting difficulties in maintaining cleanliness. Human researchers uniquely identified “contributing factors” as a theme in all interviews.

Table 2. Human researchers versus GPT-4 qualitative analysis.
Themes and subthemesHuman researchers, n (%)GPT-4, n (%)
Urinary issues19 (95)15 (75)

Spray or stream12 (60)7 (35)

Hovers over toilet8 (40)a

Pain with urination7 (35)3 (15)

History of urethral stricture disease3 (15)

Incontinence3 (15)3 (15)

Incomplete bladder emptying3 (15)2 (10)

Sits to urinate2 (10)

Smelly urine1 (5)1 (5)

Trouble with catheter1 (5)

Uses shower or tub to urinate1 (5)

Frequent urination2 (10)

Getting up at night to urinate1 (5)
Sex issues19 (95)16 (80)

Unable to perform intercourse12 (60)6 (30)

Unable to get erection9 (45)3 (15)

Pain with sex or masturbation7 (35)

Difficulty with sex or masturbation4 (20)

Painful erection4 (20)

Unable to maintain erection3 (15)

Avoids sex2 (10)4 (20)

Unable to orgasm2 (10)

Reduced genital sensation1 (5)

Takes longer to orgasm1 (5)

Pain with ejaculation1 (5)

Intercourse not enjoyable1 (5)

Adaptive masturbation techniques2 (10)

Poor cosmetic appearance2 (10)

Painful erection2 (10)

Brittle skin1 (5)

Unable to use condom1 (5)

Overuse of pornography1 (5)
Mental health issues11 (55)11 (55)

Depression10 (50)6 (30)

Feels like less of a man7 (35)4 (20)

Anxiety4 (20)2 (10)

Decreased self-esteem3 (15)

Stress1 (5)1 (5)

Emotional turmoil2 (10)

Loss of confidence1 (5)

Guilt1 (5)
Impacts social life16 (80)15 (75)

Issues using public restrooms12 (60)8 (40)

Avoids travel6 (30)

Struggles with romantic relationships9 (45)

Mobility impairment6 (30)

Spousal support3 (15)

Avoids hobbies1 (5)

Avoids social activities1 (5)

Negative impact on career1 (5)
Hygiene issues14 (70)13 (65)

Hard or effort to clean11 (55)11 (55)

Skin tearing7 (35)

Penile bleeding6 (30)2 (10)

Infections6 (30)
Contributing factors20 (100)

Worse after weight gain14 (70)

Worse after multiple surgeries8 (40)

Worse after weight loss4 (20)

Improvement after weight loss0 (0)

aNot applicable.

Validity and Reliability of GPT-4 Analysis

To further assess the validity of GPT-4 analysis, we generated an agreement matrix comparing themes coded by human researchers and GPT-4 per interview (Figure 2). There were 63 instances where both human and GPT-4 analyses agreed on the presence of a theme, and 14 instances of agreement on a theme being absent. There was disagreement in 23 cases—16 where humans identified a theme that GPT-4 did not and 7 where GPT-4 identified a theme that humans did not (Table 3). The overall Cohen κ coefficient was 0.401, indicating moderate agreement. Boxes depict interview theme analysis. The blue (AI) and yellow (humans) squares indicate presence and green squares reflect agreement on presence or absence.

We assessed reliability by analyzing the same interview transcript 10 times with the same prompt set (Table 4). There was consistent identification of “body image struggles or disfigurement” and “chronic pain and discomfort,” both appearing in all iterations (10/10, 100%). “Depression” was also frequently coded, appearing in 90% (9/10) of analyses. High reliability was observed for “motivated to have surgery,” “uses shower or tub to urinate,” and “weight challenges,” each occurring in 80% (8/10) of the analyses. Other codes such as “issues using public restrooms,” “unable to perform intercourse,” and “negative health care experiences” were present in 70% (7/10) of iterations. Codes for “hard or effort to clean,” “decreased self-esteem,” and “necrotizing fasciitis diagnosis” were identified 60% (6/10) of the time. Codes were less frequent for “urinary tract infections” (3/10, 30%), “sits to urinate” (2/10, 20%), and a cluster of codes that included “dependency on others for care,” “social isolation and loneliness,” “high frequency of urination,” “anxiety,” “loss of physical autonomy,” “financial burden,” and “hematuria,” each appearing once (1/10, 10%).

Figure 2. Themes identified per interview by GPT-4 versus human researchers.
Table 3. Codes and exemplar quotes identified exclusively by GPT-4.
Interview numberGPT-4 code applied: exemplar quoteTheme
3Impact on marital relationship: “I am married? And you know it’s it is... strained or? I wasn’t meeting her needs.”Impacts social life
3Hygiene management efforts: “I try to keep myself pretty clean... I really tried to wash my genitals really well.”Hygiene issues
6Mental health impact and resilience: “Yes in some ways it did affect me but other ways I don’t really don’t think it did.”Mental health issues
8Mental health and self-image concerns: “the preconceived notion you know but the man’s function is supposed to be.”Mental health issues
9Improved hygiene post surgery: “I actually feel that hygiene became a lot easier simply because I didn’t have to dig my finger in and run around the shaft to try and wash everything out.”Hygiene issues
16Day-to-day discontent and social withdrawal: “It’s just I just I would hate for other candidates that going forward thinking there is nothing that can be done need to be here they need to have options on the table.”Impacts social life
18Urinary dysfunction and social anxiety: “I would say they’re abnormal for somebody my age a lot of times it’s needing the needing to push… And that can cause anxiety in a public sort of restroom atmosphere.”Urinary issues
Table 4. Reliability of GPT-4–generated codes.
CodeaParticipants, n (%)
Body image struggles or disfigurement10 (100)
Chronic pain and discomfort10 (100)
Depression9 (90)
Motivated to have surgery8 (80)
Uses shower or tub to urinate8 (80)
Weight challenges8 (80)
Issues using public restrooms7 (70)
Unable to perform intercourse7 (70)
Negative health care experiences7 (70)
Hard or effort to clean6 (60)
Decreased self-esteem6 (60)
Necrotizing fasciitis diagnosis6 (60)
Urinary tract infections3 (30)
Sits to urinate2 (20)
Dependency on others for care1 (10)
Social isolation and loneliness1 (10)
High frequency of urination1 (10)
Anxiety1 (10)
Loss of physical autonomy1 (10)
Financial burden1 (10)
Hematuria1 (10)

aPresence of codes from the same interview analyzed 10 times by GPT-4. Each code was counted only once per analysis, indicating whether it was identified (present) or not (absent) during each separate analysis.


Principal Results

In this investigation, we directly compared the performance of AI (GPT-4) with human researchers in conducting a qualitative analysis of interviews with patients affected by AABP. Our study is the first of its kind, to our knowledge, to perform such a direct comparison, highlighting the potential use of AI in qualitative research. By using generalized prompts, our method allows researchers without specialized NLP knowledge to use GPT-4 for rigorous qualitative analysis, significantly reducing the time investment required.

Our results showed moderate alignment between GPT-4 and human analyses in identifying key themes, including urinary challenges, sexual health issues, and mental health impacts. Human analysis identified more subthemes, capturing the data’s complexities more thoroughly than GPT-4. This difference may stem from GPT-4’s token size limitations, which restrict its ability to perform comprehensive analyses as the input length increases [29]. The reliability tests revealed that while GPT-4 consistently recognized key codes, its identification of subtler codes was more variable. This suggests that implementing repeated analysis cycles, similar to the human multirater approach, could refine AI’s analytical reliability. Overall, our findings underscore a complementary role for AI and human collaboration in qualitative research, where each can augment the strengths of the other.

The question of how to evaluate the accuracy and reliability of AI-driven analysis is crucial for future research. We adopted a quantitative approach to directly compare the presence of themes and subthemes in both human and AI analyses. By calculating Cohen κ, a statistic that measures interrater reliability by considering the agreement occurring by chance, we provided an objective assessment of the consistency of themes identified by GPT-4 compared with human analysis, presupposing human analysis as the “gold standard.” In addition, to ensure consistency in GPT-4’s outputs, we conducted multiple iterations of the same interview transcript analysis, analogous to traditional qualitative research methods where multiple analysts and iterative coding processes are used to standardize analyses and minimize biases. It is important to note that while these quantitative metrics offer a clear criterion for comparison, they may not fully capture the depth and richness of qualitative insights. GPT-4 has demonstrated the ability to detect subtle nuances and emotional contexts from text data, suggesting that incorporating more qualitative approaches in AI analysis evaluation could enhance the understanding of its analytical capabilities [30,31].

Limitations

A primary limitation of this study arises from the comparison phase, where themes and subthemes generated by GPT-4 were aligned with those identified by human researchers. Although a blinded reviewer was used to mitigate potential bias, the subjective nature of qualitative analysis means that a degree of bias is likely to remain. This is a common challenge in qualitative research, where analysts’ subjective interpretations inherently influence their analysis. However, it can be argued that the use of a large language model such as GPT-4 may present a more objective method of analysis compared with the potential variability inherent between different human researchers’ analyses, due to the large language model’s consistent application of its transformer model.

We deliberately chose qualitative description as our analytic approach, favoring the accuracy to source material over depth of analysis. Qualitative description involves the systematic categorization and interpretation of qualitative data to uncover patterns and insights while staying close to the original data [22-24]. A more context-based approach, such as thematic analysis, could generate richer themes and subthemes but poses challenges for comparability. More interpretative methods may introduce subjectivity, reducing reproducibility. While our methodological choice ensures that our study remains accessible as a framework for others to build on and develop more interpretative techniques, the need for comparison limited our depth of insights.

Qualitative methods have inherent limitations, such as potential bias and limited generalizability due to smaller, nonrandom samples, and aim to produce in-depth insights and understanding rather than population inferences [32,33]. Consequently, our findings may not capture the full diversity of patient experiences, potentially limiting the generalizability of our results. Nevertheless, our study primarily aims to provide a comparative analysis, focusing on GPT-4 as a suitable tool for qualitative research applications.

As GPT-4 and other large language models advance, their analytical capabilities are expected to become more sophisticated, which may alter their proficiency in qualitative analysis. For example, while GPT-3.5 scored in the bottom 10% on a simulated bar examination, GPT-4 has demonstrated a significant improvement, placing within the top 10% of test takers [11]. The study’s findings are therefore a snapshot of GPT-4’s capabilities at a specific point in time and may not fully represent its future potential in qualitative analysis. Despite this limitation, the current trajectory of AI indicates that the use of GPT-4 and similar large language models in qualitative research is likely to become increasingly robust and refined.

Comparison With Previous Work

While studies applying GPT-4 or other large language models to qualitative research are limited, a growing body of work has compared the performance of OpenAI’s GPT models, including GPT-3, -3.5, and -4, with that of humans in academic research and medical education [12-15]. Wang et al [34] found that while ChatGPT can generate accurate and relevant information, it is not without gaps when compared with official sources, indicating a need for supplementary validation from reliable references. Other studies have shown that ChatGPT can mimic the style of human-written research abstracts, albeit with limitations in quality and accuracy, as indicated by the ability of blinded reviewers to distinguish AI-generated content [35]. In the field of medical education, ChatGPT has been shown to outperform medical students on examinations, suggesting valuable applications in examination preparation [36]. Similarly, ChatGPT’s performance on the United States Medical Licensing Examination (USMLE) further showcases the potential use of AI in medical education, where it achieved scores near the passing threshold without specialized training [37]. These findings emphasize that while advanced large language models such as GPT-4 are becoming increasingly competent in complex tasks, their current role remains complementary to human expertise.

The application of GPT-4 and other large language models to health care is a burgeoning field with substantial promise, resting on the fundamental ability of AI to process qualitative data efficiently. In patient care, large language models can enhance communication by translating complex medical language into more accessible terms for health care providers and patients [38]. The performance of large language models on medical licensing examinations also indicates their potential use in supporting clinical decision-making [39]. In administrative contexts, large language models are particularly valuable for generating concise clinical summaries and synthesizing extensive electronic medical record documentation; tasks that typically consume considerable time for health care professionals. The integration of large language models into administrative workflows may increase efficiency and allow clinicians to allocate more time to direct patient care. Health care companies are already beginning to integrate large language models into electronic health records, such as Epic’s recent partnership with Microsoft to embed Azure OpenAI service into its own electronic health record systems [40].

Despite its promise, the integration of large language models in health care raises several ethical concerns that warrant careful consideration [41]. Foremost among these is data privacy, particularly regarding the handling of sensitive patient information, necessitating robust safeguards against data breaches. The opacity of these models, due to the unavailability of public training data sets and model weights, poses another concern as it obscures the understanding of their decision-making processes and challenges their trustworthiness in clinical applications [42]. In addition, the commercialization of large language models by major corporations, such as OpenAI, Microsoft, Meta, and Google, brings into question the potential influence of commercial interests on model development and deployment, possibly overshadowing patient welfare. A crucial concern is the risk of patient harm arising from incorrect or biased models, emphasizing the need for rigorous testing and validation of large language models to ensure their reliability and prevent adverse clinical outcomes [43].

Conclusions

Our research demonstrates that large language models like GPT-4 can discern key themes from qualitative health care data when used with standardized prompts. This “out of the box” approach aligns moderately well with qualitative description analysis by human analysts. Future work should use more context-based prompts for deeper and richer themes. As this may introduce greater subjectivity, researchers should also explore iterative analyses, such as synthesizing output from multiple iterations, to improve large language model output reliability. In addition, researchers should assess the qualitative analytic abilities of other popular models like Gemini (Google), Llama (Meta), and Claude (Anthropic AI), and develop methods to circumvent the token limitations inherent in models such as GPT-4 by segmenting qualitative data inputs, enriching the depth and breadth of qualitative analyses.

Acknowledgments

ChatGPT was not used in the ideation or writing of this manuscript.

Conflicts of Interest

None declared.

  1. Liu Y, Han T, Ma S, Zhang J, Yang Y, Tian J, et al. Summary of ChatGPT-related research and perspective towards the future of large language models. Meta-Radiology. 2023;1(2):100017. [CrossRef]
  2. Clusmann J, Kolbinger FR, Muti HS, Carrero ZI, Eckardt J, Laleh NG, et al. The future landscape of large language models in medicine. Commun Med (Lond). 2023;3(1):141. [FREE Full text] [CrossRef] [Medline]
  3. Meyer JG, Urbanowicz RJ, Martin PCN, O'Connor K, Li R, Peng P, et al. ChatGPT and large language models in academia: opportunities and challenges. BioData Min. 2023;16(1):20. [FREE Full text] [CrossRef] [Medline]
  4. Ray PP. ChatGPT: a comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. Internet of Things and Cyber-Physical Systems. 2023;3:121-154. [CrossRef]
  5. Schopow N, Osterhoff G, Baur D. Applications of the natural language processing tool ChatGPT in clinical practice: comparative study and augmented systematic review. JMIR Med Inform. 2023;11:e48933. [FREE Full text] [CrossRef] [Medline]
  6. Almeida F, Faria D, Queirós A. Strengths and limitations of qualitative and quantitative research methods. Eur J Educ Stud. 2017;3(9):369-387. [FREE Full text]
  7. Kantor J. Best practices for implementing ChatGPT, large language models, and artificial intelligence in qualitative and survey-based research. JAAD Int. 2024;14:22-23. [FREE Full text] [CrossRef] [Medline]
  8. Hitch D. Artificial intelligence augmented qualitative analysis: the way of the future? Qual Health Res. 2024;34(7):595-606. [FREE Full text] [CrossRef] [Medline]
  9. Lennon RP, Fraleigh R, Van Scoy LJ, Keshaviah A, Hu XC, Snyder BL, et al. Developing and testing an automated qualitative assistant (AQUA) to support qualitative analysis. Fam Med Community Health. 2021;9(Suppl 1):e001287. [FREE Full text] [CrossRef] [Medline]
  10. Cheligeer C, Yang L, Nandi T, Doktorchik C, Quan H, Zeng Y, et al. Natural language processing (NLP) aided qualitative method in health research. J Integr Des Process Sci. 2023;27(1):41-58. [CrossRef]
  11. OpenAI, Achiam J, Adler S, Agarwal S, Ahmad L, Akkaya I, et al. GPT-4 Technical Report. ArXiv. Preprint posted online on March 04, 2024. [FREE Full text]
  12. Zhang H, Wu C, Xie J, Lyu Y, Cai J, Carroll J. Redefining qualitative analysis in the AI era: utilizing ChatGPT for efficient thematic analysis. ArXiv. Preprint posted online on May 28, 2024. [FREE Full text]
  13. Hamilton L, Elliott D, Quick A, Smith S, Choplin V. Exploring the use of AI in qualitative analysis: a comparative study of guaranteed income data. Int J Qual Methods. 2023;22:16094069231201504. [CrossRef]
  14. Morgan DL. Exploring the use of artificial intelligence for qualitative data analysis: the case of ChatGPT. International Journal of Qualitative Methods. 2023;22. [CrossRef]
  15. Wachinger J, Bärnighausen K, Schäfer LN, Scott K, McMahon SA. Prompts, pearls, imperfections: comparing ChatGPT and a human researcher in qualitative data analysis. Qual Health Res. 2024:10497323241244669. [FREE Full text] [CrossRef] [Medline]
  16. A Fuller K, Morbitzer KA, Zeeman JM, M Persky A, C Savage A, McLaughlin JE. Exploring the use of ChatGPT to analyze student course evaluation comments. BMC Med Educ. 2024;24(1):423. [FREE Full text] [CrossRef] [Medline]
  17. Amirova A, Fteropoulli T, Ahmed N, Cowie MR, Leibo JZ. Framework-based qualitative analysis of free responses of large language models: algorithmic fidelity. PLoS One. 2024;19(3):e0300024. [FREE Full text] [CrossRef] [Medline]
  18. Amend GM, Holler JT, Sadighian MJ, Rios N, Hakam N, Nabavizadeh B, et al. The lived experience of patients with adult acquired buried penis. J Urol. 2022;208(2):396-405. [CrossRef] [Medline]
  19. Meskó B. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res. 2023;25:e50638. [FREE Full text] [CrossRef] [Medline]
  20. Otter.ai - AI Meeting Note Taker & Real-time AI Transcription. URL: https://otter.ai/ [accessed 2024-01-17]
  21. One platform to connect. Zoom. URL: https://zoom.us/ [accessed 2024-01-17]
  22. Sandelowski M. Whatever happened to qualitative description? Res Nurs Health. 2000;23(4):334-340. [CrossRef] [Medline]
  23. Sandelowski M. What's in a name? Qualitative description revisited. Res Nurs Health. 2010;33(1):77-84. [CrossRef] [Medline]
  24. Neergaard MA, Olesen F, Andersen RS, Sondergaard J. Qualitative description - the poor cousin of health research? BMC Med Res Methodol. 2009;9:52. [FREE Full text] [CrossRef] [Medline]
  25. Zapf A, Castell S, Morawietz L, Karch A. Measuring inter-rater reliability for nominal data - which coefficients and confidence intervals are appropriate? BMC Med Res Methodol. 2016;16:93. [FREE Full text] [CrossRef] [Medline]
  26. Research with Dr Kriukow Youtube page. Thematic analysis with ChatGPT | PART 1- Coding qualitative Data with ChatGPT; May 19, 2023. URL: https://www.youtube.com/watch?v=8dTs7D42ge0 [accessed 2024-07-25]
  27. Ahmed SK. The pillars of trustworthiness in qualitative research. J Med Surg Public Health. 2024;2:100051. [CrossRef]
  28. Bengfort J. Now Available: Versa, UCSF Generative AI Platform. Office of the Chancellor; 2024. URL: https://chancellor.ucsf.edu/news/now-available-versa-ucsf-generative-ai-platform [accessed 2024-07-25]
  29. Kohn R. Mastering token limits and memory in ChatGPT and other large language models. Medium; 2023. URL: https://medium.com/@russkohn/mastering-ai-token-limits-and-memory-ce920630349a [accessed 2024-07-25]
  30. Baktash JA, Dawodi M. GPT-4: a review on advancements and opportunities in natural language processing. ArXiv. Preprint posted online on May 04, 2023. [FREE Full text]
  31. Elyoseph Z, Hadar-Shoval D, Asraf K, Lvovsky M. ChatGPT outperforms humans in emotional awareness evaluations. Front Psychol. 2023;14:1199058. [FREE Full text] [CrossRef] [Medline]
  32. Borgstede M, Scholz M. Quantitative and qualitative approaches to generalization and replication-A representationalist view. Front Psychol. 2021;12:605191. [FREE Full text] [CrossRef] [Medline]
  33. Tenny S, Brannan J, Brannan G. Qualitative Study. StatPearls [Internet]. URL: https://www.ncbi.nlm.nih.gov/books/NBK470395/ [accessed 2024-07-06]
  34. Wang G, Gao K, Liu Q, Wu Y, Zhang K, Zhou W, et al. Potential and limitations of ChatGPT 3.5 and 4.0 as a source of COVID-19 information: comprehensive comparative analysis of generative and authoritative information. J Med Internet Res. 2023;25:e49771. [FREE Full text] [CrossRef] [Medline]
  35. Cheng SL, Tsai SJ, Bai YM, Ko CH, Hsu CW, Yang FC, et al. Comparisons of quality, correctness, and similarity between ChatGPT-generated and human-written abstracts for basic research: cross-sectional study. J Med Internet Res. 2023;25:e51229. [FREE Full text] [CrossRef] [Medline]
  36. Roos J, Kasapovic A, Jansen T, Kaczmarczyk R. Artificial intelligence in medical education: comparative analysis of ChatGPT, bing, and medical students in Germany. JMIR Med Educ. 2023;9:e46482. [FREE Full text] [CrossRef] [Medline]
  37. Kung TH, Cheatham M, Medenilla A, Sillos C, De Leon L, Elepaño C, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. [FREE Full text] [CrossRef] [Medline]
  38. Decker H, Trang K, Ramirez J, Colley A, Pierce L, Coleman M, et al. Large language model-based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw Open. 2023;6(10):e2336997. [FREE Full text] [CrossRef] [Medline]
  39. Benary M, Wang XD, Schmidt M, Soll D, Hilfenhaus G, Nassir M, et al. Leveraging large language models for decision support in personalized oncology. JAMA Netw Open. 2023;6(11):e2343689. [FREE Full text] [CrossRef] [Medline]
  40. Redmond W, Verona W. Microsoft and epic expand strategic collaboration with integration of azure openAI service. Microsoft News Center; Apr 17, 2023. URL: https:/​/news.​microsoft.com/​2023/​04/​17/​microsoft-and-epic-expand-strategic-collaboration-with-integration-of-azure-openai-service/​ [accessed 2024-07-25]
  41. Meskó B, Topol EJ. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med. 2023;6(1):120. [FREE Full text] [CrossRef] [Medline]
  42. Sanderson K. GPT-4 is here: what scientists think. Nature. 2023;615(7954):773. [CrossRef] [Medline]
  43. Zack T, Lehman E, Suzgun M, Rodriguez JA, Celi LA, Gichoya J, et al. Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: a model evaluation study. Lancet Digit Health. 2024;6(1):e12-e22. [FREE Full text] [CrossRef] [Medline]


AABP: adult-acquired buried penis
AI: artificial intelligence
BERT: Bidirectional Encoder Representations from Transformers
IRB: Institutional Review Board
NLP: natural language processing
TURNS: Trauma and Urologic Reconstructive Network of Surgeons
UCSF: University of California San Francisco
USMLE: United States Medical Licensing Examination


Edited by T de Azevedo Cardoso, Q Jin; submitted 17.01.24; peer-reviewed by L Zhu, MKK Rony, Y Zhong; comments to author 14.05.24; revised version received 31.05.24; accepted 09.07.24; published 21.08.24.

Copyright

©Kevin Danis Li, Adrian M Fernandez, Rachel Schwartz, Natalie Rios, Marvin Nathaniel Carlisle, Gregory M Amend, Hiren V Patel, Benjamin N Breyer. Originally published in the Journal of Medical Internet Research (https://www.jmir.org), 21.08.2024.

This is an open-access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work, first published in the Journal of Medical Internet Research (ISSN 1438-8871), is properly cited. The complete bibliographic information, a link to the original publication on https://www.jmir.org/, as well as this copyright and license information must be included.