the health strategist
multidisciplinary institute
Joaquim Cardoso MSc.
Chief Research and Strategy Officer (CRSO),
Chief Editor and Senior Advisor
August 23, 2023
What is the message?
A study by Mass General Brigham reveals that ChatGPT, an AI chatbot, achieves a 72% accuracy rate in clinical decision-making.
The research finds that ChatGPT’s performance is comparable to that of an intern or resident who has recently graduated from medical school.
While excelling in final diagnoses, the chatbot’s accuracy is slightly lower in suggesting potential diagnoses, underlining its potential to aid healthcare professionals in decision-making processes.
One page summary:
Mass General Brigham recently conducted a study to evaluate the clinical decision-making capabilities of ChatGPT, an artificial intelligence (AI) chatbot. According to the findings published on August 22, ChatGPT demonstrated a 72% accuracy rate in clinical decision-making.
The study specifically assessed the chatbot’s ability to generate both possible (differential) diagnoses and final treatment plans based on 36 published clinical scenarios.
In the study, researchers fed ChatGPT with clinical scenarios and requested differential diagnoses, followed by additional information to arrive at a final diagnosis and treatment plan.
The AI chatbot exhibited varying levels of accuracy: 77% in providing final diagnoses, 68% in making clinical management decisions, such as recommending appropriate medications, and 60% in offering differential diagnoses.
The corresponding author of the study, Marc Succi, MD, who is also the associate chair of innovation and commercialization at Mass General Brigham, likened ChatGPT’s performance to that of a recent medical school graduate, such as an intern or resident.
The chatbot’s accuracy rates were notably higher in delivering final diagnoses, indicating its potential as a tool to aid healthcare professionals in decision-making processes.
Adam Landman, MD, CIO and senior vice president of digital at Mass General Brigham, expressed optimism about the future role of large language models like ChatGPT in clinical care.
The health system is currently exploring how such models can assist with clinical documentation and patient communication. Landman highlighted the need for rigorous evaluation, similar to the study conducted, to ensure the accuracy, reliability, safety, and equity of integrating AI tools into clinical practice.
Infographics
DEEP DIVE
Assessing the Utility of ChatGPT Throughout the Entire Clinical Workflow: Development and Usability Study [excerpt]
Journal of Medical Internet Research
Arya Rao, Michael Pang, John Kim, Meghana Kamineni, Winston Lie, Anoop K Prasad, Adam Landman, Keith Dreyer, Marc D Succi
August 22, 2023
Abstract
Background:
Large language model (LLM)–based artificial intelligence chatbots direct the power of large training data sets toward successive, related tasks as opposed to single-ask tasks, for which artificial intelligence already achieves impressive performance. The capacity of LLMs to assist in the full scope of iterative clinical reasoning via successive prompting, in effect acting as artificial physicians, has not yet been evaluated.
Objective:
This study aimed to evaluate ChatGPT’s capacity for ongoing clinical decision support via its performance on standardized clinical vignettes.
Methods:
We inputted all 36 published clinical vignettes from the Merck Sharpe & Dohme (MSD) Clinical Manual into ChatGPT and compared its accuracy on differential diagnoses, diagnostic testing, final diagnosis, and management based on patient age, gender, and case acuity. Accuracy was measured by the proportion of correct responses to the questions posed within the clinical vignettes tested, as calculated by human scorers. We further conducted linear regression to assess the contributing factors toward ChatGPT’s performance on clinical tasks.
Results:
ChatGPT achieved an overall accuracy of 71.7% (95% CI 69.3%-74.1%) across all 36 clinical vignettes. The LLM demonstrated the highest performance in making a final diagnosis with an accuracy of 76.9% (95% CI 67.8%-86.1%) and the lowest performance in generating an initial differential diagnosis with an accuracy of 60.3% (95% CI 54.2%-66.6%). Compared to answering questions about general medical knowledge, ChatGPT demonstrated inferior performance on differential diagnosis (β=–15.8%; P<.001) and clinical management (β=–7.4%; P=.02) question types.
Conclusions:
ChatGPT achieves impressive accuracy in clinical decision-making, with increasing strength as it gains more clinical information at its disposal. In particular, ChatGPT demonstrates the greatest accuracy in tasks of final diagnosis as compared to initial diagnosis. Limitations include possible model hallucinations and the unclear composition of ChatGPT’s training data set.
Introduction
Despite its relative infancy, artificial intelligence (AI) is transforming health care, with current uses including workflow triage, predictive models of utilization, labeling and interpretation of radiographic images, patient support via interactive chatbots, communication aids for non–English-speaking patients, and more [1–8]. Yet, all of these use cases are limited to a specific part of the clinical workflow and do not provide longitudinal patient or clinician support. An underexplored use of AI in medicine is predicting and synthesizing patient diagnoses, treatment plans, and outcomes. Until recently, AI models have lacked sufficient accuracy and power to engage meaningfully in the clinical decision-making space. However, the advent of large language models (LLMs), which are trained on large amounts of human-generated text such as those from the internet, has motivated further investigation into whether AI can serve as an adjunct in clinical decision-making throughout the entire clinical workflow, from triage to diagnosis to management. In this study, we assessed the performance of a novel LLM, ChatGPT (Open AI) [9], on comprehensive clinical vignettes (short, hypothetical patient cases used to test clinical knowledge and reasoning).
ChatGPT is a popular chatbot derivative of OpenAI’s Generative Pre-trained Transformer-3.5 (GPT-3.5), an autoregressive LLM released in 2022 [9]. Due to the chatbot’s widespread availability, a small but growing volume of preliminary studies have described ChatGPT’s performance on various professional exams (eg, medicine, law, business, and accounting) [10–14] and generating highly technical texts as found in biomedical literature [15]. Recently, there has been great interest in using the nascent but powerful chatbot for clinical decision support [16–20].
Given that LLMs such as ChatGPT have the ability to integrate large amounts of textual information to synthesize responses to human-generated prompts, we speculated that ChatGPT would be able to act as an on-the-ground copilot in clinical reasoning, making use of the wealth of information available during patient care from the electronic health record and other sources. We focused on comprehensive clinical vignettes as a model. Our study is the first to make use of ChatGPT’s ability to integrate information from the earlier portions of a conversation into downstream responses. Thus, this model lends itself well to the iterative nature of clinical medicine, in that the influx of new information requires constant updating of prior hypotheses. In this study, we tested the hypothesis that when provided with clinical vignettes, ChatGPT would be able to recommend diagnostic workup, decide the clinical management course, and ultimately make the diagnosis, thus working through the entire clinical encounter.
Methods
Study Design
We assessed ChatGPT’s accuracy in solving comprehensive clinical vignettes, comparing across patient age, gender, and acuity of clinical presentation. We presented each portion of the clinical workflow as a successive prompt to the model (differential diagnosis, diagnostic testing, final diagnosis, and clinical management questions were presented one after the other; Figure 1A).
Setting
ChatGPT (OpenAI) is a transformer-based language model with the ability to generate human-like text. It captures the context and relationship between words in input sequences through multiple layers of self-attention and feed-forward neural networks. The language model is trained on a variety of text including websites, articles, and books up until 2021. The ChatGPT model is self-contained in that it does not have the ability to search the internet when generating responses. Instead, it predicts the most likely “token” to succeed the previous one based on patterns in its training data. Therefore, it does not explicitly search through existing information, nor does it copy existing information. All ChatGPT model outputs were collected from the January 9, 2023, version of ChatGPT.
Data Sources and Measurement
Clinical vignettes were selected from the Merck Sharpe & Dohme (MSD) Clinical Manual, also referred to as the MSD Manual [21]. These vignettes represent canonical cases that commonly present in health care settings and include components analogous to clinical encounter documentation such as the history of present illness (HPI), review of systems (ROS), physical exam (PE), and laboratory test results. The web-based vignette modules include sequential “select all that apply”–type questions to simulate differential diagnosis, diagnostic workup, and clinical management decisions. They are written by independent experts in the field and undergo a peer review process before being published. At the time of the study, 36 vignette modules were available on the web, and 34 of the 36 were available on the web as of ChatGPT’s September 2021 training data cutoff date. All 36 modules passed the eligibility criteria of having a primarily textual basis and were included in the ChatGPT model assessment.
Case transcripts were generated by copying MSD Manual vignettes directly into ChatGPT. Questions posed in the MSD Manual vignettes were presented as successive inputs to ChatGPT (Figure 1B). All questions requesting the clinician to analyze images were excluded from our study, as ChatGPT is a text-based AI without the ability to interpret visual information.
ChatGPT’s answers are informed by the context of the ongoing conversation. To avoid the influence of other vignettes’ answers on model output, a new ChatGPT session was instantiated for each vignette. A single session was maintained for each vignette and all associated questions, allowing ChatGPT to take all available vignette information into account as it proceeds to answer new questions. To account for response-by-response variation, each vignette was tested in triplicate, each time by a different user. Prompts were not modified from user to user.
We awarded points for each correct answer given by ChatGPT and noted the total number of correct decisions possible for each question. For example, for a question asking whether each diagnostic test on a list is appropriate for the patient presented, a point was awarded each time ChatGPT’s answer was concordant with the provided MSD Manual answer.
Two scorers independently calculated an individual score for each output by inputting ChatGPT responses directly into the MSD Manual modules to ensure consensus on all output scores; there were no scoring discrepancies. The final score for each prompt was calculated as an average of the 3 replicate scores. Based on the total possible number of correct decisions per question, we calculated a proportion of correct decisions for each question (“average proportion correct” refers to the average proportion across replicates). A schematic of the workflow is provided in Figure 1A.
Participants and Variables
The MSD Manual vignettes feature hypothetical patients and include information on the age and gender of each patient. We used this information to assess the effect of age and gender on accuracy. To assess differential performance across the range of clinical acuity, the Emergency Severity Index (ESI) [22] was used to rate the acuity of the MDS Manual clinical vignettes. The ESI is a 5-level triage algorithm used to assign patient priority in the emergency department. Assessment is based on medical urgency and assesses the patient’s chief complaint, vital signs, and ability to ambulate. The ESI is an ordinal scale ranging from 1 to 5, corresponding to the highest to lowest acuity, respectively. For each vignette, we fed the HPI into ChatGPT to determine its ESI and cross-validated with human ESI scoring. All vignette metadata, including title, age, gender, ESI, and final diagnosis, can be found in Table S1 in Multimedia Appendix 1.
Questions posed by the MSD Manual vignettes fall into several categories: differential diagnoses (diff), which ask the user to determine which of several conditions cannot be eliminated from an initial differential; diagnostic questions (diag), which ask the user to determine appropriate diagnostic steps based on the current hypotheses and information; diagnosis questions (dx), which ask the user for a final diagnosis; management questions (mang), which ask the user to recommend appropriate clinical interventions; and miscellaneous questions (misc), which ask the user medical knowledge questions relevant to the vignette, but not necessarily specific to the patient at hand. We stratified results by question type and the demographic information previously described.
Statistical Methods
Multivariable linear regression was performed using the lm() function with R (version 4.2.1; R Foundation for Statistical Computing) to assess the relationship between ChatGPT vignette performance, question type, demographic variables (age and gender), and clinical acuity (ESI). The outcome variable was the proportion of correct ChatGPT responses for each question and approximated a Gaussian distribution. Age and gender were provided in each vignette and are critical diagnostic information. Thus, they were included in the model based on their theoretical importance on model performance. ESI was included to assess the effect of clinical acuity on ChatGPT performance. Question type was dummy-variable encoded to assess the effect of each category independently. The misc question type was chosen as the reference variable, as these questions assess general knowledge and not necessarily active clinical reasoning.
Results
Overall Performance
Since questions from all vignettes fall into several distinct categories, we were able to assess performance not only on a vignette-by-vignette basis but also on a category-by-category basis. We found that on average, across all vignettes, ChatGPT achieved an accuracy of 71.8% (Figure 2A; Tables S2-S3 in Multimedia Appendix 1). Between categories and across all vignettes, ChatGPT achieved the highest accuracy (76.9%) for questions in the dx category and the lowest accuracy (60.3%) for questions in the diff category (Figure 2B; Table S3 in Multimedia Appendix 1). Trends for between–question type variation in accuracy for each vignette are shown in Figure 2C.
Vignette #28, featuring a right testicular mass in a 28-year-old man (final diagnosis of testicular cancer), showed the highest accuracy overall (83.8%). Vignette #27, featuring recurrent headaches in a 31-year-old woman (final diagnosis of pheochromocytoma), showed the lowest accuracy overall (55.9%; Figure 2A; Table S2 in Multimedia Appendix 1). These findings indicate a possible association between the prevalence of diagnosis and ChatGPT accuracy.
Differential Versus Final Diagnosis
Both diff and dx questions ask the user to generate a broad differential diagnosis followed by a final diagnosis. The key difference between the 2 question types is that answers to diff questions rely solely on the HPI, ROS, and PE, whereas answers to dx questions incorporate results from relevant diagnostic testing and potentially additional clinical context. Therefore, a comparison between the 2 sheds light on whether ChatGPT’s utility in the clinical setting improves with the amount of accurate, patient-specific information it has access to.
We found a statistically significant difference in performance between these 2 question types overall (Figure 2B). Average performance on diff questions was 60.3%, and average performance on dx questions was 76.9%, indicating a 16.6% average increase in accuracy in diagnosis as more clinical context is provided. We also found that there were statistically significant differences in accuracy between diff and dx questions within vignettes for the majority of vignettes. This indicates that this is not an aggregate phenomenon but rather one that applies broadly, indicating the importance of more detailed prompts in determining ChatGPT accuracy, as dx prompt responses incorporate all prior chat session information and relevant clinical context (Figure 2C).
Performance Across Patient Age and Gender
The MSD Manual vignettes specify both the age and gender of patients. We performed a multivariable linear regression analysis to investigate the effect of patient age and gender on ChatGPT accuracy. Regression coefficients for age and gender were both not significant (age: P=.35; gender: P=.59; Table 1). This result suggests that ChatGPT performance is equivalent across the range of ages in this study as well as in a binary definition of gender.
ChatGPT Performance Across Question Types
Diff and mang question types were negatively associated with ChatGPT performance relative to the misc question type (β=–15.8%; P<.001; and β=–7.4%; P=.02, respectively). Diag questions trended toward decreased performance (P=.06); however, the effect was not statistically significant. There was no difference in performance in final diagnosis accuracy. The R2 value of the model was 0.083, indicating that only 8.3% of the variance in ChatGPT accuracy was explained by the model. This suggests that other factors, such as inherent model stochasticity, may play a role in explaining variation in ChatGPT performance.
ChatGPT Performance Does Not Vary With the Acuity of Clinical Presentation
Case acuity was assessed by asking ChatGPT to provide the ESI for each vignette based only on the HPI. These ratings were validated for accuracy by human scorers. ESI was included as an independent variable in the multivariable linear regression shown in Table 1, but it was not a significant predictor of ChatGPT accuracy (P=.55).
ChatGPT Performance Is Ambiguous With Respect to the Dosing of Medications
A small subset of mang and misc questions demanded that ChatGPT provide numerical answers, such as dosing for particular medications. Qualitative analysis of ChatGPT’s responses indicates that errors in this subset are predisposed toward incorrect dosing rather than incorrect medication (Table S4 in Multimedia Appendix 1). This may indicate that model training data are biased toward verbal as opposed to numerical accuracy; further investigation is needed to assess ChatGPT’s utility for dosing.
Discussion
See the original publication (this is an excerpt version)
Acknowledgments
See the original publication (this is an excerpt version)
References
See the original publication (this is an excerpt version)
Abbreviations
See the original publication (this is an excerpt version)
Authors and Affiliations
Arya Rao 1, 2, 3 *, BA; Michael Pang 1, 2, 3 *, BS; John Kim 1, 2, 3, BA; Meghana Kamineni 1, 2, 3, BS; Winston Lie 1, 2, 3, BA, MSc; Anoop K Prasad 1, 2, 3, MBBS; Adam Landman 1, 4, MD, MHS, MIS, MS; Keith Dreyer 1, 5, DO, PhD; Marc D Succi 1, 2, 3, 6, MD
1 Harvard Medical School, Boston, MA, United States
2 Department of Radiology, Massachusetts General Hospital, Boston, MA, United States
3 Medically Engineered Solutions in Healthcare Incubator, Innovation in Operations Research Center (MESH IO), Massachusetts General Hospital, Boston, MA, United States
4 Department of Radiology, Brigham and Women’s Hospital, Boston, MA, United States
5 Data Science Office, Mass General Brigham, Boston, MA, United States
6 Mass General Brigham Innovation, Mass General Brigham, Boston, MA, United States
*these authors contributed equally
Corresponding Author:
Marc D Succi, MD, Department of Radiology
Article originally published at https://www.beckershospitalreview.com.
Paper originally published at https://www.jmir.org