The Health Strategist
institute for continuous health transformation
and digital health
Joaquim Cardoso MSc
Chief Research and Strategy Officer (CRSO)
May 10, 2023
KEY MESSAGES:
- Large language models like ChatGPT-4 have potential applications in medicine, but they should be regarded as tools to support and augment clinical decision-making rather than replace human expertise.
- A balanced approach that combines AI capabilities with human judgment and clinical experience will lead to the effective integration of these technologies in healthcare, prioritizing patient well-being.
ONE PAGE SUMMARY
The article explores the performance of large language models, specifically ChatGPT-4, in comparison to human physicians in diagnosing patients.
- It highlights the limitations and potential risks associated with relying solely on AI models for medical decision-making.
The paper begins by discussing previous studies that compared AI algorithms to healthcare professionals in diagnosing diseases, particularly using medical imaging.
- These studies found that deep learning models showed equivalent diagnostic performance to clinicians.
- However, the lack of externally validated results and direct comparisons between models and healthcare professionals raised concerns about bias and methodological standards.
The focus then shifts to the performance of ChatGPT-4, a large language model, when presented with a diagnostic case.
- The case involved a patient with chest pain, a history of hypertension and stroke, and an elevated troponin I level.
- ChatGPT-4 correctly identified the symptoms as a potential myocardial infarction (heart attack) but lacked the reasoning skills and clinical judgment of an experienced cardiologist.
The article references a similar case report where human physicians were able to consider additional factors and correctly diagnose aortic dissection based on radiating leg pain.
The key takeaway from the article is the distinction between large language models and human physicians.
- While AI models like ChatGPT-4 can provide preliminary insights and warnings about seeking professional medical advice, they cannot substitute for the expertise, reasoning abilities, and critical thinking skills possessed by experienced clinicians.
- The limitations of AI models, which lack true understanding and reasoning capabilities, underscore the potential dangers of relying solely on their conclusions without appropriate medical evaluation.
The article suggests that the integration of AI models in medicine should follow an approach called retrieval-augmented generation (RAG), which combines trusted clinical information from reputable sources with AI capabilities.
- By incorporating knowledge from reliable medical databases, AI models can reduce the risk of misinformation and hallucinations, but it is crucial to establish proper guardrails and ethical guidelines to ensure patient safety.
While AI models like ChatGPT-4 can provide preliminary insights and warnings about seeking professional medical advice, they cannot substitute for the expertise, reasoning abilities, and critical thinking skills possessed by experienced clinicians.
DEEP DIVE
When Physicians Go Head-to-Head with Large Language Models
What happens when ChatGPT-4 and a human cardiologist are asked to diagnose the same patient? The results are quite revealing.
Mayo Clinic Platform
By Paul Cerrato, senior research analyst and communications specialist, Mayo Clinic Platform, and
John Halamka, M.D., president, Mayo Clinic Platform
May 10, 2023
For decades, innovators have been looking for ways to use computer technology to improve patient care and ease clinicians’ workloads. Some have even suggested that AI-based algorithms are as effective as physicians in diagnosing specific diseases. In 2016, Varun Gulshan at Google and his associates from several medical schools tested that theory, using a convolutional neural network (CNN) to analyze retinal images. Looking for evidence of diabetic retinopathy, they found the CNN’s diagnostic skills were on par with that of several board-certified ophthalmologists. Several additional studies have followed to determine if machine learning tools could augment or replace clinicians.
For example, British and German investigators analyzed 82 studies that compared deep learning algorithms to clinicians to determine their respective abilities to detect diseases in medical images. They “found the diagnostic performance of deep learning models to be equivalent to that of health care professionals. However, a major finding of the review is that few studies presented externally validated results or compared the performance of deep learning models and health care professionals using the same sample.”
A second analysis from British and American researchers also compared AI to clinicians, concluding: “Few prospective deep learning studies and randomized trials exist in medical imaging. Most nonrandomized trials are not prospective, are at high risk of bias, and deviate from existing reporting standards. Data and code availability are lacking in most studies, and human comparator groups are often small.”
Now that ChatGPT and other large language models have surfaced, many thought leaders are also wondering how they might perform when challenged with diagnostic puzzles.
With that in mind, we sent a prompt to ChatGPT-4 that read as follows:
“Mr. Jones, 59 years old, with a history of hypertension, stroke, and elevated lipid levels, arrives in the ED complaining of sudden-onset intense substernal chest pain that radiates to his left leg but does not affect his left arm or jaw. He also has an elevated troponin I level. What is the correct diagnosis?”
ChatGPT replied: “Based on the symptoms and medical history provided, Mr. Jones is presenting with a potential myocardial infarction (MI), also known as a heart attack. The sudden onset of intense substernal chest pain that radiates to his left leg is a common symptom of a heart attack. The fact that the pain does not affect his left arm or jaw is not unusual, as some heart attacks can present with atypical symptoms.
“The elevated troponin I level is also a significant finding, as it indicates damage to the heart muscle. Troponin is a protein released into the bloodstream when heart muscle cells are damaged or die, which is a hallmark of a heart attack. It is important to note that a definitive diagnosis can only be made by a healthcare professional after conducting a thorough medical evaluation, including a physical examination, ECG, and other tests as necessary. Prompt medical attention is essential in suspected cases of heart attack, as early intervention can improve outcomes and reduce the risk of complications.”
Several years ago, a case report published in the Journal of General Internal Medicine described an identical patient scenario in which J. William Schleifer, M.D. and his colleagues at the University of Alabama in Birmingham explained their diagnostic reasoning and the conclusion they reached.
We provided a shortened version of that case in one of our recent books, Reinventing Clinical Decision Support: Data Analysis, Artificial Intelligence, and Diagnostic Reasoning.
Based on a methodical review of all the patient data, Dr. Schleifer et al. questioned the significance of the patient’s radiating left leg pain.
One of the hallmarks of a genuine expert diagnostician is their more completely developed disease scripts and their ability to spot inconsistencies that don’t fit into these scripts.
The leg pain was one of those clues that might warrant a walk down a different diagnostic path.
They also used a reasoning technique sometimes referred to as pre-mortem examination. Essentially, they asked themselves: What would happen once a specific diagnosis is made and acted upon? What are the consequences, good and bad? In the case of Mr. Jones, if he is treated with the anticoagulants usually indicated for a typical MI, and he actually had another condition such as an aortic dissection, the consequences could prove disastrous.
The pre-mortem analysis and the fact that the patient had radiating left leg pain were enough to postpone treating the alleged MI until additional data was collected.
Once the patient was admitted to the medical floor, the appearance of a systolic murmur plus chest pain strongly suggested aortic dissection, a tear in this major blood vessel; the tear was finally confirmed with a CT angiogram.
The imaging study also documented that the dissection extended all the way down Mr. Jones’ thoracic descending aorta, which explained the mysterious leg pain.
Their correct diagnosis begs the question: Why didn’t ChatGPT reach the same conclusion?
The scenario dramatically illustrates the difference between LLMs trained on the general content of the internet and the “database” residing within the brain of a veteran cardiologist with decades of clinical experience and expert reasoning skills.
It also highlights that LLMs don’t actually reason, at least not in the way humans are capable of, including critical thinking skills that computers have yet to master.
As Chrag Shah with the University of Washington explains, “Language models are not knowledgeable beyond their ability to capture patterns of strings or words and spit them out in a probabilistic manner… It gives the false impression of intelligence.”
As Chrag Shah with the University of Washington explains, “Language models are not knowledgeable beyond their ability to capture patterns of strings or words and spit them out in a probabilistic manner… It gives the false impression of intelligence.”
Admittedly, the chatbot did warn users that a definitive diagnosis was not possible and should only be made by a health care professional “after conducting a thorough medical evaluation, including a physical examination, ECG, and other tests as necessary.”
But realistically, many patients, and perhaps a few clinicians, would rely heavily on its conclusion, with life-threatening consequences.
That is not to suggest that LLMs have no place in medicine.
One approach that has merit is retrieval-augmented generation (RAG), which can control model knowledge and reduce hallucinations.
The idea is simple. Clinical information from Mayo Clinic, Duke, Intermountain, and Massachusetts General Brigham are trustworthy.
An LLM based on content from these sources would be far less likely to mislead users with misinformation or hallucinations, provided it has appropriate guardrails inserted by its developers.
Like most new technologies, LLMs are neither a godsend nor a pending apocalypse.
With the right blend of business skills, ethical principles, and the sincere desire to put patients’ needs first, they’ll eventually become a valuable part of medicine’s digital toolbox.
Like most new technologies, LLMs are neither a godsend nor a pending apocalypse.
With the right blend of business skills, ethical principles, and the sincere desire to put patients’ needs first, they’ll eventually become a valuable part of medicine’s digital toolbox.
Originally published at https://www.mayoclinicplatform.org on May 10, 2023.
Names mentioned
- Varun Gulshan: Mentioned in a reference to a study conducted at Google and several medical schools in 2016.
- J. William Schleifer, M.D.: Mentioned in a case report published in the Journal of General Internal Medicine. He and his colleagues were from the University of Alabama in Birmingham.
- Chrag Shah: Mentioned as a representative of the University of Washington, explaining the limitations of language models.