Developing and Incorporating Large Language Models in Medical Practice – proactive engagement is a must have !

the health strategist
research institute

Joaquim Cardoso MSc.

Chief Research and Strategy Officer (CRSO)
Chief Editor and Senior Advisor

August 14, 2023

What is the message?

The article “Creation and Adoption of Large Language Models in Medicine”, published by Jama Network, calls for proactive engagement from the medical profession in defining the training data, self-supervision methods, and evaluation processes for LLMs in medicine.

This approach aims to harness the potential of LLMs while ensuring their relevance, efficacy, and responsible integration into medical practice.

One page summary:

  • LLMs are powerful language processing models that have applications in various domains, including healthcare. However, the article highlights the need for active involvement from the medical community to shape how LLMs are created, trained, and deployed to ensure their relevance and efficacy in medical contexts.
  • There is an increasing use of LLM-powered applications in medical tasks, even without training these models on medical records or verifying their benefits for medical use.
  • This lack of active involvement from the medical community in shaping the development and application of LLMs in healthcare risks losing agency over how these tools can be effectively integrated into medical practice.
  • The article emphasizes the significance of two key aspects in shaping the creation and use of LLMs in medicine:

Relevant Training Data and Self-Supervision: The training data used for LLMs are essential for their performance. The article distinguishes between two types of medical LLMs: those trained on medical documents and those trained on sequences of medical codes in patients’ records. While the former predict the next word in textual documents, the latter learn the probability of future medical events based on the sequence and timing of medical codes. The article suggests that the medical community should actively contribute to shaping the training data and self-supervision methods to ensure that LLMs are relevant to medical tasks.

Verification of Benefits: The article highlights the need to verify the benefits of using LLMs in medical settings through real-world evaluations. Current evaluation methods are criticized for their lack of clarity and appropriateness, and the article calls for defining the benefits and conducting rigorous evaluations that demonstrate the impact of LLMs on specific medical workflows. This verification process is essential to determine the actual value that LLMs bring to clinical practice.

  • The medical community should not merely adopt off-the-shelf LLMs but actively participate in shaping their development, training, and deployment.
  • It emphasizes the importance of evaluating LLMs’ performance in real-world medical scenarios, akin to road tests for cars, to ensure that these models enhance human judgment rather than replacing it.
  • The goal is to achieve better medical care by leveraging the strengths of both LLMs and medical professionals.

DEEP DIVE

Creation and Adoption of Large Language Models in Medicine [excerpt]

JAMA Network

Nigam H. Shah, MBBS, PhD – David Entwistle, BS, MHSA – Michael A. Pfeffer, MD

August 7, 2023

Abstract

Importance  There is increased interest in and potential benefits from using large language models (LLMs) in medicine. However, by simply wondering how the LLMs and the applications powered by them will reshape medicine instead of getting actively involved, the agency in shaping how these tools can be used in medicine is lost.

Observations  Applications powered by LLMs are increasingly used to perform medical tasks without the underlying language model being trained on medical records and without verifying their purported benefit in performing those tasks.

Conclusions and Relevance  The creation and use of LLMs in medicine need to be actively shaped by provisioning relevant training data, specifying the desired benefits, and evaluating the benefits via testing in real-world deployments.

Introduction

Large language models (LLMs) and the applications built using them, such as ChatGPT, have become popular. Within 2 months of the November 2022 release, ChatGPT surpassed 100 million users. The medical community has been pursuing off-the-shelf LLMs provided by technology companies. New users have been asking how the LLMs and the chatbots powered by them will reshape medicine.1 Perhaps the reverse question should be asked: How can the intended medical use shape the training of the LLMs and the chatbots or the other applications they power?

Language models learn the probabilities of occurrence for sequences of words from the corpus of text. For example, if the corpus had the 2 questions of “where are we going” and “where are we at,” the probability is 0.5 for seeing the word going after seeing the 3 words where are we. An LLM is essentially learning such probabilities on a massive scale, such that the resulting model has billions of parameters (a glossary appears in the Box). In 2017, Vaswani et al2 demonstrated that a certain kind of deep neural network, called a transformer, could learn LLMs that later performed amazingly well at language translation tasks. Their insight led to the creation of hundreds of language models that were reviewed by Zhao et al.3

Although language models are trained to predict the next word in a sentence (basically an advanced autocomplete), new capabilities (such the ability to summarize text and answer questions posed in natural language) become possible without explicitly training for them, which allow the model to perform tasks such as pass medical licensing examinations, simplify radiology reports, extract drug names from a physician’s note, reply to patient questions, summarize medical dialogues, and write histories and physical assessments.4 ChatGPT, perhaps the most popular application, uses an LLM called a generative pretrained transformer (GPT; version 3.5 or 4.0) underneath to ingest text and output text in response.

The creation of language models capable of such diverse tasks hinges on 2 things. First is the ability to learn generally useful patterns in large amounts of unlabeled data via self-supervision (training and interacting with an LLM in the Figure). For example, a commonly used form of self-supervision is to predict the next word in a sequence conditioned on prior words, which later identifies the words that go together in general. The GPT-3 model was trained on 45 terabytes of text data comprising roughly 500 billion tokens (1 token is approximately 4 characters or three-fourths of a word for English text) at a cost of approximately $4.6 million.5

Second is the subsequent tuning of the LLM to generate responses aligned with human expectations via instruction tuning. For example, in response to the request, “explain the moon landing to a 6-year-old in a few sentences,” the GPT-3 model suggested possible completions as “explain the theory of gravity to a 6-year-old” and “explain the big bang theory to a 6-year-old” (instruction tuning an LLM in the Figure). Users helped train GPT-3 by providing the instructions (also called prompts) for which the labelers (hired by OpenAI, the company that built GPT-3) provided demonstrations of the desired output and ranked the outputs from the model. OpenAI used these pairs of instructions and their desired outputs to instruction tune GPT-3.6

Although general-purpose LLMs can perform many medically relevant tasks, they have not been exposed to medical records during self-supervised training and they are not specifically instruction tuned for any medical task. By not asking how the intended medical use can shape the training of LLMs and the chatbots or other applications they power, technology companies are deciding what is right for medicine. The medical profession has made a mistake in not shaping the creation, design, and adoption of most information technology systems in health care. Given the profound disruption that is possible for such diverse activities as clinical documentation, decision support, information technology operations, medical coding, and patient-physician communication with the use of LLMs (estimated in a McKinsey report to be as high as 1.8%-3.2% of total health care revenues7), the same mistake cannot be repeated. At a minimum, the medical profession should be asking the following questions.

Glossary

See the original publication (this is an excerpt version)

Are the LLMs Being Trained With the Relevant Data and the Right Kind of Self-Supervision?

Medical records can be viewed as consisting of sequences of time-stamped clinical events represented by medical codes and textual documents, which can be the training data for a language model. Wornow et al8 reviewed the training data and the kind of self-supervision used by more than 80 medical language models and found 2 categories.

First, there are medical LLMs that are trained on documents. The self-supervision is via learning to predict the next word in a textual document, such as a progress note or a PubMed abstract, and conditioned on prior words seen. Therefore, these models are similar in their anatomy to general purpose LLMs (eg, GPT-3), but are trained on clinical or biomedical text. These models can be used for language manipulation tasks such as summarization, translation, and answering questions. Given the increased training and use costs of LLMs, it is necessary to investigate whether smaller language models trained on relevant data may achieve the desired performance at a lower cost. For example, researchers at the Center for Research on Foundation Models at Stanford University created a model called Alpaca with 4% as many parameters as OpenAI’s text-davinci-003, matching its performance at a cost of $600 to create.9

Second, there are medical LLMs that are trained on the sequence of medical codes in a patient’s entire record that take time into account. Here, the self-supervision is in the form of learning the probability of the next day’s set of codes, or learning how much time elapses until a certain code is seen. As a result, the sequence and timing of medical events in a patient’s entire record is considered. As a concrete example, given the code for “hypertension,” these models learn when a code for a stroke, myocardial infarction, or kidney failure is likely to occur. When provided with a patient’s medical record as input, such models will not output text but instead a machine understandable “representation” of that patient, referred to as an “embedding,” which is a fixed-length, high-dimensional vector representing the patient’s medical record. Such embeddings can be used in building models for predicting 30-day readmissions, long hospital lengths of stay, and in-patient mortality using less training data (as few as 100 examples).10

The medical community needs to actively shape the creation of LLMs in medicine. For example, given the importance of instruction tuning, the medical community should be discussing how to create shared instruction tuning datasets with examples of prompts to be fulfilled, such as “summarize the past specialist visits of a patient” with its corresponding valid completion (Figure). Perhaps instead of using GPT-4 at the cost of $0.06 to $0.12 per 1000 tokens (about 75 words), health care systems should be training shared, open-source models using their own data. The technology companies should be asked whether the models being offered have seen any medical data during training and whether the nature of self-supervision used is relevant for the final use of the model.

Are the Purported Value Propositions of Using LLMs in Medicine Being Verified?

Current evaluations of LLMs also do not quantify the benefits of novel collaboration between humans and artificial intelligence that is at the core of using these models in clinical settings. The methods for evaluating LLMs in the real world remain unclear. Concerns with current evaluations range from training dataset contamination (such as when the evaluation data are included in the training dataset) to the inappropriateness of using standardized examinations designed for humans to evaluate the models. Consider the analogy of evaluating a person for a driver’s license. The person takes a multiple-choice, knowledge-based test. The car, meanwhile, undergoes safety tests during manufacturing, some of which are regulated by the government. Then the person gets in the car for a road test to certify them for a license. The car does not take a multiple-choice test at the department of motor vehicles or get certified for driving, but that is the absurdity tolerated for LLMs when it is declared that they are certified to give medical advice because they passed the US medical licensing examination.

The purported benefits need to be defined and evaluations conducted to verify such benefits.8 Only after these evaluations are completed should statements be allowed such as an LLM was used for a defined task in this specific workflow, it measured a metric, and observed an improvement (or deterioration) in a prespecified outcome. Such evaluations also are necessary to clarify the medicolegal risks that might occur with the use of LLMs to guide medical care,11 and to identify mitigation strategies for the models’ tendency to generate factually incorrect outputs that are probabilistically plausible (called hallucinations).

Conclusion

The building of relevant medical LLMs needs to be balanced with verifying the presumed value propositions via testing in real-world deployments akin to road driving tests. If the goal in using such models is to augment human judgment, and not replace it, adopting this driving test mindset is critically important. Otherwise, there is a risk of falling into the trap of automating tasks that individuals already know how to do, and failing to ask the question of what a person plus such models could do together that may yield better medical care.12

Given the highly disruptive potential of these technologies, clinicians cannot afford to be on the sidelines. The adoption of LLMs in medicine needs to be shaped by the medical profession that can identify the right training (and instruction tuning) data and perform the evaluations that verify the purported benefits of using LLMs in medicine.

References

See the original publication (this is an excerpt version)

Article information

See the original publication (this is an excerpt version)

Authors and Affiliations

Nigam H. Shah, MBBS, PhD1,2,3; David Entwistle, BS, MHSA1; Michael A. Pfeffer, MD1,2

  • 1Stanford Health Care, Palo Alto, California
  • 2Department of Medicine, School of Medicine, Stanford University, Stanford, California
  • 3Clinical Excellence Research Center, School of Medicine, Stanford University, Stanford, California

Total
0
Shares
Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *

Related Posts

Subscribe

PortugueseSpanishEnglish
Total
0
Share