AI Comes to Medicine: This Time, It’s Serious

27 de junho de 2023

22 minute read

Health Strategy . Institute

multidisciplinary institute for
in-person health and
digital health strategy

Joaquim Cardoso MSc

Chief Research and Editor — of the Health Strategy Portal;
Chief Strategy Officer (CSO) — of the Health Strategy Institute (HSI),
Senior Advisor — for Boards and C-Level

June 27, 2023

This is a summary of the a Interview by Eric J. Topol, MD with Dr Isaac Kohane from Harvard, who is a pediatric endocrinologist and leads the bioinformatics department there. He’s a pioneer in this field .

Key Takeaways:

Dr. Isaac Kohane discusses the progression of AI in medicine and its impact.

In 2012, convolutional neural networks (CNNs) showed potential in medical applications, such as detecting retinopathy and predicting patient characteristics.

In 2017, the transformer architecture, specifically the attention mechanism, revolutionized deep learning models.

The transformer models, like GPT-4, are not purpose-built but can be fine-tuned for specific tasks.

GPT-4 is a multimodal model, capable of processing images, speech, and unstructured text, which is crucial for medical applications.

GPT-4’s architecture, alignment process, and large-scale training contribute to its efficiency and improved performance.

The model’s ability to generate contextually relevant responses and its alignment with human feedback have advanced its capabilities.

GPT-4’s breakthrough is attributed to its size, alignment, and continuous improvement in performance metrics.

The multimodal approach is expected to enhance chatbot capabilities, but GPT-4’s current strength lies in its size and alignment.

It’s obvious to me that the administrative part of this is going to happen almost immediately. And 30% of the medical economy in the United States is administrative overhead, so there’s billions of dollars at stake

Superhuman performance in medicine and other domains may be achievable by combining multimodal approaches and leveraging large language models like GPT-4.

GPT-4’s performance challenges our understanding of intelligence, as it demonstrates knowledge and diagnostic abilities while also making mistakes.

The debate over whether GPT-4 represents general artificial intelligence (AGI) highlights the complexities and limitations of our understanding of intelligence in machines.

Because medicine is still very much a labor-intensive process, we’re not going to be able to scale it up the way we want to.

So we’re going to have to use more and more paraprofessionals and have these paraprofessionals augmented with a large language model to look over their shoulders and catch errors — to remind them, “Have you thought about this?” — and to summarize the patient.

As you well know, because you’ve written about this as well, primary care doctors don’t know much about genomics.

That primary care doctor or PA can have a large language model of genomics to say, “This person has a family history of this kind of cancer. Maybe you should do the test.”

How often does that happen these days? Not often enough.

Now, with these models looking over the shoulder, I think the front door is there.

DEEP DIVE

AI Comes to Medicine: This Time, It’s Serious

Medscape
Eric J. Topol, MD; Isaac S. Kohane, MD, PhD
June 27, 2023

This transcript has been edited for clarity.

Eric J. Topol, MD: Hello. This is Eric Topol with Medscape’s Medicine and the Machine podcast. Today our special guest is Dr Isaac Kohane from Harvard, who is a pediatric endocrinologist and leads the bioinformatics department there. He’s a pioneer in this field and one of the people I look up to. Welcome, Zak.

Isaac S. Kohane, MD, PhD: Thank you, Eric. It’s great to be on your podcast.

Topol: Recently I had an interview with Peter Lee, and we talked about the book you and he wrote with Carey Goldberg: The AI Revolution in Medicine: GPT-4 and Beyond. So you’ve had several months to test-drive this.

Before we get into GPT-4, I want to retrace the history with you because I think some of our audience may not understand this progression. Let’s go back to the old world of convolutional neural networks: AlexNet, 2012, approximately a decade ago. That’s where we were in medicine and healthcare. That changed with the transformer architecture in 2017. Describe what we had with deep learning models and what changed when a new model of transformers came along.

Kohane: For the sake of the audience and for my own sanity, I should say — because I also got a PhD in computer science in the 1980s working on artificial intelligence — that we tried to make an impact in AI 30 years ago. For a variety of reasons, we failed, because we didn’t have the right data on patients, because we didn’t have the right data on medicine, and because neural network models were super-simple and we didn’t have to compute.

Things began to change in 2012. We were already pretty impressed with it, but around 2018, we started seeing, in the medical literature, the consequences of these convolutional neural networks — that they could actually detect changes in images that were perhaps imperceptible to humans, which would allow them not only to diagnose retinopathy in the back of the eye, for example, but could tell you the sex of the patient, their age, and what other diseases they had.

So, there were a bunch of these mostly image-based applications, but also some applications related to time series and health records, that had impressive performance. What was different, what was characteristic of them, was that they were purpose-built. You train them for a specific purpose: diagnosing retinopathy, predicting time to readmission.

These were programs that not only were purpose-built but, because of that, you could easily evaluate them and assess their accuracy for a specific task. Then, around 2017, as you point out, a paper came out of Google about this transformer architecture, which allowed us to create these models that you could fine-tune for a pretty generic task such as, “tell me the next word I’m going to say or most likely will say after this string of words.”

What was interesting about these models is that they were not built for a given purpose. They were, essentially, built on the joint probabilities across a large corpus — in this case, words — across everything that human beings have said on the internet. Similar kinds of things were done for amino acid sequencing, a very different kind of language.

In both cases, the transformer model allowed us to take these generic models, sometimes called foundation models; they were not built for a specific purpose. You could then train them to do certain tasks that are relatively generic — again, like predicting the next word that would be said or predicting the effect of a mutation on a change in an amino acid. Frankly, we all thought it was interesting.

If you followed the literature around 2018, 2019, 2020, as these models got bigger, we were increasingly impressed. But if you’d asked most of us — certainly if you’d asked me — when these would start becoming useful in medicine, I would have said in about 10 years at the earliest. We were just not data driven.

If we had really been looking at, for example, OpenAI, the original GPT, and then GPT-2 and GPT-3, had we been following the increase in performance, we would have concluded that GPT-4 was going to be impressive. But we could not imagine it.

It’s still hard to understand how something that has not been trained in medicine — for example, something that only knows the next word I’m going to say and the next word after that — that somehow I can have a conversation with it about a case of ambiguous genitalia. It still is hard to understand how that can be. So even though we could have followed that growth from GPT-3 to -4 and could say that this is going to have human-like performance on many tasks, we could not imagine it. Most of us were taken by surprise.

Topol: Before we get to what took the world by storm, ChatGPT — with well over a billion unique users in 90 days, which is incredible — I want to go back to the transformer model, because this attention issue is what’s central, and as you said, you couldn’t have predicted it.

But would you say that the previous deep learning models, deep neural networks, the inputs, were just not organized? They didn’t have these tokens, they didn’t have a way to make this prediction and appear to be so conversational. What is it about this model’s architecture that made it so uniquely different? It was kind of a sleeping giant waiting to awaken as OpenAI, and others built on it.

Kohane: I think it had to do with understanding more about the context of any given word or two words or three words, understanding context and that context could become longer and longer, depending on how much computer power you had, having that kind of attention. Rather than saying that a word is co-occurring, identifying where it co-occurs in a sentence made the difference between the transformer model and the other models.

Previous models would have eventually gotten there, but it probably would have taken two or three or four orders of magnitude more growth — just that simple insight, the attention mechanism that says where these strings of tokens are. We had tokens before. We had these recurrent neural networks, which allowed you to predict, to a certain extent, what was the next step in the emergency room and treatment.

They kind of worked, but they worked like GPT-1, and they’re inefficient. So this notion of the context that you get from attention allowed it to be very efficient. You still have to create this. We don’t know how big GPT-4 is, but it’s at least a trillion-parameter model.

If we’d use the other technologies, we would probably need something 10 or 100 times bigger to work. So that gave us efficiency. Everything has always been improving in machine learning with scale, but this gave us an additional acceleration, several generations.

So yes, the attention model did matter a lot. And of course, the irony is not lost that this was developed at Google; but Microsoft, through its investment in OpenAI, seems, at least temporarily, to be the major beneficiary.

Topol: It’s a fascinating story, when you look back — not wanting to challenge Google Search as it existed, and now, the resurrection of Bing to a whole new level. So then, we have a nine-orders-of-magnitude increase in floating petaFLOPS of computing power.

We put in the entire internet — Wikipedia, a gazillion books, every input you can imagine — and we fed in what eventually became GPT-4, which was released mid-March this year, only months after ChatGPT. And now we have an even more powerful large language model. The difference here, if you can comment, is that it’s not just a large language model; it’s multimodal. It works with images. It works with speech. And this is important as it relates to medicine, where we don’t just use images. We now can look at unstructured text, along with ambient voice or speech, along with video.

GPT-4 is a big jump from ChatGPT and other large language models. There are many of them out there, of course. It’s got more parameters and more of everything. Can you develop the multimodal story for us?

Kohane: I’m going to agree and then disagree. I agree that the strength of these multimodal approaches is going to accelerate what chatbots like GPT-4 can do. But I don’t think that’s where its strength comes from. What’s fascinating is that Sam Altman has hinted at this, but the contents, the details, are secret.

I think it goes something like this: There’s a bunch of details about the way it was implemented at that scale that have not been shared publicly. Some of it is engineering. Some of it is babysitting.

I’ve heard through the rumor mill, for example, that these models are not guaranteed to converge. In other words, they’re not guaranteed to have a stable solution. They ran it a large number of times — I don’t know how many times — before it actually converged.

There was no guarantee ahead of time, which is pretty scary, because each one of these models probably cost a few million dollars just in electricity to run. So that’s one part of it.

The other part of it is the alignment process. That’s incredibly important. The fact that GPT-4 and GPT-3.5, which was the original ChatGPT, can actually generate exactly what they’re built to, but create completely uninteresting phrases — syntactically correct sentences that represent the probabilities correctly but are completely uninteresting to humans.

That problem went away with a surprisingly small number of alignment procedures. The alignment procedure we’ve heard about most is what they call reinforcement learning through human feedback where, basically, you start by giving lots of examples of the right prompt and the right response, and then you start grading its responses to prompts. Apparently, this is done relatively cheaply, with people in Africa hired to create these cases and then to create the responses. We don’t know the numbers, but it was not billions. It was probably tens of thousands, or maybe hundreds of thousands, of examples. So, it’s tiny compared with the size of what’s called the pre-trained model.

That has made a huge difference. I would say that the breakthrough we seem to see with GPT-4 is probably only due to the size of the model, plus the alignment. There’s an important paper that came out in arXiv, “ Are Emergent Abilities of Large Language Models a Mirage?” What they were trying to say was, everybody’s saying that this emergent intelligence also appeared in GPT-4. But they say, no, it’s not the case. The metrics we have been using — the measurements of GPT performance that we’ve had to date — have basically been, either it’s right or it’s wrong. If you come close, if you get 3 out of 4 right, you don’t get partial credit.

What happened is that these things started getting better and better. Didn’t quite get to where it needed to get, but if you watch it across the models, as the paper describes, it’s actually a steady increase. And it just got to level A, where it got things right enough so that the metrics said, hey, it’s winning. And it also got to the level where it was right enough, so that human beings say, hey, that’s as smart as I am, maybe.

Eric, I do believe that the multimodal approach is going to be important. Frankly, I don’t think that’s where the strength of GPT-4 is right now. I do think we will see truly awe-inspiring wizardry when different modal AlphaFold is in a different domain around protein structure and its dependence on amino acid sequence, and a little bit of electrodynamics, and a little bit of co-evolution constraints.

But if you imagine those two things coming together, you’ll have what we’ve been looking for for so long, which is the ability to leap from molecular structure all the way to clinical impact. That’s going to be a superhuman performance. It’s going to be hard to do, but one thing that seems to be true for the past 5 years is that betting against performance increasing in this domain seems foolish.

Topol: I think you’re getting at my next line of questions, which is about this superhuman performance. As you recall, last year, a Google employee was fired because he thought the large language model was sentient. And now, you have a Microsoft paper that talks about sparks of general artificial intelligence — the same thing that the Google person was being fired for.

The fundamental question here — and, of course, as it relates to medicine — is, have we reached a level of intelligence that is beyond the stochastic parrot, statistical context stuff that we were talking about? Have we reached a point where machines have gained a jump, maybe not to the point of all human cognitive tasks like AGI (artificial general intelligence)? You’ve had all these months to test GPT-4, and since it is now out there in recent months, what are your thoughts about going to a new level of machine intelligence?

Kohane: I believe this is an uncomfortable period for us because it’s showing us how uneven our understanding is of intelligence. Because there are some things that GPT-4 can do diagnostically. I’ve given it hard cases from the undiagnosed disease network I’m involved in, cases that were not diagnosed, and it was able to diagnose them.

As doctors, we know that there are only fragments of medicine that we know well, and GPT-4 knows a lot of medicine well. It also makes very dumb mistakes, but it knows a lot. So, what do you think of a colleague who is extremely knowledgeable about a lot of medicine but also makes mistakes, and sometimes doesn’t have common sense and sometimes makes stuff up? Is that not intelligent?

We know people like that. The fact that the same thing gives you a Torah interpretation or speculates on stock markets — I don’t know what that is, but it’s a property that we as humans pride ourselves on and we hire people for because they have some of those properties. I think it gets us into the wrong rabbit hole to ask, “Is this general intelligence?” but it’s certainly the kind of behavior and performance that historically, and currently, we still value in humans. And the weird part is, it is foreseeable that this is going to get better.

Topol: That’s so critical. The range of applications in healthcare is extraordinary. More of the mundane things, like taking a conversation from a clinic visit and putting that not just into a note that’s quite good, but also into all the downstream functions, like making the next appointment and follow-up tests, and even nudging patients, as was discussed in the clinic conversation. Seems like that one is near-term, if you will, right?

Kohane: Yes.

Topol: But then there’s obviously the front door for physicians and nurse clinicians, whereby instead of having to talk to a doctor, patients can just get their GPT-X consult, and only when they’re suspicious that it’s not the right answer or the one they’re looking for, or they need more, do they see a clinician. What do you think about that as a front door for answering questions? In what we’ve seen, at least in a few reports, there’s some hallucination or confabulation there. Sometimes it’s off but for the most part, it’s giving pretty good quality. The UC San Diego redid their study to look at accuracy. It was quite accurate. What do you think about the front-door helper function?

Kohane: For some reason, I’m reminded of something that you wrote in one of your books. I can’t remember which book it was, but I thought it was very insightful because it was so real. It was about how some people end up on boards of hospitals because they want to be able to get the right answer to the right questions.

Topol: Right.

Kohane: Everybody wants that privilege. But because primary care is decimated in this country and it will not improve, the wish to have some authoritative answer and to be able to ask questions that your primary care doctor — assuming you have one — has not had time to answer, or did not answer, is going to drive a lot of that kind of front-door use, whether we like it or not.

Already, we’re using “Dr Google” to do searches that take us to a lot of sometimes not-particularly-reputable websites to get information, just because there’s such a need to get the right answers, because you’re not getting the discussion you need. Sometimes it’s because the doctor doesn’t have time, sometimes it’s because you don’t think of the questions until after you’ve seen the doctor.

Because medicine is still very much a labor-intensive process, we’re not going to be able to scale it up the way we want to. So we’re going to have to use more and more paraprofessionals and have these paraprofessionals augmented with a large language model to look over their shoulders and catch errors — to remind them, “Have you thought about this?” — and to summarize the patient. As you well know, because you’ve written about this as well, primary care doctors don’t know much about genomics.

That primary care doctor or PA can have a large language model of genomics to say, “This person has a family history of this kind of cancer. Maybe you should do the test.” How often does that happen these days? Not often enough.

Now, with these models looking over the shoulder, I think the front door is there. The only question is, are we going to be proactive enough as a medical establishment to make that part of the process? If not, then at least two other things could happen. One, patients will just start using it more and more without doctor supervision, which has its risks. But in the absence of better alternatives, that’s the way they’ll go. Or there’ll be new companies, medical-like companies through Amazon, for example, where they integrate this kind of support into the entire process, both for the doctor’s side and for the patient’s side. The question is, will august, reputable healthcare systems embrace it? We’ll see.

Topol: That’s a critical question, because in some domains, it’s seen as a threat or not validated. There are legitimate concerns because of the mistakes that are made; about the relative inability for patients, no less some clinicians, to differentiate the great answers and conversations and outputs vs the ones that are just contrived or totally erroneous.

But this gets to the fact that already today, we have patients who are going to ChatGPT, Microsoft Edge, Bing, Creative mode, GPT-4, and are asking about their relatives or themselves and getting answers. It’s not regulated. It’s out there. Does this need to be regulated?

Kohane: I don’t know. I don’t know if we know how to regulate this. We were already hitting the hairy edge with regulating those convolutional neural networks. FDA barely could do that, because in addition to specific tests, there was the question of, how reproducible is this in a different healthcare system with different patients? Now, we’re talking about the whole of medicine, with these multicapable features. So, it’s not clear how to regulate it.

But the other perspective is, in some sense, this is probably actually more systematic than Google Search. Do we regulate Google Search for medical use? We don’t. And we know that bad things have happened because people see some kooky sites. We don’t regulate textbooks either.

It’s going to be challenging. I also want to point out that as people who live in the United States, we have a very localized view of things. I am sure that China and Europe, for example, will have two very different views on how this should be regulated — different from us and different from each other.

We already see, with privacy legislation around Google Search, for example, that Europe and China are different from us. So, we do need to have a conversation about this. I do believe that it’s good that the public can access these models just so they can see what this is about.

The worst outcome would be to have policymaking without general societal awareness of what these things can do and what their limitations are. I don’t know how else to have that legislation. You know, I’m quite impressed by the fact that, despite what I like to think, I live in a bubble. Everybody around me is talking about GPT, and at this point, it’s almost boring. But I was meeting several reporters who were visiting Harvard as part of an educational mission. They were reporters from very high-profile newspapers. They had heard of GPT but none of them had actually tried to use it.

I do think that making it available for use allows the public and others to be part of a very important conversation: How do you make it safer? And there are some fundamental questions that I think are uncomfortable for some of the companies producing these models, such as, which data were used to train this model?

Topol: Right.

Kohane: OpenAI does not share that with the public. One of the reasons given is that if we give out the recipe to make these large language models, it might allow the end of the world to happen faster. The AI that will then destroy the world could come faster, and it’s seen as a public safety issue.

But do we know that there was no electronic health record data in GPT-4? We don’t. We know it’s not the standard web crawl that has been used in other models. We just don’t know what went into it. That could affect biases and accuracy. So I would start there.

Topol: The lack of transparency of the models is an essential point. This has been a real problem in medicine that’s held back some of the progress and implementation — the serious lack of transparency. Now, you’re also touching on another critical point, which is that the large language models as they exist today for the most part — we’re going to talk about specialized fine-tuning, but the ones we’ve been talking about so far were trained generically. There was no medical-specific training, as far as we know. Maybe there was some, but we don’t know of it. Now we’re getting into models that are basically piggybacking on the ones that took massive amounts of pre-training graphic processing unit computing efforts.

This, of course, is different, because these are oftentimes startup companies, and they’re saying, we’re putting in millions of electronic records and we’re going to do all this medical stuff. And now, we have a different look here. If these startups are going to succeed, they might have to be more transparent because they are trying to make it into our space.

There are many competitors — tens, if not hundreds — that are trying to have a place in medicine. They go for more simple things like preauthorization. I don’t know what you think about this, Zak. You can see that we’ll have AIs from doctors talking to AIs from insurance companies, but no humans. They’ll just talk to each other. Are each of us going to have our own AIs that are modeled for our needs, whether it be our medical professional needs or our personal needs? Where do you see all this going as we get into these specialized, fine-tuned models for our needs?

Kohane: I do think we’re heading toward a society of multiple such models. That’s because even within a single society such as ours, our values are not identical. You could imagine an AI large language model that was started the same as others, but then it was tuned more with some alignment around values that relate to less intervention, more alternative medicine approaches. And who’s to say that that’s wrong for those people?

We can’t prescribe a single type. So I do think you’ll see some — in fact, a lot of — fragmentation, officially, of different products that are derived not only from the basic data but from each other. Increasingly, we’re going to be dealing with this weird information pollution. Right now, most of the content on the web is human. Pretty soon, most of it will be AI-generated.

We’ve already seen a great example of this. There is something called Amazon Mechanical Turk, where if you want to get a lot of tags or labels on some data, you can go through Amazon Mechanical Turk and hire, literally, thousands of workers around the world who, for cents per label, would label something or do some small task. They are paid cents per transaction.

The market worked, and a bunch of people started putting large language models, instead of humans, onto those labeling tasks. So now, we have a bunch of machine learning models whose labels they’re getting not from humans, but from other machine learning models.

Is that going to make it better, or is it just some weird black hole where you have a bunch of AIs talking to each other, similar to what you just mentioned — the billing AIs talking to the reimbursement AIs?

Topol: It’s really interesting. This is a fascinating discussion. I don’t know that I could have this type of conversation with any other physician colleague. Zak, it’s a joy. You mentioned the existential thing. And you also have a very sanguine outlook. You’re not one, as I’ve known you over the years, to be at all hyperbolic about anything. Where do you see the future here? Because obviously, we’re talking about something that’s probably more transformative than we’ve ever seen in our careers.

Kohane: A lot of my friends who, like you, know that I’ve been relatively sober about these things, have been taken aback by how excited I am about this. They point out to me that every big change in technology has taken a while to flow, from the internet to the web to social media.

When I hear that, I say, that’s true, but I want to point out that every one of those things -telephone, internet, web, social media — the uptake was faster and faster for each one of those transitions. So, I think we’re in for an even faster transition.

We’re going to be in for an uncomfortable ride because the rate of adoption is going to be different in different parts of society and, narrowly, in medicine, it’s going to be in different parts of medicine. Either it’s going to end up being applied broadly or perhaps — and this would be a bad outcome — only high-end concierge services will use it, because they want to give the highest customer experience and be plugged into the latest in medical knowledge, and the healthcare systems will be slow in uptake.

I hope that’s not going to happen. I do believe that hospitals are stuck in a difficult place. They’re under multiple constraints. They’re losing money because the cost of ancillary personnel has gone through the roof after COVID. Frankly, the existing IT infrastructure is already incredibly burdensome for them. If you’re the chief information officer of a large hospital system, this feels like another big ask. On the other hand, it is transformative. It’s not going to just change billing; it’s going to change the practice of medicine.

I believe there will be large pressures. But overall, I’m extremely optimistic that this will change medicine for good. I just don’t know how to predict where it will happen most, in this country or in other countries, and in which part of the healthcare system.

Topol: I like that you are thinking globally. As you mentioned, there are two extremes, with Europe being much more conservative and China being far more aggressive.

We’re going to have to come back together for another Medicine and the Machine in the stretch to see how fast this is taking hold. On the one hand, there’s desperate need, where doctors and nurses are sick of being data clerks. Here’s a remedy that, if validated, could be a solution for some of that, if not a large proportion.

But then, the other applications taking this forward, which are just multipotent applications, in many respects, we’ll have to see whether any of these get a foothold.

Kohane: Eric, I just want to say one thing in closing. It’s obvious to me that the administrative part of this is going to happen almost immediately. And 30% of the medical economy in the United States is administrative overhead, so there’s billions of dollars at stake. In a purely self-serving way, every institution is going to benefit. Insurance companies don’t have to have thousands of doctors reviewing cases. Hospitals will not have to have basements full of people upcoding. So that is something that’s definitely going to happen.

When Willie Sutton was asked why he robbed banks, he said, because that’s where the money is. We know where AI is going to be applied first and most broadly in medicine; it’s in the financial administrative transactions. Our job and responsibility to society is to make sure that it should also work to improve the quality of healthcare.

Topol: I couldn’t agree with you more. We saw how electronic records took hold, just because of the billing issues. And we’re going to see this, as you say. As you know, my dream is that we will free up physicians to have the gift of time with their patients and actually feel like they’re caring, developing that relationship that has been so much more compromised in recent decades. We’ll see how it plays out. This has been fun.

The last point you made is about the financial factors that drive healthcare, which is a very big business. This is a major unmet need. Zak, what a treat to have this conversation. I know the Medscape folks who are listening to or reading it will really enjoy this.

Kohane: Thank you, Eric. Thanks for having me on.

Originally published at https://www.medscape.com on June 27, 2023.