Not All AI Is Created Equal: Strategies for Safe and Effective Adoption [insights from Johns Hopkins and Bayesian´s Health AI leader]


This is a republication of the article “Not All AI Is Created Equal: Strategies for Safe and Effective Adoption”, with the title above.


NEJM Catalyst
Suchi Saria, PhD

March 23, 2022


Site Editor:


Joaquim Cardoso MSc
health transformation institute (HTI) – research, strategy and advisory
October 26, 2022


Summary


  • Emerging best practices for the successful adoption and use of artificial intelligence, and a key missing ingredient: rigorous evaluation.

  • Five keys for using artificial intelligence in care delivery, and 

  • five common mistakes made in evaluating AI tools.

Health care is in a precarious position today, with an unstable workforce facing increasing demands to improve quality and safety while also reducing costs. 


The use of artificial intelligence (AI) and its successful integration through human-machine teaming is one of the most exciting untapped opportunities to improve the experience, efficiency, and quality of health care delivery. 

Predictive AI tools for clinical decision support (CDS) show tremendous promise in accelerating more accurate diagnoses and improving the safety and quality of health care.

This promise is fueled by an unprecedented amount of clinical, social, and personal data that is now available in real time from various sources, including electronic medical records (EMR) and monitoring devices.


With significant advances in AI and machine learning (ML) technologies, health care is beginning to move from being reactive to proactive, such as detecting early signs of life-threatening complications like sepsis and leukemia, and forecasting risk of progression in chronic diseases.


We are already seeing an acceleration of AI for automating administrative use cases like optimization billing and reducing denied claims. 

There is no question that the future of medicine involves intelligent clinical augmentation — which is why it is imperative for care delivery organizations to have an impactful clinical AI strategy.


We are already seeing an acceleration of AI for automating administrative use cases like optimization billing and reducing denied claims.


With significant advances in AI and machine learning (ML) technologies, health care is beginning to move from being reactive to proactive, such as : (1) detecting early signs of life-threatening complications like sepsis and leukemia, and (2) forecasting risk of progression in chronic diseases.



Key Elements of an Impactful Clinical AI Strategy


From my 2 decades of experience driving AI research, and more recently health AI translation, I offer emerging best practices for the successful adoption and use of AI.


  • 1.Defined Need
  • 2.Seamless Integration into Existing Workflows 
  • 3.Continuous Performance Monitoring and Optimization
  • 4.Operation as a Platform Rather Than a Point Solution 
  • 5.Clear Evaluation and Measurement of Efficiency and Outcomes

1.Defined Need


AI should be targeted to areas that have a clearly defined problem or need for improvement in care. Resources — both financial and personnel — should be focused on efforts where there is a clear understanding of value to the system and to relevant stakeholders.

Often, AI is applied too broadly, without a clear goal or impact.


Often, AI is applied too broadly, without a clear goal or impact.


2.Seamless Integration into Existing Workflows


Health systems, clinicians, and frontline staff are overburdened. 

They don’t have time for a tool that adds complicated steps to their processes. 

The user experience for a well-designed smartphone is very different from one assembled with a do-it-yourself kit. 

Similarly, it is important that AI tools are designed to work well within the existing information technology infrastructure, clinician workflows, pain points, and time leaks, and then build smart pathways that integrate machine learning insights into these processes. 

To extract the value AI tools can provide, the insights must be translated into actual behavior change triggered through specific workflows.


To extract the value AI tools can provide, the insights must be translated into actual behavior change triggered through specific workflows.


3.Continuous Performance Monitoring and Optimization


Each interaction between frontline staff and an AI/ML platform provides a chance to capture information that should be used to improve the tool. 


Physicians and nurses should see their feedback being incorporated, helping to build trust and drive adoption and sustained use. 

A predictive AI solution must learn from how users are interacting with the platform and detect opportunities to deepen engagement.


A predictive AI solution must learn from how users are interacting with the platform and detect opportunities to deepen engagement.


4.Operation as a Platform Rather Than a Point Solution


As health systems look to AI tools to solve multiple different complex problems, they need a platform upon which different solutions can be built as needed, as opposed to trying to address specific use cases with single point solutions. 

  • Having a consistent platform and interface also improves adoption by physicians and nurses, which drives improved outcomes at scale. 
  • Additionally, a platform solution allows for a one-time minimal integration. 

However, a single platform will not solve all use cases because different use cases demand expertise in different types of AI. 

For example, the approaches needed for imaging are different from predictive AI use cases that leverage multimodal clinical and social data. 

The EMR now makes it possible to leverage best-in-class platforms to tackle different problem areas.


Each interaction between frontline staff and an AI/ML platform provides a chance to capture information that should be used to improve the tool.


As health systems look to AI tools to solve multiple different complex problems, they need a platform upon which different solutions can be built as needed, as opposed to trying to address specific use cases with single point solutions.


5.Clear Evaluation and Measurement of Efficiency and Outcomes


Developing effective models is challenging, and mistakes are common. 

The right metrics must be identified and then evaluated in rigorous studies. 

Measuring a tool’s direct impact on granular metrics across discrete areas such as data science (model efficacy), behavior change (efficiency and adoption), and outcomes (clinical and financial, including return on investment) is critical for success. 

Additionally, the tool’s developer should be transparent about how the model is built and the strategies it employs.


Developing effective models is challenging, and mistakes are common. The right metrics must be identified and then evaluated in rigorous studies.



Rigorous Evaluations of AI Tools Are Crucial


Building and deploying AI predictive tools in health care isn’t easy. 


The data are messy and challenging from the start, and creating models that can integrate, adapt, and analyze this type of data requires a deep understanding of the latest AI/ML strategies and an ability to employ these strategies effectively. 

Recent studies and reporting have shown how hard it is to get AI right, and how important it is to be transparent with what’s “under the hood” and the effectiveness of any predictive tool.


Recent studies and reporting have shown how hard it is to get AI right, and how important it is to be transparent with what’s “under the hood” and the effectiveness of any predictive tool.


A wide gap exists between health AI done right and implementations in practice. 


State-of-the-art tools and best practices exist for performing rigorous evaluations, but awareness and implementation of these practices varies among AI developers. 

While there are many entities and groups (such as the U.S. Food and Drug Administration [FDA] and the newly formed Coalition for Health AI) working on guidelines and regulations to evaluate AI and predictive tools in health care,-there is currently no governing body to codify the right way to perform predictive tool evaluations.


As a result, many people are making mistakes when evaluating AI solutions. 


These mistakes can lead to utilization of predictive tools that aren’t effective or appropriate for a given population, and a lot of staff time spent on implementations that ultimately lead to no impact. 


These mistakes can lead to utilization of predictive tools that aren’t effective or appropriate for a given population, and a lot of staff time spent on implementations that ultimately lead to no impact.


Here are five common mistakes made in the evaluation of AI tools.


  • 1. Absent or Incorrect Quantitative Evaluations 
  • 2. Evaluation of Only the Models or the Workflows 
  • 3. Incorrect Measurement of the Impact on Outcomes 
  • 4. Inability to Detect Data Set Shifts 
  • 5. “Apples to Oranges” Outcome Studies


1. Absent or Incorrect Quantitative Evaluations


Health systems are deploying predictive software either without quantitative evaluations accompanying these tools or with evaluations that have the wrong metrics. 

In some cases, qualitative experience alone is used for determining what software to roll out and what to keep. 

In the absence of evaluations, leaders run the risk of causing frontline burnout and implementing software that may harm patients. 

The metrics for evaluation should be determined based on the mechanism of action for each condition area. 


For example, with sepsis, lead time-median time prior to antibiotics administration is critical. 

But parameters must be established that avoid alarm fatigue, which could contribute to errors, provider burnout, and overtreatment. 

The key criteria to look for in a sepsis tool are high sensitivity, significant lead time, and low false alerting rate.


2. Evaluation of Only the Models or the Workflows


Workflows are as important as the underlying AI models, and it is important to evaluate both. 

Yet often one is considered without the other. 

Workflow evaluation should think through ease of use to drive adoption, overall implementation burden, and how often the tool is misinterpreted or ignored. Models should be highly performant (e.g., with both high sensitivity and high precision). 


Workflows are as important as the underlying AI models, and it is important to evaluate both. Yet often one is considered without the other.


Assuming that efficacy can be obtained through optimized workflows alone is analogous to not knowing if a drug will work and changing the label to try to increase effectiveness.


AL/ML algorithms should be given the same rigorous scrutiny as drugs and medical devices undergoing clinical trials.


AL/ML algorithms should be given the same rigorous scrutiny as drugs and medical devices undergoing clinical trials.


3. Incorrect Measurement of the Impact on Outcomes


Many studies of AI tools rely on coded data to identify cases and measure outcome impact. 


These are not reliable because coding is highly dependent on documentation practices, and often a surveillance tool itself impacts documentation. 

A common flawed design in a pre/post study is surveillance bias, where the post-intervention period leverages a surveillance tool that dramatically increases the number of coded cases. 

This, in turn, leads to the perception that outcomes have improved because the adverse rate (e.g., sepsis mortality rate on coded cases) has decreased. 

High-quality, rigorous studies should account for these types of issues.


Many studies of AI tools rely on coded data to identify cases and measure outcome impact. 

These are not reliable because coding is highly dependent on documentation practices, and often a surveillance tool itself impacts documentation.


4. Inability to Detect Data Set Shifts


If a model doesn’t address the issues of data set shifts and transportability up front, it is at risk of being unsafe. 

Strategies to reduce bias and adapt for data set shift are critical because practice patterns are frequently changing. 

Look for evidence of high performance across diverse populations to see if the solution is detecting and tuning appropriately for shifts. 


If a model doesn’t address the issues of data set shifts and transportability up front, it is at risk of being unsafe. Strategies to reduce bias and adapt for data set shift are critical because practice patterns are frequently changing.


A common mistake is to assume any AI model can be made to work with enough rules and configurations added on top. 

The predictive AI tool should come with its own ability to tune and with an understanding of when and how to tune.


A common mistake is to assume any AI model can be made to work with enough rules and configurations added on top.


5. “Apples to Oranges” Outcome Studies


A common mistake is to overlook the standard of care. 


For example, a 10% improvement in outcomes at a high-reliability organization may be just as impressive or more so than similar improvement at a different organization with historically poor outcomes. 

Understanding the populations for which the studies were done and the standard of care in those environments will help leaders understand how and why an AI tool improved outcomes.


A common mistake is to overlook the standard of care.


For example, a 10% improvement in outcomes at a high-reliability organization may be just as impressive or more so than similar improvement at a different organization with historically poor outcomes.

Understanding the populations for which the studies were done and the standard of care in those environments will help leaders understand how and why an AI tool improved outcomes.


A common mistake is to overlook the standard of care.


For example, a 10% improvement in outcomes at a high-reliability organization may be just as impressive or more so than similar improvement at a different organization with historically poor outcomes.

Understanding the populations for which the studies were done and the standard of care in those environments will help leaders understand how and why an AI tool improved outcomes.


A common mistake is to overlook the standard of care.

Understanding the populations for which the studies were done and the standard of care in those environments will help leaders understand how and why an AI tool improved outcomes.



Despite the technical hurdles in deploying and evaluating AI in practice, there have been several examples of evaluations done well. 


Kaiser Permanente deployed and evaluated a clinical program for automated detection of impending deterioration that showed outcomes of decreased mortality. 


A multisite study evaluating a machine learning-based early recognition system for sepsis deployed at Johns Hopkins University performed both qualitative evaluations of provider experience and quantitative evaluations, demonstrating significant provider adoption coupled with significant reduction in time to antibiotics, mortality, and length of stay.,

Another example is the evaluation of a tool for diagnosing diabetic retinopathy within primary care offices, demonstrating improved sensitivity and specificity compared to review by specialists alone.



Given the significant potential for AI to transform a struggling health care industry, and the risk associated with poor execution, it is essential to establish guardrails and guidance. 

The FDA recently released the “Artificial Intelligence and Machine Learning (AI/ML) Software as a Medical Device Action Plan,” which outlines their direction, but more work needs to be done.22

AL/ML algorithms should be given the same rigorous scrutiny as drugs and medical devices undergoing clinical trials


AL/ML algorithms should be given the same rigorous scrutiny as drugs and medical devices undergoing clinical trials


High-quality, rigorous evaluations are necessary to avoid ineffective CDS tools that could cause harmful consequences, such as alarm fatigue, worse outcomes, or overtreatment. 


This would ultimately break provider trust and delay adoption of AI tools. 

Acceleration and widespread adoption of AI tools will not happen until leaders tackle the need for evaluation. 

The first randomized controlled trial (RCT) in medicine was published in 1948, and it didn’t take long for RCTs to be considered the gold standard in Western medicine. 


Acceleration and widespread adoption of AI tools will not happen until leaders tackle the need for evaluation. 

The first randomized controlled trial (RCT) in medicine was published in 1948, and it didn’t take long for RCTs to be considered the gold standard in Western medicine.


Health AI needs to graduate from 19th- and 20th-century practices to 21st-century medicine.


The massive digitization of health care offers an opportunity to mature the partnership between humans and machines, and to address the many challenges and imperatives of clinical work.


Health AI needs to graduate from 19th- and 20th-century practices to 21st-century medicine. 

The massive digitization of health care offers an opportunity to mature the partnership between humans and machines, and to address the many challenges and imperatives of clinical work.


About the author & affiliations


Suchi Saria, PhD

John C. Malone Endowed Chair and Director, Machine Learning, AI, and Healthcare Lab, Johns Hopkins University, Baltimore, Maryland, USA; Founder and Chief Executive Officer, Bayesian Health, New York, New York, USA

Originally published at https://catalyst.nejm.org on March 23, 2022.


References


See the original publication

Total
0
Shares
Deixe um comentário

O seu endereço de e-mail não será publicado. Campos obrigatórios são marcados com *

Related Posts

Subscribe

PortugueseSpanishEnglish
Total
0
Share