What is the message?
The accuracy and efficacy of algorithms heavily rely on the quality of the data they process, particularly in the era of data-driven technologies like artificial intelligence (AI) and automation tools.
EXECUTIVE SUMMARY
What are the key points?
Data Quality Concerns: Recent research from MIT highlights significant errors in widely used datasets for training AI models, with an average error rate of 3.3%. This compromises the reliability of AI-driven solutions, especially in high-stakes fields like healthcare.
Challenges in Healthcare Data: Healthcare data, often unstructured and varied, poses unique challenges for accuracy and standardization. Over 50% of clinically relevant data is only available in free-text format, leading to difficulties in aggregating and interpreting information accurately.
Complexities in Data Collection: Inconsistencies and inaccuracies in healthcare data stem from the lack of standardized data collection practices across different organizations. Additionally, patient self-reporting introduces further complexities, as seen in discrepancies in racial and ethnic identification.
Impact of Data Quality Issues: Data quality issues extend beyond inaccuracies to include fraud and abuse, costing the healthcare system billions annually. Understanding these limitations is crucial when relying on data-driven technologies for decision-making.
Technological Advancements: Despite challenges, advancements in technologies like Natural Language Processing (NLP) offer promising solutions. NLP can address data quality pitfalls by understanding unstructured text and reconciling conflicting data points.
Progress in AI: State-of-the-art AI models demonstrate significant improvements in accuracy benchmarks over the past few years. Libraries like Spark NLP achieve over 90% accuracy in clinical text understanding tasks, enhancing the reliability of AI-driven healthcare solutions.
What are the key statistics?
The average error rate in widely used datasets for training AI models is 3.3%, significantly impacting research outcomes.
Over 50% of clinically relevant healthcare data is only available in free-text format, complicating data analysis and interpretation.
The National Health Care Anti-Fraud Association estimates healthcare fraud costs the US up to $68 billion annually, highlighting the financial implications of data quality issues.
Libraries like Spark NLP achieve over 90% accuracy in clinical text understanding tasks, showcasing advancements in AI technology.
Conclusion
While advancements in AI and NLP offer promising solutions to address data quality challenges, it’s essential to acknowledge and mitigate the fundamental limitations of data accuracy.
As technology continues to evolve, maintaining vigilance regarding data quality remains paramount, especially in critical domains like healthcare.
DEEP DIVE
The Accuracy Limits Of Data-Driven Healthcare
Forbes
David Talby
February 16, 2022
Algorithms are only as good as the quality of data they’re being fed.
This is not a new concept, but as we begin to rely more heavily on data-driven technologies, such as artificial intelligence (AI) and other automation tools and applications, it’s becoming a more important one.
Recent research from MIT found a high number of errors in publicly available datasets that are widely used for training models.
An average of 3.3% errors were found in the test sets of 10 of the most widely used computer vision, natural language processing (NLP) and audio datasets.
An average of 3.3% errors were found in the test sets of 10 of the most widely used computer vision, natural language processing (NLP) and audio datasets.
Given that accuracy baselines are often at or above 90%, this means that a lot of research innovation amounts to chance — or overfitting to errors.
Data science practitioners should exercise caution when choosing which models to deploy based on small accuracy gains on such datasets.
These findings are particularly concerning when it comes to AI applications in high-stakes industries like healthcare.
Outcomes in this field have the ability to prevent disease, accelerate the development of life-saving medicine and help us understand the spread of disease and other critical health trends.
While accuracy in healthcare is vital to success, it’s also rife with complexities that make this extremely challenging.
While accuracy in healthcare is vital to success, it’s also rife with complexities that make this extremely challenging.
One of the reasons for this is the data source.
More than half of the clinically relevant data for applications like recommending a course of treatment, finding actionable genomic biomarkers or matching patients to clinical trials is only found in free-text.
This includes physicians notes, diagnostic imaging, pathology reports, lab reports and other sources not available as structured data within electronic health records (EHR).
These information sources include nuances and data quality issues that make it hard to connect the dots and get a full picture of a patient.
Another barrier exists in the limitations of what’s in the data itself.
Because there are no shared standards for data collection across hospitals and healthcare systems, inconsistencies and inaccuracies are common.
Between different organizations collecting different information and records not being updated on a consistent basis, it’s difficult to know how accurate the data is — especially if it’s being moved and updated among different providers.
It’s not just providers to blame, either — inaccuracies come directly from the patients themselves.
A recent study from The Journal of General Internal Medicine shows just how prevalent this can be.
When exploring the accuracy of race, ethnicity and language preference in EHRs, the study found that 30% of whites self-reported identification with at least one other racial or ethnic group, as did 37% of Hispanics and 41% of African Americans.
Patients were also less likely to complete the survey in Spanish than the language preference noted in the EHR would have suggested.
in a recent study … 30% of whites self-reported identification with at least one other racial or ethnic group, as did 37% of Hispanics and 41% of African Americans.
There’s clearly a need for better data collection practices in healthcare and beyond.
Accurate information can help the medical community understand more about social determinants of health, patient risk prediction, clinical trial matching and more.
Standardizing how this data is collected and recorded can ensure the clean data gets shared and analyzed correctly.
This is both a medical and social challenge. For example, what is the “correct” race to fill in? When exactly is someone considered a smoker? This is also partly a technology challenge, as we’re already way beyond the limit of what’s reasonable to ask providers and patients to manually input.
Accurate information can help the medical community understand more about social determinants of health, patient risk prediction, clinical trial matching and more.
Standardizing how this data is collected and recorded can ensure the clean data gets shared and analyzed correctly.
There are also data quality issues outside our direct control, such as fraud and abuse.
The National Health Care Anti-Fraud Association estimates that “healthcare fraud costs the nation about $68 billion annually — about 3% of the nation’s $2.26 trillion in healthcare spending.
Other estimates range as high as 10% of annual healthcare expenditure, or $230 billion.”
While we can account for error rates within the data, it’s an imperfect science at the end of the day, and it’s important to understand its limitations.
The National Health Care Anti-Fraud Association estimates that “healthcare fraud costs the nation about $68 billion annually — about 3% of the nation’s $2.26 trillion in healthcare spending. Other estimates range as high as 10% of annual healthcare expenditure, or $230 billion.”
While we can account for error rates within the data, it’s an imperfect science at the end of the day, and it’s important to understand its limitations.
That said, it’s not all doom and gloom when it comes to quality data or the algorithms we use.
Technology that can automatically understand the nuances of unstructured text and images, as well as reconcile conflicting and missing data points, is gradually maturing.
NLP, for example, can address many pitfalls of data quality, such as uncovering disparities in an EHR versus a doctor’s transcript or what a patient self-reports.
In recent years, newer algorithms and models can apply the context, medium and intent of each data source to infer useful semantic answers.
This is especially useful when you consider how specific clinical language is.
Take how we indicate triple-negative breast cancer (TNBC), for instance.
While the acronym TNCB isn’t hard to identify, the condition can also be denoted as Er-/pr-/h2-, (er pr her2) negative, tested negative for the following: er, pr, h2 and triple negative neoplasm of the upper left breast, to name a few.
NLP can identify variations of these terms when they are in context — and healthcare-specific deep learning models have gotten very good at this.
Current state-of-the-art, peer-reviewed, publicly reproducible accuracy benchmarks on both competitive academic benchmarks and real-world production deployments has been steadily improving over the last five years.
Libraries like Spark NLP surpass 90% accuracy on a variety of clinical and biomedical text understanding tasks.
Reproducibility of results, consistency of applying clinical guidelines at scale and the ability to easily tune models to a specific clinical use case or setting are three keys to successful implementations and to building broader trust in AI technology.
The healthcare industry is varied and complex and so, too, is the information collected.
When using data to make any decision in this field, technology that helps will keep improving.
But it’s critical to remember the fundamental limitations of data quality and accuracy that power these algorithms.
Simply put, it’s not safe to assume that a piece of data is correct because someone typed it into a computer.
…it’s not safe to assume that a piece of data is correct because someone typed it into a computer.
About the author
PhD, MBA, CTO at John Snow Labs. Making AI & NLP solve real-world problems in healthcare, life science and related fields.
Originally published at https://www.forbes.com.