January 3, 2022
Ziad Obermeyer, a physician and machine learning scientist at the University of California, Berkeley, launched Nightingale Open Science last month — a treasure trove of unique medical data sets, each curated around an unsolved medical mystery that artificial intelligence could help to solve.
The data sets, released after the project received $2m of funding from former Google chief executive Eric Schmidt, could help to train computer algorithms to predict medical conditions earlier, triage better and save lives.
The data include 40 terabytes of medical imagery, such as X-rays, electrocardiogram waveforms and pathology specimens, from patients with a range of conditions, including high-risk breast cancer, sudden cardiac arrest, fractures and Covid-19.
Each image is labelled with the patient’s medical outcomes, such as the stage of breast cancer and whether it resulted in death, or whether a Covid patient needed a ventilator.
Obermeyer has made the data sets free to use and mainly worked with hospitals in the US and Taiwan to build them over two years.
He plans to expand this to Kenya and Lebanon in the coming months to reflect as much medical diversity as possible.
“Nothing exists like it,” said Obermeyer, who announced the new project in December alongside colleagues at NeurIPS, the global academic conference for artificial intelligence.
“What sets this apart from anything available online is the data sets are labelled with the ‘ground truth’, which means with what really happened to a patient and not just a doctor’s opinion.”
This means that data sets on cardiac arrest ECGs, for example, have not been labelled depending on whether a cardiologist detected something suspicious, but with whether that patient eventually had a heart attack.
“We can learn from actual patient outcomes, rather than replicate flawed human judgment,” Obermeyer said.
In the past year, the AI community has undergone a sector-wide shift from collecting “big data” — as much data as possible — to meaningful data, or information that is more curated and relevant to a specific problem, which can be used to address issues such as ingrained human biases in healthcare, image recognition or natural language processing.
In the past year, the AI community has undergone a sector-wide shift from collecting “big data” — as much data as possible — to meaningful data
Until now, many healthcare algorithms have been proven to amplify existing health disparities.
For instance, Obermeyer found that an AI system used by hospitals treating up to 70m Americans, which allocated extra medical support for patients with chronic illnesses, was prioritising healthier white patients over sicker black patients who needed help.
It was assigning risk scores based on data that included an individual’s total healthcare costs in a year. The model was using healthcare costs as a proxy for healthcare needs.
The crux of this problem, which was reflected in the model’s underlying data, is that not everyone generates healthcare costs in the same way.
Minorities and other underserved populations may lack access to and resources for healthcare, be less able to get time off work for doctors’ visits, or experience discrimination within the system by receiving fewer treatments or tests, which can lead to them being classed as less costly in data sets.
This doesn’t necessarily mean they have been less sick.
The researchers calculated that nearly 47 per cent of black patients should have been referred for extra care, but the algorithmic bias meant that only 17 per cent were.
“Your costs are going to be lower even though your needs are the same. And that was the root of the bias that we found,” Obermeyer said.
He found that several other similar AI systems also used cost as a proxy, a decision that he estimates is affecting the lives of about 200m patients.
Unlike widely-used data sets in computer vision such as ImageNet, which were designed using photos from the internet that do not necessarily reflect the diversity of the real world, a spate of new data sets include information that is more representative of the population, which results not just in wider applicability and greater accuracy of the algorithms, but also in expanding our scientific knowledge.
These new diverse and high-quality data sets could be used to root out underlying biases “that are discriminatory in terms of people who are underserved and not represented” in healthcare systems, such as women and minorities, said Schmidt, whose foundation has funded the Nightingale Open Science project.
“You can use AI to understand what’s really going on with the human, rather than what a doctor thinks.”
The Nightingale data sets were among dozens proposed this year at NeurIPS.
Other projects included a speech data set of Mandarin and eight subdialects recorded by 27,000 speakers in 34 cities in China; the largest audio data set of Covid respiratory sounds, such as breathing, coughing and voice recordings, from more than 36,000 participants to help screen for the disease; and a data set of satellite images covering the entire country of South Africa from 2006 to 2017, divided and labelled by neighbourhood, to study the social effects of spatial apartheid.
Elaine Nsoesie, a computational epidemiologist at the Boston University School of Public Health, said new types of data could also help with studying the spread of diseases in diverse locations, as people from different cultures react differently to illnesses.
She said her grandmother in Cameroon, for example, might think differently than Americans do about health.
“If someone had an influenza-like illness in Cameroon, they may be looking for traditional, herbal treatments or home remedies, compared to drugs or different home remedies in the US.”
Computer scientists Serena Yeung and Joaquin Vanschoren, who proposed that research to build new data sets should be exchanged at NeurIPS, pointed out that the vast majority of the AI community still cannot find good data sets to evaluate their algorithms.
… the vast majority of the AI community still cannot find good data sets to evaluate their algorithms.
This meant that AI researchers were still turning to data that were potentially “plagued with bias”, they said. “There are no good models without good data.”
Ziad Obermeyer, a physician and machine learning scientist at the University of California, Berkeley, launched Nightingale Open Science last month
Nightingale Open Science
former Google chief executive Eric Schmidt
Elaine Nsoesie, a computational epidemiologist at the Boston University School of Public Health
Computer scientists Serena Yeung and Joaquin Vanschoren,