The Real Deal About Synthetic Data — with a Healthcare Example from NIH

13 de maio de 2022

10 minute read

It’s often difficult to access the real-world data needed to train AI models or gain insights, but new techniques for generating look-alike data sets can help.

The NIH (National Institute of Health) is using its synthetic data engine to generate and validate a nonidentifiable replica of the NIH’s database of COVID-19 patient records, …

… which comprises more than 2.7 million screened individuals and more than 413,000 COVID-19-positive patients.

MIT Sloan Management Review
Fernando Lucini
Magazine Winter 2022 Issue — October 20, 2021
Image courtesy of Michael Morgenstern/theispot.com

It’s often difficult to access the real-world data needed to train AI models or gain insights, but new techniques for generating look-alike data sets can help.

Data is the essential fuel driving organizations’ advanced analytics and machine learning initiatives, but between privacy concerns and process issues, it’s not always easy for researchers to get their hands on what they need.

Data is the essential fuel driving organizations’ advanced analytics and machine learning initiatives, but between privacy concerns and process issues, it’s not always easy for researchers to get their hands on what they need.

A promising new avenue to explore is synthetic data, which can be shared and used in ways real-world data can’t.

However, this emerging approach isn’t without risks or drawbacks, and it’s essential that organizations carefully explore where and how they invest their resources.

A promising new avenue to explore is synthetic data, which can be shared and used in ways real-world data can’t.

However, this emerging approach isn’t without risks or drawbacks,…

Structure of the article

What Is Synthetic Data?
The Value for Business: Security, Speed, and Scale
Why Isn’t Everybody Using It?
What Could Go Wrong?
What It Takes to Move Forward

What Is Synthetic Data?

Synthetic data is artificially generated by an AI algorithm that has been trained on a real data set.

It has the same predictive power as the original data but replaces it rather than disguising or modifying it.

The goal is to reproduce the statistical properties and patterns of an existing data set by modeling its probability distribution and sampling it out.

The algorithm essentially creates new data that has all of the same characteristics of the original data — leading to the same answers.

However, crucially, it’s virtually impossible to reconstruct the original data (think personally identifiable information) from either the algorithm or the synthetic data it has created.

The goal is to reproduce the statistical properties and patterns of an existing data set by modeling its probability distribution and sampling it out.

Synthetic data is a boon for researchers.

Consider what the National Institutes of Health in the U.S. is doing with Syntegra, an IT services startup.

Syntegra is using its synthetic data engine to generate and validate a nonidentifiable replica of the NIH’s database of COVID-19 patient records, which comprises more than 2.7 million screened individuals and more than 413,000 COVID-19-positive patients.

The synthetic data set, which precisely duplicates the original data set’s statistical properties but with no links to the original information, can be shared and used by researchers across the globe to learn more about the disease and accelerate progress in treatments and vaccines.

Consider what the National Institutes of Health in the U.S. is doing with Syntegra, an IT services startup.

Syntegra is using its synthetic data engine to generate and validate a nonidentifiable replica of the NIH’s database of COVID-19 patient records, which comprises more than 2.7 million screened individuals and more than 413,000 COVID-19-positive patients.

The synthetic data set, which precisely duplicates the original data set’s statistical properties but with no links to the original information, can be shared and used by researchers across the globe to learn more about the disease and accelerate progress in treatments and vaccines.

The technology has potential across a range of industries.

In financial services, where restrictions around data usage and customer privacy are particularly limiting, companies are starting to use synthetic data to help them identify and eliminate bias in how they treat customers — without contravening data privacy regulations.

And retailers are seeing the potential for new revenue streams derived from selling synthetic data on customers’ purchasing behavior without revealing personal information.

The Value for Business: Security, Speed, and Scale

Synthetic data’s most obvious benefit is that it eliminates the risk of exposing critical data and compromising the privacy and security of companies and customers.

Techniques such as encryption, anonymization, and advanced privacy preservation (for example, homomorphic encryption or secure multiparty computation) focus on protecting the original data and the information the data contains that could be traced back to an individual.

But as long as the original data is in play, there’s always a risk of compromising or exposing it in some way.

By eliminating the time-consuming roadblocks of privacy and security protocols, synthetic data also allows organizations to gain access to data more quickly.

Consider one financial institution that had a cache of rich data that could help decision makers solve a variety of business problems.

The data was so highly protected, gaining access to it was an arduous process, even for purely internal use.

In one case, it took six months to get just a small amount of data, and another six months to receive an update.

Now that the company is generating synthetic data based on the original data, the team can continuously update and model it and generate ongoing insights into how to improve business performance.

Furthermore, with synthetic data, a company can quickly train machine learning models on large data sets, accelerating the processes of training, testing, and deploying an AI solution.

This addresses a real challenge many companies face: the lack of enough data to train a model.

Access to a large set of synthetic data gives machine learning engineers and data scientists more confidence in the results they’re getting at the different stages of model development — and that means getting to market more quickly with new products and services.

Furthermore, with synthetic data, a company can quickly train machine learning models on large data sets, accelerating the processes of training, testing, and deploying an AI solution.

This addresses a real challenge many companies face: the lack of enough data to train a model.

Security and speed also enable scale, enlarging the amount of data available for analysis.

While companies can currently purchase third-party data, it’s often prohibitively expensive.

Buying synthetic data sets from third parties should make it easy and inexpensive for companies to bring more data to bear on the problem they’re trying to solve and get more-accurate answers.

For example, every bank has obligations to identify and eliminate fraud.

That’s a solitary and resource-intensive quest for each bank, because regulators allow a bank to examine only its own data for suspicious activity.

If banks pooled their synthetic data sets, they could get a holistic picture of all the people interacting with banks in a particular country, not just their own organization, which would help streamline and speed up the detection process and, ultimately, eliminate more fraud using fewer resources.

Why Isn’t Everybody Using It?

While the benefits of synthetic data are compelling, realizing them can be difficult.

Generating synthetic data is an extremely complex process, and to do it right, an organization needs to do more than just plug in an AI tool to analyze its data sets.

The task requires people with specialized skills and truly advanced knowledge of AI.

A company also needs very specific, sophisticated frameworks and metrics that enable it to validate that it created what it set out to create. This is where things become especially difficult.

Generating synthetic data is an extremely complex process, and to do it right, an organization needs to do more than just plug in an AI tool to analyze its data sets.

Evaluating synthetic data is complicated by the many different potential use cases.

Specific types of synthetic data are necessary for different tasks (such as prediction or statistical analysis), and those come with different performance metrics, requirements, and privacy constraints.

Furthermore, different data modalities dictate their own unique requirements and challenges.

A simple example: Let’s say you’re evaluating data that includes a date and a place.

These two discrete variables operate in different ways and require different metrics to track them.

Now imagine data that includes hundreds of different variables, all of which need to be assessed with very specific metrics, and you can begin to see the extent of the complexity and challenge.

We are just in the beginning stages of creating the tools, frameworks, and metrics needed to assess and “guarantee” the accuracy of synthetic data. Getting to an industrialized, repeatable approach is critical to creating accurate synthetic data via a standard process that’s accepted — and trusted — by everyone.

Also holding back the concept of synthetic data is the cultural resistance it meets at many companies: “It won’t work in our company.”

“I don’t trust it — it doesn’t sound secure.”
“The regulators will never go for it.”

Educating C-suite executives, as well as risk and legal teams, and convincing them that synthetic data works will be critical to adoption.

Educating C-suite executives, as well as risk and legal teams, and convincing them that synthetic data works will be critical to adoption.

What Could Go Wrong?

Proving the veracity of synthetic data is a critical point.

The team working on the effort must be able to demonstrate that the artificial data it created truly represents the original data — but can’t be tied to or expose the original data set in any way.

That’s really hard to do. If it doesn’t match precisely, the synthetic data set isn’t truly valid, which opens up a host of potential problems.

For example, let’s say you’ve created a synthetic data set to inform the development of a new product.

If the synthetic set doesn’t truly represent the original customer data set, it might contain the wrong buying signals regarding what customers are interested in or are inclined to buy.

As a result, you could end up spending a lot of money creating a product nobody wants.

Creating incorrect synthetic data also can get a company in hot water with regulators.

If the use of such data leads to a compliance or legal issue — such as creating a product that harmed someone or didn’t work as advertised — it could mean substantial financial penalties and, possibly, closer scrutiny in the future.

Regulators are just beginning to assess how synthetic data is created and measured, not to mention shared, and will undoubtedly have a role to play in guiding this exercise.

Regulators are just beginning to assess how synthetic data is created and measured, not to mention shared, and will undoubtedly have a role to play in guiding this exercise.

A distant, but still real, ramification of improperly created synthetic data is the possibility of what’s known as member inference attacks.

The whole concept of synthetic data is that it’s not in any way tied to the original data.

But if it isn’t created exactly right, malicious actors might be able to find a vulnerability that enables them to trace some data point back to the original data set and infer who a particular person is.

The actors can then use this knowledge to continually probe and question the synthetic set and eventually figure out the rest — exposing the entire original data set.

Technically, this is extremely difficult to do. But with the right resources, it’s not impossible — and, if successful, the implications could be dire.

One potential problem with synthetic data that can result even if the data set was created correctly is bias, …

… which can easily creep into AI models that have been trained on human-created data sets that contain inherent, historical biases.

Synthetic data can be used to generate data sets that conform to a pre-agreed definition of fairness.

Using this metric as a constraint to an optimizing model, the new data set will not only accurately reflect the original one but do so in a way that meets that specific definition of fairness.

But if a company doesn’t make complex adjustments to AI models to account for bias and simply copies the pattern of the original, the synthetic data will have all the same biases — and, in some cases, could even amplify those biases.

What It Takes to Move Forward

With the relevant skills, frameworks, metrics, and technologies maturing, companies will be hearing a lot more about synthetic data in the coming years.

As they weigh whether it makes sense for them, companies should consider the following four questions:

Do the right people know what we’re getting into?
Do we have access to the necessary skills
Do we have a clear purpose?
What’s the scale of our ambitions?

1.Do the right people know what we’re getting into?

Synthetic data is a new and complicated concept for most people. Before any synthetic data program is rolled out, it’s important that the entire C-suite, as well as the risk and legal teams, fully understand what it is, how it will be used, and how it could benefit the enterprise.

2. Do we have access to the necessary skills?

Creating synthetic data is a very complex process, so organizations need to determine whether their data scientists and engineers are capable of learning how to do it.

They should consider how often they will create such data, which will influence whether they should spend the time and money building this capability or contract for external expertise as needed.

3. Do we have a clear purpose?

Synthetic data must be generated with a particular purpose in mind, because the intended use affects how it’s generated and which of the original data’s properties are retained.

And if one potential use is to sell it to create a new revenue stream, planning for this potential new business model is key.

4. What’s the scale of our ambitions?

Creating synthetic data isn’t for the faint of heart. The sheer complexity associated with doing it right — and the potential pitfalls of doing it wrong — means organizations should be sure it will deliver sufficient value in return.

Although synthetic data is still at the cutting edge of data science, more organizations are experimenting with how to get it out of the lab and apply it to real-world business challenges.

How this evolution unfolds and the timeline it will follow remain to be seen.

But leaders of data-driven organizations should have it on their radar and be ready to consider applying it when the time is right for them.

About the Author

Fernando Lucini, is global data science and machine learning engineering lead at Accenture Applied Intelligence.

Originally published at https://sloanreview.mit.edu on October 20, 2021.

Author

Joaquim Cardoso

The Latest

Global Obesity Drug Access Crisis: High Prices, Limited Availability

Study Recommends Suspending Weight-Loss Meds Pre-Endoscopy: Cedars-Sinai

Nvidia’s AI Dominance: Transformative Growth and Market Challenges Explained

Generative AI’s Impact on Middle-Class Workers: Insights from LinkedIn and MIT

The Real Deal About Synthetic Data — with a Healthcare Example from NIH