Categories
Emerging Technologies

Synthetic Data

“Synthetic data is utilized in use cases where the available data is limited, incomplete or cannot be sourced easily. Simulation and generative techniques can be used to increase the available training data.”

Gartner

As the name suggests, Synthetic Data is generated programmatically mainly for supporting training needs of AI/ML. Various types of synthetic data can be generated using simple or advanced AI approaches for meeting the specific needs. 

Gartner defines it as: Synthetic data is generated by applying a sampling technique to real-world data or by creating simulation scenarios where models and processes interact to create completely new data not directly taken from the real world. Gartner has included it under ‘On The Rise’ category in the Hype Cycle for Data Science and Machine Learning, 2019.

Synthetic Data usage goes beyond training AI/ML models. It can be used to get around security and privacy needs of real datasets. This is especially useful for sharing data with 3rd parties without comprising security and privacy.

There are various tools and applications to generate either fully Synthetic Data or partial Synthetic Data from real data.

While there are many benefits with Synthetic Data, it does carry risks. Proper care and governance is required to ensure that the results are as close to reality as possible.