Generating Synthetic Data

Generating Synthetic Data

AI and ML models need data to train, test, and validate, not just LLMs, but all models - even the ones working on more mundane tasks such as modeling IoT devices, identifying fraud in financial transactions, anomaly detection, and even predicting the weather. Building an AI/ML model requires your Data Science team to find and understand data. This is most frequently the data you have accumulated over the years and stored in your EDW or Data Lake. But what happens when there is not enough data to properly train and test a model? This is where synthetic data comes into the picture.

Wikipedia defines synthetic data as: “Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data generated by a computer simulation can be seen as synthetic data.” Wikipedia

Synthetic data can be created with what’s known as a Generative Adversarial Network or GAN model. GAN models were initially described by Ian Goodfellow and his colleagues in his paper Generative Adversarial Nets from June of 2014. Since that time, GAN models have proliferated and are available in a variety of models, each with unique properties and capabilities. GAN models can be found generating data in areas as diverse as image synthesis and generation, image-to-image translation, anomaly detection, drug discovery, and molecular generation.

A GAN model is comprised of two opposing deep learning models, often convolutional neural networks. The Generative Model creates new data while the Discriminator Model attempts to classify the data as ‘real’ or ‘fake’. This classification of the data is used to continuously train the model resulting in fake data that looks and acts like real data.

Tabular data can be challenging for a GAN model when generating synthetic data. This is due in part to the heterogeneous nature of tabular data (think data that looks like a spreadsheet.) Tabular data will often include continuous data, categorical data, text, and even Boolean in the same table. Each data type will present unique issues a GAN model must resolve to accurately synthesize data.

Example: Continuous data can be any of a variety of distributions (Normal, Uniform, Binomial, Exponential, and others.) Each distribution will have its shape and properties. To further complicate, a table with continuous data can have different distributions from one column to the next.

Categorical data (often defined as the possible results of a random variable that can take on one of K possible categories, with each category's probability separately specified.) Categorical data can be imbalanced or sparse or even have too many categories to easily to be labeled. To convert these categorical elements to a number a GAN model can utilize the Data Scientist would use a variety of encoding techniques. One such technique - One-Hot encoding represents categorical data as binary vectors, where each category is represented by a binary vector that has a single high value (1) and all other values are low (0). Almost by definition, a one-hot encoded categorical variable will result in a sparse data set.

A Conditional Tabular Generative Adversarial Network or CTGAN model will resolve most of the challenges tabular data represents when generating synthetic data. The CTGAN model resolves the heterogeneous data problem while maintaining the original data characteristics for generating new (synthetic) data.

Data plays a key role in increasing the performance and accuracy of any AI/ML model, including GAN-type models. More data will (usually) be better than less, and high-quality, reliable data is required for explainable, repeatable results.

The use of GAN models to generate pixels in an image, music, voices, synthetic data, or just about anything for that matter is both exciting and humbling. A GAN model can generate a new symphony, create a new image, or enhance an existing one. The generation of synthetic data allows the Data Scientist to train models and produce outcomes. In short, this is a powerful tool for Data Scientists to keep in their toolbox.


To view or add a comment, sign in

More articles by Harry Goldman

  • Opportunity or Hallucination?

    Opportunity or Hallucination?

    Hallucination - computing: a plausible but false or misleading response generated by an artificial intelligence…

  • AI/ML Development Lifecycle is Like Building a Car

    AI/ML Development Lifecycle is Like Building a Car

    I recently began thinking of AI/ML and MLOPs being described to a non-technical person. How can I convey the…

    1 Comment
  • Are We Ready For The AI Singularity?

    Are We Ready For The AI Singularity?

    Dr. Mark van Rijmenam, CSP recently published an article in The Digital Speaker where he discusses the future path…

    2 Comments
  • Generative AI: Hype or Hope?

    Generative AI: Hype or Hope?

    The IT hype cycle is a recurring fact of life. From Dot Coms to Generative AI (Gen AI), organizations struggle not only…

    1 Comment
  • Your data is talking. Are you listening?

    Your data is talking. Are you listening?

    Introduction Imagine you’re reading a murder mystery. The plot’s thickening, the game’s afoot, you’re pages away from…

  • Thoughts on Microservices

    Thoughts on Microservices

    Recently I have been reading a large number of articles about containers and also on Microservices. So I have put…

    1 Comment
  • Final Four are Set

    Final Four are Set

    Wednesday, March 29 – The final four are set and like many people, my brackets are a shambles. I do have two teams in…

  • Monday March 20 - Sweet Sixteen is Set

    Monday March 20 - Sweet Sixteen is Set

    Monday March 20 – The pundits are all raving about upsets in the NCAA Men’s tournament. Of course, they must say…

    1 Comment
  • First Half of the First Round Results

    First Half of the First Round Results

    Friday March 17 – The first half of the first-round games are complete. I guess it was exciting.

    1 Comment
  • My Attempt at 2017 Men's NCAA Brackets

    My Attempt at 2017 Men's NCAA Brackets

    For some years now I have been fascinated by the lengths to which some people (my son included) will go to set their…

Insights from the community

Others also viewed

Explore topics