Generating Synthetic Data
AI and ML models need data to train, test, and validate, not just LLMs, but all models - even the ones working on more mundane tasks such as modeling IoT devices, identifying fraud in financial transactions, anomaly detection, and even predicting the weather. Building an AI/ML model requires your Data Science team to find and understand data. This is most frequently the data you have accumulated over the years and stored in your EDW or Data Lake. But what happens when there is not enough data to properly train and test a model? This is where synthetic data comes into the picture.
Wikipedia defines synthetic data as: “Synthetic data is information that is artificially generated rather than produced by real-world events. Typically created using algorithms, synthetic data can be deployed to validate mathematical models and to train machine learning models. Data generated by a computer simulation can be seen as synthetic data.” Wikipedia
Synthetic data can be created with what’s known as a Generative Adversarial Network or GAN model. GAN models were initially described by Ian Goodfellow and his colleagues in his paper Generative Adversarial Nets from June of 2014. Since that time, GAN models have proliferated and are available in a variety of models, each with unique properties and capabilities. GAN models can be found generating data in areas as diverse as image synthesis and generation, image-to-image translation, anomaly detection, drug discovery, and molecular generation.
A GAN model is comprised of two opposing deep learning models, often convolutional neural networks. The Generative Model creates new data while the Discriminator Model attempts to classify the data as ‘real’ or ‘fake’. This classification of the data is used to continuously train the model resulting in fake data that looks and acts like real data.
Tabular data can be challenging for a GAN model when generating synthetic data. This is due in part to the heterogeneous nature of tabular data (think data that looks like a spreadsheet.) Tabular data will often include continuous data, categorical data, text, and even Boolean in the same table. Each data type will present unique issues a GAN model must resolve to accurately synthesize data.
Example: Continuous data can be any of a variety of distributions (Normal, Uniform, Binomial, Exponential, and others.) Each distribution will have its shape and properties. To further complicate, a table with continuous data can have different distributions from one column to the next.
Recommended by LinkedIn
Categorical data (often defined as the possible results of a random variable that can take on one of K possible categories, with each category's probability separately specified.) Categorical data can be imbalanced or sparse or even have too many categories to easily to be labeled. To convert these categorical elements to a number a GAN model can utilize the Data Scientist would use a variety of encoding techniques. One such technique - One-Hot encoding represents categorical data as binary vectors, where each category is represented by a binary vector that has a single high value (1) and all other values are low (0). Almost by definition, a one-hot encoded categorical variable will result in a sparse data set.
A Conditional Tabular Generative Adversarial Network or CTGAN model will resolve most of the challenges tabular data represents when generating synthetic data. The CTGAN model resolves the heterogeneous data problem while maintaining the original data characteristics for generating new (synthetic) data.
Data plays a key role in increasing the performance and accuracy of any AI/ML model, including GAN-type models. More data will (usually) be better than less, and high-quality, reliable data is required for explainable, repeatable results.
The use of GAN models to generate pixels in an image, music, voices, synthetic data, or just about anything for that matter is both exciting and humbling. A GAN model can generate a new symphony, create a new image, or enhance an existing one. The generation of synthetic data allows the Data Scientist to train models and produce outcomes. In short, this is a powerful tool for Data Scientists to keep in their toolbox.