Step-by-Step Guide to Creating Synthetic Data Using the Synthetic Data Vault (SDV)

Spread the love

The real -world data is often expensive, messy and limited by privacy rules. Synthetic data provides a solution – and it is already widely used:

Train on LLMS AI-Janit Text

The fraud system simulates edge cases

Vision models show off on fake images

SDV (Synthetic data vault) is an open-source python library that generates realistic tabular data using machine learning. It learns patterns from real data and creates high quality synthetic data for safe sharing, test and model training.

In this tutorial, we use SDV to generate synthetic data step by step.

We will first install SDV library:

from sdv.io.local import CSVHandler

connector = CSVHandler()
FOLDER_NAME = '.' # If the data is in the same directory

data = connector.read(folder_name=FOLDER_NAME)
salesDf = data['data']

Next, we import the required modules and connect to our local folder with dataset files. It reads CSV files from the specified folder and stores them as panda dataframes. In this case, we use using the main dataset data[‘data’],

from sdv.metadata import Metadata
metadata = Metadata.load_from_json('metadata.json')

Now we import metadate for our dataset. This metadata is stored in a JSON file and tells SDV how to explain your data. This includes:

table name
primary key

data type Each column
optional Column format Like a daytime pattern or ID pattern
table relationship (For multi-table setup)

Here is a sample metadata. Json format:

{
  "METADATA_SPEC_VERSION": "V1",
  "tables": {
    "your_table_name": {
      "primary_key": "your_primary_key_column",
      "columns": {
        "your_primary_key_column": { "sdtype": "id", "regex_format": "T[0-9]{6}" },
        "date_column": { "sdtype": "datetime", "datetime_format": "%d-%m-%Y" },
        "category_column": { "sdtype": "categorical" },
        "numeric_column": { "sdtype": "numerical" }
      },
      "column_relationships": []
    }
  }
}

from sdv.metadata import Metadata

metadata = Metadata.detect_from_dataframes(data)

Alternatively, we can automatically use the SDV library to estimate metadata. However, the results may not always be accurate or complete, so you may need to review and update if there is any anomalies.

from sdv.single_table import GaussianCopulaSynthesizer

synthesizer = GaussianCopulaSynthesizer(metadata)
synthesizer.fit(data=salesDf)
synthetic_data = synthesizer.sample(num_rows=10000)

With the preparation of metadata and original dataset, we can now use SDV to train a model and generate synthetic data. The model learns composition and pattern in your actual dataset and uses that knowledge to create synthetic records.

You can control how many rows to generate to generate num_rows logic.

from sdv.evaluation.single_table import evaluate_quality

quality_report = evaluate_quality(
    salesDf,
    synthetic_data,
    metadata)

The SDV library also provides equipment to evaluate your synthetic data quality by comparing the original dataset. To begin with a luxurious place by generating one Quality report

You can also imagine how synthetic data compare the actual data using the underlying plotting tool of SDV. For example, import Get_column_Plot From sdv.evaluation.Single_table To make comparison plots for specific columns:

from sdv.evaluation.single_table import get_column_plot

fig = get_column_plot(
    real_data=salesDf,
    synthetic_data=synthetic_data,
    column_name="Sales",
    metadata=metadata
)
   
fig.show()

We can see that the distribution of ‘sales’ columns in real and synthetic data is very similar. To further detect, we can use matplotlib to create a more detailed comparison – such as imagining average monthly sales trends in both datasets.

import pandas as pd
import matplotlib.pyplot as plt

# Ensure 'Date' columns are datetime
salesDf['Date'] = pd.to_datetime(salesDf['Date'], format="%d-%m-%Y")
synthetic_data['Date'] = pd.to_datetime(synthetic_data['Date'], format="%d-%m-%Y")

# Extract 'Month' as year-month string
salesDf['Month'] = salesDf['Date'].dt.to_period('M').astype(str)
synthetic_data['Month'] = synthetic_data['Date'].dt.to_period('M').astype(str)

# Group by 'Month' and calculate average sales
actual_avg_monthly = salesDf.groupby('Month')['Sales'].mean().rename('Actual Average Sales')
synthetic_avg_monthly = synthetic_data.groupby('Month')['Sales'].mean().rename('Synthetic Average Sales')

# Merge the two series into a DataFrame
avg_monthly_comparison = pd.concat([actual_avg_monthly, synthetic_avg_monthly], axis=1).fillna(0)

# Plot
plt.figure(figsize=(10, 6))
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Actual Average Sales'], label="Actual Average Sales", marker="o")
plt.plot(avg_monthly_comparison.index, avg_monthly_comparison['Synthetic Average Sales'], label="Synthetic Average Sales", marker="o")

plt.title('Average Monthly Sales Comparison: Actual vs Synthetic')
plt.xlabel('Month')
plt.ylabel('Average Sales')
plt.xticks(rotation=45)
plt.grid(True)
plt.legend()
plt.ylim(bottom=0)  # y-axis starts at 0
plt.tight_layout()
plt.show()

This chart also shows that the average monthly sales in both datasets are very similar, only with a minimum difference.

In this tutorial, we demonstrated how to prepare your data and metadata for synthetic data generation using the SDV library. By training a model on its original dataset, SDV can create high quality synthetic data that closely reflects the pattern and distribution of real data. We also discovered how to evaluate and imagine synthetic data, which confirm major matrix such as sales distribution and monthly trends. Synthetic data provides a powerful way to remove privacy and availability challenges, enabling strong data analysis and machine learning workflows.

See notebook on github, All credit for this research goes to the researchers of this project. Also, feel free to follow us Twitter And don’t forget to join us 95K+ ML Subredit More membership Our newspaper,

I am a Civil Engineering Bachelor of Civil Engineering (2022) from Jamia Millia Islamia, New Delhi, and I have a keen interest in data science, especially nerve networks and their application in various fields.

Source link