Synthetic data with SDV and Gaussian copulas

The Synthetic Data Vault (SDV) provides a collection of models for generating high-quality synthetic tabular data. The GaussianCopulaSynthesizer uses statistical techniques based on copulas to learn the distribution and correlations from real data, then generate synthetic data that preserves these statistical properties whilst ensuring privacy.

This notebook demonstrates synthetic data generation using the Gaussian Copula model from SDV. We’ll generate 1000 synthetic samples and compare them visually with the original data to assess how well the synthetic data captures the underlying patterns.

import pandas as pd

data = pd.read_csv("_data/sdv/svm-hyperparameters-train-features.csv")
data.head()
Pclass Sex Age SibSp Parch Fare
0 3 1 31.093 0 0 23.543
1 1 0 24.970 1 0 2.891
2 2 1 23.686 0 0 44.785
3 1 0 29.087 0 0 5.743
4 3 1 34.438 0 0 6.241
data.describe(include="all")
Pclass Sex Age SibSp Parch Fare
count 891.000000 891.000000 891.000000 891.000000 891.000000 891.000000
mean 2.332211 0.657688 29.841903 0.151515 0.072952 34.329742
std 0.827658 0.474750 12.695561 0.422067 0.288853 61.412245
min 1.000000 0.000000 0.420000 0.000000 0.000000 0.291000
25% 2.000000 0.000000 21.027000 0.000000 0.000000 7.145000
50% 3.000000 1.000000 30.077000 0.000000 0.000000 15.688000
75% 3.000000 1.000000 38.204000 0.000000 0.000000 38.715500
max 3.000000 1.000000 72.943000 3.000000 2.000000 1035.184000

We start by creating a GaussianCopulaSynthesizer which automatically detects the metadata from our dataset. The Gaussian Copula model works by:

  1. Modelling univariate distributions: Each column (Age, Fare, etc.) is fitted to an appropriate distribution
  2. Capturing correlations: A Gaussian copula models the dependencies between variables
  3. Generating samples: New synthetic data is drawn from the learned model
import warnings
warnings.filterwarnings('ignore')

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data)

model = GaussianCopulaSynthesizer(metadata=metadata)
model.fit(data)
N_SAMPLES = 1000
new_df = model.sample(num_rows=N_SAMPLES)
new_df.head()
Pclass Sex Age SibSp Parch Fare
0 1 0 11.885 0 0 63.659
1 2 1 21.457 1 0 49.929
2 1 1 20.368 0 0 7.978
3 2 1 46.439 0 0 33.400
4 2 1 13.356 0 0 9.036
new_df.describe()
Pclass Sex Age SibSp Parch Fare
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000
mean 2.333000 0.650000 30.549068 0.149000 0.097000 35.136762
std 0.809176 0.477208 11.927531 0.411057 0.331211 37.531116
min 1.000000 0.000000 0.420000 0.000000 0.000000 0.296000
25% 2.000000 0.000000 22.349000 0.000000 0.000000 8.063000
50% 3.000000 1.000000 30.835000 0.000000 0.000000 23.365500
75% 3.000000 1.000000 39.257500 0.000000 0.000000 48.249250
max 3.000000 1.000000 68.072000 3.000000 2.000000 254.054000

import warnings
warnings.filterwarnings('ignore')

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data)
metadata.update_column('Pclass', sdtype='categorical')
metadata.update_column('Sex', sdtype='categorical')

model = GaussianCopulaSynthesizer(metadata=metadata)
model.fit(data)
new_df = model.sample(num_rows=N_SAMPLES)
new_df.head()
Pclass Sex Age SibSp Parch Fare
0 1 0 11.885 0 0 63.659
1 2 1 21.457 1 0 49.929
2 1 1 20.368 0 0 7.978
3 2 1 46.439 0 0 33.400
4 2 1 13.356 0 0 9.036

data.Fare.describe()
count     891.000000
mean       34.329742
std        61.412245
min         0.291000
25%         7.145000
50%        15.688000
75%        38.715500
max      1035.184000
Name: Fare, dtype: float64

from sdv.single_table import GaussianCopulaSynthesizer
from sdv.metadata import SingleTableMetadata

metadata = SingleTableMetadata()
metadata.detect_from_dataframe(data)
metadata.update_column('Pclass', sdtype='categorical')
metadata.update_column('Sex', sdtype='categorical')

model = GaussianCopulaSynthesizer(metadata=metadata)
model.fit(data)
new_df = model.sample(num_rows=N_SAMPLES)
new_df.Fare.describe()
count    1000.000000
mean       35.136762
std        37.531116
min         0.296000
25%         8.063000
50%        23.365500
75%        48.249250
max       254.054000
Name: Fare, dtype: float64