The Synthetic Data Vault (SDV) provides a collection of models for generating high-quality synthetic tabular data. The GaussianCopulaSynthesizer uses statistical techniques based on copulas to learn the distribution and correlations from real data, then generate synthetic data that preserves these statistical properties whilst ensuring privacy.
This notebook demonstrates synthetic data generation using the Gaussian Copula model from SDV. We’ll generate 1000 synthetic samples and compare them visually with the original data to assess how well the synthetic data captures the underlying patterns.
import pandas as pddata = pd.read_csv("_data/sdv/svm-hyperparameters-train-features.csv")data.head()
Pclass
Sex
Age
SibSp
Parch
Fare
0
3
1
31.093
0
0
23.543
1
1
0
24.970
1
0
2.891
2
2
1
23.686
0
0
44.785
3
1
0
29.087
0
0
5.743
4
3
1
34.438
0
0
6.241
data.describe(include="all")
Pclass
Sex
Age
SibSp
Parch
Fare
count
891.000000
891.000000
891.000000
891.000000
891.000000
891.000000
mean
2.332211
0.657688
29.841903
0.151515
0.072952
34.329742
std
0.827658
0.474750
12.695561
0.422067
0.288853
61.412245
min
1.000000
0.000000
0.420000
0.000000
0.000000
0.291000
25%
2.000000
0.000000
21.027000
0.000000
0.000000
7.145000
50%
3.000000
1.000000
30.077000
0.000000
0.000000
15.688000
75%
3.000000
1.000000
38.204000
0.000000
0.000000
38.715500
max
3.000000
1.000000
72.943000
3.000000
2.000000
1035.184000
We start by creating a GaussianCopulaSynthesizer which automatically detects the metadata from our dataset. The Gaussian Copula model works by:
Modelling univariate distributions: Each column (Age, Fare, etc.) is fitted to an appropriate distribution
Capturing correlations: A Gaussian copula models the dependencies between variables
Generating samples: New synthetic data is drawn from the learned model