Synthetic Data Generation
Generating synthetic data
Synthetic data will be used mainly for these scenarios:
- Regression
- Classification
Here we will mainly look at the methods provided by scikit-learn
to generate synthetic datasets. For more advanced methods, such as using the SDV library please check the SDV page. It support methods such as Gaussian copulas, CTGAN and CopulaGAN.
Regression data
What does a regression consist of?
For this section we will mainly use scikit-learn
’s make_regression
method.
For reproducibility, we will set a random_state
.
We will create a dataset using make_regression
’s random linear regression model with input features $x=(f_1,f_2,f_3,f_4)$ and an output $y$.
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from scipy.stats import linregress
N_FEATURES = 4
N_TARGETS = 1
N_SAMPLES = 100
dataset = make_regression(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=2,
n_targets=N_TARGETS,
bias=0.0,
effective_rank=None,
tail_strength=0.5,
noise=0.0,
shuffle=True,
coef=False,
random_state=random_state,
)
print(dataset[0][:10])
print(dataset[1][:10])
[[ 0.87305874 -1.63096187 0.52538404 -0.19035824]
[ 1.00698671 0.79834941 -0.04057655 -0.31358605]
[-0.61464273 1.65110321 0.75791487 -0.0039844 ]
[-1.08536678 1.82337823 0.4612592 -1.72325306]
[-1.67774847 -0.54401341 0.86347869 -0.30250463]
[-0.02427254 0.75537599 -0.04644972 -0.85153564]
[-0.48085576 0.82100952 -0.9390196 -0.25870492]
[-0.66772841 -2.46244005 -0.19855095 -1.85756579]
[-0.29810663 -0.02239635 0.25363492 -1.22688366]
[ 1.48146924 0.38269965 -1.18208819 -1.31062148]]
[ 20.00449025 -30.41054677 52.65371365 -119.26376184 33.78805456
-78.12189078 -88.41673748 -177.21674804 -90.13920313 -197.90799195]
Let’s turn this dataset into a Pandas DataFrame
:
df = pd.DataFrame(data=dataset[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = dataset[1]
df.head()
f1 | f2 | f3 | f4 | y | |
---|---|---|---|---|---|
0 | 0.873 | -1.631 | 0.525 | -0.190 | 20.004 |
1 | 1.007 | 0.798 | -0.041 | -0.314 | -30.411 |
2 | -0.615 | 1.651 | 0.758 | -0.004 | 52.654 |
3 | -1.085 | 1.823 | 0.461 | -1.723 | -119.264 |
4 | -1.678 | -0.544 | 0.863 | -0.303 | 33.788 |
Let’s plot the data:
Changing the Gaussian noise level
The noise
parameter in make_regression
allows to adjust the scale of the data’s gaussian centered noise.
dataset = make_regression(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=2,
n_targets=N_TARGETS,
bias=0.0,
effective_rank=None,
tail_strength=0.5,
noise=2.0,
shuffle=True,
coef=False,
random_state=random_state,
)
df = pd.DataFrame(data=dataset[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = dataset[1]
Visualising increasing noise
Let’s increase the noise by $10^i$, for $i=1, 2, 3$ and see what the data looks like.
df = pd.DataFrame(data=np.zeros((N_SAMPLES, 1)))
def create_noisy_data(noise):
return make_regression(
n_samples=N_SAMPLES,
n_features=1,
n_informative=1,
n_targets=1,
bias=0.0,
effective_rank=None,
tail_strength=0.5,
noise=noise,
shuffle=True,
coef=False,
random_state=random_state,
)
for i in range(3):
data = create_noisy_data(10 ** i)
df[f"f{i+1}"] = data[0]
df[f"y{i+1}"] = data[1]
Classification data
To generate data for classification we will use the make_classification method.
from sklearn.datasets import make_classification
N = 4
data = make_classification(
n_samples=N_SAMPLES,
n_features=N,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.01,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=random_state,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N)])
df["y"] = data[1]
df.head()
f1 | f2 | f3 | f4 | y | |
---|---|---|---|---|---|
0 | -3.216 | -0.416 | -1.295 | -1.882 | 0 |
1 | -1.426 | -1.257 | -1.734 | -1.804 | 0 |
2 | 2.798 | -3.010 | -1.085 | -3.134 | 1 |
3 | 0.633 | 2.502 | -1.553 | 1.625 | 1 |
4 | 1.494 | 0.912 | -1.887 | -1.457 | 1 |
Cluster separation
According to the docs1, class_sep
is the factor multiplying the hypercube size.
Larger values spread out the clusters/classes and make the classification task easier.
N_FEATURES = 4
data = make_classification(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.01,
class_sep=3.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=None,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = data[1]
We can make the cluster separability more difficult, by decreasing the value of class_sep
.
N_FEATURES = 4
data = make_classification(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.01,
class_sep=0.5,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=None,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = data[1]
Noise level
According to the documentation2, flip_y
is the fraction of samples whose class is assigned randomly.
Larger values introduce noise in the labels and make the classification task harder.
N_FEATURES = 4
for i in range(6):
data = make_classification(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.1 * i,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=False,
random_state=random_state,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = data[1]
plt.subplot(2, 3, i + 1)
plt.title(f"$flip_y={round(0.1*i,2)}$")
plt.scatter(
df["f1"],
df["f2"],
s=50,
c=df["y"],
cmap='gray',
edgecolor='gray'
)
plt.xlabel(f"${var1[0]}_{var1[1]}$")
plt.ylabel(f"${var2[0]}_{var2[1]}$")
ax = plt.gca()
ax.set_facecolor((247.0/255.0, 239.0/255.0, 217.0/255.0))
plt.tight_layout()
plt.tight_layout(pad=3.0)
df = pd.DataFrame(data=np.zeros((N_SAMPLES, 1)))
for i in range(3):
data = make_classification(
n_samples=N_SAMPLES,
n_features=2,
n_informative=2,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0,
class_sep=i + 0.5,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=False,
random_state=random_state,
)
df[f"f{i+1}1"] = data[0][:, 0]
df[f"f{i+1}2"] = data[0][:, 1]
df[f"t{i+1}"] = data[1]
It is noteworthy that many paremeters in scikit-learn
for synthetic data generation allow inputs per feature or cluster.
To do so, we simple pass the parameter value as an array.
For instance, to
N = 4
data = make_classification(
n_samples=N_SAMPLES,
n_features=N,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.01,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=random_state,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N)])
df["y"] = data[1]
Separability
from sklearn.datasets import make_blobs
N_FEATURE = 4
data = make_blobs(
n_samples=60,
n_features=N_FEATURE,
centers=3,
cluster_std=1.0,
center_box=(-5.0, 5.0),
shuffle=True,
random_state=None,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURE)])
df["y"] = data[1]
To make a cluster more separable we can change cluster_std
.
data = make_blobs(
n_samples=60,
n_features=N_FEATURES,
centers=3,
cluster_std=0.3,
center_box=(-5.0, 5.0),
shuffle=True,
random_state=None,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = data[1]
By decreasing cluster_std
we make them less separable.
data = make_blobs(
n_samples=60,
n_features=N_FEATURES,
centers=3,
cluster_std=2.5,
center_box=(-5.0, 5.0),
shuffle=True,
random_state=None,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = data[1]
Anisotropic data
data = make_blobs(n_samples=50, n_features=2, centers=3, cluster_std=1.5)
transformation = [0.5, -0.5], [-0.4, 0.8]()
data_0 = np.dot(data[0], transformation)
df = pd.DataFrame(data_0, columns=[f"f{i}" for i in range(1, 3)])
df["y"] = data[1]
Concentric clusters
Sometimes we might be interested in creating a non-separable cluster.
The simples way is to create concentric clusters with the make_circles
method.
from sklearn.datasets import make_circles
data = make_circles(
n_samples=N_SAMPLES, shuffle=True, noise=None, random_state=random_state, factor=0.6
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(2)])
df["y"] = data[1]
Adding noise
The noise
parameter allows to create a concentric noisy dataset.
data = make_circles(
n_samples=N_SAMPLES, shuffle=True, noise=0.15, random_state=random_state, factor=0.6
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(2)])
df["y"] = data[1]
Moon clusters
A shape that can be useful to other methods (such as Counterfactuals, for instance) is the one generated by the make_moons
method.
from sklearn.datasets import make_moons
data = make_moons(
n_samples=N_SAMPLES, shuffle=True, noise=None, random_state=random_state
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(2)])
df["y"] = data[1]
Adding noise
As usual, the noise
parameter allows to control the noise.
data = make_moons(
n_samples=N_SAMPLES, shuffle=True, noise=0.1, random_state=random_state
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(2)])
df["y"] = data[1]
Time-series data
Random walk
See [Random walk].
Simple periodic
No trend
Generate a simple HMM with a sine state and gaussian observations:
import numpy as np
def generate_sine(period, n):
cycles = n / period
length = np.pi * 2 * cycles
return np.sin(np.arange(0, length, length / n))
We will now get a set of $n=1000$ observations with a $p=10$ period
N=1000
data = generate_sine(10, N) * np.random.uniform(10, size=N)
Trend
N=1000
data = (generate_sine(10, N) * np.random.uniform(10, size=N)) + np.arange(N)/200.0
Univariate data
Using the streamad
library:
from streamad.util.dataset import CustomDS
from streamad.util import StreamGenerator, plot
ds = CustomDS("../../data/streamad/uniDS.csv")
stream = StreamGenerator(ds.data)