# Synthetic Data Generation

## Generating synthetic data

Synthetic data will be used mainly for these scenarios:

• Regression
• Classification

Here we will mainly look at the methods provided by scikit-learn to generate synthetic datasets. For more advanced methods, such as using the SDV library please check the SDV page. It support methods such as Gaussian copulas, CTGAN and CopulaGAN.

## Regression data

What does a regression consist of?

For this section we will mainly use scikit-learn’s make_regression method.

For reproducibility, we will set a random_state.

We will create a dataset using make_regression’s random linear regression model with input features $x=(f_1,f_2,f_3,f_4)$ and an output $y$.

import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from scipy.stats import linregress

N_FEATURES = 4
N_TARGETS = 1
N_SAMPLES = 100

dataset = make_regression(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=2,
n_targets=N_TARGETS,
bias=0.0,
effective_rank=None,
tail_strength=0.5,
noise=0.0,
shuffle=True,
coef=False,
random_state=random_state,
)

print(dataset[0][:10])
print(dataset[1][:10])

[[ 0.87305874 -1.63096187  0.52538404 -0.19035824]
[ 1.00698671  0.79834941 -0.04057655 -0.31358605]
[-0.61464273  1.65110321  0.75791487 -0.0039844 ]
[-1.08536678  1.82337823  0.4612592  -1.72325306]
[-1.67774847 -0.54401341  0.86347869 -0.30250463]
[-0.02427254  0.75537599 -0.04644972 -0.85153564]
[-0.48085576  0.82100952 -0.9390196  -0.25870492]
[-0.66772841 -2.46244005 -0.19855095 -1.85756579]
[-0.29810663 -0.02239635  0.25363492 -1.22688366]
[ 1.48146924  0.38269965 -1.18208819 -1.31062148]]
[  20.00449025  -30.41054677   52.65371365 -119.26376184   33.78805456
-78.12189078  -88.41673748 -177.21674804  -90.13920313 -197.90799195]


Let’s turn this dataset into a Pandas DataFrame:

df = pd.DataFrame(data=dataset[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])

df["y"] = dataset[1]



f1f2f3f4y
00.873-1.6310.525-0.19020.004
11.0070.798-0.041-0.314-30.411
2-0.6151.6510.758-0.00452.654
3-1.0851.8230.461-1.723-119.264
4-1.678-0.5440.863-0.30333.788

Let’s plot the data:

### Changing the Gaussian noise level

The noise parameter in make_regression allows to adjust the scale of the data’s gaussian centered noise.

dataset = make_regression(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=2,
n_targets=N_TARGETS,
bias=0.0,
effective_rank=None,
tail_strength=0.5,
noise=2.0,
shuffle=True,
coef=False,
random_state=random_state,
)

df = pd.DataFrame(data=dataset[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])

df["y"] = dataset[1]


### Visualising increasing noise

Let’s increase the noise by $10^i$, for $i=1, 2, 3$ and see what the data looks like.

df = pd.DataFrame(data=np.zeros((N_SAMPLES, 1)))

def create_noisy_data(noise):
return make_regression(
n_samples=N_SAMPLES,
n_features=1,
n_informative=1,
n_targets=1,
bias=0.0,
effective_rank=None,
tail_strength=0.5,
noise=noise,
shuffle=True,
coef=False,
random_state=random_state,
)

for i in range(3):
data = create_noisy_data(10 ** i)

df[f"f{i+1}"] = data[0]
df[f"y{i+1}"] = data[1]


## Classification data

To generate data for classification we will use the make_classification method.

from sklearn.datasets import make_classification

N = 4
data = make_classification(
n_samples=N_SAMPLES,
n_features=N,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.01,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=random_state,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N)])
df["y"] = data[1]


f1f2f3f4y
0-3.216-0.416-1.295-1.8820
1-1.426-1.257-1.734-1.8040
22.798-3.010-1.085-3.1341
30.6332.502-1.5531.6251
41.4940.912-1.887-1.4571

### Cluster separation

According to the docs1, class_sep is the factor multiplying the hypercube size.

Larger values spread out the clusters/classes and make the classification task easier.

N_FEATURES = 4

data = make_classification(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.01,
class_sep=3.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=None,
)

df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])

df["y"] = data[1]


We can make the cluster separability more difficult, by decreasing the value of class_sep.

N_FEATURES = 4

data = make_classification(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.01,
class_sep=0.5,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=None,
)

df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])

df["y"] = data[1]


### Noise level

According to the documentation2, flip_y is the fraction of samples whose class is assigned randomly.

Larger values introduce noise in the labels and make the classification task harder.

N_FEATURES = 4

for i in range(6):
data = make_classification(
n_samples=N_SAMPLES,
n_features=N_FEATURES,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.1 * i,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=False,
random_state=random_state,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = data[1]
plt.subplot(2, 3, i + 1)
plt.title(f"$flip_y={round(0.1*i,2)}$")
plt.scatter(
df["f1"],
df["f2"],
s=50,
c=df["y"],
cmap='gray',
edgecolor='gray'
)
plt.xlabel(f"${var1[0]}_{var1[1]}$")
plt.ylabel(f"${var2[0]}_{var2[1]}$")
ax = plt.gca()
ax.set_facecolor((247.0/255.0, 239.0/255.0, 217.0/255.0))
plt.tight_layout()


df = pd.DataFrame(data=np.zeros((N_SAMPLES, 1)))

for i in range(3):
data = make_classification(
n_samples=N_SAMPLES,
n_features=2,
n_informative=2,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0,
class_sep=i + 0.5,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=False,
random_state=random_state,
)
df[f"f{i+1}1"] = data[0][:, 0]
df[f"f{i+1}2"] = data[0][:, 1]
df[f"t{i+1}"] = data[1]


It is noteworthy that many paremeters in scikit-learn for synthetic data generation allow inputs per feature or cluster. To do so, we simple pass the parameter value as an array. For instance, to

N = 4

data = make_classification(
n_samples=N_SAMPLES,
n_features=N,
n_informative=4,
n_redundant=0,
n_repeated=0,
n_classes=2,
n_clusters_per_class=1,
weights=None,
flip_y=0.01,
class_sep=1.0,
hypercube=True,
shift=0.0,
scale=1.0,
shuffle=True,
random_state=random_state,
)

df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N)])

df["y"] = data[1]


## Separability

from sklearn.datasets import make_blobs

N_FEATURE = 4
data = make_blobs(
n_samples=60,
n_features=N_FEATURE,
centers=3,
cluster_std=1.0,
center_box=(-5.0, 5.0),
shuffle=True,
random_state=None,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURE)])
df["y"] = data[1]


To make a cluster more separable we can change cluster_std.

data = make_blobs(
n_samples=60,
n_features=N_FEATURES,
centers=3,
cluster_std=0.3,
center_box=(-5.0, 5.0),
shuffle=True,
random_state=None,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = data[1]


By decreasing cluster_std we make them less separable.

data = make_blobs(
n_samples=60,
n_features=N_FEATURES,
centers=3,
cluster_std=2.5,
center_box=(-5.0, 5.0),
shuffle=True,
random_state=None,
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(N_FEATURES)])
df["y"] = data[1]


### Anisotropic data

data = make_blobs(n_samples=50, n_features=2, centers=3, cluster_std=1.5)

transformation = [0.5, -0.5], [-0.4, 0.8]()

data_0 = np.dot(data[0], transformation)
df = pd.DataFrame(data_0, columns=[f"f{i}" for i in range(1, 3)])
df["y"] = data[1]


## Concentric clusters

Sometimes we might be interested in creating a non-separable cluster. The simples way is to create concentric clusters with the make_circles method.

from sklearn.datasets import make_circles

data = make_circles(
n_samples=N_SAMPLES, shuffle=True, noise=None, random_state=random_state, factor=0.6
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(2)])
df["y"] = data[1]


The noise parameter allows to create a concentric noisy dataset.

data = make_circles(
n_samples=N_SAMPLES, shuffle=True, noise=0.15, random_state=random_state, factor=0.6
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(2)])
df["y"] = data[1]


## Moon clusters

A shape that can be useful to other methods (such as Counterfactuals, for instance) is the one generated by the make_moons method.

from sklearn.datasets import make_moons

data = make_moons(
n_samples=N_SAMPLES, shuffle=True, noise=None, random_state=random_state
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(2)])
df["y"] = data[1]


As usual, the noise parameter allows to control the noise.

data = make_moons(
n_samples=N_SAMPLES, shuffle=True, noise=0.1, random_state=random_state
)
df = pd.DataFrame(data[0], columns=[f"f{i+1}" for i in range(2)])
df["y"] = data[1]


## Time-series data

### Random walk

See [Random walk].

### Simple periodic

#### No trend

Generate a simple HMM with a sine state and gaussian observations:

import numpy as np

def generate_sine(period, n):
cycles = n / period
length = np.pi * 2 * cycles
return np.sin(np.arange(0, length, length / n))


We will now get a set of $n=1000$ observations with a $p=10$ period

N=1000
data = generate_sine(10, N) * np.random.uniform(10, size=N)


#### Trend

N=1000
data = (generate_sine(10, N) * np.random.uniform(10, size=N)) + np.arange(N)/200.0


### Univariate data

Using the streamad library:

from streamad.util.dataset import CustomDS