Generating synthetic data
Synthetic data generation methods can be broadly categorised into two approaches:
Statistical/Tabular Methods : Using statistical distributions and predefined structures (e.g., scikit-learn methods)
Generative AI Methods : Using deep learning and LLMs to learn and replicate data distributions (e.g., SDV, CTGAN, LLMs)
This page covers tabular synthetic data generation using scikit-learn. For advanced generative AI methods such as using the SDV library please check the SDV page , which includes methods such as Gaussian copulas , CTGAN and CopulaGAN .
Tabular data generation
Regression data
What does a regression consist of?
For this section we will mainly use scikit-learn’s make_regression method.
For reproducibility, we will set a random_state.
We will create a dataset using make_regression’s random linear regression model with input features x=(f_1,f_2,f_3,f_4) and an output y .
import numpy as np
import pandas as pd
from sklearn.datasets import make_regression
from scipy.stats import linregress
N_FEATURES = 4
N_TARGETS = 1
N_SAMPLES = 100
random_state = 42
dataset = make_regression(
n_samples= N_SAMPLES,
n_features= N_FEATURES,
n_informative= 2 ,
n_targets= N_TARGETS,
bias= 0.0 ,
effective_rank= None ,
tail_strength= 0.5 ,
noise= 0.0 ,
shuffle= True ,
coef= False ,
random_state= random_state,
)
print (dataset[0 ][:10 ])
print (dataset[1 ][:10 ])
[[-1.91877122 -0.02651388 -0.07444592 0.25755039]
[ 1.58601682 -1.2378155 0.66213067 0.11351735]
[-1.51936997 -0.48423407 1.03246526 2.1221562 ]
[ 0.81351722 -1.23086432 -0.32206152 -0.78325329]
[ 1.6324113 -1.43014138 -1.24778318 -0.25256815]
[ 1.15811087 0.79166269 -1.21418861 -1.00601738]
[-2.6197451 0.8219025 1.56464366 -0.03582604]
[-1.26088395 0.91786195 0.40498171 1.76545424]
[-0.60170661 1.85227818 -0.29169375 -0.60063869]
[ 0.34175598 1.87617084 0.15039379 -0.75913266]]
[ 8.64828851 22.81204176 116.25950813 -41.29626371 -44.59229411
-74.92833829 41.29233008 84.30697131 -32.89077765 -27.37843081]
Let’s turn this dataset into a Pandas DataFrame:
df = pd.DataFrame(data= dataset[0 ], columns= [f"f { i+ 1 } " for i in range (N_FEATURES)])
df["y" ] = dataset[1 ]
df.head()
0
-1.918771
-0.026514
-0.074446
0.257550
8.648289
1
1.586017
-1.237815
0.662131
0.113517
22.812042
2
-1.519370
-0.484234
1.032465
2.122156
116.259508
3
0.813517
-1.230864
-0.322062
-0.783253
-41.296264
4
1.632411
-1.430141
-1.247783
-0.252568
-44.592294
Let’s plot the data:
Changing the Gaussian noise level
The noise parameter in make_regression allows to adjust the scale of the data’s gaussian centered noise.
dataset = make_regression(
n_samples= N_SAMPLES,
n_features= N_FEATURES,
n_informative= 2 ,
n_targets= N_TARGETS,
bias= 0.0 ,
effective_rank= None ,
tail_strength= 0.5 ,
noise= 2.0 ,
shuffle= True ,
coef= False ,
random_state= random_state,
)
df = pd.DataFrame(data= dataset[0 ], columns= [f"f { i+ 1 } " for i in range (N_FEATURES)])
df["y" ] = dataset[1 ]
Visualising increasing noise
Let’s increase the noise by 10^i , for i=1, 2, 3 and see what the data looks like.
df = pd.DataFrame(data= np.zeros((N_SAMPLES, 1 )))
def create_noisy_data(noise):
return make_regression(
n_samples= N_SAMPLES,
n_features= 1 ,
n_informative= 1 ,
n_targets= 1 ,
bias= 0.0 ,
effective_rank= None ,
tail_strength= 0.5 ,
noise= noise,
shuffle= True ,
coef= False ,
random_state= random_state,
)
for i in range (3 ):
data = create_noisy_data(10 ** i)
df[f"f { i+ 1 } " ] = data[0 ]
df[f"y { i+ 1 } " ] = data[1 ]
Classification data
To generate data for classification we will use the make_classification method.
from sklearn.datasets import make_classification
N = 4
data = make_classification(
n_samples= N_SAMPLES,
n_features= N,
n_informative= 4 ,
n_redundant= 0 ,
n_repeated= 0 ,
n_classes= 2 ,
n_clusters_per_class= 1 ,
weights= None ,
flip_y= 0.01 ,
class_sep= 1.0 ,
hypercube= True ,
shift= 0.0 ,
scale= 1.0 ,
shuffle= True ,
random_state= random_state,
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (N)])
df["y" ] = data[1 ]
df.head()
0
-0.385042
0.633662
1.049620
1.196063
1
1
-1.944212
-2.993269
-4.524798
-0.957576
1
2
-1.184505
-1.493694
-2.188634
-0.308396
0
3
1.221447
2.673526
1.088408
0.242603
1
4
-2.694502
-3.098131
-2.710986
-2.322962
0
Cluster separation
According to the docs, class_sep is the factor multiplying the hypercube size.
Larger values spread out the clusters/classes and make the classification task easier.
N_FEATURES = 4
data = make_classification(
n_samples= N_SAMPLES,
n_features= N_FEATURES,
n_informative= 4 ,
n_redundant= 0 ,
n_repeated= 0 ,
n_classes= 2 ,
n_clusters_per_class= 1 ,
weights= None ,
flip_y= 0.01 ,
class_sep= 3.0 ,
hypercube= True ,
shift= 0.0 ,
scale= 1.0 ,
shuffle= True ,
random_state= None ,
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (N_FEATURES)])
df["y" ] = data[1 ]
We can make the cluster separability more difficult, by decreasing the value of class_sep.
N_FEATURES = 4
data = make_classification(
n_samples= N_SAMPLES,
n_features= N_FEATURES,
n_informative= 4 ,
n_redundant= 0 ,
n_repeated= 0 ,
n_classes= 2 ,
n_clusters_per_class= 1 ,
weights= None ,
flip_y= 0.01 ,
class_sep= 0.5 ,
hypercube= True ,
shift= 0.0 ,
scale= 1.0 ,
shuffle= True ,
random_state= None ,
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (N_FEATURES)])
df["y" ] = data[1 ]
Noise level
According to the documentation, flip_y is the fraction of samples whose class is assigned randomly.
Larger values introduce noise in the labels and make the classification task harder.
df = pd.DataFrame(data= np.zeros((N_SAMPLES, 1 )))
for i in range (3 ):
data = make_classification(
n_samples= N_SAMPLES,
n_features= 2 ,
n_informative= 2 ,
n_redundant= 0 ,
n_repeated= 0 ,
n_classes= 2 ,
n_clusters_per_class= 1 ,
weights= None ,
flip_y= 0 ,
class_sep= i + 0.5 ,
hypercube= True ,
shift= 0.0 ,
scale= 1.0 ,
shuffle= False ,
random_state= random_state,
)
df[f"f { i+ 1 } 1" ] = data[0 ][:, 0 ]
df[f"f { i+ 1 } 2" ] = data[0 ][:, 1 ]
df[f"t { i+ 1 } " ] = data[1 ]
It is noteworthy that many paremeters in scikit-learn for synthetic data generation allow inputs per feature or cluster. To do so, we simple pass the parameter value as an array. For instance, to
N = 4
data = make_classification(
n_samples= N_SAMPLES,
n_features= N,
n_informative= 4 ,
n_redundant= 0 ,
n_repeated= 0 ,
n_classes= 2 ,
n_clusters_per_class= 1 ,
weights= None ,
flip_y= 0.01 ,
class_sep= 1.0 ,
hypercube= True ,
shift= 0.0 ,
scale= 1.0 ,
shuffle= True ,
random_state= random_state,
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (N)])
df["y" ] = data[1 ]
Cluster separability
from sklearn.datasets import make_blobs
N_FEATURE = 4
data = make_blobs(
n_samples= 60 ,
n_features= N_FEATURE,
centers= 3 ,
cluster_std= 1.0 ,
center_box= (- 5.0 , 5.0 ),
shuffle= True ,
random_state= None ,
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (N_FEATURE)])
df["y" ] = data[1 ]
To make a cluster more separable we can change cluster_std.
data = make_blobs(
n_samples= 60 ,
n_features= N_FEATURES,
centers= 3 ,
cluster_std= 0.3 ,
center_box= (- 5.0 , 5.0 ),
shuffle= True ,
random_state= None ,
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (N_FEATURES)])
df["y" ] = data[1 ]
By decreasing cluster_std we make them less separable.
data = make_blobs(
n_samples= 60 ,
n_features= N_FEATURES,
centers= 3 ,
cluster_std= 2.5 ,
center_box= (- 5.0 , 5.0 ),
shuffle= True ,
random_state= None ,
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (N_FEATURES)])
df["y" ] = data[1 ]
Anisotropic data
data = make_blobs(n_samples= 50 , n_features= 2 , centers= 3 , cluster_std= 1.5 )
transformation = [[0.5 , - 0.5 ], [- 0.4 , 0.8 ]]
data_0 = np.dot(data[0 ], transformation)
df = pd.DataFrame(data_0, columns= [f"f { i} " for i in range (1 , 3 )])
df["y" ] = data[1 ]
Concentric clusters
Sometimes we might be interested in creating a non-separable cluster. The simplest way is to create concentric clusters with the make_circles method.
from sklearn.datasets import make_circles
data = make_circles(
n_samples= N_SAMPLES, shuffle= True , noise= None , random_state= random_state, factor= 0.6
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (2 )])
df["y" ] = data[1 ]
Adding noise
The noise parameter allows to create a concentric noisy dataset.
data = make_circles(
n_samples= N_SAMPLES, shuffle= True , noise= 0.15 , random_state= random_state, factor= 0.6
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (2 )])
df["y" ] = data[1 ]
Moon clusters
A shape that can be useful to other methods (such as Counterfactuals , for instance) is the one generated by the make_moons method.
from sklearn.datasets import make_moons
data = make_moons(
n_samples= N_SAMPLES, shuffle= True , noise= None , random_state= random_state
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (2 )])
df["y" ] = data[1 ]
Adding noise
As usual, the noise parameter allows to control the noise.
data = make_moons(
n_samples= N_SAMPLES, shuffle= True , noise= 0.1 , random_state= random_state
)
df = pd.DataFrame(data[0 ], columns= [f"f { i+ 1 } " for i in range (2 )])
df["y" ] = data[1 ]
Generative AI for synthetic data
For more advanced synthetic data generation using deep learning and generative models, please see:
SDV (Synthetic Data Vault) : A comprehensive library for synthetic data generation
Gaussian Copulas : Statistical method for modelling multivariate distributions
CTGAN : Conditional Tabular GAN for high-quality synthetic tabular data
CopulaGAN : GAN combined with copulas for improved data quality
These methods learn the underlying data distribution from real datasets and generate new synthetic samples that preserve statistical properties and relationships between features.
Text generation
For generating synthetic text data, Large Language Models (LLMs) are commonly used:
LLM-based generation : Using models like GPT, Claude, or Llama to generate synthetic text samples
Fine-tuning : Training language models on domain-specific datasets to generate more targeted synthetic text
Prompt engineering : Designing prompts to guide LLMs to generate specific types of text (e.g., medical notes, legal documents, customer reviews)
Data augmentation : Using LLMs to paraphrase, expand, or diversify existing text datasets
These approaches can generate realistic text that maintains the style, tone, and domain characteristics of the training data, useful for privacy-preserving data sharing and increasing dataset diversity.
Time-series data
Simple periodic
No trend
Generate a simple HMM with a sine state and gaussian observations:
import numpy as np
def generate_sine(period, n):
cycles = n / period
length = np.pi * 2 * cycles
return np.sin(np.arange(0 , length, length / n))
We will now get a set of n=1000 observations with a p=10 period
N= 1000
data = generate_sine(10 , N) * np.random.uniform(10 , size= N)
Trend
N= 1000
data = (generate_sine(10 , N) * np.random.uniform(10 , size= N)) + np.arange(N)/ 200.0
Univariate data
Using the streamad library:
from streamad.util.dataset import CustomDS
from streamad.util import StreamGenerator, plot
ds = CustomDS("_data/streamad/uniDS.csv" )
stream = StreamGenerator(ds.data)