15 Synthetic Data Generation

Real datasets come with inconvenient constraints. Privacy regulations may prevent sharing customer records across teams or with vendors. Rare events — fraud transactions, equipment failures, disease diagnoses — appear too infrequently to train robust models. Class imbalance distorts classifiers toward the majority class. And in early development, the data we need may not exist yet.

Synthetic data addresses all of these. Depending on the method, we can generate records that preserve the statistical relationships of a real dataset without exposing individual records, oversample minority classes while respecting feature correlations, or create datasets with arbitrary properties to test a pipeline before production data arrives.

We cover simple statistical generation, controlled synthetic datasets, class imbalance oversampling with SMOTE, preserving joint distributions with the Synthetic Data Vault, generative model approaches, basic privacy concepts, and how to evaluate whether synthetic data is actually useful.

Code

import numpy as np
import pandas as pd
from scipy import stats
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import make_classification, make_regression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report, roc_auc_score
from sklearn.preprocessing import StandardScaler

sns.set_style("whitegrid")
np.random.seed(42)
print("Libraries loaded.")

15.1 Generating from Distributions

The simplest form of synthetic data is sampling from parametric distributions. We define the marginal distribution of each feature, generate independently, and optionally impose correlation structure using a multivariate normal or a copula.

This approach works well when we know the rough shape of each variable — which we can read from the real data summary statistics — and when the correlations between variables are modest. It breaks down when variables have complex, non-linear dependencies or when the joint distribution has multimodal structure that a normal copula cannot capture.

Code

# Simulate a customer transaction dataset from scratch
n = 2000

# Correlated features via Cholesky decomposition
corr = np.array([
    [1.0,  0.55, -0.3],   # age
    [0.55, 1.0,  -0.2],   # income
    [-0.3,-0.2,   1.0],   # num_complaints
])
L = np.linalg.cholesky(corr)
Z = np.random.randn(n, 3) @ L.T

df = pd.DataFrame({
    "age":           np.clip(30 + 12 * Z[:, 0], 18, 75).astype(int),
    "annual_income": np.clip(60000 + 25000 * Z[:, 1], 20000, 200000).astype(int),
    "num_complaints":np.clip(np.round(0.5 + 1.2 * Z[:, 2]), 0, 8).astype(int),
    "tenure_years":  np.random.exponential(3.5, n).clip(0, 20).round(1),
    "channel":       np.random.choice(["web","mobile","branch"], n, p=[0.5,0.35,0.15]),
})
churn_logit = -3 + 0.03*df["num_complaints"] - 0.01*df["tenure_years"]
df["churned"] = (np.random.rand(n) < 1/(1+np.exp(-churn_logit))).astype(int)

print(df.head())
print()
print(df.describe().T.round(2))
print()
print(f"Churn rate: {df.churned.mean():.1%}")
print(f"Age-Income correlation: {df["age"].corr(df["annual_income"]):.3f}  (target: 0.55)")

15.2 Controlled Synthetic Datasets with scikit-learn

When the goal is to test a modeling pipeline rather than to mimic a specific real dataset, scikit-learn’s make_classification and make_regression offer precise control over the properties of the data: number of informative features, class overlap, cluster structure, noise level, and class weight.

These functions are particularly useful for understanding how an algorithm behaves under different conditions — how performance degrades with more noise, how many features are needed for good accuracy, what happens with extreme class imbalance — before committing to a real dataset where those factors are confounded with each other.

Code

from sklearn.datasets import make_classification
from sklearn.linear_model import LogisticRegression

# Vary class imbalance and measure effect on AUC
weights = [(0.5,0.5), (0.8,0.2), (0.9,0.1), (0.95,0.05), (0.99,0.01)]
results = []

for w in weights:
    X, y = make_classification(
        n_samples=2000, n_features=10, n_informative=5,
        n_redundant=2, weights=list(w), flip_y=0.01, random_state=42
    )
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, stratify=y)
    model = LogisticRegression(max_iter=1000)
    model.fit(X_train, y_train)
    auc = roc_auc_score(y_test, model.predict_proba(X_test)[:,1])
    minority_pct = f"{w[1]:.0%}"
    results.append({"minority_%": minority_pct, "AUC": round(auc,3),
                    "minority_n_train": int(y_train.sum())})

print(pd.DataFrame(results).to_string(index=False))

15.3 Handling Class Imbalance with SMOTE

When the minority class is rare, a classifier can achieve high accuracy simply by predicting the majority class always. SMOTE (Synthetic Minority Over-sampling Technique) addresses this by generating synthetic minority class examples along line segments connecting existing minority examples in feature space.

Unlike simple random oversampling (which just duplicates existing records), SMOTE creates new interpolated points, which tends to produce smoother decision boundaries and better generalization. Several variants exist:

SMOTE: baseline — interpolate between a minority point and one of its k-nearest minority neighbors
BorderlineSMOTE: focuses on minority examples near the class boundary, where the model is most uncertain
ADASYN: weights generation by difficulty — more synthetic samples near regions where the model struggles

Install: pip install imbalanced-learn

An important caution: SMOTE should be applied only to the training set, after the train/test split. Applying it before splitting causes data leakage and inflates test performance.

Code

# Baseline imbalanced vs SMOTE-resampled comparison
try:
    from imblearn.over_sampling import SMOTE
    smote_available = True
except ImportError:
    smote_available = False
    print("pip install imbalanced-learn to run this cell")

if smote_available:
    X, y = make_classification(
        n_samples=3000, n_features=10, n_informative=5,
        weights=[0.92, 0.08], flip_y=0.01, random_state=42
    )
    X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.25, stratify=y, random_state=0)

    # Baseline: train on imbalanced data
    clf_base = RandomForestClassifier(n_estimators=100, random_state=42)
    clf_base.fit(X_tr, y_tr)
    auc_base = roc_auc_score(y_te, clf_base.predict_proba(X_te)[:,1])

    # SMOTE: resample training set only
    sm = SMOTE(random_state=42)
    X_res, y_res = sm.fit_resample(X_tr, y_tr)
    clf_sm = RandomForestClassifier(n_estimators=100, random_state=42)
    clf_sm.fit(X_res, y_res)
    auc_sm = roc_auc_score(y_te, clf_sm.predict_proba(X_te)[:,1])

    print(f"Original minority class: {y_tr.sum()} / {len(y_tr)} ({y_tr.mean():.1%})")
    print(f"After SMOTE:             {y_res.sum()} / {len(y_res)} ({y_res.mean():.1%})")
    print()
    print(f"AUC — baseline:      {auc_base:.4f}")
    print(f"AUC — with SMOTE:    {auc_sm:.4f}")
    print()
    print("Classification report (SMOTE model):")
    print(classification_report(y_te, clf_sm.predict(X_te)))

15.4 Preserving Statistical Structure with SDV

Simple distribution sampling and SMOTE both ignore the joint distribution of the full dataset — the correlations, conditional distributions, and categorical-continuous relationships that make data realistic. The Synthetic Data Vault (SDV) library models the full joint distribution and generates records that preserve these relationships.

The GaussianCopulaSynthesizer fits a copula to model dependencies between variables, transforming each marginal to a standard normal and modeling the correlation structure of the resulting normal vectors. It handles mixed datatypes (continuous, categorical, datetime) and can enforce constraints (e.g., age > 0, start_date < end_date).

For more complex distributions, the CTGANSynthesizer uses a conditional GAN architecture specifically designed for tabular data, and TVAESynthesizer uses a variational autoencoder. Both handle multimodal distributions and complex interactions that the Gaussian copula misses.

Install: pip install sdv

Code

try:
    from sdv.single_table import GaussianCopulaSynthesizer
    from sdv.metadata import SingleTableMetadata
    sdv_available = True
except ImportError:
    sdv_available = False
    print("pip install sdv to run this cell")

if sdv_available:
    metadata = SingleTableMetadata()
    metadata.detect_from_dataframe(df)

    synth = GaussianCopulaSynthesizer(metadata)
    synth.fit(df)
    synthetic_df = synth.sample(num_rows=2000)

    print("Real data stats:")
    print(df[["age","annual_income","num_complaints"]].describe().T[["mean","std"]].round(1))
    print()
    print("Synthetic data stats:")
    print(synthetic_df[["age","annual_income","num_complaints"]].describe().T[["mean","std"]].round(1))
    print()
    print("Real correlation (age, income):     ", df["age"].corr(df["annual_income"]).round(3))
    print("Synthetic correlation (age, income):", synthetic_df["age"].corr(synthetic_df["annual_income"]).round(3))
else:
    # Show what the output looks like
    print("SDV preserves marginal distributions and pairwise correlations.")
    print("After fitting, synth.sample(num_rows=N) returns a DataFrame")
    print("with the same schema and statistical structure as the original.")

15.5 Privacy Considerations

Synthetic data is not automatically private. A model trained on sensitive data can memorize individual records, allowing an adversary to reconstruct them from synthetic samples. This is especially true of generative models trained on small datasets or to very high fidelity.

Differential privacy (DP) provides a formal guarantee: any synthetic record could plausibly have been generated from a dataset with any single real record changed or removed. The privacy parameter \(\varepsilon\) controls the tradeoff — smaller \(\varepsilon\) means stronger privacy but lower data quality.

In practice, DP-synthetic data is implemented by adding calibrated noise to the sufficient statistics of the generative model before sampling. The diffprivlib library (IBM) provides DP-aware ML algorithms. The smartnoise-sdk package provides DP synthetic data generation.

Even without formal DP, several practical measures reduce re-identification risk: capping extreme values, adding small amounts of noise, suppressing rare combinations of quasi-identifiers, and not generating synthetic records for groups smaller than k individuals (k-anonymity).

Code

# Demonstrate the Gaussian mechanism: adding DP noise to a statistic
# (A toy illustration — real DP libraries handle sensitivity and composition)

true_mean_income = df["annual_income"].mean()
n_records        = len(df)

# Sensitivity of the mean: max change one record can cause
# For bounded data [lo, hi], sensitivity = (hi - lo) / n
lo, hi      = 20000, 200000
sensitivity = (hi - lo) / n_records

epsilon_values = [0.01, 0.1, 1.0, 10.0]
print(f"True mean income: ${true_mean_income:,.0f}")
print(f"Sensitivity:      ${sensitivity:.2f}")
print()
print("{:>8}  {:>16}  {:>18}  {:>12}".format("epsilon","DP noise sigma","DP mean estimate","abs error"))
for eps in epsilon_values:
    sigma    = sensitivity / eps
    dp_mean  = true_mean_income + np.random.normal(0, sigma)
    err      = abs(dp_mean - true_mean_income)
    print(f"{eps:>8.2f}  {sigma:>16,.1f}  ${dp_mean:>16,.0f}  ${err:>10,.0f}")
print()
print("Smaller epsilon = more privacy = more noise = less accurate estimate.")

15.6 Evaluating Synthetic Data

Two properties matter: fidelity (does the synthetic data look like the real data statistically?) and utility (is a model trained on synthetic data useful for predicting on real data?).

Fidelity is measured by comparing marginal and joint distributions — KS tests per feature, chi-square tests for categoricals, correlation matrix comparison. SDV’s evaluate_quality function automates this.

Utility is measured by the Train on Synthetic, Test on Real (TSTR) protocol: train a downstream model on synthetic data, evaluate it on a held-out real test set, and compare its performance to a model trained on real data. A smaller gap between the two indicates better synthetic data utility.

Code

# Train on Synthetic, Test on Real (TSTR) evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score

feature_cols = ["age","annual_income","num_complaints","tenure_years"]
target_col   = "churned"

X_real = df[feature_cols].values
y_real = df[target_col].values
X_tr_r, X_te_r, y_tr_r, y_te_r = train_test_split(X_real, y_real, test_size=0.25, random_state=0)

# Model trained on real data (upper bound)
scaler = StandardScaler()
lr_real = LogisticRegression(max_iter=1000)
lr_real.fit(scaler.fit_transform(X_tr_r), y_tr_r)
auc_real = roc_auc_score(y_te_r, lr_real.predict_proba(scaler.transform(X_te_r))[:,1])

# Model trained on our distribution-sampled synthetic data
X_syn = df[feature_cols].sample(1000, replace=True, random_state=7).values
X_syn = X_syn + np.random.normal(0, X_syn.std(axis=0)*0.05, X_syn.shape)  # add small noise
y_syn = (np.random.rand(1000) < 0.08).astype(int)   # synthetic labels with approx churn rate

lr_syn = LogisticRegression(max_iter=1000)
lr_syn.fit(scaler.fit_transform(X_syn), y_syn)
auc_syn = roc_auc_score(y_te_r, lr_syn.predict_proba(scaler.transform(X_te_r))[:,1])

print("TSTR Evaluation:")
print(f"  AUC (trained on real data):      {auc_real:.4f}  <-- upper bound")
print(f"  AUC (trained on synthetic data): {auc_syn:.4f}")
print()
print("The closer the synthetic AUC to the real-data AUC, the more useful")
print("the synthetic data is as a substitute for the real thing.")

# Feature KS tests for fidelity
print("Kolmogorov-Smirnov fidelity tests (p > 0.05 = indistinguishable):")
for col in ["age","annual_income","num_complaints","tenure_years"]:
    real_vals = df[col].values
    # Use bootstrap resample as our proxy for synthetic
    syn_vals  = df[col].sample(1000, replace=True).values + np.random.normal(0, df[col].std()*0.05, 1000)
    ks, p = stats.ks_2samp(real_vals, syn_vals)
    print(f"  {col:<20} KS={ks:.3f}  p={p:.3f}")

15.7 Key Takeaways

Synthetic data serves four purposes: privacy preservation, handling rare events, correcting class imbalance, and early-stage development
Sampling from distributions is quick and controllable but ignores joint structure; use it when marginal distributions are all that matter
Apply SMOTE only to the training set, after the train/test split; never before
SDV’s GaussianCopulaSynthesizer preserves correlations and mixed datatypes; CTGAN handles multimodal and complex joint distributions
Formal differential privacy provides mathematical guarantees but at a fidelity cost; practical measures (noise, suppression) reduce risk without formal guarantees
Evaluate synthetic data on both fidelity (KS/chi-square) and utility (TSTR AUC gap)

Recommended reading: - SDV documentation: docs.sdv.dev - Synthetic Data for Machine Learning — Yoon, Drummond, Ghassemi (NeurIPS 2020) - diffprivlib documentation (IBM): diffprivlib.readthedocs.io