Synthetic Data Generation

SynthGen trains deep generative models (CTGAN, CopulaGAN, TVAE) on your data and produces privacy-safe synthetic datasets that preserve statistical properties of the original.

Overview

Synthetic data is useful when:

  • Original data contains PII and cannot be shared

  • You need more training data for imbalanced classes

  • Testing and development require realistic data

  • Regulatory constraints prevent using production data

The workflow is:

  1. Train a SynthGen model on a dataset version

  2. Generate synthetic rows from the trained model

  3. Download or use the synthetic data


Training a Model

Available model types:

Type

Description

ctgan

Conditional Tabular GAN. Best for datasets with mixed column types.

copulagan

CopulaGAN. Captures complex column correlations.

tvae

Tabular Variational Autoencoder. Faster training, good for large datasets.

gaussian_copula

Gaussian Copula. Lightweight, interpretable, fast.

Choosing a Model Type

Type

Training Speed

Quality

When to Use

ctgan

Slow (hours)

Highest

Production synthetic data with mixed column types. Best overall quality.

copulagan

Slow (hours)

High

Datasets where column correlations are critical (e.g., financial data with correlated features).

tvae

Medium (minutes)

Medium-High

Large datasets where training speed matters. Good quality with less training time.

gaussian_copula

Fast (seconds)

Medium

Quick prototyping, baselines, or when you need synthetic data immediately. Not suitable for complex distributions.

Training Configuration

Common configuration options for GAN-based models (CTGAN, CopulaGAN):

config = {
    "epochs": 300,              # Training iterations (more = better quality, slower)
    "batch_size": 500,          # Samples per training step
    "embedding_dim": 128,       # Data embedding size
    "generator_dim": [256, 256],     # Generator network layers
    "discriminator_dim": [256, 256], # Discriminator network layers
    "pac": 10,                  # Packing size (must divide batch_size)
}

For TVAE, use epochs, batch_size, and embedding_dim. For Gaussian Copula, only seed is configurable.

resp = requests.post(f"{BASE_URL}/api/synthgen/models", headers=HEADERS, json={
    "project_id": project_id,
    "dataset_version_id": version_id,
    "name": "Customer Synth CTGAN",
    "model_type": "ctgan",
    "config": {
        "epochs": 300,
        "batch_size": 500,
        "generator_dim": [256, 256],
        "discriminator_dim": [256, 256],
    },
})
synth_model_id = resp.json()["id"]

Training runs as a background job. Poll for completion:

import time

while True:
    r = requests.get(
        f"{BASE_URL}/api/synthgen/models/{synth_model_id}",
        headers=HEADERS,
    )
    model = r.json()
    print(f"Status: {model['status']}")
    if model["status"] in ("ready", "failed"):
        break
    time.sleep(10)

Generating Synthetic Data

Once the model is trained, generate rows:

resp = requests.post(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate",
    headers=HEADERS,
    json={
        "num_rows": 5000,
        "seed": 42,
    },
)
result = resp.json()
print(f"Generated {result.get('num_rows', 'N/A')} rows")

Conditional Generation

Generate data matching specific conditions:

resp = requests.post(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate",
    headers=HEADERS,
    json={
        "num_rows": 1000,
        "conditions": {
            "churned": "Yes",
            "contract": "Month-to-month",
        },
    },
)

Streaming Large Datasets

For large synthetic datasets, use the streaming endpoint:

resp = requests.post(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}/stream",
    headers=HEADERS,
    json={
        "num_rows": 100000,
        "batch_size": 5000,
    },
    stream=True,
)
with open("synthetic_data.csv", "wb") as f:
    for chunk in resp.iter_content(chunk_size=8192):
        f.write(chunk)

Downloading Synthetic Data

Download the most recent generation as a CSV:

resp = requests.get(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}/download",
    headers=HEADERS,
)
with open("synthetic_output.csv", "wb") as f:
    f.write(resp.content)

Quality Metrics

After generation, inspect quality metrics comparing synthetic vs. real data distributions:

resp = requests.get(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}",
    headers=HEADERS,
)
model = resp.json()
if "quality_metrics" in model:
    metrics = model["quality_metrics"]
    print(f"Overall quality score: {metrics.get('overall_score', 'N/A')}")
    for col, score in metrics.get("column_scores", {}).items():
        print(f"  {col}: {score:.3f}")

SynthGen Report

Generate a PDF report with quality analysis:

resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={
    "project_id": project_id,
    "kind": "synthgen",
    "entity_id": synth_model_id,
    "options": {"llm_insights": True},
})

SDK Example

from coreplexml import CorePlexMLClient

client = CorePlexMLClient(base_url=BASE_URL, api_key=API_KEY)

# Train
model = client.synthgen.create(
    project_id=project_id,
    dataset_version_id=version_id,
    name="Customer Synth",
    model_type="ctgan",
)

# Wait for training
# (poll model status or use client.synthgen.get(model["id"]))

# Generate
result = client.synthgen.generate(model["id"], num_rows=5000, seed=42)

See also