Synthetic Data Generation

SynthGen trains deep generative models (CTGAN, CopulaGAN, TVAE) on your data and produces privacy-safe synthetic datasets that preserve statistical properties of the original.

Overview 

Synthetic data is useful when:

Original data contains PII and cannot be shared
You need more training data for imbalanced classes
Testing and development require realistic data
Regulatory constraints prevent using production data

The workflow is:

Train a SynthGen model on a dataset version
Generate synthetic rows from the trained model
Download or use the synthetic data

Training a Model 

Available model types:

Type	Description
`ctgan`	Conditional Tabular GAN. Best for datasets with mixed column types.
`copulagan`	CopulaGAN. Captures complex column correlations.
`tvae`	Tabular Variational Autoencoder. Faster training, good for large datasets.
`gaussian_copula`	Gaussian Copula. Lightweight, interpretable, fast.

Choosing a Model Type 

Type	Training Speed	Quality	When to Use
`ctgan`	Slow (hours)	Highest	Production synthetic data with mixed column types. Best overall quality.
`copulagan`	Slow (hours)	High	Datasets where column correlations are critical (e.g., financial data with correlated features).
`tvae`	Medium (minutes)	Medium-High	Large datasets where training speed matters. Good quality with less training time.
`gaussian_copula`	Fast (seconds)	Medium	Quick prototyping, baselines, or when you need synthetic data immediately. Not suitable for complex distributions.

Training Configuration 

Common configuration options for GAN-based models (CTGAN, CopulaGAN):

config = {
    "epochs": 300,              # Training iterations (more = better quality, slower)
    "batch_size": 500,          # Samples per training step
    "embedding_dim": 128,       # Data embedding size
    "generator_dim": [256, 256],     # Generator network layers
    "discriminator_dim": [256, 256], # Discriminator network layers
    "pac": 10,                  # Packing size (must divide batch_size)
}

For TVAE, use epochs, batch_size, and embedding_dim. For Gaussian Copula, only seed is configurable.

resp = requests.post(f"{BASE_URL}/api/synthgen/models", headers=HEADERS, json={
    "project_id": project_id,
    "dataset_version_id": version_id,
    "name": "Customer Synth CTGAN",
    "model_type": "ctgan",
    "config": {
        "epochs": 300,
        "batch_size": 500,
        "generator_dim": [256, 256],
        "discriminator_dim": [256, 256],
    },
})
synth_model_id = resp.json()["id"]

Training runs as a background job. Poll for completion:

import time

while True:
    r = requests.get(
        f"{BASE_URL}/api/synthgen/models/{synth_model_id}",
        headers=HEADERS,
    )
    model = r.json()
    print(f"Status: {model['status']}")
    if model["status"] in ("ready", "failed"):
        break
    time.sleep(10)

Generating Synthetic Data 

Once the model is trained, generate rows:

resp = requests.post(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate",
    headers=HEADERS,
    json={
        "num_rows": 5000,
        "seed": 42,
    },
)
result = resp.json()
print(f"Generated {result.get('num_rows', 'N/A')} rows")

Conditional Generation 

Generate data matching specific conditions:

resp = requests.post(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate",
    headers=HEADERS,
    json={
        "num_rows": 1000,
        "conditions": {
            "churned": "Yes",
            "contract": "Month-to-month",
        },
    },
)

Streaming Large Datasets 

For large synthetic datasets, use the streaming endpoint:

resp = requests.post(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}/stream",
    headers=HEADERS,
    json={
        "num_rows": 100000,
        "batch_size": 5000,
    },
    stream=True,
)
with open("synthetic_data.csv", "wb") as f:
    for chunk in resp.iter_content(chunk_size=8192):
        f.write(chunk)

Downloading Synthetic Data 

Download the most recent generation as a CSV:

resp = requests.get(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}/download",
    headers=HEADERS,
)
with open("synthetic_output.csv", "wb") as f:
    f.write(resp.content)

Quality Metrics 

After generation, inspect quality metrics comparing synthetic vs. real data distributions:

resp = requests.get(
    f"{BASE_URL}/api/synthgen/models/{synth_model_id}",
    headers=HEADERS,
)
model = resp.json()
if "quality_metrics" in model:
    metrics = model["quality_metrics"]
    print(f"Overall quality score: {metrics.get('overall_score', 'N/A')}")
    for col, score in metrics.get("column_scores", {}).items():
        print(f"  {col}: {score:.3f}")

SynthGen Report 

Generate a PDF report with quality analysis:

resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={
    "project_id": project_id,
    "kind": "synthgen",
    "entity_id": synth_model_id,
    "options": {"llm_insights": True},
})

SDK Example 

from coreplexml import CorePlexMLClient

client = CorePlexMLClient(base_url=BASE_URL, api_key=API_KEY)

# Train
model = client.synthgen.create(
    project_id=project_id,
    dataset_version_id=version_id,
    name="Customer Synth",
    model_type="ctgan",
)

# Wait for training
# (poll model status or use client.synthgen.get(model["id"]))

# Generate
result = client.synthgen.generate(model["id"], num_rows=5000, seed=42)