Synthetic Data Generation
SynthGen trains deep generative models (CTGAN, CopulaGAN, TVAE) on your data and produces privacy-safe synthetic datasets that preserve statistical properties of the original.
Overview
Synthetic data is useful when:
Original data contains PII and cannot be shared
You need more training data for imbalanced classes
Testing and development require realistic data
Regulatory constraints prevent using production data
The workflow is:
Train a SynthGen model on a dataset version
Generate synthetic rows from the trained model
Download or use the synthetic data
Training a Model
Available model types:
Type |
Description |
|---|---|
|
Conditional Tabular GAN. Best for datasets with mixed column types. |
|
CopulaGAN. Captures complex column correlations. |
|
Tabular Variational Autoencoder. Faster training, good for large datasets. |
|
Gaussian Copula. Lightweight, interpretable, fast. |
Choosing a Model Type
Type |
Training Speed |
Quality |
When to Use |
|---|---|---|---|
|
Slow (hours) |
Highest |
Production synthetic data with mixed column types. Best overall quality. |
|
Slow (hours) |
High |
Datasets where column correlations are critical (e.g., financial data with correlated features). |
|
Medium (minutes) |
Medium-High |
Large datasets where training speed matters. Good quality with less training time. |
|
Fast (seconds) |
Medium |
Quick prototyping, baselines, or when you need synthetic data immediately. Not suitable for complex distributions. |
Training Configuration
Common configuration options for GAN-based models (CTGAN, CopulaGAN):
config = {
"epochs": 300, # Training iterations (more = better quality, slower)
"batch_size": 500, # Samples per training step
"embedding_dim": 128, # Data embedding size
"generator_dim": [256, 256], # Generator network layers
"discriminator_dim": [256, 256], # Discriminator network layers
"pac": 10, # Packing size (must divide batch_size)
}
For TVAE, use epochs, batch_size, and embedding_dim. For
Gaussian Copula, only seed is configurable.
resp = requests.post(f"{BASE_URL}/api/synthgen/models", headers=HEADERS, json={
"project_id": project_id,
"dataset_version_id": version_id,
"name": "Customer Synth CTGAN",
"model_type": "ctgan",
"config": {
"epochs": 300,
"batch_size": 500,
"generator_dim": [256, 256],
"discriminator_dim": [256, 256],
},
})
synth_model_id = resp.json()["id"]
Training runs as a background job. Poll for completion:
import time
while True:
r = requests.get(
f"{BASE_URL}/api/synthgen/models/{synth_model_id}",
headers=HEADERS,
)
model = r.json()
print(f"Status: {model['status']}")
if model["status"] in ("ready", "failed"):
break
time.sleep(10)
Generating Synthetic Data
Once the model is trained, generate rows:
resp = requests.post(
f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate",
headers=HEADERS,
json={
"num_rows": 5000,
"seed": 42,
},
)
result = resp.json()
print(f"Generated {result.get('num_rows', 'N/A')} rows")
Conditional Generation
Generate data matching specific conditions:
resp = requests.post(
f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate",
headers=HEADERS,
json={
"num_rows": 1000,
"conditions": {
"churned": "Yes",
"contract": "Month-to-month",
},
},
)
Streaming Large Datasets
For large synthetic datasets, use the streaming endpoint:
resp = requests.post(
f"{BASE_URL}/api/synthgen/models/{synth_model_id}/stream",
headers=HEADERS,
json={
"num_rows": 100000,
"batch_size": 5000,
},
stream=True,
)
with open("synthetic_data.csv", "wb") as f:
for chunk in resp.iter_content(chunk_size=8192):
f.write(chunk)
Downloading Synthetic Data
Download the most recent generation as a CSV:
resp = requests.get(
f"{BASE_URL}/api/synthgen/models/{synth_model_id}/download",
headers=HEADERS,
)
with open("synthetic_output.csv", "wb") as f:
f.write(resp.content)
Quality Metrics
After generation, inspect quality metrics comparing synthetic vs. real data distributions:
resp = requests.get(
f"{BASE_URL}/api/synthgen/models/{synth_model_id}",
headers=HEADERS,
)
model = resp.json()
if "quality_metrics" in model:
metrics = model["quality_metrics"]
print(f"Overall quality score: {metrics.get('overall_score', 'N/A')}")
for col, score in metrics.get("column_scores", {}).items():
print(f" {col}: {score:.3f}")
SynthGen Report
Generate a PDF report with quality analysis:
resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={
"project_id": project_id,
"kind": "synthgen",
"entity_id": synth_model_id,
"options": {"llm_insights": True},
})
SDK Example
from coreplexml import CorePlexMLClient
client = CorePlexMLClient(base_url=BASE_URL, api_key=API_KEY)
# Train
model = client.synthgen.create(
project_id=project_id,
dataset_version_id=version_id,
name="Customer Synth",
model_type="ctgan",
)
# Wait for training
# (poll model status or use client.synthgen.get(model["id"]))
# Generate
result = client.synthgen.generate(model["id"], num_rows=5000, seed=42)
See also
SynthGen API – Full SynthGen API reference
Privacy Suite – Anonymize data before generating synthetic versions