========================= Synthetic Data Generation ========================= SynthGen trains deep generative models (CTGAN, CopulaGAN, TVAE) on your data and produces privacy-safe synthetic datasets that preserve statistical properties of the original. .. contents:: Sections :local: :depth: 2 Overview -------- Synthetic data is useful when: - Original data contains PII and cannot be shared - You need more training data for imbalanced classes - Testing and development require realistic data - Regulatory constraints prevent using production data The workflow is: 1. **Train a SynthGen model** on a dataset version 2. **Generate synthetic rows** from the trained model 3. **Download or use** the synthetic data ---- Training a Model ----------------- Available model types: .. list-table:: :header-rows: 1 :widths: 20 80 * - Type - Description * - ``ctgan`` - Conditional Tabular GAN. Best for datasets with mixed column types. * - ``copulagan`` - CopulaGAN. Captures complex column correlations. * - ``tvae`` - Tabular Variational Autoencoder. Faster training, good for large datasets. * - ``gaussian_copula`` - Gaussian Copula. Lightweight, interpretable, fast. Choosing a Model Type ^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 20 15 15 50 * - Type - Training Speed - Quality - When to Use * - ``ctgan`` - Slow (hours) - Highest - Production synthetic data with mixed column types. Best overall quality. * - ``copulagan`` - Slow (hours) - High - Datasets where column correlations are critical (e.g., financial data with correlated features). * - ``tvae`` - Medium (minutes) - Medium-High - Large datasets where training speed matters. Good quality with less training time. * - ``gaussian_copula`` - Fast (seconds) - Medium - Quick prototyping, baselines, or when you need synthetic data immediately. Not suitable for complex distributions. Training Configuration ^^^^^^^^^^^^^^^^^^^^^^ Common configuration options for GAN-based models (CTGAN, CopulaGAN): .. code-block:: python config = { "epochs": 300, # Training iterations (more = better quality, slower) "batch_size": 500, # Samples per training step "embedding_dim": 128, # Data embedding size "generator_dim": [256, 256], # Generator network layers "discriminator_dim": [256, 256], # Discriminator network layers "pac": 10, # Packing size (must divide batch_size) } For TVAE, use ``epochs``, ``batch_size``, and ``embedding_dim``. For Gaussian Copula, only ``seed`` is configurable. .. code-block:: python resp = requests.post(f"{BASE_URL}/api/synthgen/models", headers=HEADERS, json={ "project_id": project_id, "dataset_version_id": version_id, "name": "Customer Synth CTGAN", "model_type": "ctgan", "config": { "epochs": 300, "batch_size": 500, "generator_dim": [256, 256], "discriminator_dim": [256, 256], }, }) synth_model_id = resp.json()["id"] Training runs as a background job. Poll for completion: .. code-block:: python import time while True: r = requests.get( f"{BASE_URL}/api/synthgen/models/{synth_model_id}", headers=HEADERS, ) model = r.json() print(f"Status: {model['status']}") if model["status"] in ("ready", "failed"): break time.sleep(10) Generating Synthetic Data ------------------------- Once the model is trained, generate rows: .. code-block:: python resp = requests.post( f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate", headers=HEADERS, json={ "num_rows": 5000, "seed": 42, }, ) result = resp.json() print(f"Generated {result.get('num_rows', 'N/A')} rows") Conditional Generation ^^^^^^^^^^^^^^^^^^^^^^ Generate data matching specific conditions: .. code-block:: python resp = requests.post( f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate", headers=HEADERS, json={ "num_rows": 1000, "conditions": { "churned": "Yes", "contract": "Month-to-month", }, }, ) Streaming Large Datasets ^^^^^^^^^^^^^^^^^^^^^^^^^ For large synthetic datasets, use the streaming endpoint: .. code-block:: python resp = requests.post( f"{BASE_URL}/api/synthgen/models/{synth_model_id}/stream", headers=HEADERS, json={ "num_rows": 100000, "batch_size": 5000, }, stream=True, ) with open("synthetic_data.csv", "wb") as f: for chunk in resp.iter_content(chunk_size=8192): f.write(chunk) Downloading Synthetic Data -------------------------- Download the most recent generation as a CSV: .. code-block:: python resp = requests.get( f"{BASE_URL}/api/synthgen/models/{synth_model_id}/download", headers=HEADERS, ) with open("synthetic_output.csv", "wb") as f: f.write(resp.content) Quality Metrics --------------- After generation, inspect quality metrics comparing synthetic vs. real data distributions: .. code-block:: python resp = requests.get( f"{BASE_URL}/api/synthgen/models/{synth_model_id}", headers=HEADERS, ) model = resp.json() if "quality_metrics" in model: metrics = model["quality_metrics"] print(f"Overall quality score: {metrics.get('overall_score', 'N/A')}") for col, score in metrics.get("column_scores", {}).items(): print(f" {col}: {score:.3f}") SynthGen Report --------------- Generate a PDF report with quality analysis: .. code-block:: python resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={ "project_id": project_id, "kind": "synthgen", "entity_id": synth_model_id, "options": {"llm_insights": True}, }) SDK Example ----------- .. code-block:: python from coreplexml import CorePlexMLClient client = CorePlexMLClient(base_url=BASE_URL, api_key=API_KEY) # Train model = client.synthgen.create( project_id=project_id, dataset_version_id=version_id, name="Customer Synth", model_type="ctgan", ) # Wait for training # (poll model status or use client.synthgen.get(model["id"])) # Generate result = client.synthgen.generate(model["id"], num_rows=5000, seed=42) ---- .. seealso:: - :doc:`/api-reference/synthgen` -- Full SynthGen API reference - :doc:`privacy-suite` -- Anonymize data before generating synthetic versions