=========================
Synthetic Data Generation
=========================

SynthGen trains deep generative models (CTGAN, CopulaGAN, TVAE) on your
data and produces privacy-safe synthetic datasets that preserve statistical
properties of the original.

.. contents:: Sections
   :local:
   :depth: 2

Overview
--------

Synthetic data is useful when:

- Original data contains PII and cannot be shared
- You need more training data for imbalanced classes
- Testing and development require realistic data
- Regulatory constraints prevent using production data

The workflow is:

1. **Train a SynthGen model** on a dataset version
2. **Generate synthetic rows** from the trained model
3. **Download or use** the synthetic data

----

Training a Model
-----------------

Available model types:

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Type
     - Description
   * - ``ctgan``
     - Conditional Tabular GAN. Best for datasets with mixed column types.
   * - ``copulagan``
     - CopulaGAN. Captures complex column correlations.
   * - ``tvae``
     - Tabular Variational Autoencoder. Faster training, good for large datasets.
   * - ``gaussian_copula``
     - Gaussian Copula. Lightweight, interpretable, fast.

Choosing a Model Type
^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 20 15 15 50

   * - Type
     - Training Speed
     - Quality
     - When to Use
   * - ``ctgan``
     - Slow (hours)
     - Highest
     - Production synthetic data with mixed column types. Best overall
       quality.
   * - ``copulagan``
     - Slow (hours)
     - High
     - Datasets where column correlations are critical (e.g., financial
       data with correlated features).
   * - ``tvae``
     - Medium (minutes)
     - Medium-High
     - Large datasets where training speed matters. Good quality with
       less training time.
   * - ``gaussian_copula``
     - Fast (seconds)
     - Medium
     - Quick prototyping, baselines, or when you need synthetic data
       immediately. Not suitable for complex distributions.

Training Configuration
^^^^^^^^^^^^^^^^^^^^^^

Common configuration options for GAN-based models (CTGAN, CopulaGAN):

.. code-block:: python

   config = {
       "epochs": 300,              # Training iterations (more = better quality, slower)
       "batch_size": 500,          # Samples per training step
       "embedding_dim": 128,       # Data embedding size
       "generator_dim": [256, 256],     # Generator network layers
       "discriminator_dim": [256, 256], # Discriminator network layers
       "pac": 10,                  # Packing size (must divide batch_size)
   }

For TVAE, use ``epochs``, ``batch_size``, and ``embedding_dim``. For
Gaussian Copula, only ``seed`` is configurable.

.. code-block:: python

   resp = requests.post(f"{BASE_URL}/api/synthgen/models", headers=HEADERS, json={
       "project_id": project_id,
       "dataset_version_id": version_id,
       "name": "Customer Synth CTGAN",
       "model_type": "ctgan",
       "config": {
           "epochs": 300,
           "batch_size": 500,
           "generator_dim": [256, 256],
           "discriminator_dim": [256, 256],
       },
   })
   synth_model_id = resp.json()["id"]

Training runs as a background job. Poll for completion:

.. code-block:: python

   import time

   while True:
       r = requests.get(
           f"{BASE_URL}/api/synthgen/models/{synth_model_id}",
           headers=HEADERS,
       )
       model = r.json()
       print(f"Status: {model['status']}")
       if model["status"] in ("ready", "failed"):
           break
       time.sleep(10)

Generating Synthetic Data
-------------------------

Once the model is trained, generate rows:

.. code-block:: python

   resp = requests.post(
       f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate",
       headers=HEADERS,
       json={
           "num_rows": 5000,
           "seed": 42,
       },
   )
   result = resp.json()
   print(f"Generated {result.get('num_rows', 'N/A')} rows")

Conditional Generation
^^^^^^^^^^^^^^^^^^^^^^

Generate data matching specific conditions:

.. code-block:: python

   resp = requests.post(
       f"{BASE_URL}/api/synthgen/models/{synth_model_id}/generate",
       headers=HEADERS,
       json={
           "num_rows": 1000,
           "conditions": {
               "churned": "Yes",
               "contract": "Month-to-month",
           },
       },
   )

Streaming Large Datasets
^^^^^^^^^^^^^^^^^^^^^^^^^

For large synthetic datasets, use the streaming endpoint:

.. code-block:: python

   resp = requests.post(
       f"{BASE_URL}/api/synthgen/models/{synth_model_id}/stream",
       headers=HEADERS,
       json={
           "num_rows": 100000,
           "batch_size": 5000,
       },
       stream=True,
   )
   with open("synthetic_data.csv", "wb") as f:
       for chunk in resp.iter_content(chunk_size=8192):
           f.write(chunk)

Downloading Synthetic Data
--------------------------

Download the most recent generation as a CSV:

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/synthgen/models/{synth_model_id}/download",
       headers=HEADERS,
   )
   with open("synthetic_output.csv", "wb") as f:
       f.write(resp.content)

Quality Metrics
---------------

After generation, inspect quality metrics comparing synthetic vs. real
data distributions:

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/synthgen/models/{synth_model_id}",
       headers=HEADERS,
   )
   model = resp.json()
   if "quality_metrics" in model:
       metrics = model["quality_metrics"]
       print(f"Overall quality score: {metrics.get('overall_score', 'N/A')}")
       for col, score in metrics.get("column_scores", {}).items():
           print(f"  {col}: {score:.3f}")

SynthGen Report
---------------

Generate a PDF report with quality analysis:

.. code-block:: python

   resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={
       "project_id": project_id,
       "kind": "synthgen",
       "entity_id": synth_model_id,
       "options": {"llm_insights": True},
   })

SDK Example
-----------

.. code-block:: python

   from coreplexml import CorePlexMLClient

   client = CorePlexMLClient(base_url=BASE_URL, api_key=API_KEY)

   # Train
   model = client.synthgen.create(
       project_id=project_id,
       dataset_version_id=version_id,
       name="Customer Synth",
       model_type="ctgan",
   )

   # Wait for training
   # (poll model status or use client.synthgen.get(model["id"]))

   # Generate
   result = client.synthgen.generate(model["id"], num_rows=5000, seed=42)

----

.. seealso::

   - :doc:`/api-reference/synthgen` -- Full SynthGen API reference
   - :doc:`privacy-suite` -- Anonymize data before generating synthetic versions