============
SynthGen API
============

SynthGen generates synthetic tabular data that preserves the statistical
properties of your original dataset. It supports four model architectures:
CTGAN, CopulaGAN, TVAE, and Gaussian Copula.

All endpoints are prefixed with ``/api/synthgen``.

.. contents:: Endpoints
   :local:
   :depth: 1

----

Create Model
------------

.. code-block:: text

   POST /api/synthgen/models

Create a new SynthGen model and enqueue a ``synthgen_train`` background
job. Training begins automatically after the job is picked up by the
worker.

**Request Body**

.. list-table::
   :header-rows: 1
   :widths: 25 10 10 55

   * - Field
     - Type
     - Required
     - Description
   * - ``project_id``
     - string
     - Yes
     - UUID of the project.
   * - ``dataset_version_id``
     - string
     - Yes
     - UUID of the dataset version to train on.
   * - ``name``
     - string
     - Yes
     - Model name.
   * - ``model_type``
     - string
     - No
     - Model architecture: ``ctgan`` (default), ``copulagan``, ``tvae``,
       or ``gaussian_copula``.
   * - ``config``
     - object
     - No
     - Architecture-specific hyperparameters (e.g. ``epochs``,
       ``batch_size``, ``embedding_dim``).

**Example**

.. code-block:: bash

   curl -X POST "$BASE_URL/api/synthgen/models" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
       "project_id": "d4e5f6a7-b8c9-0123-def4-567890123456",
       "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
       "name": "Transactions CTGAN",
       "model_type": "ctgan",
       "config": {
         "epochs": 300,
         "batch_size": 500
       }
     }'

.. code-block:: python

   import requests
   import time

   resp = requests.post(f"{BASE_URL}/api/synthgen/models", headers={
       "Authorization": "Bearer YOUR_API_KEY",
   }, json={
       "project_id": "d4e5f6a7-b8c9-0123-def4-567890123456",
       "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
       "name": "Transactions CTGAN",
       "model_type": "ctgan",
       "config": {"epochs": 300, "batch_size": 500},
   })
   model_id = resp.json()["model_id"]
   print("Model training started:", model_id)

   # Poll until ready
   while True:
       r = requests.get(f"{BASE_URL}/api/synthgen/models/{model_id}", headers={
           "Authorization": "Bearer YOUR_API_KEY",
       })
       status = r.json()["model"]["status"]
       print(f"  Status: {status}")
       if status in ("ready", "failed"):
           break
       time.sleep(10)

**Response** ``201 Created``

.. code-block:: json

   {
     "id": "5e6f7a8b-9c0d-1234-ef56-789012345678",
     "model_id": "5e6f7a8b-9c0d-1234-ef56-789012345678",
     "job_id": "6f7a8b9c-0d1e-2345-f678-901234567890",
     "status": "pending"
   }

Model Configuration Reference
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Each model type supports specific configuration options passed in the
``config`` object:

**CTGAN (Conditional Tabular GAN)**

.. list-table::
   :header-rows: 1
   :widths: 25 10 15 50

   * - Option
     - Type
     - Default
     - Description
   * - ``epochs``
     - integer
     - 300
     - Number of training epochs.
   * - ``batch_size``
     - integer
     - 500
     - Training batch size.
   * - ``embedding_dim``
     - integer
     - 128
     - Dimensionality of data embeddings.
   * - ``generator_dim``
     - array[int]
     - [256, 256]
     - Hidden layer sizes for the generator network.
   * - ``discriminator_dim``
     - array[int]
     - [256, 256]
     - Hidden layer sizes for the discriminator network.
   * - ``pac``
     - integer
     - 10
     - PAC (Packing) size. Must evenly divide ``batch_size``.

**CopulaGAN**

Same options as CTGAN. Additionally captures complex column correlations
using copula functions.

**TVAE (Tabular VAE)**

.. list-table::
   :header-rows: 1
   :widths: 25 10 15 50

   * - Option
     - Type
     - Default
     - Description
   * - ``epochs``
     - integer
     - 300
     - Number of training epochs.
   * - ``batch_size``
     - integer
     - 500
     - Training batch size.
   * - ``embedding_dim``
     - integer
     - 128
     - Dimensionality of data embeddings.

**Gaussian Copula**

Fastest model type. No deep learning -- uses statistical copula fitting.
Minimal configuration required:

.. list-table::
   :header-rows: 1
   :widths: 25 10 15 50

   * - Option
     - Type
     - Default
     - Description
   * - ``seed``
     - integer
     - None
     - Random seed for reproducibility.

**Model Type Comparison**

.. list-table::
   :header-rows: 1
   :widths: 20 15 15 15 35

   * - Type
     - Speed
     - Quality
     - Best For
     - Limitations
   * - ``ctgan``
     - Slow
     - High
     - Mixed column types, general use
     - Requires GPU for large datasets
   * - ``copulagan``
     - Slow
     - High
     - Complex correlations
     - Higher memory usage
   * - ``tvae``
     - Medium
     - Medium-High
     - Large datasets, fast iteration
     - May miss complex correlations
   * - ``gaussian_copula``
     - Fast
     - Medium
     - Quick prototyping, baselines
     - Assumes Gaussian distributions

----

List Models
-----------

.. code-block:: text

   GET /api/synthgen/models

**Query Parameters**

.. list-table::
   :header-rows: 1
   :widths: 20 10 10 60

   * - Parameter
     - Type
     - Default
     - Description
   * - ``project_id``
     - string
     - --
     - Filter by project.
   * - ``limit``
     - integer
     - 50
     - Max items.
   * - ``offset``
     - integer
     - 0
     - Pagination offset.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/synthgen/models?project_id=d4e5f6a7-b8c9-0123-def4-567890123456" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "items": [
       {
         "id": "5e6f7a8b-9c0d-1234-ef56-789012345678",
         "name": "Transactions CTGAN",
         "model_type": "ctgan",
         "status": "ready",
         "dataset_name": "Transactions Q4",
         "dataset_version": 0,
         "created_at": "2026-02-25T10:00:00Z"
       }
     ],
     "total": 1,
     "limit": 50,
     "offset": 0
   }

----

Get Model Detail
----------------

.. code-block:: text

   GET /api/synthgen/models/{model_id}

Return model metadata and its associated training/generation jobs.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/synthgen/models/5e6f7a8b-9c0d-1234-ef56-789012345678" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "model": {
       "id": "5e6f7a8b-9c0d-1234-ef56-789012345678",
       "name": "Transactions CTGAN",
       "model_type": "ctgan",
       "status": "ready",
       "config": {"epochs": 300, "batch_size": 500},
       "dataset_name": "Transactions Q4",
       "dataset_version": 0,
       "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
       "created_at": "2026-02-25T10:00:00Z"
     },
     "jobs": [
       {
         "id": "7a8b9c0d-1e2f-3456-7890-123456789012",
         "job_type": "train",
         "status": "completed",
         "created_at": "2026-02-25T10:00:00Z"
       }
     ]
   }

----

Generate Synthetic Data
-----------------------

.. code-block:: text

   POST /api/synthgen/models/{model_id}/generate

Generate synthetic data from a trained model. Enqueues a
``synthgen_generate`` background job. The model must be in ``ready``
status.

**Request Body**

.. list-table::
   :header-rows: 1
   :widths: 20 10 10 60

   * - Field
     - Type
     - Required
     - Description
   * - ``num_rows``
     - integer
     - No
     - Number of synthetic rows to generate (default 1000).
   * - ``seed``
     - integer
     - No
     - Random seed for reproducibility.
   * - ``conditions``
     - object
     - No
     - Conditional generation constraints (column-value pairs).

**Example**

.. code-block:: bash

   curl -X POST "$BASE_URL/api/synthgen/models/5e6f7a8b-9c0d-1234-ef56-789012345678/generate" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -H "Content-Type: application/json" \
     -d '{
       "num_rows": 10000,
       "seed": 42
     }'

.. code-block:: python

   # Generate synthetic data
   resp = requests.post(
       f"{BASE_URL}/api/synthgen/models/5e6f7a8b-9c0d-1234-ef56-789012345678/generate",
       headers={"Authorization": "Bearer YOUR_API_KEY"},
       json={"num_rows": 10000, "seed": 42},
   )
   synthgen_job_id = resp.json()["synthgen_job_id"]

   # Poll until generation completes
   while True:
       r = requests.get(
           f"{BASE_URL}/api/synthgen/jobs/{synthgen_job_id}",
           headers={"Authorization": "Bearer YOUR_API_KEY"},
       )
       status = r.json()["status"]
       if status in ("completed", "failed"):
           break
       time.sleep(5)

   # Download the generated data
   dl = requests.get(
       f"{BASE_URL}/api/synthgen/jobs/{synthgen_job_id}/download",
       headers={"Authorization": "Bearer YOUR_API_KEY"},
   )
   with open("synthetic_data.csv", "wb") as f:
       f.write(dl.content)

**Response** ``200 OK``

.. code-block:: json

   {
     "job_id": "8b9c0d1e-2f3a-4567-8901-234567890123",
     "synthgen_job_id": "9c0d1e2f-3a4b-5678-9012-345678901234",
     "status": "pending"
   }

----

Delete Model
------------

.. code-block:: text

   DELETE /api/synthgen/models/{model_id}

Permanently delete a SynthGen model and its associated jobs.

**Example**

.. code-block:: bash

   curl -X DELETE "$BASE_URL/api/synthgen/models/5e6f7a8b-9c0d-1234-ef56-789012345678" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "ok": true,
     "model_id": "5e6f7a8b-9c0d-1234-ef56-789012345678"
   }

----

List Jobs
---------

.. code-block:: text

   GET /api/synthgen/jobs

Return SynthGen jobs (training and generation) with optional filters.

**Query Parameters**

.. list-table::
   :header-rows: 1
   :widths: 20 10 10 60

   * - Parameter
     - Type
     - Default
     - Description
   * - ``project_id``
     - string
     - --
     - Filter by project.
   * - ``model_id``
     - string
     - --
     - Filter by SynthGen model.
   * - ``status``
     - string
     - --
     - Filter by status: ``pending``, ``running``, ``completed``,
       ``failed``.
   * - ``limit``
     - integer
     - 50
     - Max items.
   * - ``offset``
     - integer
     - 0
     - Pagination offset.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/synthgen/jobs?model_id=5e6f7a8b-9c0d-1234-ef56-789012345678" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "items": [
       {
         "id": "9c0d1e2f-3a4b-5678-9012-345678901234",
         "model_id": "5e6f7a8b-9c0d-1234-ef56-789012345678",
         "model_name": "Transactions CTGAN",
         "model_type": "ctgan",
         "job_type": "generate",
         "status": "completed",
         "config": {"num_rows": 10000, "seed": 42},
         "created_at": "2026-02-25T11:00:00Z"
       }
     ],
     "total": 1,
     "limit": 50,
     "offset": 0
   }

----

Get Job Detail
--------------

.. code-block:: text

   GET /api/synthgen/jobs/{job_id}

Return details for a specific SynthGen job.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/synthgen/jobs/9c0d1e2f-3a4b-5678-9012-345678901234" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "id": "9c0d1e2f-3a4b-5678-9012-345678901234",
     "model_id": "5e6f7a8b-9c0d-1234-ef56-789012345678",
     "model_name": "Transactions CTGAN",
     "model_type": "ctgan",
     "project_name": "Fraud Detection v2",
     "job_type": "generate",
     "status": "completed",
     "config": {"num_rows": 10000, "seed": 42},
     "output_artifact_id": "0d1e2f3a-4b5c-6789-0123-456789012345",
     "created_at": "2026-02-25T11:00:00Z"
   }

----

Download Synthetic Data
-----------------------

.. code-block:: text

   GET /api/synthgen/jobs/{job_id}/download

Download the output artifact from a completed generation job.

Returns ``409 Conflict`` if the job output is not available yet.

**Example**

.. code-block:: bash

   curl -o synthetic_data.csv \
     "$BASE_URL/api/synthgen/jobs/9c0d1e2f-3a4b-5678-9012-345678901234/download" \
     -H "Authorization: Bearer YOUR_API_KEY"

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/synthgen/jobs/9c0d1e2f-3a4b-5678-9012-345678901234/download",
       headers={"Authorization": "Bearer YOUR_API_KEY"},
   )
   with open("synthetic_data.csv", "wb") as f:
       f.write(resp.content)

**Response** ``200 OK``

Binary file download.

----

List Model Types
----------------

.. code-block:: text

   GET /api/synthgen/model-types

Return available model architectures and their descriptions.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/synthgen/model-types" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "model_types": [
       {
         "id": "ctgan",
         "name": "CTGAN",
         "description": "Conditional Tabular GAN - best for general tabular data"
       },
       {
         "id": "copulagan",
         "name": "CopulaGAN",
         "description": "Copula-based GAN - good for capturing correlations"
       },
       {
         "id": "tvae",
         "name": "TVAE",
         "description": "Tabular VAE - faster training, good for large datasets"
       },
       {
         "id": "gaussian_copula",
         "name": "Gaussian Copula",
         "description": "Statistical copula model - fastest, good baseline"
       }
     ]
   }

----

Full Workflow Example
---------------------

Train a CTGAN model and generate 10,000 synthetic rows:

.. code-block:: python

   import requests
   import time

   BASE_URL = "http://localhost:8888"
   HEADERS = {"Authorization": "Bearer YOUR_API_KEY"}

   # 1. Train a CTGAN model
   resp = requests.post(f"{BASE_URL}/api/synthgen/models", headers=HEADERS, json={
       "project_id": "d4e5f6a7-b8c9-0123-def4-567890123456",
       "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
       "name": "Transactions CTGAN v1",
       "model_type": "ctgan",
       "config": {"epochs": 300},
   })
   model_id = resp.json()["model_id"]

   # 2. Wait for training to complete
   while True:
       r = requests.get(f"{BASE_URL}/api/synthgen/models/{model_id}", headers=HEADERS)
       if r.json()["model"]["status"] in ("ready", "failed"):
           break
       time.sleep(15)

   # 3. Generate synthetic data
   resp = requests.post(
       f"{BASE_URL}/api/synthgen/models/{model_id}/generate",
       headers=HEADERS,
       json={"num_rows": 10000, "seed": 42},
   )
   job_id = resp.json()["synthgen_job_id"]

   # 4. Wait for generation to complete
   while True:
       r = requests.get(f"{BASE_URL}/api/synthgen/jobs/{job_id}", headers=HEADERS)
       if r.json()["status"] in ("completed", "failed"):
           break
       time.sleep(5)

   # 5. Download the result
   resp = requests.get(
       f"{BASE_URL}/api/synthgen/jobs/{job_id}/download",
       headers=HEADERS,
   )
   with open("synthetic_transactions.csv", "wb") as f:
       f.write(resp.content)
   print(f"Generated {10000} synthetic rows.")

----

.. seealso::

   - :doc:`datasets` -- Uploading real datasets for training.
   - :doc:`reports` -- Generating SynthGen quality reports.