============
Datasets API
============

Datasets are versioned tabular data files (CSV or Parquet) stored within a
project. Each upload creates a dataset version with schema metadata and a
data quality report.

.. contents:: Endpoints
   :local:
   :depth: 1

----

Upload Dataset
--------------

.. code-block:: text

   POST /api/datasets/upload

Upload a CSV or Parquet file as a new dataset. Uses ``multipart/form-data``.
Maximum file size: 500 MB (configurable via ``MAX_DATASET_UPLOAD_BYTES``).

**Form Fields**

.. list-table::
   :header-rows: 1
   :widths: 20 10 10 60

   * - Field
     - Type
     - Required
     - Description
   * - ``project_id``
     - string
     - Yes
     - UUID of the owning project.
   * - ``name``
     - string
     - Yes
     - Human-readable dataset name.
   * - ``description``
     - string
     - No
     - Optional description.
   * - ``file``
     - file
     - Yes
     - The CSV or Parquet file to upload.

**Example**

.. code-block:: bash

   curl -X POST "$BASE_URL/api/datasets/upload" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -F "project_id=d4e5f6a7-b8c9-0123-def4-567890123456" \
     -F "name=Transactions Q4" \
     -F "description=Credit card transactions from Q4 2025" \
     -F "file=@transactions.csv"

.. code-block:: python

   import requests

   with open("transactions.csv", "rb") as f:
       resp = requests.post(f"{BASE_URL}/api/datasets/upload", headers={
           "Authorization": "Bearer YOUR_API_KEY",
       }, data={
           "project_id": "d4e5f6a7-b8c9-0123-def4-567890123456",
           "name": "Transactions Q4",
           "description": "Credit card transactions from Q4 2025",
       }, files={"file": ("transactions.csv", f, "text/csv")})

   data = resp.json()
   print("Dataset ID:", data["dataset_id"])
   print("Version ID:", data["dataset_version_id"])

**Response** ``201 Created``

.. code-block:: json

   {
     "id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
     "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
     "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
     "artifact_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456",
     "quality": {
       "overall_score": 92.5,
       "issues": []
     }
   }

Supported File Formats
^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 15 20 65

   * - Format
     - Extension
     - Notes
   * - CSV
     - ``.csv``
     - Comma-separated values. Auto-detects delimiter, encoding, and
       header row.
   * - Parquet
     - ``.parquet``
     - Apache Parquet columnar format. Preserves column types.

Auto-Detected Column Types
^^^^^^^^^^^^^^^^^^^^^^^^^^

The platform automatically detects column data types during upload:

.. list-table::
   :header-rows: 1
   :widths: 20 80

   * - Type
     - Description
   * - ``int64``
     - Integer values (whole numbers).
   * - ``float64``
     - Floating-point values (decimals).
   * - ``object``
     - String/text values, or mixed types.
   * - ``bool``
     - Boolean values (true/false).
   * - ``datetime64``
     - Date and time values (auto-parsed from common formats).
   * - ``category``
     - Categorical values (low-cardinality strings).

Download/Export Formats
^^^^^^^^^^^^^^^^^^^^^^^

.. list-table::
   :header-rows: 1
   :widths: 15 85

   * - Format
     - Description
   * - CSV
     - Default export format for dataset downloads.
   * - XLSX
     - Excel format (available for smaller datasets).

----

List Datasets
-------------

.. code-block:: text

   GET /api/datasets

**Query Parameters**

.. list-table::
   :header-rows: 1
   :widths: 20 10 10 60

   * - Parameter
     - Type
     - Default
     - Description
   * - ``project_id``
     - string
     - --
     - Filter by project. If omitted, returns datasets across all
       accessible projects.
   * - ``limit``
     - integer
     - 50
     - Max items (1--500).
   * - ``offset``
     - integer
     - 0
     - Pagination offset.
   * - ``search``
     - string
     - --
     - Search in name and description.
   * - ``sort_field``
     - string
     - ``created_at``
     - One of ``created_at``, ``name``, ``rows``, ``columns``,
       ``project_name``.
   * - ``sort_direction``
     - string
     - ``desc``
     - ``asc`` or ``desc``.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/datasets?project_id=d4e5f6a7-b8c9-0123-def4-567890123456&limit=20" \
     -H "Authorization: Bearer YOUR_API_KEY"

.. code-block:: python

   resp = requests.get(f"{BASE_URL}/api/datasets", headers={
       "Authorization": "Bearer YOUR_API_KEY",
   }, params={"project_id": "d4e5f6a7-b8c9-0123-def4-567890123456"})
   for ds in resp.json()["items"]:
       print(ds["name"], ds["rows"], "rows")

**Response** ``200 OK``

.. code-block:: json

   {
     "items": [
       {
         "id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
         "name": "Transactions Q4",
         "project_id": "d4e5f6a7-b8c9-0123-def4-567890123456",
         "project_name": "Fraud Detection v2",
         "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
         "rows": 50000,
         "columns": 15,
         "created_at": "2026-02-12T08:00:00Z"
       }
     ],
     "total": 1,
     "limit": 20,
     "offset": 0
   }

----

Get Dataset Detail
------------------

.. code-block:: text

   GET /api/datasets/{dataset_id}

Return dataset metadata and a preview of the first 20 rows.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "dataset": {
       "id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
       "name": "Transactions Q4",
       "rows": 50000,
       "columns": 15,
       "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
       "artifact_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456"
     },
     "preview": [
       {"amount": 29.99, "merchant": "ACME Corp", "is_fraud": 0},
       {"amount": 1250.00, "merchant": "Luxury Ltd", "is_fraud": 1}
     ],
     "preview_error": null
   }

----

Get Column Schema
-----------------

.. code-block:: text

   GET /api/datasets/{dataset_id}/columns

Return column names and data types for the latest dataset version.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/columns" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "columns": [
       {"name": "amount", "dtype": "float64"},
       {"name": "merchant", "dtype": "object"},
       {"name": "is_fraud", "dtype": "int64"}
     ]
   }

----

Download Dataset
----------------

.. code-block:: text

   GET /api/datasets/{dataset_id}/download

Download the latest version of a dataset as a file.

**Query Parameters**

.. list-table::
   :header-rows: 1
   :widths: 20 10 10 60

   * - Parameter
     - Type
     - Default
     - Description
   * - ``format``
     - string
     - ``csv``
     - ``csv`` or ``xlsx``.

**Example**

.. code-block:: bash

   # Download as CSV
   curl -o dataset.csv \
     "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/download" \
     -H "Authorization: Bearer YOUR_API_KEY"

   # Download as Excel
   curl -o dataset.xlsx \
     "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/download?format=xlsx" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

Binary file download.

----

Statistical Analysis
--------------------

.. code-block:: text

   GET /api/datasets/{dataset_id}/analyze

Compute per-column statistics including mean, standard deviation,
missing values, percentiles, skewness, kurtosis, outlier counts,
histograms (numeric columns), and top values (categorical columns).
Also returns a correlation matrix for numeric columns.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/analyze" \
     -H "Authorization: Bearer YOUR_API_KEY"

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/analyze",
       headers={"Authorization": "Bearer YOUR_API_KEY"},
   )
   stats = resp.json()["statistics"]
   for col, info in stats["columns"].items():
       print(f"{col}: type={info['type']}, missing={info['missing']}")

**Response** ``200 OK``

.. code-block:: json

   {
     "statistics": {
       "columns": {
         "amount": {
           "type": "float64",
           "unique": 4520,
           "missing": 12,
           "missing_percent": 0.00024,
           "min": 0.01,
           "max": 25000.0,
           "mean": 88.35,
           "std": 340.12,
           "percentiles": {"p25": 9.99, "p50": 29.99, "p75": 69.99},
           "skewness": 15.234,
           "kurtosis": 312.5,
           "iqr": 60.0,
           "outlier_count": 482,
           "histogram": {
             "counts": [42000, 5000, 1500, 800, 300, 200, 100, 50, 30, 20],
             "bin_edges": [0.01, 2500.0, 5000.0, 7500.0, 10000.0, 12500.0, 15000.0, 17500.0, 20000.0, 22500.0, 25000.0]
           }
         },
         "merchant": {
           "type": "object",
           "unique": 120,
           "missing": 0,
           "missing_percent": 0.0,
           "top_values": {"ACME Corp": 5200, "Luxury Ltd": 3100}
         }
       },
       "row_count": 50000,
       "duplicate_rows": 45,
       "memory_usage": 5.8,
       "correlations": {
         "amount": {"amount": 1.0, "is_fraud": 0.15}
       }
     },
     "error": null
   }

----

Data Quality Report
-------------------

.. code-block:: text

   GET /api/datasets/{dataset_id}/quality

Return the data quality report computed during upload (stored in the
dataset version).

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/quality" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "quality": {
       "overall_score": 92.5,
       "checks": [
         {"name": "missing_values", "passed": true, "details": "Max 0.02% missing"},
         {"name": "duplicate_rows", "passed": true, "details": "45 duplicates (0.09%)"}
       ]
     }
   }

----

Dataset Lineage
---------------

.. code-block:: text

   GET /api/datasets/{dataset_id}/lineage

Trace the full lineage of a dataset: which experiments used it, which
models were trained from those experiments, and which deployments serve
those models.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/lineage" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "experiments": [
       {
         "id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
         "name": "AutoML Run 1",
         "target_column": "is_fraud",
         "problem_type": "classification",
         "status": "succeeded",
         "model_count": 8
       }
     ],
     "models": [
       {
         "id": "a7b8c9d0-e1f2-3456-7890-abcdef123456",
         "name": "GBM_1_AutoML",
         "algorithm": "GBM",
         "experiment_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345"
       }
     ],
     "deployments": [
       {
         "id": "b8c9d0e1-f2a3-4567-8901-bcdef1234567",
         "name": "GBM_1_AutoML",
         "model_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456",
         "stage": "production",
         "is_active": true
       }
     ]
   }

----

Delete Dataset
--------------

.. code-block:: text

   DELETE /api/datasets/{dataset_id}

Permanently delete a dataset and all its versions and artifact files.

**Example**

.. code-block:: bash

   curl -X DELETE "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "ok": true
   }

----

List Versions
-------------

.. code-block:: text

   GET /api/datasets/{dataset_id}/versions

Return all versions of a dataset ordered by version number.

**Query Parameters**

.. list-table::
   :header-rows: 1
   :widths: 20 10 10 60

   * - Parameter
     - Type
     - Default
     - Description
   * - ``limit``
     - integer
     - 50
     - Max items (1--100).
   * - ``offset``
     - integer
     - 0
     - Pagination offset.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/versions" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "items": [
       {
         "id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
         "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
         "version": 0,
         "row_count": 50000,
         "column_count": 15,
         "description": "Initial upload",
         "created_by_name": "Alice Chen",
         "created_at": "2026-02-12T08:00:00Z"
       },
       {
         "id": "a8b9c0d1-e2f3-4567-8901-abcdef234567",
         "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
         "version": 1,
         "row_count": 75000,
         "column_count": 15,
         "description": "Added January transactions",
         "created_by_name": "Alice Chen",
         "created_at": "2026-02-20T11:00:00Z"
       }
     ],
     "total": 2,
     "limit": 50,
     "offset": 0
   }

----

Upload New Version
------------------

.. code-block:: text

   POST /api/datasets/{dataset_id}/versions

Upload a new version of an existing dataset. Uses ``multipart/form-data``.

**Form Fields**

.. list-table::
   :header-rows: 1
   :widths: 20 10 10 60

   * - Field
     - Type
     - Required
     - Description
   * - ``description``
     - string
     - Yes
     - Description of what changed in this version.
   * - ``file``
     - file
     - Yes
     - The CSV or Parquet file.

**Example**

.. code-block:: bash

   curl -X POST "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/versions" \
     -H "Authorization: Bearer YOUR_API_KEY" \
     -F "description=Added January transactions" \
     -F "file=@transactions_v2.csv"

**Response** ``201 Created``

.. code-block:: json

   {
     "dataset_version_id": "a8b9c0d1-e2f3-4567-8901-abcdef234567",
     "version": 1,
     "row_count": 75000,
     "column_count": 15
   }

----

Compare Versions
----------------

.. code-block:: text

   GET /api/datasets/{dataset_id}/compare?v1=0&v2=1

Compare two versions of a dataset. Returns schema diffs (added/removed
columns, type changes) and statistical drift metrics (PSI for numeric
columns, category changes for categorical columns).

**Query Parameters**

.. list-table::
   :header-rows: 1
   :widths: 20 10 10 60

   * - Parameter
     - Type
     - Required
     - Description
   * - ``v1``
     - integer
     - Yes
     - First version number.
   * - ``v2``
     - integer
     - Yes
     - Second version number.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/compare?v1=0&v2=1" \
     -H "Authorization: Bearer YOUR_API_KEY"

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/compare",
       headers={"Authorization": "Bearer YOUR_API_KEY"},
       params={"v1": 0, "v2": 1},
   )
   summary = resp.json()["summary"]
   print(f"Row change: {summary['rows_diff']:+d}")

**Response** ``200 OK``

.. code-block:: json

   {
     "summary": {
       "v1": 0,
       "v2": 1,
       "v1_rows": 50000,
       "v2_rows": 75000,
       "rows_diff": 25000,
       "v1_columns": 15,
       "v2_columns": 15,
       "columns_added": [],
       "columns_removed": [],
       "columns_type_changed": []
     },
     "column_diffs": {
       "amount": {
         "type": "numeric",
         "v1_mean": 88.35,
         "v2_mean": 92.10,
         "v1_std": 340.12,
         "v2_std": 355.40,
         "missing_v1": 0.0002,
         "missing_v2": 0.0001,
         "mean_shift_pct": 4.25,
         "psi": 0.023,
         "shift_level": "low"
       }
     }
   }

PSI (Population Stability Index) interpretation:

- ``< 0.1`` -- **low** drift, distributions are similar.
- ``0.1 -- 0.25`` -- **moderate** drift, investigation recommended.
- ``> 0.25`` -- **high** drift, significant distribution change.

----

Get Dataset Version
-------------------

.. code-block:: text

   GET /api/dataset-versions/{dataset_version_id}

Return metadata for a specific dataset version including its column
schema.

**Example**

.. code-block:: bash

   curl "$BASE_URL/api/dataset-versions/f6a7b8c9-d0e1-2345-6789-0abcdef12345" \
     -H "Authorization: Bearer YOUR_API_KEY"

**Response** ``200 OK``

.. code-block:: json

   {
     "id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
     "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
     "version": 0,
     "row_count": 50000,
     "column_count": 15,
     "columns": [
       {"name": "amount", "dtype": "float64"},
       {"name": "merchant", "dtype": "object"},
       {"name": "is_fraud", "dtype": "int64"}
     ],
     "created_at": "2026-02-12T08:00:00Z"
   }

----

.. seealso::

   - :doc:`experiments` -- Training models on a dataset.
   - :doc:`privacy` -- Scanning datasets for PII.