============ Datasets API ============ Datasets are versioned tabular data files (CSV or Parquet) stored within a project. Each upload creates a dataset version with schema metadata and a data quality report. .. contents:: Endpoints :local: :depth: 1 ---- Upload Dataset -------------- .. code-block:: text POST /api/datasets/upload Upload a CSV or Parquet file as a new dataset. Uses ``multipart/form-data``. Maximum file size: 500 MB (configurable via ``MAX_DATASET_UPLOAD_BYTES``). **Form Fields** .. list-table:: :header-rows: 1 :widths: 20 10 10 60 * - Field - Type - Required - Description * - ``project_id`` - string - Yes - UUID of the owning project. * - ``name`` - string - Yes - Human-readable dataset name. * - ``description`` - string - No - Optional description. * - ``file`` - file - Yes - The CSV or Parquet file to upload. **Example** .. code-block:: bash curl -X POST "$BASE_URL/api/datasets/upload" \ -H "Authorization: Bearer YOUR_API_KEY" \ -F "project_id=d4e5f6a7-b8c9-0123-def4-567890123456" \ -F "name=Transactions Q4" \ -F "description=Credit card transactions from Q4 2025" \ -F "file=@transactions.csv" .. code-block:: python import requests with open("transactions.csv", "rb") as f: resp = requests.post(f"{BASE_URL}/api/datasets/upload", headers={ "Authorization": "Bearer YOUR_API_KEY", }, data={ "project_id": "d4e5f6a7-b8c9-0123-def4-567890123456", "name": "Transactions Q4", "description": "Credit card transactions from Q4 2025", }, files={"file": ("transactions.csv", f, "text/csv")}) data = resp.json() print("Dataset ID:", data["dataset_id"]) print("Version ID:", data["dataset_version_id"]) **Response** ``201 Created`` .. code-block:: json { "id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234", "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234", "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345", "artifact_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456", "quality": { "overall_score": 92.5, "issues": [] } } Supported File Formats ^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 15 20 65 * - Format - Extension - Notes * - CSV - ``.csv`` - Comma-separated values. Auto-detects delimiter, encoding, and header row. * - Parquet - ``.parquet`` - Apache Parquet columnar format. Preserves column types. Auto-Detected Column Types ^^^^^^^^^^^^^^^^^^^^^^^^^^ The platform automatically detects column data types during upload: .. list-table:: :header-rows: 1 :widths: 20 80 * - Type - Description * - ``int64`` - Integer values (whole numbers). * - ``float64`` - Floating-point values (decimals). * - ``object`` - String/text values, or mixed types. * - ``bool`` - Boolean values (true/false). * - ``datetime64`` - Date and time values (auto-parsed from common formats). * - ``category`` - Categorical values (low-cardinality strings). Download/Export Formats ^^^^^^^^^^^^^^^^^^^^^^^ .. list-table:: :header-rows: 1 :widths: 15 85 * - Format - Description * - CSV - Default export format for dataset downloads. * - XLSX - Excel format (available for smaller datasets). ---- List Datasets ------------- .. code-block:: text GET /api/datasets **Query Parameters** .. list-table:: :header-rows: 1 :widths: 20 10 10 60 * - Parameter - Type - Default - Description * - ``project_id`` - string - -- - Filter by project. If omitted, returns datasets across all accessible projects. * - ``limit`` - integer - 50 - Max items (1--500). * - ``offset`` - integer - 0 - Pagination offset. * - ``search`` - string - -- - Search in name and description. * - ``sort_field`` - string - ``created_at`` - One of ``created_at``, ``name``, ``rows``, ``columns``, ``project_name``. * - ``sort_direction`` - string - ``desc`` - ``asc`` or ``desc``. **Example** .. code-block:: bash curl "$BASE_URL/api/datasets?project_id=d4e5f6a7-b8c9-0123-def4-567890123456&limit=20" \ -H "Authorization: Bearer YOUR_API_KEY" .. code-block:: python resp = requests.get(f"{BASE_URL}/api/datasets", headers={ "Authorization": "Bearer YOUR_API_KEY", }, params={"project_id": "d4e5f6a7-b8c9-0123-def4-567890123456"}) for ds in resp.json()["items"]: print(ds["name"], ds["rows"], "rows") **Response** ``200 OK`` .. code-block:: json { "items": [ { "id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234", "name": "Transactions Q4", "project_id": "d4e5f6a7-b8c9-0123-def4-567890123456", "project_name": "Fraud Detection v2", "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345", "rows": 50000, "columns": 15, "created_at": "2026-02-12T08:00:00Z" } ], "total": 1, "limit": 20, "offset": 0 } ---- Get Dataset Detail ------------------ .. code-block:: text GET /api/datasets/{dataset_id} Return dataset metadata and a preview of the first 20 rows. **Example** .. code-block:: bash curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234" \ -H "Authorization: Bearer YOUR_API_KEY" **Response** ``200 OK`` .. code-block:: json { "dataset": { "id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234", "name": "Transactions Q4", "rows": 50000, "columns": 15, "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345", "artifact_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456" }, "preview": [ {"amount": 29.99, "merchant": "ACME Corp", "is_fraud": 0}, {"amount": 1250.00, "merchant": "Luxury Ltd", "is_fraud": 1} ], "preview_error": null } ---- Get Column Schema ----------------- .. code-block:: text GET /api/datasets/{dataset_id}/columns Return column names and data types for the latest dataset version. **Example** .. code-block:: bash curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/columns" \ -H "Authorization: Bearer YOUR_API_KEY" **Response** ``200 OK`` .. code-block:: json { "columns": [ {"name": "amount", "dtype": "float64"}, {"name": "merchant", "dtype": "object"}, {"name": "is_fraud", "dtype": "int64"} ] } ---- Download Dataset ---------------- .. code-block:: text GET /api/datasets/{dataset_id}/download Download the latest version of a dataset as a file. **Query Parameters** .. list-table:: :header-rows: 1 :widths: 20 10 10 60 * - Parameter - Type - Default - Description * - ``format`` - string - ``csv`` - ``csv`` or ``xlsx``. **Example** .. code-block:: bash # Download as CSV curl -o dataset.csv \ "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/download" \ -H "Authorization: Bearer YOUR_API_KEY" # Download as Excel curl -o dataset.xlsx \ "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/download?format=xlsx" \ -H "Authorization: Bearer YOUR_API_KEY" **Response** ``200 OK`` Binary file download. ---- Statistical Analysis -------------------- .. code-block:: text GET /api/datasets/{dataset_id}/analyze Compute per-column statistics including mean, standard deviation, missing values, percentiles, skewness, kurtosis, outlier counts, histograms (numeric columns), and top values (categorical columns). Also returns a correlation matrix for numeric columns. **Example** .. code-block:: bash curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/analyze" \ -H "Authorization: Bearer YOUR_API_KEY" .. code-block:: python resp = requests.get( f"{BASE_URL}/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/analyze", headers={"Authorization": "Bearer YOUR_API_KEY"}, ) stats = resp.json()["statistics"] for col, info in stats["columns"].items(): print(f"{col}: type={info['type']}, missing={info['missing']}") **Response** ``200 OK`` .. code-block:: json { "statistics": { "columns": { "amount": { "type": "float64", "unique": 4520, "missing": 12, "missing_percent": 0.00024, "min": 0.01, "max": 25000.0, "mean": 88.35, "std": 340.12, "percentiles": {"p25": 9.99, "p50": 29.99, "p75": 69.99}, "skewness": 15.234, "kurtosis": 312.5, "iqr": 60.0, "outlier_count": 482, "histogram": { "counts": [42000, 5000, 1500, 800, 300, 200, 100, 50, 30, 20], "bin_edges": [0.01, 2500.0, 5000.0, 7500.0, 10000.0, 12500.0, 15000.0, 17500.0, 20000.0, 22500.0, 25000.0] } }, "merchant": { "type": "object", "unique": 120, "missing": 0, "missing_percent": 0.0, "top_values": {"ACME Corp": 5200, "Luxury Ltd": 3100} } }, "row_count": 50000, "duplicate_rows": 45, "memory_usage": 5.8, "correlations": { "amount": {"amount": 1.0, "is_fraud": 0.15} } }, "error": null } ---- Data Quality Report ------------------- .. code-block:: text GET /api/datasets/{dataset_id}/quality Return the data quality report computed during upload (stored in the dataset version). **Example** .. code-block:: bash curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/quality" \ -H "Authorization: Bearer YOUR_API_KEY" **Response** ``200 OK`` .. code-block:: json { "quality": { "overall_score": 92.5, "checks": [ {"name": "missing_values", "passed": true, "details": "Max 0.02% missing"}, {"name": "duplicate_rows", "passed": true, "details": "45 duplicates (0.09%)"} ] } } ---- Dataset Lineage --------------- .. code-block:: text GET /api/datasets/{dataset_id}/lineage Trace the full lineage of a dataset: which experiments used it, which models were trained from those experiments, and which deployments serve those models. **Example** .. code-block:: bash curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/lineage" \ -H "Authorization: Bearer YOUR_API_KEY" **Response** ``200 OK`` .. code-block:: json { "experiments": [ { "id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345", "name": "AutoML Run 1", "target_column": "is_fraud", "problem_type": "classification", "status": "succeeded", "model_count": 8 } ], "models": [ { "id": "a7b8c9d0-e1f2-3456-7890-abcdef123456", "name": "GBM_1_AutoML", "algorithm": "GBM", "experiment_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345" } ], "deployments": [ { "id": "b8c9d0e1-f2a3-4567-8901-bcdef1234567", "name": "GBM_1_AutoML", "model_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456", "stage": "production", "is_active": true } ] } ---- Delete Dataset -------------- .. code-block:: text DELETE /api/datasets/{dataset_id} Permanently delete a dataset and all its versions and artifact files. **Example** .. code-block:: bash curl -X DELETE "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234" \ -H "Authorization: Bearer YOUR_API_KEY" **Response** ``200 OK`` .. code-block:: json { "ok": true } ---- List Versions ------------- .. code-block:: text GET /api/datasets/{dataset_id}/versions Return all versions of a dataset ordered by version number. **Query Parameters** .. list-table:: :header-rows: 1 :widths: 20 10 10 60 * - Parameter - Type - Default - Description * - ``limit`` - integer - 50 - Max items (1--100). * - ``offset`` - integer - 0 - Pagination offset. **Example** .. code-block:: bash curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/versions" \ -H "Authorization: Bearer YOUR_API_KEY" **Response** ``200 OK`` .. code-block:: json { "items": [ { "id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345", "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234", "version": 0, "row_count": 50000, "column_count": 15, "description": "Initial upload", "created_by_name": "Alice Chen", "created_at": "2026-02-12T08:00:00Z" }, { "id": "a8b9c0d1-e2f3-4567-8901-abcdef234567", "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234", "version": 1, "row_count": 75000, "column_count": 15, "description": "Added January transactions", "created_by_name": "Alice Chen", "created_at": "2026-02-20T11:00:00Z" } ], "total": 2, "limit": 50, "offset": 0 } ---- Upload New Version ------------------ .. code-block:: text POST /api/datasets/{dataset_id}/versions Upload a new version of an existing dataset. Uses ``multipart/form-data``. **Form Fields** .. list-table:: :header-rows: 1 :widths: 20 10 10 60 * - Field - Type - Required - Description * - ``description`` - string - Yes - Description of what changed in this version. * - ``file`` - file - Yes - The CSV or Parquet file. **Example** .. code-block:: bash curl -X POST "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/versions" \ -H "Authorization: Bearer YOUR_API_KEY" \ -F "description=Added January transactions" \ -F "file=@transactions_v2.csv" **Response** ``201 Created`` .. code-block:: json { "dataset_version_id": "a8b9c0d1-e2f3-4567-8901-abcdef234567", "version": 1, "row_count": 75000, "column_count": 15 } ---- Compare Versions ---------------- .. code-block:: text GET /api/datasets/{dataset_id}/compare?v1=0&v2=1 Compare two versions of a dataset. Returns schema diffs (added/removed columns, type changes) and statistical drift metrics (PSI for numeric columns, category changes for categorical columns). **Query Parameters** .. list-table:: :header-rows: 1 :widths: 20 10 10 60 * - Parameter - Type - Required - Description * - ``v1`` - integer - Yes - First version number. * - ``v2`` - integer - Yes - Second version number. **Example** .. code-block:: bash curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/compare?v1=0&v2=1" \ -H "Authorization: Bearer YOUR_API_KEY" .. code-block:: python resp = requests.get( f"{BASE_URL}/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/compare", headers={"Authorization": "Bearer YOUR_API_KEY"}, params={"v1": 0, "v2": 1}, ) summary = resp.json()["summary"] print(f"Row change: {summary['rows_diff']:+d}") **Response** ``200 OK`` .. code-block:: json { "summary": { "v1": 0, "v2": 1, "v1_rows": 50000, "v2_rows": 75000, "rows_diff": 25000, "v1_columns": 15, "v2_columns": 15, "columns_added": [], "columns_removed": [], "columns_type_changed": [] }, "column_diffs": { "amount": { "type": "numeric", "v1_mean": 88.35, "v2_mean": 92.10, "v1_std": 340.12, "v2_std": 355.40, "missing_v1": 0.0002, "missing_v2": 0.0001, "mean_shift_pct": 4.25, "psi": 0.023, "shift_level": "low" } } } PSI (Population Stability Index) interpretation: - ``< 0.1`` -- **low** drift, distributions are similar. - ``0.1 -- 0.25`` -- **moderate** drift, investigation recommended. - ``> 0.25`` -- **high** drift, significant distribution change. ---- Get Dataset Version ------------------- .. code-block:: text GET /api/dataset-versions/{dataset_version_id} Return metadata for a specific dataset version including its column schema. **Example** .. code-block:: bash curl "$BASE_URL/api/dataset-versions/f6a7b8c9-d0e1-2345-6789-0abcdef12345" \ -H "Authorization: Bearer YOUR_API_KEY" **Response** ``200 OK`` .. code-block:: json { "id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345", "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234", "version": 0, "row_count": 50000, "column_count": 15, "columns": [ {"name": "amount", "dtype": "float64"}, {"name": "merchant", "dtype": "object"}, {"name": "is_fraud", "dtype": "int64"} ], "created_at": "2026-02-12T08:00:00Z" } ---- .. seealso:: - :doc:`experiments` -- Training models on a dataset. - :doc:`privacy` -- Scanning datasets for PII.