Datasets API

Datasets are versioned tabular data files (CSV or Parquet) stored within a project. Each upload creates a dataset version with schema metadata and a data quality report.


Upload Dataset

POST /api/datasets/upload

Upload a CSV or Parquet file as a new dataset. Uses multipart/form-data. Maximum file size: 500 MB (configurable via MAX_DATASET_UPLOAD_BYTES).

Form Fields

Field

Type

Required

Description

project_id

string

Yes

UUID of the owning project.

name

string

Yes

Human-readable dataset name.

description

string

No

Optional description.

file

file

Yes

The CSV or Parquet file to upload.

Example

curl -X POST "$BASE_URL/api/datasets/upload" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "project_id=d4e5f6a7-b8c9-0123-def4-567890123456" \
  -F "name=Transactions Q4" \
  -F "description=Credit card transactions from Q4 2025" \
  -F "file=@transactions.csv"
import requests

with open("transactions.csv", "rb") as f:
    resp = requests.post(f"{BASE_URL}/api/datasets/upload", headers={
        "Authorization": "Bearer YOUR_API_KEY",
    }, data={
        "project_id": "d4e5f6a7-b8c9-0123-def4-567890123456",
        "name": "Transactions Q4",
        "description": "Credit card transactions from Q4 2025",
    }, files={"file": ("transactions.csv", f, "text/csv")})

data = resp.json()
print("Dataset ID:", data["dataset_id"])
print("Version ID:", data["dataset_version_id"])

Response 201 Created

{
  "id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
  "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
  "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
  "artifact_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456",
  "quality": {
    "overall_score": 92.5,
    "issues": []
  }
}

Supported File Formats

Format

Extension

Notes

CSV

.csv

Comma-separated values. Auto-detects delimiter, encoding, and header row.

Parquet

.parquet

Apache Parquet columnar format. Preserves column types.

Auto-Detected Column Types

The platform automatically detects column data types during upload:

Type

Description

int64

Integer values (whole numbers).

float64

Floating-point values (decimals).

object

String/text values, or mixed types.

bool

Boolean values (true/false).

datetime64

Date and time values (auto-parsed from common formats).

category

Categorical values (low-cardinality strings).

Download/Export Formats

Format

Description

CSV

Default export format for dataset downloads.

XLSX

Excel format (available for smaller datasets).


List Datasets

GET /api/datasets

Query Parameters

Parameter

Type

Default

Description

project_id

string

Filter by project. If omitted, returns datasets across all accessible projects.

limit

integer

50

Max items (1–500).

offset

integer

0

Pagination offset.

search

string

Search in name and description.

sort_field

string

created_at

One of created_at, name, rows, columns, project_name.

sort_direction

string

desc

asc or desc.

Example

curl "$BASE_URL/api/datasets?project_id=d4e5f6a7-b8c9-0123-def4-567890123456&limit=20" \
  -H "Authorization: Bearer YOUR_API_KEY"
resp = requests.get(f"{BASE_URL}/api/datasets", headers={
    "Authorization": "Bearer YOUR_API_KEY",
}, params={"project_id": "d4e5f6a7-b8c9-0123-def4-567890123456"})
for ds in resp.json()["items"]:
    print(ds["name"], ds["rows"], "rows")

Response 200 OK

{
  "items": [
    {
      "id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
      "name": "Transactions Q4",
      "project_id": "d4e5f6a7-b8c9-0123-def4-567890123456",
      "project_name": "Fraud Detection v2",
      "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
      "rows": 50000,
      "columns": 15,
      "created_at": "2026-02-12T08:00:00Z"
    }
  ],
  "total": 1,
  "limit": 20,
  "offset": 0
}

Get Dataset Detail

GET /api/datasets/{dataset_id}

Return dataset metadata and a preview of the first 20 rows.

Example

curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response 200 OK

{
  "dataset": {
    "id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
    "name": "Transactions Q4",
    "rows": 50000,
    "columns": 15,
    "dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
    "artifact_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456"
  },
  "preview": [
    {"amount": 29.99, "merchant": "ACME Corp", "is_fraud": 0},
    {"amount": 1250.00, "merchant": "Luxury Ltd", "is_fraud": 1}
  ],
  "preview_error": null
}

Get Column Schema

GET /api/datasets/{dataset_id}/columns

Return column names and data types for the latest dataset version.

Example

curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/columns" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response 200 OK

{
  "columns": [
    {"name": "amount", "dtype": "float64"},
    {"name": "merchant", "dtype": "object"},
    {"name": "is_fraud", "dtype": "int64"}
  ]
}

Download Dataset

GET /api/datasets/{dataset_id}/download

Download the latest version of a dataset as a file.

Query Parameters

Parameter

Type

Default

Description

format

string

csv

csv or xlsx.

Example

# Download as CSV
curl -o dataset.csv \
  "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/download" \
  -H "Authorization: Bearer YOUR_API_KEY"

# Download as Excel
curl -o dataset.xlsx \
  "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/download?format=xlsx" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response 200 OK

Binary file download.


Statistical Analysis

GET /api/datasets/{dataset_id}/analyze

Compute per-column statistics including mean, standard deviation, missing values, percentiles, skewness, kurtosis, outlier counts, histograms (numeric columns), and top values (categorical columns). Also returns a correlation matrix for numeric columns.

Example

curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/analyze" \
  -H "Authorization: Bearer YOUR_API_KEY"
resp = requests.get(
    f"{BASE_URL}/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/analyze",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
)
stats = resp.json()["statistics"]
for col, info in stats["columns"].items():
    print(f"{col}: type={info['type']}, missing={info['missing']}")

Response 200 OK

{
  "statistics": {
    "columns": {
      "amount": {
        "type": "float64",
        "unique": 4520,
        "missing": 12,
        "missing_percent": 0.00024,
        "min": 0.01,
        "max": 25000.0,
        "mean": 88.35,
        "std": 340.12,
        "percentiles": {"p25": 9.99, "p50": 29.99, "p75": 69.99},
        "skewness": 15.234,
        "kurtosis": 312.5,
        "iqr": 60.0,
        "outlier_count": 482,
        "histogram": {
          "counts": [42000, 5000, 1500, 800, 300, 200, 100, 50, 30, 20],
          "bin_edges": [0.01, 2500.0, 5000.0, 7500.0, 10000.0, 12500.0, 15000.0, 17500.0, 20000.0, 22500.0, 25000.0]
        }
      },
      "merchant": {
        "type": "object",
        "unique": 120,
        "missing": 0,
        "missing_percent": 0.0,
        "top_values": {"ACME Corp": 5200, "Luxury Ltd": 3100}
      }
    },
    "row_count": 50000,
    "duplicate_rows": 45,
    "memory_usage": 5.8,
    "correlations": {
      "amount": {"amount": 1.0, "is_fraud": 0.15}
    }
  },
  "error": null
}

Data Quality Report

GET /api/datasets/{dataset_id}/quality

Return the data quality report computed during upload (stored in the dataset version).

Example

curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/quality" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response 200 OK

{
  "quality": {
    "overall_score": 92.5,
    "checks": [
      {"name": "missing_values", "passed": true, "details": "Max 0.02% missing"},
      {"name": "duplicate_rows", "passed": true, "details": "45 duplicates (0.09%)"}
    ]
  }
}

Dataset Lineage

GET /api/datasets/{dataset_id}/lineage

Trace the full lineage of a dataset: which experiments used it, which models were trained from those experiments, and which deployments serve those models.

Example

curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/lineage" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response 200 OK

{
  "experiments": [
    {
      "id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
      "name": "AutoML Run 1",
      "target_column": "is_fraud",
      "problem_type": "classification",
      "status": "succeeded",
      "model_count": 8
    }
  ],
  "models": [
    {
      "id": "a7b8c9d0-e1f2-3456-7890-abcdef123456",
      "name": "GBM_1_AutoML",
      "algorithm": "GBM",
      "experiment_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345"
    }
  ],
  "deployments": [
    {
      "id": "b8c9d0e1-f2a3-4567-8901-bcdef1234567",
      "name": "GBM_1_AutoML",
      "model_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456",
      "stage": "production",
      "is_active": true
    }
  ]
}

Delete Dataset

DELETE /api/datasets/{dataset_id}

Permanently delete a dataset and all its versions and artifact files.

Example

curl -X DELETE "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response 200 OK

{
  "ok": true
}

List Versions

GET /api/datasets/{dataset_id}/versions

Return all versions of a dataset ordered by version number.

Query Parameters

Parameter

Type

Default

Description

limit

integer

50

Max items (1–100).

offset

integer

0

Pagination offset.

Example

curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/versions" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response 200 OK

{
  "items": [
    {
      "id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
      "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
      "version": 0,
      "row_count": 50000,
      "column_count": 15,
      "description": "Initial upload",
      "created_by_name": "Alice Chen",
      "created_at": "2026-02-12T08:00:00Z"
    },
    {
      "id": "a8b9c0d1-e2f3-4567-8901-abcdef234567",
      "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
      "version": 1,
      "row_count": 75000,
      "column_count": 15,
      "description": "Added January transactions",
      "created_by_name": "Alice Chen",
      "created_at": "2026-02-20T11:00:00Z"
    }
  ],
  "total": 2,
  "limit": 50,
  "offset": 0
}

Upload New Version

POST /api/datasets/{dataset_id}/versions

Upload a new version of an existing dataset. Uses multipart/form-data.

Form Fields

Field

Type

Required

Description

description

string

Yes

Description of what changed in this version.

file

file

Yes

The CSV or Parquet file.

Example

curl -X POST "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/versions" \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "description=Added January transactions" \
  -F "file=@transactions_v2.csv"

Response 201 Created

{
  "dataset_version_id": "a8b9c0d1-e2f3-4567-8901-abcdef234567",
  "version": 1,
  "row_count": 75000,
  "column_count": 15
}

Compare Versions

GET /api/datasets/{dataset_id}/compare?v1=0&v2=1

Compare two versions of a dataset. Returns schema diffs (added/removed columns, type changes) and statistical drift metrics (PSI for numeric columns, category changes for categorical columns).

Query Parameters

Parameter

Type

Required

Description

v1

integer

Yes

First version number.

v2

integer

Yes

Second version number.

Example

curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/compare?v1=0&v2=1" \
  -H "Authorization: Bearer YOUR_API_KEY"
resp = requests.get(
    f"{BASE_URL}/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/compare",
    headers={"Authorization": "Bearer YOUR_API_KEY"},
    params={"v1": 0, "v2": 1},
)
summary = resp.json()["summary"]
print(f"Row change: {summary['rows_diff']:+d}")

Response 200 OK

{
  "summary": {
    "v1": 0,
    "v2": 1,
    "v1_rows": 50000,
    "v2_rows": 75000,
    "rows_diff": 25000,
    "v1_columns": 15,
    "v2_columns": 15,
    "columns_added": [],
    "columns_removed": [],
    "columns_type_changed": []
  },
  "column_diffs": {
    "amount": {
      "type": "numeric",
      "v1_mean": 88.35,
      "v2_mean": 92.10,
      "v1_std": 340.12,
      "v2_std": 355.40,
      "missing_v1": 0.0002,
      "missing_v2": 0.0001,
      "mean_shift_pct": 4.25,
      "psi": 0.023,
      "shift_level": "low"
    }
  }
}

PSI (Population Stability Index) interpretation:

  • < 0.1low drift, distributions are similar.

  • 0.1 -- 0.25moderate drift, investigation recommended.

  • > 0.25high drift, significant distribution change.


Get Dataset Version

GET /api/dataset-versions/{dataset_version_id}

Return metadata for a specific dataset version including its column schema.

Example

curl "$BASE_URL/api/dataset-versions/f6a7b8c9-d0e1-2345-6789-0abcdef12345" \
  -H "Authorization: Bearer YOUR_API_KEY"

Response 200 OK

{
  "id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
  "dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
  "version": 0,
  "row_count": 50000,
  "column_count": 15,
  "columns": [
    {"name": "amount", "dtype": "float64"},
    {"name": "merchant", "dtype": "object"},
    {"name": "is_fraud", "dtype": "int64"}
  ],
  "created_at": "2026-02-12T08:00:00Z"
}

See also