Datasets API
Datasets are versioned tabular data files (CSV or Parquet) stored within a project. Each upload creates a dataset version with schema metadata and a data quality report.
Upload Dataset
POST /api/datasets/upload
Upload a CSV or Parquet file as a new dataset. Uses multipart/form-data.
Maximum file size: 500 MB (configurable via MAX_DATASET_UPLOAD_BYTES).
Form Fields
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
UUID of the owning project. |
|
string |
Yes |
Human-readable dataset name. |
|
string |
No |
Optional description. |
|
file |
Yes |
The CSV or Parquet file to upload. |
Example
curl -X POST "$BASE_URL/api/datasets/upload" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "project_id=d4e5f6a7-b8c9-0123-def4-567890123456" \
-F "name=Transactions Q4" \
-F "description=Credit card transactions from Q4 2025" \
-F "file=@transactions.csv"
import requests
with open("transactions.csv", "rb") as f:
resp = requests.post(f"{BASE_URL}/api/datasets/upload", headers={
"Authorization": "Bearer YOUR_API_KEY",
}, data={
"project_id": "d4e5f6a7-b8c9-0123-def4-567890123456",
"name": "Transactions Q4",
"description": "Credit card transactions from Q4 2025",
}, files={"file": ("transactions.csv", f, "text/csv")})
data = resp.json()
print("Dataset ID:", data["dataset_id"])
print("Version ID:", data["dataset_version_id"])
Response 201 Created
{
"id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
"dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
"dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
"artifact_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456",
"quality": {
"overall_score": 92.5,
"issues": []
}
}
Supported File Formats
Format |
Extension |
Notes |
|---|---|---|
CSV |
|
Comma-separated values. Auto-detects delimiter, encoding, and header row. |
Parquet |
|
Apache Parquet columnar format. Preserves column types. |
Auto-Detected Column Types
The platform automatically detects column data types during upload:
Type |
Description |
|---|---|
|
Integer values (whole numbers). |
|
Floating-point values (decimals). |
|
String/text values, or mixed types. |
|
Boolean values (true/false). |
|
Date and time values (auto-parsed from common formats). |
|
Categorical values (low-cardinality strings). |
Download/Export Formats
Format |
Description |
|---|---|
CSV |
Default export format for dataset downloads. |
XLSX |
Excel format (available for smaller datasets). |
List Datasets
GET /api/datasets
Query Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
– |
Filter by project. If omitted, returns datasets across all accessible projects. |
|
integer |
50 |
Max items (1–500). |
|
integer |
0 |
Pagination offset. |
|
string |
– |
Search in name and description. |
|
string |
|
One of |
|
string |
|
|
Example
curl "$BASE_URL/api/datasets?project_id=d4e5f6a7-b8c9-0123-def4-567890123456&limit=20" \
-H "Authorization: Bearer YOUR_API_KEY"
resp = requests.get(f"{BASE_URL}/api/datasets", headers={
"Authorization": "Bearer YOUR_API_KEY",
}, params={"project_id": "d4e5f6a7-b8c9-0123-def4-567890123456"})
for ds in resp.json()["items"]:
print(ds["name"], ds["rows"], "rows")
Response 200 OK
{
"items": [
{
"id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
"name": "Transactions Q4",
"project_id": "d4e5f6a7-b8c9-0123-def4-567890123456",
"project_name": "Fraud Detection v2",
"dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
"rows": 50000,
"columns": 15,
"created_at": "2026-02-12T08:00:00Z"
}
],
"total": 1,
"limit": 20,
"offset": 0
}
Get Dataset Detail
GET /api/datasets/{dataset_id}
Return dataset metadata and a preview of the first 20 rows.
Example
curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234" \
-H "Authorization: Bearer YOUR_API_KEY"
Response 200 OK
{
"dataset": {
"id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
"name": "Transactions Q4",
"rows": 50000,
"columns": 15,
"dataset_version_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
"artifact_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456"
},
"preview": [
{"amount": 29.99, "merchant": "ACME Corp", "is_fraud": 0},
{"amount": 1250.00, "merchant": "Luxury Ltd", "is_fraud": 1}
],
"preview_error": null
}
Get Column Schema
GET /api/datasets/{dataset_id}/columns
Return column names and data types for the latest dataset version.
Example
curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/columns" \
-H "Authorization: Bearer YOUR_API_KEY"
Response 200 OK
{
"columns": [
{"name": "amount", "dtype": "float64"},
{"name": "merchant", "dtype": "object"},
{"name": "is_fraud", "dtype": "int64"}
]
}
Download Dataset
GET /api/datasets/{dataset_id}/download
Download the latest version of a dataset as a file.
Query Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
string |
|
|
Example
# Download as CSV
curl -o dataset.csv \
"$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/download" \
-H "Authorization: Bearer YOUR_API_KEY"
# Download as Excel
curl -o dataset.xlsx \
"$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/download?format=xlsx" \
-H "Authorization: Bearer YOUR_API_KEY"
Response 200 OK
Binary file download.
Statistical Analysis
GET /api/datasets/{dataset_id}/analyze
Compute per-column statistics including mean, standard deviation, missing values, percentiles, skewness, kurtosis, outlier counts, histograms (numeric columns), and top values (categorical columns). Also returns a correlation matrix for numeric columns.
Example
curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/analyze" \
-H "Authorization: Bearer YOUR_API_KEY"
resp = requests.get(
f"{BASE_URL}/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/analyze",
headers={"Authorization": "Bearer YOUR_API_KEY"},
)
stats = resp.json()["statistics"]
for col, info in stats["columns"].items():
print(f"{col}: type={info['type']}, missing={info['missing']}")
Response 200 OK
{
"statistics": {
"columns": {
"amount": {
"type": "float64",
"unique": 4520,
"missing": 12,
"missing_percent": 0.00024,
"min": 0.01,
"max": 25000.0,
"mean": 88.35,
"std": 340.12,
"percentiles": {"p25": 9.99, "p50": 29.99, "p75": 69.99},
"skewness": 15.234,
"kurtosis": 312.5,
"iqr": 60.0,
"outlier_count": 482,
"histogram": {
"counts": [42000, 5000, 1500, 800, 300, 200, 100, 50, 30, 20],
"bin_edges": [0.01, 2500.0, 5000.0, 7500.0, 10000.0, 12500.0, 15000.0, 17500.0, 20000.0, 22500.0, 25000.0]
}
},
"merchant": {
"type": "object",
"unique": 120,
"missing": 0,
"missing_percent": 0.0,
"top_values": {"ACME Corp": 5200, "Luxury Ltd": 3100}
}
},
"row_count": 50000,
"duplicate_rows": 45,
"memory_usage": 5.8,
"correlations": {
"amount": {"amount": 1.0, "is_fraud": 0.15}
}
},
"error": null
}
Data Quality Report
GET /api/datasets/{dataset_id}/quality
Return the data quality report computed during upload (stored in the dataset version).
Example
curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/quality" \
-H "Authorization: Bearer YOUR_API_KEY"
Response 200 OK
{
"quality": {
"overall_score": 92.5,
"checks": [
{"name": "missing_values", "passed": true, "details": "Max 0.02% missing"},
{"name": "duplicate_rows", "passed": true, "details": "45 duplicates (0.09%)"}
]
}
}
Dataset Lineage
GET /api/datasets/{dataset_id}/lineage
Trace the full lineage of a dataset: which experiments used it, which models were trained from those experiments, and which deployments serve those models.
Example
curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/lineage" \
-H "Authorization: Bearer YOUR_API_KEY"
Response 200 OK
{
"experiments": [
{
"id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
"name": "AutoML Run 1",
"target_column": "is_fraud",
"problem_type": "classification",
"status": "succeeded",
"model_count": 8
}
],
"models": [
{
"id": "a7b8c9d0-e1f2-3456-7890-abcdef123456",
"name": "GBM_1_AutoML",
"algorithm": "GBM",
"experiment_id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345"
}
],
"deployments": [
{
"id": "b8c9d0e1-f2a3-4567-8901-bcdef1234567",
"name": "GBM_1_AutoML",
"model_id": "a7b8c9d0-e1f2-3456-7890-abcdef123456",
"stage": "production",
"is_active": true
}
]
}
Delete Dataset
DELETE /api/datasets/{dataset_id}
Permanently delete a dataset and all its versions and artifact files.
Example
curl -X DELETE "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234" \
-H "Authorization: Bearer YOUR_API_KEY"
Response 200 OK
{
"ok": true
}
List Versions
GET /api/datasets/{dataset_id}/versions
Return all versions of a dataset ordered by version number.
Query Parameters
Parameter |
Type |
Default |
Description |
|---|---|---|---|
|
integer |
50 |
Max items (1–100). |
|
integer |
0 |
Pagination offset. |
Example
curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/versions" \
-H "Authorization: Bearer YOUR_API_KEY"
Response 200 OK
{
"items": [
{
"id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
"dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
"version": 0,
"row_count": 50000,
"column_count": 15,
"description": "Initial upload",
"created_by_name": "Alice Chen",
"created_at": "2026-02-12T08:00:00Z"
},
{
"id": "a8b9c0d1-e2f3-4567-8901-abcdef234567",
"dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
"version": 1,
"row_count": 75000,
"column_count": 15,
"description": "Added January transactions",
"created_by_name": "Alice Chen",
"created_at": "2026-02-20T11:00:00Z"
}
],
"total": 2,
"limit": 50,
"offset": 0
}
Upload New Version
POST /api/datasets/{dataset_id}/versions
Upload a new version of an existing dataset. Uses multipart/form-data.
Form Fields
Field |
Type |
Required |
Description |
|---|---|---|---|
|
string |
Yes |
Description of what changed in this version. |
|
file |
Yes |
The CSV or Parquet file. |
Example
curl -X POST "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/versions" \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "description=Added January transactions" \
-F "file=@transactions_v2.csv"
Response 201 Created
{
"dataset_version_id": "a8b9c0d1-e2f3-4567-8901-abcdef234567",
"version": 1,
"row_count": 75000,
"column_count": 15
}
Compare Versions
GET /api/datasets/{dataset_id}/compare?v1=0&v2=1
Compare two versions of a dataset. Returns schema diffs (added/removed columns, type changes) and statistical drift metrics (PSI for numeric columns, category changes for categorical columns).
Query Parameters
Parameter |
Type |
Required |
Description |
|---|---|---|---|
|
integer |
Yes |
First version number. |
|
integer |
Yes |
Second version number. |
Example
curl "$BASE_URL/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/compare?v1=0&v2=1" \
-H "Authorization: Bearer YOUR_API_KEY"
resp = requests.get(
f"{BASE_URL}/api/datasets/e5f6a7b8-c9d0-1234-5678-90abcdef1234/compare",
headers={"Authorization": "Bearer YOUR_API_KEY"},
params={"v1": 0, "v2": 1},
)
summary = resp.json()["summary"]
print(f"Row change: {summary['rows_diff']:+d}")
Response 200 OK
{
"summary": {
"v1": 0,
"v2": 1,
"v1_rows": 50000,
"v2_rows": 75000,
"rows_diff": 25000,
"v1_columns": 15,
"v2_columns": 15,
"columns_added": [],
"columns_removed": [],
"columns_type_changed": []
},
"column_diffs": {
"amount": {
"type": "numeric",
"v1_mean": 88.35,
"v2_mean": 92.10,
"v1_std": 340.12,
"v2_std": 355.40,
"missing_v1": 0.0002,
"missing_v2": 0.0001,
"mean_shift_pct": 4.25,
"psi": 0.023,
"shift_level": "low"
}
}
}
PSI (Population Stability Index) interpretation:
< 0.1– low drift, distributions are similar.0.1 -- 0.25– moderate drift, investigation recommended.> 0.25– high drift, significant distribution change.
Get Dataset Version
GET /api/dataset-versions/{dataset_version_id}
Return metadata for a specific dataset version including its column schema.
Example
curl "$BASE_URL/api/dataset-versions/f6a7b8c9-d0e1-2345-6789-0abcdef12345" \
-H "Authorization: Bearer YOUR_API_KEY"
Response 200 OK
{
"id": "f6a7b8c9-d0e1-2345-6789-0abcdef12345",
"dataset_id": "e5f6a7b8-c9d0-1234-5678-90abcdef1234",
"version": 0,
"row_count": 50000,
"column_count": 15,
"columns": [
{"name": "amount", "dtype": "float64"},
{"name": "merchant", "dtype": "object"},
{"name": "is_fraud", "dtype": "int64"}
],
"created_at": "2026-02-12T08:00:00Z"
}
See also
Experiments API – Training models on a dataset.
Privacy Suite API – Scanning datasets for PII.