End-to-End ML Pipeline

This guide covers the full machine learning lifecycle through the API: data preparation, model training, evaluation, deployment, and prediction serving.

Overview

A typical CorePlexML workflow follows these steps:

  1. Create a Project – organizational container

  2. Upload Datasets – CSV data with automatic profiling

  3. Run Experiments – AutoML training with H2O

  4. Evaluate Models – metrics, feature importance, predictions

  5. Deploy – serve predictions via REST endpoints

  6. Generate Reports – PDF summaries with charts

Each step is a simple API call. Experiments and report generation run asynchronously as background jobs.


Working with Projects

Creating a Project

resp = requests.post(f"{BASE_URL}/api/projects", headers=HEADERS, json={
    "name": "Credit Scoring",
    "description": "Consumer credit risk model",
})
project_id = resp.json()["project_id"]

Listing Projects

resp = requests.get(f"{BASE_URL}/api/projects", headers=HEADERS, params={
    "search": "credit",
    "sort_field": "created_at",
    "sort_direction": "desc",
    "limit": 20,
})
for p in resp.json()["items"]:
    print(f"{p['id']}: {p['name']} ({p['model_count']} models)")

Sharing a Project

Add team members with specific roles:

requests.post(
    f"{BASE_URL}/api/projects/{project_id}/members",
    headers=HEADERS,
    json={"email": "analyst@company.com", "role": "editor"},
)

Roles: viewer (read-only), editor (read/write), admin (full access), owner (transfer ownership).


Data Ingestion

Uploading a CSV

with open("credit_data.csv", "rb") as f:
    resp = requests.post(
        f"{BASE_URL}/api/datasets/upload",
        headers=HEADERS,
        files={"file": ("credit_data.csv", f, "text/csv")},
        data={
            "project_id": project_id,
            "name": "Credit Applications Q1",
            "description": "25,000 loan applications",
        },
    )
ds = resp.json()
dataset_id = ds["id"]
version_id = ds["version_id"]

The platform automatically:

  • Detects column types (numeric, categorical, text, datetime)

  • Computes descriptive statistics per column

  • Identifies missing values and duplicates

  • Creates an immutable dataset version

Inspecting Column Schema

resp = requests.get(
    f"{BASE_URL}/api/dataset-versions/{version_id}",
    headers=HEADERS,
)
version = resp.json()
print(f"Rows: {version['row_count']}, Columns: {version['column_count']}")
for col in version.get("columns", []):
    print(f"  {col['name']}: {col['dtype']} (missing: {col.get('missing_pct', 0):.1f}%)")

Downloading Data

Retrieve the raw CSV for a dataset version:

resp = requests.get(
    f"{BASE_URL}/api/datasets/{dataset_id}/download",
    headers=HEADERS,
)
with open("downloaded.csv", "wb") as f:
    f.write(resp.content)

AutoML Training

Supported Algorithms

CorePlexML uses H2O AutoML to automatically train models from six algorithm families. All are enabled by default:

ID

Name

Description

GBM

Gradient Boosting

Sequential decision trees correcting previous errors. Strong default performer on most tabular data.

XGBoost

XGBoost

Optimized gradient boosting with L1/L2 regularization. Often achieves top accuracy.

DRF

Random Forest

Ensemble of trees using random feature subsets. Robust to overfitting.

DeepLearning

Deep Learning

Multi-layer neural network. Best for complex nonlinear patterns.

GLM

Linear Model

Logistic/linear regression. Fast, interpretable baseline.

StackedEnsemble

Stacked Ensemble

Meta-model combining all other models. Typically the best accuracy but SHAP contributions are not available.

Use exclude_algos to skip specific algorithms:

config = {
    "exclude_algos": ["DeepLearning", "StackedEnsemble"],
    # ...
}

Evaluation Metrics

Models are ranked by the sort_metric. Use AUTO (default) to let the platform choose based on problem type.

Classification: AUC (default), AUCPR, logloss, mean_per_class_error, accuracy, MCC

Regression: RMSE (default), MSE, MAE, RMSLE, mean_residual_deviance, R2

See Experiments API for the full metric reference.

Experiment Configuration

An experiment runs H2O AutoML to train, tune, and rank candidate models. Key configuration options:

Field

Default

Description

target_column

Column to predict (required).

problem_type

auto-detect

classification, regression, or multiclass.

max_models

20

Maximum number of candidate models.

max_runtime_secs

600

Time budget in seconds.

nfolds

5

Cross-validation folds.

seed

1

Random seed for reproducibility.

balance_classes

false

Oversample minority class.

stopping_metric

auto

Early stopping metric.

stopping_rounds

3

Early stopping patience.

exclude_algos

[]

Algorithms to skip (e.g., ["DeepLearning"]).

Classification Example

resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={
    "project_id": project_id,
    "dataset_version_id": version_id,
    "target_column": "default",
    "problem_type": "classification",
    "name": "Credit Default v1",
    "config": {
        "max_models": 15,
        "max_runtime_secs": 600,
        "nfolds": 5,
        "balance_classes": True,
        "stopping_metric": "AUC",
        "exclude_algos": ["DeepLearning"],
    },
})
experiment_id = resp.json()["id"]

Regression Example

resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={
    "project_id": project_id,
    "dataset_version_id": version_id,
    "target_column": "loan_amount",
    "problem_type": "regression",
    "name": "Loan Amount Estimator",
    "config": {
        "max_models": 10,
        "max_runtime_secs": 300,
        "stopping_metric": "RMSE",
    },
})

Waiting for Completion

Experiments are background jobs. Poll until complete:

import time

while True:
    resp = requests.get(
        f"{BASE_URL}/api/experiments/{experiment_id}",
        headers=HEADERS,
    )
    exp = resp.json()["experiment"]
    print(f"Status: {exp['status']} | Progress: {exp.get('progress', 0)}%")
    if exp["status"] in ("completed", "failed"):
        break
    time.sleep(10)

With the SDK, this is simpler:

result = client.experiments.wait(experiment_id, interval=5.0, timeout=3600.0)

Model Evaluation

Leaderboard

After training, list the ranked models:

resp = requests.get(
    f"{BASE_URL}/api/experiments/{experiment_id}/models",
    headers=HEADERS,
)
models = resp.json()["items"]
for i, m in enumerate(models[:5], 1):
    metrics = m.get("metrics", {})
    print(f"  #{i} {m['algorithm']}: AUC={metrics.get('auc', 'N/A'):.4f}")

Model Details

resp = requests.get(
    f"{BASE_URL}/api/models/{model_id}",
    headers=HEADERS,
)
model = resp.json()["model"]
print(f"Algorithm: {model['algorithm']}")
print(f"Metrics: {model['metrics']}")

Feature Importance

resp = requests.get(
    f"{BASE_URL}/api/models/{model_id}/feature-importance",
    headers=HEADERS,
)
for feat in resp.json()["features"][:10]:
    print(f"  {feat['feature']}: {feat['importance']:.4f}")

Ad-hoc Predictions

Test a model before deploying:

resp = requests.post(
    f"{BASE_URL}/api/models/{model_id}/predict",
    headers=HEADERS,
    json={
        "inputs": {
            "income": 55000,
            "debt_ratio": 0.35,
            "credit_score": 680,
            "employment_years": 5,
        },
    },
)
print(resp.json())  # {"prediction": "no_default", "probabilities": {...}}

Report Generation

Generate a PDF report summarizing experiment or model results:

# Experiment report
resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={
    "project_id": project_id,
    "kind": "experiment",
    "entity_id": experiment_id,
    "options": {
        "include_leaderboard": True,
        "include_feature_importance": True,
        "llm_insights": True,  # optional AI-powered suggestions
    },
})
report_id = resp.json()["id"]

# Wait for generation
while True:
    r = requests.get(f"{BASE_URL}/api/reports/{report_id}", headers=HEADERS)
    if r.json()["report"]["status"] in ("completed", "failed"):
        break
    time.sleep(5)

# Download PDF
r = requests.get(
    f"{BASE_URL}/api/reports/{report_id}/download",
    headers=HEADERS,
)
with open("experiment_report.pdf", "wb") as f:
    f.write(r.content)

Available report kinds: experiment, model, project, synthgen, privacy, deployment.


See also