End-to-End ML Pipeline

This guide covers the full machine learning lifecycle through the API: data preparation, model training, evaluation, deployment, and prediction serving.

Overview 

A typical CorePlexML workflow follows these steps:

Create a Project – organizational container
Upload Datasets – CSV data with automatic profiling
Run Experiments – AutoML training with H2O
Evaluate Models – metrics, feature importance, predictions
Deploy – serve predictions via REST endpoints
Generate Reports – PDF summaries with charts

Each step is a simple API call. Experiments and report generation run asynchronously as background jobs.

Working with Projects 

Creating a Project 

resp = requests.post(f"{BASE_URL}/api/projects", headers=HEADERS, json={
    "name": "Credit Scoring",
    "description": "Consumer credit risk model",
})
project_id = resp.json()["project_id"]

Listing Projects 

resp = requests.get(f"{BASE_URL}/api/projects", headers=HEADERS, params={
    "search": "credit",
    "sort_field": "created_at",
    "sort_direction": "desc",
    "limit": 20,
})
for p in resp.json()["items"]:
    print(f"{p['id']}: {p['name']} ({p['model_count']} models)")

Data Ingestion 

Uploading a CSV 

with open("credit_data.csv", "rb") as f:
    resp = requests.post(
        f"{BASE_URL}/api/datasets/upload",
        headers=HEADERS,
        files={"file": ("credit_data.csv", f, "text/csv")},
        data={
            "project_id": project_id,
            "name": "Credit Applications Q1",
            "description": "25,000 loan applications",
        },
    )
ds = resp.json()
dataset_id = ds["id"]
version_id = ds["version_id"]

The platform automatically:

Detects column types (numeric, categorical, text, datetime)
Computes descriptive statistics per column
Identifies missing values and duplicates
Creates an immutable dataset version

Inspecting Column Schema 

resp = requests.get(
    f"{BASE_URL}/api/dataset-versions/{version_id}",
    headers=HEADERS,
)
version = resp.json()
print(f"Rows: {version['row_count']}, Columns: {version['column_count']}")
for col in version.get("columns", []):
    print(f"  {col['name']}: {col['dtype']} (missing: {col.get('missing_pct', 0):.1f}%)")

Downloading Data 

Retrieve the raw CSV for a dataset version:

resp = requests.get(
    f"{BASE_URL}/api/datasets/{dataset_id}/download",
    headers=HEADERS,
)
with open("downloaded.csv", "wb") as f:
    f.write(resp.content)

AutoML Training 

Supported Algorithms 

CorePlexML uses H2O AutoML to automatically train models from six algorithm families. All are enabled by default:

ID	Name	Description
`GBM`	Gradient Boosting	Sequential decision trees correcting previous errors. Strong default performer on most tabular data.
`XGBoost`	XGBoost	Optimized gradient boosting with L1/L2 regularization. Often achieves top accuracy.
`DRF`	Random Forest	Ensemble of trees using random feature subsets. Robust to overfitting.
`DeepLearning`	Deep Learning	Multi-layer neural network. Best for complex nonlinear patterns.
`GLM`	Linear Model	Logistic/linear regression. Fast, interpretable baseline.
`StackedEnsemble`	Stacked Ensemble	Meta-model combining all other models. Typically the best accuracy but SHAP contributions are not available.

Use exclude_algos to skip specific algorithms:

config = {
    "exclude_algos": ["DeepLearning", "StackedEnsemble"],
    # ...
}

Evaluation Metrics 

Models are ranked by the sort_metric. Use AUTO (default) to let the platform choose based on problem type.

Classification: AUC (default), AUCPR, logloss, mean_per_class_error, accuracy, MCC

Regression: RMSE (default), MSE, MAE, RMSLE, mean_residual_deviance, R2

See Experiments API for the full metric reference.

Experiment Configuration 

An experiment runs H2O AutoML to train, tune, and rank candidate models. Key configuration options:

Field	Default	Description
`target_column`	–	Column to predict (required).
`problem_type`	auto-detect	`classification`, `regression`, or `multiclass`.
`max_models`	20	Maximum number of candidate models.
`max_runtime_secs`	600	Time budget in seconds.
`nfolds`	5	Cross-validation folds.
`seed`	1	Random seed for reproducibility.
`balance_classes`	false	Oversample minority class.
`stopping_metric`	auto	Early stopping metric.
`stopping_rounds`	3	Early stopping patience.
`exclude_algos`	[]	Algorithms to skip (e.g., `["DeepLearning"]`).

Classification Example 

resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={
    "project_id": project_id,
    "dataset_version_id": version_id,
    "target_column": "default",
    "problem_type": "classification",
    "name": "Credit Default v1",
    "config": {
        "max_models": 15,
        "max_runtime_secs": 600,
        "nfolds": 5,
        "balance_classes": True,
        "stopping_metric": "AUC",
        "exclude_algos": ["DeepLearning"],
    },
})
experiment_id = resp.json()["id"]

Regression Example 

resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={
    "project_id": project_id,
    "dataset_version_id": version_id,
    "target_column": "loan_amount",
    "problem_type": "regression",
    "name": "Loan Amount Estimator",
    "config": {
        "max_models": 10,
        "max_runtime_secs": 300,
        "stopping_metric": "RMSE",
    },
})

Waiting for Completion 

Experiments are background jobs. Poll until complete:

import time

while True:
    resp = requests.get(
        f"{BASE_URL}/api/experiments/{experiment_id}",
        headers=HEADERS,
    )
    exp = resp.json()["experiment"]
    print(f"Status: {exp['status']} | Progress: {exp.get('progress', 0)}%")
    if exp["status"] in ("completed", "failed"):
        break
    time.sleep(10)

With the SDK, this is simpler:

result = client.experiments.wait(experiment_id, interval=5.0, timeout=3600.0)

Model Evaluation 

Leaderboard 

After training, list the ranked models:

resp = requests.get(
    f"{BASE_URL}/api/experiments/{experiment_id}/models",
    headers=HEADERS,
)
models = resp.json()["items"]
for i, m in enumerate(models[:5], 1):
    metrics = m.get("metrics", {})
    print(f"  #{i} {m['algorithm']}: AUC={metrics.get('auc', 'N/A'):.4f}")

Model Details 

resp = requests.get(
    f"{BASE_URL}/api/models/{model_id}",
    headers=HEADERS,
)
model = resp.json()["model"]
print(f"Algorithm: {model['algorithm']}")
print(f"Metrics: {model['metrics']}")

Feature Importance 

resp = requests.get(
    f"{BASE_URL}/api/models/{model_id}/feature-importance",
    headers=HEADERS,
)
for feat in resp.json()["features"][:10]:
    print(f"  {feat['feature']}: {feat['importance']:.4f}")

Ad-hoc Predictions 

Test a model before deploying:

resp = requests.post(
    f"{BASE_URL}/api/models/{model_id}/predict",
    headers=HEADERS,
    json={
        "inputs": {
            "income": 55000,
            "debt_ratio": 0.35,
            "credit_score": 680,
            "employment_years": 5,
        },
    },
)
print(resp.json())  # {"prediction": "no_default", "probabilities": {...}}

Report Generation 

Generate a PDF report summarizing experiment or model results:

# Experiment report
resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={
    "project_id": project_id,
    "kind": "experiment",
    "entity_id": experiment_id,
    "options": {
        "include_leaderboard": True,
        "include_feature_importance": True,
        "llm_insights": True,  # optional AI-powered suggestions
    },
})
report_id = resp.json()["id"]

# Wait for generation
while True:
    r = requests.get(f"{BASE_URL}/api/reports/{report_id}", headers=HEADERS)
    if r.json()["report"]["status"] in ("completed", "failed"):
        break
    time.sleep(5)

# Download PDF
r = requests.get(
    f"{BASE_URL}/api/reports/{report_id}/download",
    headers=HEADERS,
)
with open("experiment_report.pdf", "wb") as f:
    f.write(r.content)

Available report kinds: experiment, model, project, synthgen, privacy, deployment.