End-to-End ML Pipeline
This guide covers the full machine learning lifecycle through the API: data preparation, model training, evaluation, deployment, and prediction serving.
Overview
A typical CorePlexML workflow follows these steps:
Create a Project – organizational container
Upload Datasets – CSV data with automatic profiling
Run Experiments – AutoML training with H2O
Evaluate Models – metrics, feature importance, predictions
Deploy – serve predictions via REST endpoints
Generate Reports – PDF summaries with charts
Each step is a simple API call. Experiments and report generation run asynchronously as background jobs.
Working with Projects
Creating a Project
resp = requests.post(f"{BASE_URL}/api/projects", headers=HEADERS, json={
"name": "Credit Scoring",
"description": "Consumer credit risk model",
})
project_id = resp.json()["project_id"]
Listing Projects
resp = requests.get(f"{BASE_URL}/api/projects", headers=HEADERS, params={
"search": "credit",
"sort_field": "created_at",
"sort_direction": "desc",
"limit": 20,
})
for p in resp.json()["items"]:
print(f"{p['id']}: {p['name']} ({p['model_count']} models)")
Data Ingestion
Uploading a CSV
with open("credit_data.csv", "rb") as f:
resp = requests.post(
f"{BASE_URL}/api/datasets/upload",
headers=HEADERS,
files={"file": ("credit_data.csv", f, "text/csv")},
data={
"project_id": project_id,
"name": "Credit Applications Q1",
"description": "25,000 loan applications",
},
)
ds = resp.json()
dataset_id = ds["id"]
version_id = ds["version_id"]
The platform automatically:
Detects column types (numeric, categorical, text, datetime)
Computes descriptive statistics per column
Identifies missing values and duplicates
Creates an immutable dataset version
Inspecting Column Schema
resp = requests.get(
f"{BASE_URL}/api/dataset-versions/{version_id}",
headers=HEADERS,
)
version = resp.json()
print(f"Rows: {version['row_count']}, Columns: {version['column_count']}")
for col in version.get("columns", []):
print(f" {col['name']}: {col['dtype']} (missing: {col.get('missing_pct', 0):.1f}%)")
Downloading Data
Retrieve the raw CSV for a dataset version:
resp = requests.get(
f"{BASE_URL}/api/datasets/{dataset_id}/download",
headers=HEADERS,
)
with open("downloaded.csv", "wb") as f:
f.write(resp.content)
AutoML Training
Supported Algorithms
CorePlexML uses H2O AutoML to automatically train models from six algorithm families. All are enabled by default:
ID |
Name |
Description |
|---|---|---|
|
Gradient Boosting |
Sequential decision trees correcting previous errors. Strong default performer on most tabular data. |
|
XGBoost |
Optimized gradient boosting with L1/L2 regularization. Often achieves top accuracy. |
|
Random Forest |
Ensemble of trees using random feature subsets. Robust to overfitting. |
|
Deep Learning |
Multi-layer neural network. Best for complex nonlinear patterns. |
|
Linear Model |
Logistic/linear regression. Fast, interpretable baseline. |
|
Stacked Ensemble |
Meta-model combining all other models. Typically the best accuracy but SHAP contributions are not available. |
Use exclude_algos to skip specific algorithms:
config = {
"exclude_algos": ["DeepLearning", "StackedEnsemble"],
# ...
}
Evaluation Metrics
Models are ranked by the sort_metric. Use AUTO (default) to let
the platform choose based on problem type.
Classification: AUC (default), AUCPR, logloss,
mean_per_class_error, accuracy, MCC
Regression: RMSE (default), MSE, MAE, RMSLE,
mean_residual_deviance, R2
See Experiments API for the full metric reference.
Experiment Configuration
An experiment runs H2O AutoML to train, tune, and rank candidate models. Key configuration options:
Field |
Default |
Description |
|---|---|---|
|
– |
Column to predict (required). |
|
auto-detect |
|
|
20 |
Maximum number of candidate models. |
|
600 |
Time budget in seconds. |
|
5 |
Cross-validation folds. |
|
1 |
Random seed for reproducibility. |
|
false |
Oversample minority class. |
|
auto |
Early stopping metric. |
|
3 |
Early stopping patience. |
|
[] |
Algorithms to skip (e.g., |
Classification Example
resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={
"project_id": project_id,
"dataset_version_id": version_id,
"target_column": "default",
"problem_type": "classification",
"name": "Credit Default v1",
"config": {
"max_models": 15,
"max_runtime_secs": 600,
"nfolds": 5,
"balance_classes": True,
"stopping_metric": "AUC",
"exclude_algos": ["DeepLearning"],
},
})
experiment_id = resp.json()["id"]
Regression Example
resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={
"project_id": project_id,
"dataset_version_id": version_id,
"target_column": "loan_amount",
"problem_type": "regression",
"name": "Loan Amount Estimator",
"config": {
"max_models": 10,
"max_runtime_secs": 300,
"stopping_metric": "RMSE",
},
})
Waiting for Completion
Experiments are background jobs. Poll until complete:
import time
while True:
resp = requests.get(
f"{BASE_URL}/api/experiments/{experiment_id}",
headers=HEADERS,
)
exp = resp.json()["experiment"]
print(f"Status: {exp['status']} | Progress: {exp.get('progress', 0)}%")
if exp["status"] in ("completed", "failed"):
break
time.sleep(10)
With the SDK, this is simpler:
result = client.experiments.wait(experiment_id, interval=5.0, timeout=3600.0)
Model Evaluation
Leaderboard
After training, list the ranked models:
resp = requests.get(
f"{BASE_URL}/api/experiments/{experiment_id}/models",
headers=HEADERS,
)
models = resp.json()["items"]
for i, m in enumerate(models[:5], 1):
metrics = m.get("metrics", {})
print(f" #{i} {m['algorithm']}: AUC={metrics.get('auc', 'N/A'):.4f}")
Model Details
resp = requests.get(
f"{BASE_URL}/api/models/{model_id}",
headers=HEADERS,
)
model = resp.json()["model"]
print(f"Algorithm: {model['algorithm']}")
print(f"Metrics: {model['metrics']}")
Feature Importance
resp = requests.get(
f"{BASE_URL}/api/models/{model_id}/feature-importance",
headers=HEADERS,
)
for feat in resp.json()["features"][:10]:
print(f" {feat['feature']}: {feat['importance']:.4f}")
Ad-hoc Predictions
Test a model before deploying:
resp = requests.post(
f"{BASE_URL}/api/models/{model_id}/predict",
headers=HEADERS,
json={
"inputs": {
"income": 55000,
"debt_ratio": 0.35,
"credit_score": 680,
"employment_years": 5,
},
},
)
print(resp.json()) # {"prediction": "no_default", "probabilities": {...}}
Report Generation
Generate a PDF report summarizing experiment or model results:
# Experiment report
resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={
"project_id": project_id,
"kind": "experiment",
"entity_id": experiment_id,
"options": {
"include_leaderboard": True,
"include_feature_importance": True,
"llm_insights": True, # optional AI-powered suggestions
},
})
report_id = resp.json()["id"]
# Wait for generation
while True:
r = requests.get(f"{BASE_URL}/api/reports/{report_id}", headers=HEADERS)
if r.json()["report"]["status"] in ("completed", "failed"):
break
time.sleep(5)
# Download PDF
r = requests.get(
f"{BASE_URL}/api/reports/{report_id}/download",
headers=HEADERS,
)
with open("experiment_report.pdf", "wb") as f:
f.write(r.content)
Available report kinds: experiment, model, project,
synthgen, privacy, deployment.
See also
Projects API – Full projects API reference
Datasets API – Full datasets API reference
Experiments API – Full experiments API reference
Models API – Full models API reference
MLOps & Model Serving – Deploying and monitoring models