======================== End-to-End ML Pipeline ======================== This guide covers the full machine learning lifecycle through the API: data preparation, model training, evaluation, deployment, and prediction serving. .. contents:: Sections :local: :depth: 2 Overview -------- A typical CorePlexML workflow follows these steps: 1. **Create a Project** -- organizational container 2. **Upload Datasets** -- CSV data with automatic profiling 3. **Run Experiments** -- AutoML training with H2O 4. **Evaluate Models** -- metrics, feature importance, predictions 5. **Deploy** -- serve predictions via REST endpoints 6. **Generate Reports** -- PDF summaries with charts Each step is a simple API call. Experiments and report generation run asynchronously as background jobs. ---- Working with Projects --------------------- Creating a Project ^^^^^^^^^^^^^^^^^^ .. code-block:: python resp = requests.post(f"{BASE_URL}/api/projects", headers=HEADERS, json={ "name": "Credit Scoring", "description": "Consumer credit risk model", }) project_id = resp.json()["project_id"] Listing Projects ^^^^^^^^^^^^^^^^ .. code-block:: python resp = requests.get(f"{BASE_URL}/api/projects", headers=HEADERS, params={ "search": "credit", "sort_field": "created_at", "sort_direction": "desc", "limit": 20, }) for p in resp.json()["items"]: print(f"{p['id']}: {p['name']} ({p['model_count']} models)") Sharing a Project ^^^^^^^^^^^^^^^^^ Add team members with specific roles: .. code-block:: python requests.post( f"{BASE_URL}/api/projects/{project_id}/members", headers=HEADERS, json={"email": "analyst@company.com", "role": "editor"}, ) Roles: ``viewer`` (read-only), ``editor`` (read/write), ``admin`` (full access), ``owner`` (transfer ownership). ---- Data Ingestion -------------- Uploading a CSV ^^^^^^^^^^^^^^^ .. code-block:: python with open("credit_data.csv", "rb") as f: resp = requests.post( f"{BASE_URL}/api/datasets/upload", headers=HEADERS, files={"file": ("credit_data.csv", f, "text/csv")}, data={ "project_id": project_id, "name": "Credit Applications Q1", "description": "25,000 loan applications", }, ) ds = resp.json() dataset_id = ds["id"] version_id = ds["version_id"] The platform automatically: - Detects column types (numeric, categorical, text, datetime) - Computes descriptive statistics per column - Identifies missing values and duplicates - Creates an immutable dataset version Inspecting Column Schema ^^^^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python resp = requests.get( f"{BASE_URL}/api/dataset-versions/{version_id}", headers=HEADERS, ) version = resp.json() print(f"Rows: {version['row_count']}, Columns: {version['column_count']}") for col in version.get("columns", []): print(f" {col['name']}: {col['dtype']} (missing: {col.get('missing_pct', 0):.1f}%)") Downloading Data ^^^^^^^^^^^^^^^^ Retrieve the raw CSV for a dataset version: .. code-block:: python resp = requests.get( f"{BASE_URL}/api/datasets/{dataset_id}/download", headers=HEADERS, ) with open("downloaded.csv", "wb") as f: f.write(resp.content) ---- AutoML Training --------------- Supported Algorithms ^^^^^^^^^^^^^^^^^^^^ CorePlexML uses H2O AutoML to automatically train models from six algorithm families. All are enabled by default: .. list-table:: :header-rows: 1 :widths: 15 15 70 * - ID - Name - Description * - ``GBM`` - Gradient Boosting - Sequential decision trees correcting previous errors. Strong default performer on most tabular data. * - ``XGBoost`` - XGBoost - Optimized gradient boosting with L1/L2 regularization. Often achieves top accuracy. * - ``DRF`` - Random Forest - Ensemble of trees using random feature subsets. Robust to overfitting. * - ``DeepLearning`` - Deep Learning - Multi-layer neural network. Best for complex nonlinear patterns. * - ``GLM`` - Linear Model - Logistic/linear regression. Fast, interpretable baseline. * - ``StackedEnsemble`` - Stacked Ensemble - Meta-model combining all other models. Typically the best accuracy but SHAP contributions are not available. Use ``exclude_algos`` to skip specific algorithms: .. code-block:: python config = { "exclude_algos": ["DeepLearning", "StackedEnsemble"], # ... } Evaluation Metrics ^^^^^^^^^^^^^^^^^^ Models are ranked by the ``sort_metric``. Use ``AUTO`` (default) to let the platform choose based on problem type. **Classification:** ``AUC`` (default), ``AUCPR``, ``logloss``, ``mean_per_class_error``, ``accuracy``, ``MCC`` **Regression:** ``RMSE`` (default), ``MSE``, ``MAE``, ``RMSLE``, ``mean_residual_deviance``, ``R2`` See :doc:`/api-reference/experiments` for the full metric reference. Experiment Configuration ^^^^^^^^^^^^^^^^^^^^^^^^ An experiment runs H2O AutoML to train, tune, and rank candidate models. Key configuration options: .. list-table:: :header-rows: 1 :widths: 25 15 60 * - Field - Default - Description * - ``target_column`` - -- - Column to predict (required). * - ``problem_type`` - auto-detect - ``classification``, ``regression``, or ``multiclass``. * - ``max_models`` - 20 - Maximum number of candidate models. * - ``max_runtime_secs`` - 600 - Time budget in seconds. * - ``nfolds`` - 5 - Cross-validation folds. * - ``seed`` - 1 - Random seed for reproducibility. * - ``balance_classes`` - false - Oversample minority class. * - ``stopping_metric`` - auto - Early stopping metric. * - ``stopping_rounds`` - 3 - Early stopping patience. * - ``exclude_algos`` - [] - Algorithms to skip (e.g., ``["DeepLearning"]``). Classification Example ^^^^^^^^^^^^^^^^^^^^^^ .. code-block:: python resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={ "project_id": project_id, "dataset_version_id": version_id, "target_column": "default", "problem_type": "classification", "name": "Credit Default v1", "config": { "max_models": 15, "max_runtime_secs": 600, "nfolds": 5, "balance_classes": True, "stopping_metric": "AUC", "exclude_algos": ["DeepLearning"], }, }) experiment_id = resp.json()["id"] Regression Example ^^^^^^^^^^^^^^^^^^ .. code-block:: python resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={ "project_id": project_id, "dataset_version_id": version_id, "target_column": "loan_amount", "problem_type": "regression", "name": "Loan Amount Estimator", "config": { "max_models": 10, "max_runtime_secs": 300, "stopping_metric": "RMSE", }, }) Waiting for Completion ^^^^^^^^^^^^^^^^^^^^^^ Experiments are background jobs. Poll until complete: .. code-block:: python import time while True: resp = requests.get( f"{BASE_URL}/api/experiments/{experiment_id}", headers=HEADERS, ) exp = resp.json()["experiment"] print(f"Status: {exp['status']} | Progress: {exp.get('progress', 0)}%") if exp["status"] in ("completed", "failed"): break time.sleep(10) With the SDK, this is simpler: .. code-block:: python result = client.experiments.wait(experiment_id, interval=5.0, timeout=3600.0) ---- Model Evaluation ---------------- Leaderboard ^^^^^^^^^^^ After training, list the ranked models: .. code-block:: python resp = requests.get( f"{BASE_URL}/api/experiments/{experiment_id}/models", headers=HEADERS, ) models = resp.json()["items"] for i, m in enumerate(models[:5], 1): metrics = m.get("metrics", {}) print(f" #{i} {m['algorithm']}: AUC={metrics.get('auc', 'N/A'):.4f}") Model Details ^^^^^^^^^^^^^ .. code-block:: python resp = requests.get( f"{BASE_URL}/api/models/{model_id}", headers=HEADERS, ) model = resp.json()["model"] print(f"Algorithm: {model['algorithm']}") print(f"Metrics: {model['metrics']}") Feature Importance ^^^^^^^^^^^^^^^^^^ .. code-block:: python resp = requests.get( f"{BASE_URL}/api/models/{model_id}/feature-importance", headers=HEADERS, ) for feat in resp.json()["features"][:10]: print(f" {feat['feature']}: {feat['importance']:.4f}") Ad-hoc Predictions ^^^^^^^^^^^^^^^^^^ Test a model before deploying: .. code-block:: python resp = requests.post( f"{BASE_URL}/api/models/{model_id}/predict", headers=HEADERS, json={ "inputs": { "income": 55000, "debt_ratio": 0.35, "credit_score": 680, "employment_years": 5, }, }, ) print(resp.json()) # {"prediction": "no_default", "probabilities": {...}} ---- Report Generation ----------------- Generate a PDF report summarizing experiment or model results: .. code-block:: python # Experiment report resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={ "project_id": project_id, "kind": "experiment", "entity_id": experiment_id, "options": { "include_leaderboard": True, "include_feature_importance": True, "llm_insights": True, # optional AI-powered suggestions }, }) report_id = resp.json()["id"] # Wait for generation while True: r = requests.get(f"{BASE_URL}/api/reports/{report_id}", headers=HEADERS) if r.json()["report"]["status"] in ("completed", "failed"): break time.sleep(5) # Download PDF r = requests.get( f"{BASE_URL}/api/reports/{report_id}/download", headers=HEADERS, ) with open("experiment_report.pdf", "wb") as f: f.write(r.content) Available report kinds: ``experiment``, ``model``, ``project``, ``synthgen``, ``privacy``, ``deployment``. ---- .. seealso:: - :doc:`/api-reference/projects` -- Full projects API reference - :doc:`/api-reference/datasets` -- Full datasets API reference - :doc:`/api-reference/experiments` -- Full experiments API reference - :doc:`/api-reference/models` -- Full models API reference - :doc:`mlops` -- Deploying and monitoring models