========================
End-to-End ML Pipeline
========================

This guide covers the full machine learning lifecycle through the API:
data preparation, model training, evaluation, deployment, and prediction
serving.

.. contents:: Sections
   :local:
   :depth: 2

Overview
--------

A typical CorePlexML workflow follows these steps:

1. **Create a Project** -- organizational container
2. **Upload Datasets** -- CSV data with automatic profiling
3. **Run Experiments** -- AutoML training with H2O
4. **Evaluate Models** -- metrics, feature importance, predictions
5. **Deploy** -- serve predictions via REST endpoints
6. **Generate Reports** -- PDF summaries with charts

Each step is a simple API call. Experiments and report generation run
asynchronously as background jobs.

----

Working with Projects
---------------------

Creating a Project
^^^^^^^^^^^^^^^^^^

.. code-block:: python

   resp = requests.post(f"{BASE_URL}/api/projects", headers=HEADERS, json={
       "name": "Credit Scoring",
       "description": "Consumer credit risk model",
   })
   project_id = resp.json()["project_id"]

Listing Projects
^^^^^^^^^^^^^^^^

.. code-block:: python

   resp = requests.get(f"{BASE_URL}/api/projects", headers=HEADERS, params={
       "search": "credit",
       "sort_field": "created_at",
       "sort_direction": "desc",
       "limit": 20,
   })
   for p in resp.json()["items"]:
       print(f"{p['id']}: {p['name']} ({p['model_count']} models)")

Sharing a Project
^^^^^^^^^^^^^^^^^

Add team members with specific roles:

.. code-block:: python

   requests.post(
       f"{BASE_URL}/api/projects/{project_id}/members",
       headers=HEADERS,
       json={"email": "analyst@company.com", "role": "editor"},
   )

Roles: ``viewer`` (read-only), ``editor`` (read/write), ``admin``
(full access), ``owner`` (transfer ownership).

----

Data Ingestion
--------------

Uploading a CSV
^^^^^^^^^^^^^^^

.. code-block:: python

   with open("credit_data.csv", "rb") as f:
       resp = requests.post(
           f"{BASE_URL}/api/datasets/upload",
           headers=HEADERS,
           files={"file": ("credit_data.csv", f, "text/csv")},
           data={
               "project_id": project_id,
               "name": "Credit Applications Q1",
               "description": "25,000 loan applications",
           },
       )
   ds = resp.json()
   dataset_id = ds["id"]
   version_id = ds["version_id"]

The platform automatically:

- Detects column types (numeric, categorical, text, datetime)
- Computes descriptive statistics per column
- Identifies missing values and duplicates
- Creates an immutable dataset version

Inspecting Column Schema
^^^^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/dataset-versions/{version_id}",
       headers=HEADERS,
   )
   version = resp.json()
   print(f"Rows: {version['row_count']}, Columns: {version['column_count']}")
   for col in version.get("columns", []):
       print(f"  {col['name']}: {col['dtype']} (missing: {col.get('missing_pct', 0):.1f}%)")

Downloading Data
^^^^^^^^^^^^^^^^

Retrieve the raw CSV for a dataset version:

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/datasets/{dataset_id}/download",
       headers=HEADERS,
   )
   with open("downloaded.csv", "wb") as f:
       f.write(resp.content)

----

AutoML Training
---------------

Supported Algorithms
^^^^^^^^^^^^^^^^^^^^

CorePlexML uses H2O AutoML to automatically train models from six
algorithm families. All are enabled by default:

.. list-table::
   :header-rows: 1
   :widths: 15 15 70

   * - ID
     - Name
     - Description
   * - ``GBM``
     - Gradient Boosting
     - Sequential decision trees correcting previous errors. Strong
       default performer on most tabular data.
   * - ``XGBoost``
     - XGBoost
     - Optimized gradient boosting with L1/L2 regularization. Often
       achieves top accuracy.
   * - ``DRF``
     - Random Forest
     - Ensemble of trees using random feature subsets. Robust to
       overfitting.
   * - ``DeepLearning``
     - Deep Learning
     - Multi-layer neural network. Best for complex nonlinear patterns.
   * - ``GLM``
     - Linear Model
     - Logistic/linear regression. Fast, interpretable baseline.
   * - ``StackedEnsemble``
     - Stacked Ensemble
     - Meta-model combining all other models. Typically the best accuracy
       but SHAP contributions are not available.

Use ``exclude_algos`` to skip specific algorithms:

.. code-block:: python

   config = {
       "exclude_algos": ["DeepLearning", "StackedEnsemble"],
       # ...
   }

Evaluation Metrics
^^^^^^^^^^^^^^^^^^

Models are ranked by the ``sort_metric``. Use ``AUTO`` (default) to let
the platform choose based on problem type.

**Classification:** ``AUC`` (default), ``AUCPR``, ``logloss``,
``mean_per_class_error``, ``accuracy``, ``MCC``

**Regression:** ``RMSE`` (default), ``MSE``, ``MAE``, ``RMSLE``,
``mean_residual_deviance``, ``R2``

See :doc:`/api-reference/experiments` for the full metric reference.

Experiment Configuration
^^^^^^^^^^^^^^^^^^^^^^^^

An experiment runs H2O AutoML to train, tune, and rank candidate models.
Key configuration options:

.. list-table::
   :header-rows: 1
   :widths: 25 15 60

   * - Field
     - Default
     - Description
   * - ``target_column``
     - --
     - Column to predict (required).
   * - ``problem_type``
     - auto-detect
     - ``classification``, ``regression``, or ``multiclass``.
   * - ``max_models``
     - 20
     - Maximum number of candidate models.
   * - ``max_runtime_secs``
     - 600
     - Time budget in seconds.
   * - ``nfolds``
     - 5
     - Cross-validation folds.
   * - ``seed``
     - 1
     - Random seed for reproducibility.
   * - ``balance_classes``
     - false
     - Oversample minority class.
   * - ``stopping_metric``
     - auto
     - Early stopping metric.
   * - ``stopping_rounds``
     - 3
     - Early stopping patience.
   * - ``exclude_algos``
     - []
     - Algorithms to skip (e.g., ``["DeepLearning"]``).

Classification Example
^^^^^^^^^^^^^^^^^^^^^^

.. code-block:: python

   resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={
       "project_id": project_id,
       "dataset_version_id": version_id,
       "target_column": "default",
       "problem_type": "classification",
       "name": "Credit Default v1",
       "config": {
           "max_models": 15,
           "max_runtime_secs": 600,
           "nfolds": 5,
           "balance_classes": True,
           "stopping_metric": "AUC",
           "exclude_algos": ["DeepLearning"],
       },
   })
   experiment_id = resp.json()["id"]

Regression Example
^^^^^^^^^^^^^^^^^^

.. code-block:: python

   resp = requests.post(f"{BASE_URL}/api/experiments", headers=HEADERS, json={
       "project_id": project_id,
       "dataset_version_id": version_id,
       "target_column": "loan_amount",
       "problem_type": "regression",
       "name": "Loan Amount Estimator",
       "config": {
           "max_models": 10,
           "max_runtime_secs": 300,
           "stopping_metric": "RMSE",
       },
   })

Waiting for Completion
^^^^^^^^^^^^^^^^^^^^^^

Experiments are background jobs. Poll until complete:

.. code-block:: python

   import time

   while True:
       resp = requests.get(
           f"{BASE_URL}/api/experiments/{experiment_id}",
           headers=HEADERS,
       )
       exp = resp.json()["experiment"]
       print(f"Status: {exp['status']} | Progress: {exp.get('progress', 0)}%")
       if exp["status"] in ("completed", "failed"):
           break
       time.sleep(10)

With the SDK, this is simpler:

.. code-block:: python

   result = client.experiments.wait(experiment_id, interval=5.0, timeout=3600.0)

----

Model Evaluation
----------------

Leaderboard
^^^^^^^^^^^

After training, list the ranked models:

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/experiments/{experiment_id}/models",
       headers=HEADERS,
   )
   models = resp.json()["items"]
   for i, m in enumerate(models[:5], 1):
       metrics = m.get("metrics", {})
       print(f"  #{i} {m['algorithm']}: AUC={metrics.get('auc', 'N/A'):.4f}")

Model Details
^^^^^^^^^^^^^

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/models/{model_id}",
       headers=HEADERS,
   )
   model = resp.json()["model"]
   print(f"Algorithm: {model['algorithm']}")
   print(f"Metrics: {model['metrics']}")

Feature Importance
^^^^^^^^^^^^^^^^^^

.. code-block:: python

   resp = requests.get(
       f"{BASE_URL}/api/models/{model_id}/feature-importance",
       headers=HEADERS,
   )
   for feat in resp.json()["features"][:10]:
       print(f"  {feat['feature']}: {feat['importance']:.4f}")

Ad-hoc Predictions
^^^^^^^^^^^^^^^^^^

Test a model before deploying:

.. code-block:: python

   resp = requests.post(
       f"{BASE_URL}/api/models/{model_id}/predict",
       headers=HEADERS,
       json={
           "inputs": {
               "income": 55000,
               "debt_ratio": 0.35,
               "credit_score": 680,
               "employment_years": 5,
           },
       },
   )
   print(resp.json())  # {"prediction": "no_default", "probabilities": {...}}

----

Report Generation
-----------------

Generate a PDF report summarizing experiment or model results:

.. code-block:: python

   # Experiment report
   resp = requests.post(f"{BASE_URL}/api/reports", headers=HEADERS, json={
       "project_id": project_id,
       "kind": "experiment",
       "entity_id": experiment_id,
       "options": {
           "include_leaderboard": True,
           "include_feature_importance": True,
           "llm_insights": True,  # optional AI-powered suggestions
       },
   })
   report_id = resp.json()["id"]

   # Wait for generation
   while True:
       r = requests.get(f"{BASE_URL}/api/reports/{report_id}", headers=HEADERS)
       if r.json()["report"]["status"] in ("completed", "failed"):
           break
       time.sleep(5)

   # Download PDF
   r = requests.get(
       f"{BASE_URL}/api/reports/{report_id}/download",
       headers=HEADERS,
   )
   with open("experiment_report.pdf", "wb") as f:
       f.write(r.content)

Available report kinds: ``experiment``, ``model``, ``project``,
``synthgen``, ``privacy``, ``deployment``.

----

.. seealso::

   - :doc:`/api-reference/projects` -- Full projects API reference
   - :doc:`/api-reference/datasets` -- Full datasets API reference
   - :doc:`/api-reference/experiments` -- Full experiments API reference
   - :doc:`/api-reference/models` -- Full models API reference
   - :doc:`mlops` -- Deploying and monitoring models