Advanced Usage
==============

This page covers advanced SDK patterns including job polling, batch
predictions, dataset versions, Privacy Suite, SynthGen, What-If Analysis,
error handling, and timeout configuration.

.. contents::
   :local:
   :depth: 2

Job Polling Patterns
--------------------

Several CorePlexML operations run as background jobs: AutoML training,
report generation, and synthetic data model training. The SDK provides
built-in ``wait`` methods that poll until completion.

Experiment Polling
^^^^^^^^^^^^^^^^^^

.. code-block:: python

   exp = client.experiments.create(
       project_id=project_id,
       dataset_version_id=version_id,
       target_column="target",
       problem_type="classification",
       config={"max_models": 10},
   )

   # Block until complete (up to 2 hours, polling every 10 seconds)
   result = client.experiments.wait(
       exp["id"],
       interval=10.0,
       timeout=7200.0,
   )

   if result["status"] == "succeeded":
       print("Training completed")
   elif result["status"] in ("failed", "error"):
       print(f"Training failed: {result.get('error', 'unknown')}")

Report Polling
^^^^^^^^^^^^^^

.. code-block:: python

   report = client.reports.create(
       project_id=project_id,
       kind="experiment",
       entity_id=experiment_id,
   )

   # Reports typically finish in under a minute
   status = client.reports.wait(report["id"], interval=2.0, timeout=120.0)
   if status["report"]["status"] == "succeeded":
       client.reports.download(report["id"], "report.pdf")

Custom Polling Loop
^^^^^^^^^^^^^^^^^^^

For finer control (e.g., progress logging), write your own loop:

.. code-block:: python

   import time
   from coreplexml import CorePlexMLError

   def wait_with_progress(client, experiment_id, timeout=3600.0):
       """Poll an experiment and print progress updates."""
       start = time.time()
       last_status = None
       while time.time() - start < timeout:
           data = client.experiments.get(experiment_id)
           exp = data.get("experiment", {})
           status = exp.get("status", "unknown")
           if status != last_status:
               elapsed = int(time.time() - start)
               print(f"  [{elapsed}s] Status: {status}")
               last_status = status
           if status in ("succeeded", "failed", "error"):
               return exp
           time.sleep(5.0)
       raise CorePlexMLError(f"Timeout after {timeout}s")

   result = wait_with_progress(client, exp["id"])

Batch Predictions
-----------------

Both ``models.predict`` and ``deployments.predict`` accept a list of dicts for
batch inference.

.. code-block:: python

   import csv

   # Load rows from a CSV file
   rows = []
   with open("new_customers.csv") as f:
       reader = csv.DictReader(f)
       for row in reader:
           rows.append(row)

   # Send batch (the SDK serializes the list to JSON)
   results = client.deployments.predict(deployment_id, inputs=rows)

   # Process results
   for i, pred in enumerate(results["predictions"]):
       print(f"Row {i}: {pred['prediction']} (confidence={pred.get('probability', 'N/A')})")

For very large datasets, process in chunks to stay within server payload
limits:

.. code-block:: python

   def chunked_predict(client, deployment_id, rows, chunk_size=500):
       """Predict in batches to avoid payload limits."""
       all_predictions = []
       for i in range(0, len(rows), chunk_size):
           chunk = rows[i : i + chunk_size]
           result = client.deployments.predict(deployment_id, inputs=chunk)
           all_predictions.extend(result["predictions"])
           print(f"  Processed {min(i + chunk_size, len(rows))}/{len(rows)} rows")
       return all_predictions

   predictions = chunked_predict(client, deployment_id, rows)

Dataset Versions Workflow
-------------------------

Datasets in CorePlexML can contain multiple versions. The SDK upload helper
creates a new dataset with an initial version; additional versions can be
created via the REST endpoint ``POST /api/datasets/{dataset_id}/versions``.
You can list versions and train experiments on a specific version ID.

.. code-block:: python

   # Upload initial dataset
   ds = client.datasets.upload(
       project_id=project_id,
       file_path="data_v1.csv",
       name="Sales Data",
   )
   dataset_id = ds["id"]
   v1_id = ds["version_id"]

   # (Optional) Create an additional version via REST endpoint.
   # This endpoint is not wrapped yet by a high-level SDK resource method.
   v2 = client._http.upload(f"/api/datasets/{dataset_id}/versions", "data_v2.csv")
   v2_id = v2["version_id"]

   # List all versions
   versions = client.datasets.versions(dataset_id)
   for v in versions["items"]:
       print(f"  Version {v['version']}: {v['row_count']} rows")

   # Train on the latest version
   exp = client.experiments.create(
       project_id=project_id,
       dataset_version_id=v2_id,
       target_column="revenue",
       problem_type="regression",
   )

Privacy Suite Workflow
----------------------

The Privacy Suite detects PII in datasets and applies configurable
transformations (masking, hashing, redaction, generalization, etc.) to
produce anonymized data that meets compliance requirements.

Full Workflow
^^^^^^^^^^^^^

.. code-block:: python

   # 1. Create a HIPAA compliance policy
   policy = client.privacy.create_policy(
       project_id=project_id,
       name="Patient Data HIPAA Policy",
       profile="hipaa",
       description="Scan and transform PHI in patient records",
   )
   policy_id = policy["id"]

   # 2. Create a session linking the policy to a dataset
   session = client.privacy.create_session(
       policy_id=policy_id,
       dataset_id=dataset_id,
   )
   session_id = session["id"]

   # 3. Run PII detection
   detection = client.privacy.detect(session_id)
   print(f"Found {len(detection.get('findings', []))} PII columns:")
   for finding in detection.get("findings", []):
       print(f"  {finding['column']}: {finding['pii_type']} ({finding['count']} occurrences)")

   # 4. Apply transformations
   transform_result = client.privacy.transform(session_id)
   print(f"Transformations applied: {transform_result.get('transformations_applied', 0)}")

   # 5. Get full results
   results = client.privacy.results(session_id)
   print(f"Session status: {results.get('status')}")

Compliance Profiles
^^^^^^^^^^^^^^^^^^^

CorePlexML supports four built-in compliance profiles, each pre-configured
with rules for the relevant PII types:

.. list-table::
   :header-rows: 1

   * - Profile
     - Description
     - PII Types
   * - ``hipaa``
     - HIPAA Safe Harbor
     - Names, SSN, MRN, dates, addresses, phone, email, etc.
   * - ``gdpr``
     - EU General Data Protection Regulation
     - Personal identifiers, IP addresses, biometric data, etc.
   * - ``pci_dss``
     - Payment Card Industry Data Security Standard
     - Credit card numbers, CVVs, cardholder names, etc.
   * - ``ccpa``
     - California Consumer Privacy Act
     - Broad personal information categories

SynthGen Workflow
-----------------

SynthGen trains deep generative models on real datasets to produce
statistically similar synthetic data. This is useful for privacy-preserving
data sharing, test data generation, and data augmentation.

.. code-block:: python

   # 1. Train a CTGAN model
   synth = client.synthgen.create_model(
       project_id=project_id,
       dataset_version_id=version_id,
       name="Customer Data Generator",
       model_type="ctgan",
       config={"epochs": 300, "batch_size": 500},
   )
   synth_id = synth["id"]
   print(f"SynthGen model created: {synth_id}")

   # 2. Wait for training to complete (poll manually)
   import time
   while True:
       status = client.synthgen.get_model(synth_id)
       if status.get("status") == "succeeded":
           print("SynthGen model training complete")
           break
       elif status.get("status") == "failed":
           print(f"Training failed: {status.get('error')}")
           break
       print(f"  Status: {status.get('status')}...")
       time.sleep(10)

   # 3. Generate synthetic data
   result = client.synthgen.generate(
       model_id=synth_id,
       num_rows=5000,
       seed=42,
   )
   print(f"Generated {result.get('num_rows', 0)} synthetic rows")

Model Types
^^^^^^^^^^^

.. list-table::
   :header-rows: 1

   * - Model Type
     - Description
     - Best For
   * - ``ctgan``
     - Conditional Tabular GAN
     - General-purpose tabular data with mixed types
   * - ``copulagan``
     - Copula-based GAN
     - Data with complex multivariate relationships
   * - ``tvae``
     - Tabular Variational Autoencoder
     - Faster training, simpler distributions
   * - ``gaussian_copula``
     - Gaussian Copula synthesizer
     - Fast baseline for mostly continuous tabular data

What-If Analysis (Studio) Workflow
-----------------------------------

ML Studio allows you to explore model behavior by defining scenarios with
different input values and comparing the resulting predictions.

.. code-block:: python

   # 1. Create a session with a baseline input
   session = client.studio.create_session(
       project_id=project_id,
       deployment_id=deployment_id,
       baseline_input={
           "age": 35,
           "income": 75000,
           "credit_score": 680,
           "loan_amount": 250000,
           "employment_years": 5,
       },
   )
   session_id = session["id"]

   # 2. Define alternative scenarios
   scenarios = [
       ("Higher Credit Score", {"credit_score": 780}),
       ("Lower Income", {"income": 45000}),
       ("Larger Loan", {"loan_amount": 500000}),
       ("Senior Applicant", {"age": 55, "employment_years": 25}),
   ]

   for name, changes in scenarios:
       s = client.studio.create_scenario(session_id, name=name, changes=changes)
       # Execute the scenario
       client.studio.run_scenario(s["id"])

   # 3. Compare all scenarios
   comparison = client.studio.compare(session_id)
   print("Scenario Comparison:")
   for sc in comparison.get("scenarios", []):
       print(f"  {sc['name']}: prediction={sc['prediction']}")

Error Handling Patterns
-----------------------

Structured Error Inspection
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Every exception carries ``message``, ``status_code``, and ``detail``:

.. code-block:: python

   from coreplexml import CorePlexMLError

   try:
       client.datasets.upload(project_id, "missing.csv", "test")
   except CorePlexMLError as e:
       print(f"HTTP {e.status_code}")
       print(f"Message: {e.message}")
       print(f"Detail: {e.detail}")

Safe Resource Fetching
^^^^^^^^^^^^^^^^^^^^^^

Use ``NotFoundError`` for safe lookups:

.. code-block:: python

   from coreplexml import NotFoundError

   def get_project_or_none(client, project_id):
       """Return project dict or None if not found."""
       try:
           return client.projects.get(project_id)
       except NotFoundError:
           return None

   project = get_project_or_none(client, some_id)
   if project is None:
       print("Project not found, creating...")
       project = client.projects.create("New Project")

Timeout Configuration
---------------------

The ``timeout`` parameter on the client controls the HTTP request timeout
(how long to wait for the server to respond). This is separate from job
polling timeouts.

.. code-block:: python

   # Short timeout for fast endpoints
   client = CorePlexMLClient(
       base_url="https://ml.example.com",
       api_key="your-key",
       timeout=10,  # 10 seconds
   )

   # Longer timeout for large file uploads
   upload_client = CorePlexMLClient(
       base_url="https://ml.example.com",
       api_key="your-key",
       timeout=300,  # 5 minutes
   )

For long-running operations, adjust the polling timeout separately:

.. code-block:: python

   # Default: poll every 5s, give up after 1 hour
   result = client.experiments.wait(exp_id, interval=5.0, timeout=3600.0)

   # Large training: poll every 30s, give up after 6 hours
   result = client.experiments.wait(exp_id, interval=30.0, timeout=21600.0)

Working with Large Datasets
----------------------------

Upload Performance
^^^^^^^^^^^^^^^^^^

For large CSV files, increase the client timeout to accommodate upload time:

.. code-block:: python

   client = CorePlexMLClient(
       base_url="https://ml.example.com",
       api_key="your-key",
       timeout=600,  # 10 minutes for large uploads
   )

   ds = client.datasets.upload(
       project_id=project_id,
       file_path="large_dataset.csv",  # e.g., 500 MB
       name="Large Training Data",
   )

Streaming Downloads
^^^^^^^^^^^^^^^^^^^

Dataset and report downloads stream data to disk in 8 KB chunks, so they
work with files of any size without loading the entire file into memory:

.. code-block:: python

   # Download large dataset
   client.datasets.download(dataset_id, "/data/export.csv")

   # Download report
   client.reports.download(report_id, "/reports/analysis.pdf")

Logging
-------

The SDK uses Python's standard ``logging`` module under the ``coreplexml``
logger. Enable debug logging to see HTTP requests and responses:

.. code-block:: python

   import logging

   logging.basicConfig(level=logging.DEBUG)
   logger = logging.getLogger("coreplexml")
   logger.setLevel(logging.DEBUG)

   # All SDK HTTP calls will now be logged
   client = CorePlexMLClient(base_url="https://ml.example.com", api_key="your-key")
   client.projects.list()