Skip to content

New use case: MLOps lifecycle automation — model-registry promotion, drift remediation, training-pipeline failure triage, and scheduled fairness audits via agent-driven workflows #1037

@kelos-bot

Description

@kelos-bot

Summary

Kelos ships strong primitives for engineering-team automation (GitHub/Linear/Jira sources, generic webhooks, Prometheus alerts, K8s events, cron) but no example, template, or proposal addresses MLOps / ML model lifecycle. This is a notable gap given Kelos is Kubernetes-native and the K8s ecosystem is the de facto control plane for ML — MLflow, KServe, Kubeflow Pipelines, Argo Workflows, Flyte, Seldon, BentoML, Ray, Feast all run on K8s and emit events Kelos can already consume via webhook, cron, and prometheusAlerts (#775, merged) sources.

ML platforms have an unusually high autonomous-agent payoff because:

This issue proposes four concrete TaskSpawner patterns for the MLOps lifecycle, identifies the ecosystem they target, and notes which existing Kelos primitives already cover them vs. what minor gaps could be closed in follow-ups.

Target Audience

  • ML platform engineers running MLflow / KServe / Kubeflow / Seldon on K8s, owning the serving stack and registry plumbing.
  • ML engineers and data scientists owning model code, training pipelines, evaluation harnesses, and notebooks.
  • SRE-for-ML / on-call rotations that today catch drift alerts and fairness regressions manually.

Proposed TaskSpawner Patterns

All four patterns work with Kelos's current API; no CRD changes are required to land the examples. The first three use the existing webhook (GenericWebhook) source merged via #687; the fourth uses cron.

Pattern 1 — MLflow Model Registry promotion → update KServe InferenceService manifest

Trigger: MLflow's model-registry webhook fires on MODEL_VERSION_TRANSITIONED_STAGE (e.g., StagingProduction).

Agent task: Open a PR updating the corresponding InferenceService manifest in the GitOps repo with the new model URI, runtime version, and resource requests pulled from the registry's tags. Generate a model card delta from MLflow run metadata. Re-run smoke tests against the staging endpoint.

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: mlflow-promotion-responder
spec:
  when:
    webhook:
      source: mlflow
      fieldMapping:
        id: "$.id"
        modelName: "$.model_name"
        version: "$.version"
        toStage: "$.to_stage"
        runId: "$.run_id"
      filters:
        - field: "$.to_stage"
          value: "Production"
  taskTemplate:
    type: claude-code
    credentials:
      type: api-key
      secretRef: { name: claude-credentials }
    workspaceRef: { name: gitops-inference-manifests }
    branch: "model-promotion/{{.modelName}}-v{{.version}}"
    promptTemplate: |
      Model `{{.modelName}}` version `{{.version}}` was promoted to **Production** in MLflow
      (run: {{.runId}}).

      Please:
      1. Locate the corresponding KServe InferenceService manifest in this repo
         (search by name `{{.modelName}}` under `inference/`).
      2. Update `spec.predictor.model.storageUri` to the new artifact URI from MLflow.
      3. Update resource requests from the new run's tags (`gpu_type`, `replica_min`, `replica_max`).
      4. Append a model-card entry under `model-cards/{{.modelName}}.md` summarizing the
         run's eval metrics (accuracy, calibration, fairness deltas) compared to the
         currently-deployed version.
      5. Open a PR with a checklist for the on-call to verify the staging smoke test
         and approve rollout.

Pattern 2 — Drift detector webhook (Evidently / NannyML / Arize / Fiddler / WhyLabs) → open retraining PR

Trigger: Generic webhook from any drift-monitoring platform. All five major vendors support outbound webhooks and ship a stable JSON shape including model, feature, severity, metric, and timestamp.

Agent task: Investigate the flagged feature, propose a retraining plan in a draft PR (config bumps to the training pipeline definition, dataset window adjustment, retraining trigger), or open an issue if root cause looks like an upstream data-pipeline bug.

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: drift-remediation
spec:
  when:
    webhook:
      source: evidently
      fieldMapping:
        id: "$.test_id"
        model: "$.model_name"
        feature: "$.feature"
        severity: "$.severity"
        metric: "$.metric_name"
        metricValue: "$.metric_value"
      filters:
        - field: "$.severity"
          pattern: "^(high|critical)$"
  taskTemplate:
    type: claude-code
    credentials:
      type: api-key
      secretRef: { name: claude-credentials }
    workspaceRef: { name: ml-training-pipelines }
    branch: "drift/{{.model}}-{{.feature}}-{{.id}}"
    promptTemplate: |
      Evidently flagged drift on model **{{.model}}**, feature **{{.feature}}**:
      `{{.metric}} = {{.metricValue}}` (severity: {{.severity}}).

      Investigate by:
      1. Locating this model's training pipeline (likely under `pipelines/{{.model}}/`).
      2. Checking whether the upstream feature pipeline has had recent schema changes
         (`git log` on `features/{{.feature}}*`).
      3. Producing one of:
         - **Draft retraining PR**: bump dataset window, adjust feature transforms, bump
           pipeline parameters file. Add a checklist for evaluation thresholds.
         - **Issue (label `data-quality`)**: if the root cause looks like an upstream
           pipeline bug, not a model-decay issue.
      Quote the relevant metric thresholds from the pipeline's config so the human
      reviewer can audit your decision.
    ttlSecondsAfterFinished: 86400

Pattern 3 — Training-pipeline failure (Argo Workflows / Kubeflow Pipelines / Flyte) → root-cause analysis PR

Trigger: Kubernetes Events on WorkflowFailed (this would benefit from #872 kubernetesEvents source; until then, Argo's outbound webhook hook or Prometheus alert via #775 works).

Agent task: Pull the failed workflow's logs and pod statuses, classify the failure (OOM / image pull / data missing / actual code bug / GPU resource), and either open an issue tagging the pipeline owner with a remediation suggestion, or open a PR for clear-cut fixes (resource bumps, image digest pins, retry policy).

This pattern is identical in shape to #946's CI/CD failure auto-remediation, just specialized to ML workflow CRDs.

Pattern 4 — Scheduled fairness / bias audit and model-card refresh

Trigger: cron on a weekly or monthly cadence.

Agent task: For each registered production model, run a templated fairness sweep against the eval dataset, regenerate the model card, and open a PR if metrics have shifted by more than a configured threshold. This produces the evidence trail demanded by EU AI Act / NIST AI RMF audits without launching humans every cycle.

apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
  name: weekly-model-card-refresh
spec:
  when:
    cron:
      schedule: "0 6 * * 1"   # Mondays 06:00 UTC
  taskTemplate:
    type: claude-code
    credentials:
      type: api-key
      secretRef: { name: claude-credentials }
    workspaceRef: { name: model-cards-repo }
    branch: "model-card-refresh-{{.Time.Format \"2006-01-02\"}}"
    promptTemplate: |
      Weekly model-card and fairness-audit refresh.

      For each production-stage model registered in MLflow (read via the
      `MLFLOW_TRACKING_URI` env var):
      1. Pull the latest eval-set metrics including disaggregated fairness slices.
      2. Diff against the metrics in `model-cards/<model>.md`.
      3. Update model cards whose drift exceeds the thresholds defined in
         `model-cards/POLICY.md`.
      4. Open a single PR titled "Weekly model-card refresh — <date>" listing
         every changed card, with a digest of metric deltas in the PR body.

Why this is differentiated from existing issues

Existing issue Why MLOps is distinct
#920 security vuln auto-remediation ML triggers are registry / drift / training-failure events, not GitHub security advisories
#946 CI/CD failure auto-remediation ML pipelines fail with ML-specific signatures (OOM on data, eval-set regressions, GPU starvation)
#967 perf regression Inference latency is one slice; drift / fairness / accuracy regression are different metrics
#981 supply-chain compliance Model lineage / model-card / data provenance are governed by AI-specific frameworks (NIST AI RMF)
#992 data-privacy compliance Overlaps lightly (PII in training data) but the ML-eval and registry pipelines are separate

Existing Kelos primitives this builds on

Minor gaps worth tracking (for follow-up issues, not this one)

These are already proposed and tracked. This issue does not ask for new CRDs — only the four reference TaskSpawner patterns, examples folders, and an MLOps section in the docs.

Proposed deliverables

  1. New examples directory examples/mlops-mlflow-promotion/ with a runnable TaskSpawner + Workspace + README walking through pattern 1.
  2. New examples directory examples/mlops-drift-remediation/ for pattern 2.
  3. New examples directory examples/mlops-fairness-audit-cron/ for pattern 4.
  4. A new docs page docs/use-cases/mlops.md linking the patterns and explaining where each ecosystem partner (MLflow, KServe, Kubeflow, Argo, Evidently, Arize) fits.
  5. A short addition to the main README.md "Use Cases" section calling MLOps out as a first-class lifecycle.

Acceptance criteria

  • Each example directory follows the structure of examples/10-taskspawner-github-webhook/ (yaml + README).
  • Examples reference existing CRD fields only; no schema changes.
  • Docs page links to the upstream MLflow webhook spec, Evidently webhook spec, KServe InferenceService spec, and the relevant CNCF / Linux Foundation project pages.
  • Examples are listed in examples/README.md index.

/kind feature

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions