Summary
Kelos ships strong primitives for engineering-team automation (GitHub/Linear/Jira sources, generic webhooks, Prometheus alerts, K8s events, cron) but no example, template, or proposal addresses MLOps / ML model lifecycle. This is a notable gap given Kelos is Kubernetes-native and the K8s ecosystem is the de facto control plane for ML — MLflow, KServe, Kubeflow Pipelines, Argo Workflows, Flyte, Seldon, BentoML, Ray, Feast all run on K8s and emit events Kelos can already consume via webhook, cron, and prometheusAlerts (#775, merged) sources.
ML platforms have an unusually high autonomous-agent payoff because:
This issue proposes four concrete TaskSpawner patterns for the MLOps lifecycle, identifies the ecosystem they target, and notes which existing Kelos primitives already cover them vs. what minor gaps could be closed in follow-ups.
Target Audience
- ML platform engineers running MLflow / KServe / Kubeflow / Seldon on K8s, owning the serving stack and registry plumbing.
- ML engineers and data scientists owning model code, training pipelines, evaluation harnesses, and notebooks.
- SRE-for-ML / on-call rotations that today catch drift alerts and fairness regressions manually.
Proposed TaskSpawner Patterns
All four patterns work with Kelos's current API; no CRD changes are required to land the examples. The first three use the existing webhook (GenericWebhook) source merged via #687; the fourth uses cron.
Pattern 1 — MLflow Model Registry promotion → update KServe InferenceService manifest
Trigger: MLflow's model-registry webhook fires on MODEL_VERSION_TRANSITIONED_STAGE (e.g., Staging → Production).
Agent task: Open a PR updating the corresponding InferenceService manifest in the GitOps repo with the new model URI, runtime version, and resource requests pulled from the registry's tags. Generate a model card delta from MLflow run metadata. Re-run smoke tests against the staging endpoint.
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: mlflow-promotion-responder
spec:
when:
webhook:
source: mlflow
fieldMapping:
id: "$.id"
modelName: "$.model_name"
version: "$.version"
toStage: "$.to_stage"
runId: "$.run_id"
filters:
- field: "$.to_stage"
value: "Production"
taskTemplate:
type: claude-code
credentials:
type: api-key
secretRef: { name: claude-credentials }
workspaceRef: { name: gitops-inference-manifests }
branch: "model-promotion/{{.modelName}}-v{{.version}}"
promptTemplate: |
Model `{{.modelName}}` version `{{.version}}` was promoted to **Production** in MLflow
(run: {{.runId}}).
Please:
1. Locate the corresponding KServe InferenceService manifest in this repo
(search by name `{{.modelName}}` under `inference/`).
2. Update `spec.predictor.model.storageUri` to the new artifact URI from MLflow.
3. Update resource requests from the new run's tags (`gpu_type`, `replica_min`, `replica_max`).
4. Append a model-card entry under `model-cards/{{.modelName}}.md` summarizing the
run's eval metrics (accuracy, calibration, fairness deltas) compared to the
currently-deployed version.
5. Open a PR with a checklist for the on-call to verify the staging smoke test
and approve rollout.
Pattern 2 — Drift detector webhook (Evidently / NannyML / Arize / Fiddler / WhyLabs) → open retraining PR
Trigger: Generic webhook from any drift-monitoring platform. All five major vendors support outbound webhooks and ship a stable JSON shape including model, feature, severity, metric, and timestamp.
Agent task: Investigate the flagged feature, propose a retraining plan in a draft PR (config bumps to the training pipeline definition, dataset window adjustment, retraining trigger), or open an issue if root cause looks like an upstream data-pipeline bug.
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: drift-remediation
spec:
when:
webhook:
source: evidently
fieldMapping:
id: "$.test_id"
model: "$.model_name"
feature: "$.feature"
severity: "$.severity"
metric: "$.metric_name"
metricValue: "$.metric_value"
filters:
- field: "$.severity"
pattern: "^(high|critical)$"
taskTemplate:
type: claude-code
credentials:
type: api-key
secretRef: { name: claude-credentials }
workspaceRef: { name: ml-training-pipelines }
branch: "drift/{{.model}}-{{.feature}}-{{.id}}"
promptTemplate: |
Evidently flagged drift on model **{{.model}}**, feature **{{.feature}}**:
`{{.metric}} = {{.metricValue}}` (severity: {{.severity}}).
Investigate by:
1. Locating this model's training pipeline (likely under `pipelines/{{.model}}/`).
2. Checking whether the upstream feature pipeline has had recent schema changes
(`git log` on `features/{{.feature}}*`).
3. Producing one of:
- **Draft retraining PR**: bump dataset window, adjust feature transforms, bump
pipeline parameters file. Add a checklist for evaluation thresholds.
- **Issue (label `data-quality`)**: if the root cause looks like an upstream
pipeline bug, not a model-decay issue.
Quote the relevant metric thresholds from the pipeline's config so the human
reviewer can audit your decision.
ttlSecondsAfterFinished: 86400
Pattern 3 — Training-pipeline failure (Argo Workflows / Kubeflow Pipelines / Flyte) → root-cause analysis PR
Trigger: Kubernetes Events on WorkflowFailed (this would benefit from #872 kubernetesEvents source; until then, Argo's outbound webhook hook or Prometheus alert via #775 works).
Agent task: Pull the failed workflow's logs and pod statuses, classify the failure (OOM / image pull / data missing / actual code bug / GPU resource), and either open an issue tagging the pipeline owner with a remediation suggestion, or open a PR for clear-cut fixes (resource bumps, image digest pins, retry policy).
This pattern is identical in shape to #946's CI/CD failure auto-remediation, just specialized to ML workflow CRDs.
Pattern 4 — Scheduled fairness / bias audit and model-card refresh
Trigger: cron on a weekly or monthly cadence.
Agent task: For each registered production model, run a templated fairness sweep against the eval dataset, regenerate the model card, and open a PR if metrics have shifted by more than a configured threshold. This produces the evidence trail demanded by EU AI Act / NIST AI RMF audits without launching humans every cycle.
apiVersion: kelos.dev/v1alpha1
kind: TaskSpawner
metadata:
name: weekly-model-card-refresh
spec:
when:
cron:
schedule: "0 6 * * 1" # Mondays 06:00 UTC
taskTemplate:
type: claude-code
credentials:
type: api-key
secretRef: { name: claude-credentials }
workspaceRef: { name: model-cards-repo }
branch: "model-card-refresh-{{.Time.Format \"2006-01-02\"}}"
promptTemplate: |
Weekly model-card and fairness-audit refresh.
For each production-stage model registered in MLflow (read via the
`MLFLOW_TRACKING_URI` env var):
1. Pull the latest eval-set metrics including disaggregated fairness slices.
2. Diff against the metrics in `model-cards/<model>.md`.
3. Update model cards whose drift exceeds the thresholds defined in
`model-cards/POLICY.md`.
4. Open a single PR titled "Weekly model-card refresh — <date>" listing
every changed card, with a digest of metric deltas in the PR body.
Why this is differentiated from existing issues
| Existing issue |
Why MLOps is distinct |
| #920 security vuln auto-remediation |
ML triggers are registry / drift / training-failure events, not GitHub security advisories |
| #946 CI/CD failure auto-remediation |
ML pipelines fail with ML-specific signatures (OOM on data, eval-set regressions, GPU starvation) |
| #967 perf regression |
Inference latency is one slice; drift / fairness / accuracy regression are different metrics |
| #981 supply-chain compliance |
Model lineage / model-card / data provenance are governed by AI-specific frameworks (NIST AI RMF) |
| #992 data-privacy compliance |
Overlaps lightly (PII in training data) but the ML-eval and registry pipelines are separate |
Existing Kelos primitives this builds on
Minor gaps worth tracking (for follow-up issues, not this one)
These are already proposed and tracked. This issue does not ask for new CRDs — only the four reference TaskSpawner patterns, examples folders, and an MLOps section in the docs.
Proposed deliverables
- New examples directory
examples/mlops-mlflow-promotion/ with a runnable TaskSpawner + Workspace + README walking through pattern 1.
- New examples directory
examples/mlops-drift-remediation/ for pattern 2.
- New examples directory
examples/mlops-fairness-audit-cron/ for pattern 4.
- A new docs page
docs/use-cases/mlops.md linking the patterns and explaining where each ecosystem partner (MLflow, KServe, Kubeflow, Argo, Evidently, Arize) fits.
- A short addition to the main
README.md "Use Cases" section calling MLOps out as a first-class lifecycle.
Acceptance criteria
- Each example directory follows the structure of
examples/10-taskspawner-github-webhook/ (yaml + README).
- Examples reference existing CRD fields only; no schema changes.
- Docs page links to the upstream MLflow webhook spec, Evidently webhook spec, KServe InferenceService spec, and the relevant CNCF / Linux Foundation project pages.
- Examples are listed in
examples/README.md index.
/kind feature
Summary
Kelos ships strong primitives for engineering-team automation (GitHub/Linear/Jira sources, generic webhooks, Prometheus alerts, K8s events, cron) but no example, template, or proposal addresses MLOps / ML model lifecycle. This is a notable gap given Kelos is Kubernetes-native and the K8s ecosystem is the de facto control plane for ML — MLflow, KServe, Kubeflow Pipelines, Argo Workflows, Flyte, Seldon, BentoML, Ray, Feast all run on K8s and emit events Kelos can already consume via
webhook,cron, andprometheusAlerts(#775, merged) sources.ML platforms have an unusually high autonomous-agent payoff because:
This issue proposes four concrete TaskSpawner patterns for the MLOps lifecycle, identifies the ecosystem they target, and notes which existing Kelos primitives already cover them vs. what minor gaps could be closed in follow-ups.
Target Audience
Proposed TaskSpawner Patterns
All four patterns work with Kelos's current API; no CRD changes are required to land the examples. The first three use the existing
webhook(GenericWebhook) source merged via #687; the fourth usescron.Pattern 1 — MLflow Model Registry promotion → update KServe InferenceService manifest
Trigger: MLflow's model-registry webhook fires on
MODEL_VERSION_TRANSITIONED_STAGE(e.g.,Staging→Production).Agent task: Open a PR updating the corresponding
InferenceServicemanifest in the GitOps repo with the new model URI, runtime version, and resource requests pulled from the registry's tags. Generate a model card delta from MLflow run metadata. Re-run smoke tests against the staging endpoint.Pattern 2 — Drift detector webhook (Evidently / NannyML / Arize / Fiddler / WhyLabs) → open retraining PR
Trigger: Generic webhook from any drift-monitoring platform. All five major vendors support outbound webhooks and ship a stable JSON shape including
model,feature,severity,metric, andtimestamp.Agent task: Investigate the flagged feature, propose a retraining plan in a draft PR (config bumps to the training pipeline definition, dataset window adjustment, retraining trigger), or open an issue if root cause looks like an upstream data-pipeline bug.
Pattern 3 — Training-pipeline failure (Argo Workflows / Kubeflow Pipelines / Flyte) → root-cause analysis PR
Trigger: Kubernetes Events on
WorkflowFailed(this would benefit from #872 kubernetesEvents source; until then, Argo's outbound webhook hook or Prometheus alert via #775 works).Agent task: Pull the failed workflow's logs and pod statuses, classify the failure (OOM / image pull / data missing / actual code bug / GPU resource), and either open an issue tagging the pipeline owner with a remediation suggestion, or open a PR for clear-cut fixes (resource bumps, image digest pins, retry policy).
This pattern is identical in shape to #946's CI/CD failure auto-remediation, just specialized to ML workflow CRDs.
Pattern 4 — Scheduled fairness / bias audit and model-card refresh
Trigger:
cronon a weekly or monthly cadence.Agent task: For each registered production model, run a templated fairness sweep against the eval dataset, regenerate the model card, and open a PR if metrics have shifted by more than a configured threshold. This produces the evidence trail demanded by EU AI Act / NIST AI RMF audits without launching humans every cycle.
Why this is differentiated from existing issues
Existing Kelos primitives this builds on
webhooksource (GenericWebhook, Integration: Add generic webhook source type to TaskSpawner for universal event-driven task triggering #687) — already covers MLflow, Evidently, NannyML, Arize, Fiddler, WhyLabs, BentoMLprometheusAlertssource (Integration: Add prometheusAlerts source type to TaskSpawner for alert-driven autonomous remediation #775) — covers inference-side regressionscronsource — covers scheduled auditslinearWebhook/githubWebhook— covers human-in-the-loop ChatOps for ML reviewersMinor gaps worth tracking (for follow-up issues, not this one)
kubernetesEventssource (Integration: Add kubernetesEvents source type to TaskSpawner for zero-infrastructure cluster self-healing #872, open) would let pattern 3 listen directly onWorkflowFailed/PipelineRunFailedCustom Resource events without a webhook bridge.contextSources(API: Add contextSources to TaskTemplate for cross-system context enrichment before task creation #881, open) would let agents pull MLflow run metadata or eval reports as input context, reducing the agent's reliance on tool-calling out of the workspace.CloudEventssource (Integration: Add CloudEvents support to kelos-webhook-server for universal CNCF-ecosystem event-driven task creation #914, open) would give a single canonical mapping for Vertex AI / SageMaker / Kubeflow Notifications, all of which speak CloudEvents natively.These are already proposed and tracked. This issue does not ask for new CRDs — only the four reference TaskSpawner patterns, examples folders, and an MLOps section in the docs.
Proposed deliverables
examples/mlops-mlflow-promotion/with a runnable TaskSpawner + Workspace + README walking through pattern 1.examples/mlops-drift-remediation/for pattern 2.examples/mlops-fairness-audit-cron/for pattern 4.docs/use-cases/mlops.mdlinking the patterns and explaining where each ecosystem partner (MLflow, KServe, Kubeflow, Argo, Evidently, Arize) fits.README.md"Use Cases" section calling MLOps out as a first-class lifecycle.Acceptance criteria
examples/10-taskspawner-github-webhook/(yaml + README).examples/README.mdindex./kind feature