🚨 Breaking Change - Dataplane - Sane defaults#210
Conversation
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts: # tests/generated/dataplane.additional-podlabels.yaml # tests/generated/dataplane.aws.eks-automode.yaml # tests/generated/dataplane.aws.with-ingress.yaml # tests/generated/dataplane.aws.yaml # tests/generated/dataplane.azure.yaml # tests/generated/dataplane.cost.yaml # tests/generated/dataplane.dcgm-exporter.yaml # tests/generated/dataplane.fully-selfhosted.yaml # tests/generated/dataplane.low-priv.yaml # tests/generated/dataplane.nodeobserver.yaml # tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Static Webhook
Current Aviator status
This PR was merged using Aviator.
See the real-time status of this PR on the
Aviator webapp.
Use the Aviator Chrome Extension
to see the status of your PR within GitHub.
|
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts: # charts/dataplane/templates/_helpers.tpl # charts/dataplane/values.yaml # tests/generated/dataplane.additional-podlabels.yaml # tests/generated/dataplane.aws.eks-automode.yaml # tests/generated/dataplane.aws.with-ingress.yaml # tests/generated/dataplane.aws.yaml # tests/generated/dataplane.azure.yaml # tests/generated/dataplane.cost.yaml # tests/generated/dataplane.dcgm-exporter.yaml # tests/generated/dataplane.fully-selfhosted.yaml # tests/generated/dataplane.low-priv.yaml # tests/generated/dataplane.nodeobserver.yaml # tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
| Get the webhook secret name | ||
| */}} | ||
| {{- define "flytepropellerwebhook.secretName" -}} | ||
| flyte-pod-webhook |
There was a problem hiding this comment.
| flyte-pod-webhook | |
| union-pod-webhook |
or something that doesn't clash with the Flyte OSS one. Whenever a customer needs to run Union and Flyte OSS in the same namespace, this will make deployment fail
There was a problem hiding this comment.
Why can't these be installed in separate namespaces?
There was a problem hiding this comment.
it can, but we have customers who only have permissions for a single namespace
There was a problem hiding this comment.
same here, this conflicts with Flyte OSS one
There was a problem hiding this comment.
ditto, separate namespaces? the only thing that's cluster-wide is the MutatingWebhookConfiguration object for which I added "-org" in the name
There was a problem hiding this comment.
| name: union-pod-webhook |
|
I just tested this without overriding the default config for webhook certs and got this from a V2 execution that calls a secret |
# Conflicts: # tests/generated/dataplane.additional-podlabels.yaml # tests/generated/dataplane.aws.eks-automode.yaml # tests/generated/dataplane.aws.with-ingress.yaml # tests/generated/dataplane.aws.yaml # tests/generated/dataplane.azure.yaml # tests/generated/dataplane.cost.yaml # tests/generated/dataplane.dcgm-exporter.yaml # tests/generated/dataplane.fully-selfhosted.yaml # tests/generated/dataplane.low-priv.yaml # tests/generated/dataplane.nodeobserver.yaml # tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts: # tests/generated/dataplane.additional-podlabels.yaml # tests/generated/dataplane.additional-templates.yaml # tests/generated/dataplane.aws.eks-automode.yaml # tests/generated/dataplane.aws.with-ingress.yaml # tests/generated/dataplane.aws.yaml # tests/generated/dataplane.azure-custom-storage-prefix.yaml # tests/generated/dataplane.azure.yaml # tests/generated/dataplane.cost.yaml # tests/generated/dataplane.dcgm-exporter.yaml # tests/generated/dataplane.fully-selfhosted.yaml # tests/generated/dataplane.low-priv.yaml # tests/generated/dataplane.monitoring.yaml # tests/generated/dataplane.nodeobserver.yaml # tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
davidmirror-ops
left a comment
There was a problem hiding this comment.
Not that testing is done but at least with the singleNamespace and low_privilege modes, the base components look and behave healthy
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts: # tests/generated/dataplane.additional-podlabels.yaml # tests/generated/dataplane.additional-templates.yaml # tests/generated/dataplane.aws.eks-automode.yaml # tests/generated/dataplane.aws.with-ingress.yaml # tests/generated/dataplane.aws.yaml # tests/generated/dataplane.azure-custom-storage-prefix.yaml # tests/generated/dataplane.azure.yaml # tests/generated/dataplane.cost.yaml # tests/generated/dataplane.dcgm-exporter.yaml # tests/generated/dataplane.fully-selfhosted.yaml # tests/generated/dataplane.low-priv.yaml # tests/generated/dataplane.monitoring.yaml # tests/generated/dataplane.nodeobserver.yaml # tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts: # tests/generated/dataplane.additional-podlabels.yaml # tests/generated/dataplane.additional-templates.yaml # tests/generated/dataplane.aws.eks-automode.yaml # tests/generated/dataplane.aws.with-ingress.yaml # tests/generated/dataplane.aws.yaml # tests/generated/dataplane.azure-custom-storage-prefix.yaml # tests/generated/dataplane.azure.yaml # tests/generated/dataplane.cost.yaml # tests/generated/dataplane.dcgm-exporter.yaml # tests/generated/dataplane.fully-selfhosted.yaml # tests/generated/dataplane.low-priv.yaml # tests/generated/dataplane.monitoring.yaml # tests/generated/dataplane.nodeobserver.yaml # tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts: # charts/dataplane/templates/operator/configmap.yaml # tests/generated/dataplane.additional-podlabels.yaml # tests/generated/dataplane.additional-templates.yaml # tests/generated/dataplane.aws.eks-automode.yaml # tests/generated/dataplane.aws.with-ingress.yaml # tests/generated/dataplane.aws.yaml # tests/generated/dataplane.azure-custom-storage-prefix.yaml # tests/generated/dataplane.azure.yaml # tests/generated/dataplane.cost.yaml # tests/generated/dataplane.dcgm-exporter.yaml # tests/generated/dataplane.fully-selfhosted.yaml # tests/generated/dataplane.low-priv.yaml # tests/generated/dataplane.monitoring.yaml # tests/generated/dataplane.nodeobserver.yaml # tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts: # charts/controlplane/Chart.yaml # charts/dataplane/Chart.yaml # tests/generated/controlplane.aws.billing-enable.yaml # tests/generated/controlplane.aws.yaml # tests/generated/controlplane.external-authz.yaml # tests/generated/controlplane.userclouds.yaml # tests/generated/dataplane.additional-podlabels.yaml # tests/generated/dataplane.additional-templates.yaml # tests/generated/dataplane.aws.eks-automode.yaml # tests/generated/dataplane.aws.with-ingress.yaml # tests/generated/dataplane.aws.yaml # tests/generated/dataplane.azure-custom-storage-prefix.yaml # tests/generated/dataplane.azure.yaml # tests/generated/dataplane.cost.yaml # tests/generated/dataplane.dcgm-exporter.yaml # tests/generated/dataplane.fully-selfhosted.yaml # tests/generated/dataplane.low-priv.yaml # tests/generated/dataplane.monitoring.yaml # tests/generated/dataplane.nodeobserver.yaml # tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
## Overview Fixes multiple monitoring and metrics regressions introduced by PR #210 (sane defaults), plus adds missing dataproxy taskMetrics config from cloud #15357 migration. ### Dataplane monitoring fixes 1. **ServiceMonitor and PrometheusRule gates** — Remove `(not .Values.low_privilege)` dependency. These are namespace-scoped CRDs that don't require elevated permissions. Gate on `monitoring.serviceMonitors.enabled` + CRD capability check instead. 2. **Prometheus scrape config label mismatch** — The `union-services` job selected on `platform.union.ai/service-group` but executor/leaseworker services only have `platform.union.ai/prometheus-group`. Fixed to use `prometheus-group`. 3. **Restore kubernetes-cadvisor scrape job** — PR #210 dropped the cadvisor job from the union-features Prometheus. This job scrapes `container_cpu_usage_seconds_total` and `container_memory_working_set_bytes` which power the Infrastructure dashboard panels and TLM metrics. 4. **Fix duplicate kube-state-metrics key** — Two `kube-state-metrics:` entries under `prometheus:` (one said `enabled: false`, the other `enabled: true`). Merged into single entry with metric relabelings. 5. **Move prometheus-rbac.yaml** to `templates/prometheus/rbac.yaml`. ### Controlplane fixes 6. **Add taskMetrics config to dataproxy** (FAB-306) — Cloud #15357 migrated `GetActionAttemptMetrics` from usage to dataproxy, but the helm chart never included the query templates. Console CPU/memory charts failed with "key not found." Added all 20 PromQL templates + agentQuery mappings under `services.dataproxy.configMap.dataproxy.taskMetrics`. 7. **Remove duplicate imagePullSecrets key** — Orphaned top-level key conflicted with `controlplane.imagePullSecrets`. 8. **Use UNION_HOST in ingress tls.hosts** — Replace hardcoded `localhost` with `{{ .Values.global.UNION_HOST }}` so the TLS cert covers the actual domain. Allows Terraform overlays to drop their `tls.hosts` override. ### Documentation 9. **Document low_privilege tradeoffs** — Added known limitations when `low_privilege: true`: TLM unavailable (cadvisor needs node SD), KSM limited to release namespace, node-level metrics missing, cost calculation reduced. ## Upgrade notes for `low_privilege: false` When switching from `low_privilege: true` to `low_privilege: false`, the prometheus subchart and its kube-state-metrics dependency need RBAC enabled to create ClusterRole/ClusterRoleBinding. Add to your values override: ```yaml prometheus: rbac: create: true kube-state-metrics: rbac: create: true ``` Without this, the union-features Prometheus cannot scrape cadvisor (node service discovery) and KSM cannot list pods across namespaces — TLM metrics will show "No data" in the console. ## Test plan - [x] Verified all 42 Apple doc metrics exist in helm-charts dashboards/PrometheusRules - [x] Live validation on `mike-apple-aws` selfhosted environment - [x] DP Prometheus scraping operator, executor, propeller metrics via ServiceMonitor - [x] cadvisor metrics (container_cpu, container_memory) flowing — 558 series - [x] KSM metrics (kube_pod_container_resource_requests/limits, kube_pod_status_phase) flowing — 124+ series - [x] TLM console metrics tab shows CPU/memory data for tasks > 30s - [x] All 15 dataplane + 5 controlplane test snapshots passing - [x] `helm template` renders dataproxy configmap with taskMetrics under correct config section ## Issues Fixes FAB-306 🤖 Generated with [Claude Code](https://claude.com/claude-code)
[:rotating_light: Breaking Change]
Major overhaul of the dataplane chart to establish sane, low-privilege defaults that work out of the box for new deployments while preserving backward compatibility via values-legacy.yaml.
Default mode changes
Prometheus consolidation
Knative operator improvements
Image builder auto-configuration
<account_id>.dkr.ecr.<region>.amazonaws.com/<registryName><region>-docker.pkg.dev/<projectId>/<registryName><registryName>.azurecr.ioOther changes
Webhook certificate management
🚨 Preserving current behavior
In order to preserve current behavior,
dataplane/values-legacy.yamlis added to revert all defaults to current values. This can be used in conjunction with any other values file like this:Test plan
main