Skip to content

🚨 Breaking Change - Dataplane - Sane defaults#210

Merged
aviator-app[bot] merged 95 commits intomainfrom
enghabu/sane-defaults
Apr 9, 2026
Merged

🚨 Breaking Change - Dataplane - Sane defaults#210
aviator-app[bot] merged 95 commits intomainfrom
enghabu/sane-defaults

Conversation

@EngHabu
Copy link
Copy Markdown
Contributor

@EngHabu EngHabu commented Jan 26, 2026

[:rotating_light: Breaking Change]

Major overhaul of the dataplane chart to establish sane, low-privilege defaults that work out of the box for new deployments while preserving backward compatibility via values-legacy.yaml.

Default mode changes

  • Low-privilege mode (low_privilege: true) is now the default — namespace-scoped RBAC, no cluster-wide permissions except where strictly required (fluentbit DaemonSet, knative-operator)
  • V2 executor is the default — flytepropeller.enabled: false, executor.idl2Executor: true, flyteconnector.enabled: true
  • Cluster resource sync disabled by default (clusterresourcesync.enabled: false)
  • Common service account (union-system) shared across all components by default
  • values-legacy.yaml created to restore all previous defaults with a single overlay file

Prometheus consolidation

  • Replaced the static prometheus deployment (templates/prometheus/) and the prometheus-simple subchart with a single community prometheus chart aliased as prometheus
  • Removed standalone kube-state-metrics dependency (now a subchart of prometheus)
  • Namespace-scoped RBAC for prometheus and kube-state-metrics in low-privilege mode via prometheus-rbac.yaml
  • Scrape configs for flytepropeller, serving-envoy, and dcgm-exporter are now unconditional (no-op when targets don't exist)
  • cAdvisor scraping moved to values-legacy.yaml (requires ClusterRole)

Knative operator improvements

  • Split CRDs into knative-operator-crds subchart — solves chicken-and-egg CRD validation error on fresh installs. CRDs are in the crds/ directory so Helm installs them before templates
  • CRDs chart conditioned on knative-operator-crds.enabled
  • Removed single_namespace support — the operator requires cluster-scoped RBAC and can't be namespace-restricted
  • Fixed _example key in operator ConfigMaps that caused webhook validation failures on helm upgrade
  • Added tpl support in knative-operator.namespace helper for namespaceOverride
  • Fixed CRD conversion webhook namespace to use the helper instead of .Release.Namespace

Image builder auto-configuration

  • imageBuilder.defaultRepository auto-generates from cloud provider:
    • AWS: <account_id>.dkr.ecr.<region>.amazonaws.com/<registryName>
    • GCP: <region>-docker.pkg.dev/<projectId>/<registryName>
    • Azure: <registryName>.azurecr.io
  • imageBuilder.authenticationType auto-detects (aws, google, azure, noop)
  • New imageBuilder.registryName value (default: union-dataplane)
  • Added global.AWS_ACCOUNT_ID for ECR URL generation

Other changes

  • Fixed fluentbit.serviceAccount.name to use common SA (union-system) by default
  • Fixed union-serviceaccount.yaml missing tpl calls (pre-existing bug)
  • Webhook templates moved from nodeexecutor/ to webhook/ directory with Helm-managed certificates
  • Added GCP test case (dataplane.gcp.yaml)
  • Removed values-low-privilege.yaml and values.v2.yaml (superseded by new defaults)

Webhook certificate management

  • Helm-managed MutatingWebhookConfiguration — the webhook configuration is now created by Helm (with flytepropellerwebhook.managedConfig: true) instead of self-registered by the webhook binary at runtime. This removes the need for the webhook to have cluster-scoped RBAC for mutatingwebhookconfigurations
  • Configurable certificate providers (flytepropellerwebhook.certificate.provider):
    • helm (default) — generates self-signed certs using Helm's crypto functions, preserved on upgrade if the secret exists
    • certManager — uses cert-manager to provision and manage certificates
    • external — user-provided certificates via caCert, tlsCrt, tlsKey values
    • legacy — original behavior where the webhook binary generates its own certs via init container
  • Static test certificates (values-test-certs.yaml) for deterministic test snapshot generation
  • Webhook templates restructured from nodeexecutor/ to webhook/ directory

🚨 Preserving current behavior

In order to preserve current behavior, dataplane/values-legacy.yaml is added to revert all defaults to current values. This can be used in conjunction with any other values file like this:

helm upgrade ..... -f values-legacy.yaml -f myvalues.yaml

Test plan

  • Verify helm template renders correctly for default values
  • Verify helm template with low_privilege: true produces namespace-scoped resources and no cluster-wide permissions
  • Verify build-image-config configmap is created only in single-namespace mode
  • Verify depot-token imagePullSecret appears in task pod template when Depot is enabled
  • Verify webhook certificates render correctly for all providers (helm, certManager, external, legacy)
  • Verify generated test fixtures match expected output
  • Test upgrade path from existing deployments (webhook secret reuse, resource renames)

Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts:
#	tests/generated/dataplane.additional-podlabels.yaml
#	tests/generated/dataplane.aws.eks-automode.yaml
#	tests/generated/dataplane.aws.with-ingress.yaml
#	tests/generated/dataplane.aws.yaml
#	tests/generated/dataplane.azure.yaml
#	tests/generated/dataplane.cost.yaml
#	tests/generated/dataplane.dcgm-exporter.yaml
#	tests/generated/dataplane.fully-selfhosted.yaml
#	tests/generated/dataplane.low-priv.yaml
#	tests/generated/dataplane.nodeobserver.yaml
#	tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@aviator-app
Copy link
Copy Markdown
Contributor

aviator-app Bot commented Jan 26, 2026

Current Aviator status

Aviator will automatically update this comment as the status of the PR changes.
Comment /aviator refresh to force Aviator to re-examine your PR (or learn about other /aviator commands).

This PR was merged using Aviator.


See the real-time status of this PR on the Aviator webapp.
Use the Aviator Chrome Extension to see the status of your PR within GitHub.

Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts:
#	charts/dataplane/templates/_helpers.tpl
#	charts/dataplane/values.yaml
#	tests/generated/dataplane.additional-podlabels.yaml
#	tests/generated/dataplane.aws.eks-automode.yaml
#	tests/generated/dataplane.aws.with-ingress.yaml
#	tests/generated/dataplane.aws.yaml
#	tests/generated/dataplane.azure.yaml
#	tests/generated/dataplane.cost.yaml
#	tests/generated/dataplane.dcgm-exporter.yaml
#	tests/generated/dataplane.fully-selfhosted.yaml
#	tests/generated/dataplane.low-priv.yaml
#	tests/generated/dataplane.nodeobserver.yaml
#	tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu EngHabu marked this pull request as ready for review February 6, 2026 21:30
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Comment thread charts/dataplane/templates/_helpers.tpl Outdated
Get the webhook secret name
*/}}
{{- define "flytepropellerwebhook.secretName" -}}
flyte-pod-webhook
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
flyte-pod-webhook
union-pod-webhook

or something that doesn't clash with the Flyte OSS one. Whenever a customer needs to run Union and Flyte OSS in the same namespace, this will make deployment fail

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can't these be installed in separate namespaces?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it can, but we have customers who only have permissions for a single namespace

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here, this conflicts with Flyte OSS one

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto, separate namespaces? the only thing that's cluster-wide is the MutatingWebhookConfiguration object for which I added "-org" in the name

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
name: union-pod-webhook

@davidmirror-ops
Copy link
Copy Markdown
Contributor

I just tested this without overriding the default config for webhook certs and got this from a V2 execution that calls a secret

failed to execute node: failed at Node[a0]. RuntimeExecutionError: failed during plugin execution, caused by: failed to execute handle for plugin [container]: [InternalError] failed to create resource, caused by: Internal error occurred: failed calling webhook "flyte-pod-webhook.flyte.org": failed to call webhook: Post "https://flyte-pod-webhook.union.svc:443/mutate--v1-pod/secrets?timeout=30s": tls: failed to verify certificate: x509: certificate signed by unknown authority

Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts:
#	tests/generated/dataplane.additional-podlabels.yaml
#	tests/generated/dataplane.aws.eks-automode.yaml
#	tests/generated/dataplane.aws.with-ingress.yaml
#	tests/generated/dataplane.aws.yaml
#	tests/generated/dataplane.azure.yaml
#	tests/generated/dataplane.cost.yaml
#	tests/generated/dataplane.dcgm-exporter.yaml
#	tests/generated/dataplane.fully-selfhosted.yaml
#	tests/generated/dataplane.low-priv.yaml
#	tests/generated/dataplane.nodeobserver.yaml
#	tests/generated/dataplane.oci.yaml
@github-actions github-actions Bot mentioned this pull request Apr 3, 2026
EngHabu added 3 commits April 3, 2026 18:31
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts:
#	tests/generated/dataplane.additional-podlabels.yaml
#	tests/generated/dataplane.additional-templates.yaml
#	tests/generated/dataplane.aws.eks-automode.yaml
#	tests/generated/dataplane.aws.with-ingress.yaml
#	tests/generated/dataplane.aws.yaml
#	tests/generated/dataplane.azure-custom-storage-prefix.yaml
#	tests/generated/dataplane.azure.yaml
#	tests/generated/dataplane.cost.yaml
#	tests/generated/dataplane.dcgm-exporter.yaml
#	tests/generated/dataplane.fully-selfhosted.yaml
#	tests/generated/dataplane.low-priv.yaml
#	tests/generated/dataplane.monitoring.yaml
#	tests/generated/dataplane.nodeobserver.yaml
#	tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu EngHabu requested a review from davidmirror-ops April 6, 2026 19:22
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Copy link
Copy Markdown
Contributor

@davidmirror-ops davidmirror-ops left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that testing is done but at least with the singleNamespace and low_privilege modes, the base components look and behave healthy

EngHabu added 11 commits April 6, 2026 18:19
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts:
#	tests/generated/dataplane.additional-podlabels.yaml
#	tests/generated/dataplane.additional-templates.yaml
#	tests/generated/dataplane.aws.eks-automode.yaml
#	tests/generated/dataplane.aws.with-ingress.yaml
#	tests/generated/dataplane.aws.yaml
#	tests/generated/dataplane.azure-custom-storage-prefix.yaml
#	tests/generated/dataplane.azure.yaml
#	tests/generated/dataplane.cost.yaml
#	tests/generated/dataplane.dcgm-exporter.yaml
#	tests/generated/dataplane.fully-selfhosted.yaml
#	tests/generated/dataplane.low-priv.yaml
#	tests/generated/dataplane.monitoring.yaml
#	tests/generated/dataplane.nodeobserver.yaml
#	tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts:
#	tests/generated/dataplane.additional-podlabels.yaml
#	tests/generated/dataplane.additional-templates.yaml
#	tests/generated/dataplane.aws.eks-automode.yaml
#	tests/generated/dataplane.aws.with-ingress.yaml
#	tests/generated/dataplane.aws.yaml
#	tests/generated/dataplane.azure-custom-storage-prefix.yaml
#	tests/generated/dataplane.azure.yaml
#	tests/generated/dataplane.cost.yaml
#	tests/generated/dataplane.dcgm-exporter.yaml
#	tests/generated/dataplane.fully-selfhosted.yaml
#	tests/generated/dataplane.low-priv.yaml
#	tests/generated/dataplane.monitoring.yaml
#	tests/generated/dataplane.nodeobserver.yaml
#	tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
# Conflicts:
#	charts/dataplane/templates/operator/configmap.yaml
#	tests/generated/dataplane.additional-podlabels.yaml
#	tests/generated/dataplane.additional-templates.yaml
#	tests/generated/dataplane.aws.eks-automode.yaml
#	tests/generated/dataplane.aws.with-ingress.yaml
#	tests/generated/dataplane.aws.yaml
#	tests/generated/dataplane.azure-custom-storage-prefix.yaml
#	tests/generated/dataplane.azure.yaml
#	tests/generated/dataplane.cost.yaml
#	tests/generated/dataplane.dcgm-exporter.yaml
#	tests/generated/dataplane.fully-selfhosted.yaml
#	tests/generated/dataplane.low-priv.yaml
#	tests/generated/dataplane.monitoring.yaml
#	tests/generated/dataplane.nodeobserver.yaml
#	tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@EngHabu EngHabu changed the title Dataplane - Sane defaults 🚨 Breaking Change - Dataplane - Sane defaults Apr 9, 2026
EngHabu added 3 commits April 9, 2026 11:15
# Conflicts:
#	charts/controlplane/Chart.yaml
#	charts/dataplane/Chart.yaml
#	tests/generated/controlplane.aws.billing-enable.yaml
#	tests/generated/controlplane.aws.yaml
#	tests/generated/controlplane.external-authz.yaml
#	tests/generated/controlplane.userclouds.yaml
#	tests/generated/dataplane.additional-podlabels.yaml
#	tests/generated/dataplane.additional-templates.yaml
#	tests/generated/dataplane.aws.eks-automode.yaml
#	tests/generated/dataplane.aws.with-ingress.yaml
#	tests/generated/dataplane.aws.yaml
#	tests/generated/dataplane.azure-custom-storage-prefix.yaml
#	tests/generated/dataplane.azure.yaml
#	tests/generated/dataplane.cost.yaml
#	tests/generated/dataplane.dcgm-exporter.yaml
#	tests/generated/dataplane.fully-selfhosted.yaml
#	tests/generated/dataplane.low-priv.yaml
#	tests/generated/dataplane.monitoring.yaml
#	tests/generated/dataplane.nodeobserver.yaml
#	tests/generated/dataplane.oci.yaml
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
Signed-off-by: Haytham Abuelfutuh <haytham@afutuh.com>
@aviator-app aviator-app Bot merged commit 42d9277 into main Apr 9, 2026
5 checks passed
@aviator-app aviator-app Bot deleted the enghabu/sane-defaults branch April 9, 2026 18:22
aviator-app Bot pushed a commit that referenced this pull request May 2, 2026
## Overview

Fixes multiple monitoring and metrics regressions introduced by PR #210 (sane defaults), plus adds missing dataproxy taskMetrics config from cloud #15357 migration.

### Dataplane monitoring fixes

1. **ServiceMonitor and PrometheusRule gates** — Remove `(not .Values.low_privilege)` dependency. These are namespace-scoped CRDs that don't require elevated permissions. Gate on `monitoring.serviceMonitors.enabled` + CRD capability check instead.

2. **Prometheus scrape config label mismatch** — The `union-services` job selected on `platform.union.ai/service-group` but executor/leaseworker services only have `platform.union.ai/prometheus-group`. Fixed to use `prometheus-group`.

3. **Restore kubernetes-cadvisor scrape job** — PR #210 dropped the cadvisor job from the union-features Prometheus. This job scrapes `container_cpu_usage_seconds_total` and `container_memory_working_set_bytes` which power the Infrastructure dashboard panels and TLM metrics.

4. **Fix duplicate kube-state-metrics key** — Two `kube-state-metrics:` entries under `prometheus:` (one said `enabled: false`, the other `enabled: true`). Merged into single entry with metric relabelings.

5. **Move prometheus-rbac.yaml** to `templates/prometheus/rbac.yaml`.

### Controlplane fixes

6. **Add taskMetrics config to dataproxy** (FAB-306) — Cloud #15357 migrated `GetActionAttemptMetrics` from usage to dataproxy, but the helm chart never included the query templates. Console CPU/memory charts failed with "key not found." Added all 20 PromQL templates + agentQuery mappings under `services.dataproxy.configMap.dataproxy.taskMetrics`.

7. **Remove duplicate imagePullSecrets key** — Orphaned top-level key conflicted with `controlplane.imagePullSecrets`.

8. **Use UNION_HOST in ingress tls.hosts** — Replace hardcoded `localhost` with `{{ .Values.global.UNION_HOST }}` so the TLS cert covers the actual domain. Allows Terraform overlays to drop their `tls.hosts` override.

### Documentation

9. **Document low_privilege tradeoffs** — Added known limitations when `low_privilege: true`: TLM unavailable (cadvisor needs node SD), KSM limited to release namespace, node-level metrics missing, cost calculation reduced.

## Upgrade notes for `low_privilege: false`

When switching from `low_privilege: true` to `low_privilege: false`, the prometheus subchart and its kube-state-metrics dependency need RBAC enabled to create ClusterRole/ClusterRoleBinding. Add to your values override:

```yaml
prometheus:
  rbac:
    create: true
  kube-state-metrics:
    rbac:
      create: true
```

Without this, the union-features Prometheus cannot scrape cadvisor (node service discovery) and KSM cannot list pods across namespaces — TLM metrics will show "No data" in the console.

## Test plan

- [x] Verified all 42 Apple doc metrics exist in helm-charts dashboards/PrometheusRules
- [x] Live validation on `mike-apple-aws` selfhosted environment
- [x] DP Prometheus scraping operator, executor, propeller metrics via ServiceMonitor
- [x] cadvisor metrics (container_cpu, container_memory) flowing — 558 series
- [x] KSM metrics (kube_pod_container_resource_requests/limits, kube_pod_status_phase) flowing — 124+ series
- [x] TLM console metrics tab shows CPU/memory data for tasks > 30s
- [x] All 15 dataplane + 5 controlplane test snapshots passing
- [x] `helm template` renders dataproxy configmap with taskMetrics under correct config section

## Issues

Fixes FAB-306

🤖 Generated with [Claude Code](https://claude.com/claude-code)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants