Update authorizer dashboard: standardized backend metrics#353
Update authorizer dashboard: standardized backend metrics#353aviator-app[bot] merged 1 commit intomainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Updates the union-controlplane-overview Grafana dashboard to modernize the Authorizer row (standardized backend metrics, identity_type breakdowns, improved units/thresholds/zero-state behavior) and refreshes generated Helm snapshot fixtures to match the new dashboard output.
Changes:
- Overhaul Authorizer row panels/queries (mode, allow/deny, deny %, latency, backend errors, attribution, fail-open) and remove “(V1 + V2)” row-title suffixes.
- Add identity_type breakdowns to decision-rate panels and adjust legends/units/thresholds.
- Regenerate
tests/generated/*snapshots to reflect updated rendered outputs.
Reviewed changes
Copilot reviewed 1 out of 6 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
charts/controlplane/dashboards/union-controlplane-overview.json |
Dashboard JSON updates: row titles + reworked Authorizer panels/PromQL. |
tests/generated/controlplane.external-authz.yaml |
Updated expected Helm-render output (includes embedded dashboard JSON changes). |
tests/generated/controlplane.custom-oidc.yaml |
Updated expected Helm-render output for custom OIDC fixture (includes embedded dashboard JSON changes). |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| { | ||
| "expr": "rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]) / (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval]) + rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]))", | ||
| "legendFormat": "Deny %", | ||
| "expr": "(sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval])) or vector(0)) / clamp_min((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval])) + sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]))), 1e-10)", |
There was a problem hiding this comment.
The or vector(0) fallback in this PromQL will always add an extra unlabeled {} series (because vector(0) has no labels and won’t match the sum by (identity_type) output). It also doesn’t reliably fix the zero-state for the ratio because the denominator can still be empty, producing no result. Use a label-matching fallback (e.g., ... or on() vector(0) / ... or on() vector(1e-10)), or restructure the expression so both numerator and denominator have compatible zero-state behavior without introducing an always-present extra series.
| "expr": "(sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval])) or vector(0)) / clamp_min((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval])) + sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]))), 1e-10)", | |
| "expr": "((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]))) or on(identity_type) (0 * ((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval]))) + (sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval])))))) / clamp_min(((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval]))) + (sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval])))), 1e-10)", |
| { | ||
| "expr": "rate(authorizer:authorizer:cloudauthorizer:connect:external:fail_open_activated{namespace=\"$namespace\"}[$__rate_interval])", | ||
| "legendFormat": "Fail-Open", | ||
| "expr": "sum by (error_type) (rate(authorizer:authorizer:cloudauthorizer:connect:backend_authorize_errors{namespace=\"$namespace\"}[$__rate_interval])) or vector(0)", | ||
| "legendFormat": "{{error_type}}", | ||
| "refId": "A" |
There was a problem hiding this comment.
... or vector(0) will always include an additional unlabeled {} time series alongside the real error_type series, because the labelsets don’t match. Prefer or on() vector(0) (or remove the fallback and rely on Grafana’s no-value handling) so the zero fallback only appears when the metric is absent.
| { | ||
| "expr": "sum by (action) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval]))", | ||
| "legendFormat": "allowed: {{ action }}", | ||
| "expr": "sum by (error_source) (rate(authorizer:authorizer:cloudauthorizer:connect:authorize_errors_total{namespace=\"$namespace\"}[$__rate_interval])) or vector(0)", | ||
| "legendFormat": "{{error_source}}", | ||
| "refId": "A" |
There was a problem hiding this comment.
Same issue here: or vector(0) adds a permanent unlabeled {} series (since sum by (error_source) won’t match vector(0)). Use or on() vector(0) if you want a zero-state fallback only when the metric is absent.
| { | ||
| "expr": "sum by (error_source) (rate(authorizer:authorizer:cloudauthorizer:connect:authorize_errors_total{namespace=\"$namespace\"}[$__rate_interval]))", | ||
| "legendFormat": "{{ error_source }}", | ||
| "expr": "rate(authorizer:authorizer:cloudauthorizer:connect:external:fail_open_activated{namespace=\"$namespace\"}[$__rate_interval]) or vector(0)", |
There was a problem hiding this comment.
rate(...) or vector(0) will always add an extra unlabeled {} series even when fail_open_activated is present (labelsets won’t match). If the goal is just to show 0 when the metric is missing, switch to or on() vector(0) so the fallback doesn’t appear when data exists.
| "expr": "rate(authorizer:authorizer:cloudauthorizer:connect:external:fail_open_activated{namespace=\"$namespace\"}[$__rate_interval]) or vector(0)", | |
| "expr": "rate(authorizer:authorizer:cloudauthorizer:connect:external:fail_open_activated{namespace=\"$namespace\"}[$__rate_interval]) or on() vector(0)", |
| { | ||
| "expr": "histogram_quantile(0.50, sum by (le) (rate(authorizer:authorizer:cloudauthorizer:connect:external:authorize_duration_bucket{namespace=\"$namespace\"}[$__rate_interval])))", | ||
| "expr": "histogram_quantile(0.50, sum by (le) (rate(authorizer:authorizer:cloudauthorizer:connect:backend_authorize_duration_ms_bucket{namespace=\"$namespace\"}[$__rate_interval])))", | ||
| "legendFormat": "p50", | ||
| "refId": "A" | ||
| }, | ||
| { | ||
| "expr": "histogram_quantile(0.95, sum by (le) (rate(authorizer:authorizer:cloudauthorizer:connect:external:authorize_duration_bucket{namespace=\"$namespace\"}[$__rate_interval])))", | ||
| "expr": "histogram_quantile(0.95, sum by (le) (rate(authorizer:authorizer:cloudauthorizer:connect:backend_authorize_duration_ms_bucket{namespace=\"$namespace\"}[$__rate_interval])))", | ||
| "legendFormat": "p95", |
There was a problem hiding this comment.
PR description/migration notes mention panels referencing authz_backend_* metrics, but this dashboard queries ...:backend_authorize_duration_ms_bucket / ...:backend_authorize_errors instead. Please either update the dashboard queries to the intended authz_backend_* names (if that’s the standardized metric prefix) or adjust the PR description/migration notes so operators know exactly which metrics are required.
b33ebdc to
446a6c6
Compare
2acf6a8 to
a404489
Compare
## Overview Adds authorizer configuration defaults for selfhosted deployments and fixes dashboard panel issues. - **defaultIdentityToSubject**: Config for non-Okta IdPs (FAB-189) that don't include `identitytype` claim natively. When enabled, subjects without identitytype are treated as users. - **UserClouds client defaults**: Pre-configured connection settings in the controlplane values so terraform doesn't need deep-merge overrides. - **Dashboard fixes**: Authorizer Mode panel value mappings for case sensitivity, metric name mismatches. ## Migration Notes No migration required. Additive defaults only — existing deployments are not affected. ## Test Plan - [x] `helm template` renders correctly - [x] Verified on identity-testing environment ref FAB-189 ref FAB-178 🤖 Generated with [Claude Code](https://claude.com/claude-code) - `main` <!-- branch-stack --> - **Add selfhosted authorizer defaults and dashboard fixes** :point\_left: - \#349 - \#350 - \#351 - \#352 - \#353 - \#354
## Overview Adds OIDC/OAuth2 globals to support identity providers beyond Okta (Entra ID, Keycloak, Authentik). **New globals** (all empty by default — Okta behavior unchanged): - `OIDC_METADATA_URL` — metadata discovery endpoint (Okta: `.well-known/oauth-authorization-server`, Entra: `.well-known/openid-configuration`) - `OIDC_APP_SCOPE` — resource scope for CLI/SDK and task pods (Entra: `api://my-app/.default`) - `OIDC_APP_AUDIENCE` — audience parameter for PKCE flows (Entra: `api://my-app`) - `OIDC_S2S_SCOPE` — service-to-service scope (Entra: `api://my-app/.default`) - Dataplane S2S scope support via `OIDC_S2S_SCOPE` Includes inline documentation with Okta/Entra ID examples, OAuth app flow references, and test fixture for non-Okta configuration. ## Migration Notes No migration required. All new globals default to empty which preserves existing Okta behavior. Non-Okta IdPs can now be configured without terraform deep-merge overrides. ## Test Plan - [x] Added `tests/values/controlplane.custom-oidc.yaml` test fixture - [x] `helm template` renders correctly with both empty and non-empty globals - [x] Verified on identity-testing environment with Entra ID ref FAB-178 🤖 Generated with [Claude Code](https://claude.com/claude-code) - `main` <!-- branch-stack --> - **Add OIDC globals for multi-IdP support (Okta, Entra ID, Keycloak)** :point\_left: - \#350 - \#351 - \#352 - \#353 - \#354
446a6c6 to
8c91953
Compare
a404489 to
be6eb97
Compare
…350) ## Summary - **server-alias annotation**: Allows DP→CP intra-cluster traffic through nginx ingress by matching the internal service DNS name - **gRPC configuration-snippet**: Forwards identity headers (X-User-Subject, X-User-Claim-Identitytype) from auth subrequest to gRPC backends. Required for BYOC since Oct 2024, now ported to selfhosted chart. - **Organizations service**: New controlplane service for settings/org APIs. Required by CreateRun since #15107. - **Fix organizations connectPort** in configmap Stacked on #349 → #348 → main ## Migration Notes **New behavior:** Intra-cluster DP→CP gRPC traffic now goes through nginx auth subrequests (via server-alias). This ensures consistent identity resolution for all callers. Previously, intra-cluster traffic could bypass auth if the `:authority` header didn't match the ingress host. **Action required:** None for new deployments. Existing deployments will get the server-alias annotation on next helm upgrade. ## Risk Assessment **Medium** — Changes ingress behavior. Test: DP→CP traffic, browser login. ## Test Plan - [ ] Verify DP→CP gRPC traffic still works through ingress - [ ] Verify browser login flow unchanged - [ ] Verify organizations service endpoints respond - [ ] `helm template` produces expected manifests ref FAB-178, FAB-195 🤖 Generated with [Claude Code](https://claude.com/claude-code) - `main` <!-- branch-stack --> - **Add selfhosted ingress identity forwarding and organizations service** :point\_left: - \#351 - \#352 - \#353 - \#354
8c91953 to
26251ed
Compare
## Summary - Move all OIDC/OAuth2 auth configuration into a documented `flyte.configmap.adminServer.auth` block in base values.yaml with inline Okta/Entra ID examples - Move `adminClient.connection` to base (was duplicated in AWS/GCP overlays) - Wire `OIDC_S2S_SCOPE` for CP S2S auth (Entra requires `/.default`, Okta uses `all`) - Add `OIDC_BROWSER_SCOPE` for Entra browser auth (AADSTS90009 fix) - Deprecate scattered auth globals: `OIDC_BASE_URL`, `OIDC_CLIENT_ID`, `CLI_CLIENT_ID` - Authz templates accept both `"Union"` and `"UserClouds"` type Stacked on #350 → #349 → #348 → main ## Migration Notes **Auth config location changed:** All auth config now lives in `flyte.configmap.adminServer.auth` in base values.yaml. Cloud overlays no longer set auth values — terraform authn modules output complete `app_auth`/`user_auth` blocks. **Deprecated globals:** `OIDC_BASE_URL`, `OIDC_CLIENT_ID`, `CLI_CLIENT_ID` still work but are deprecated. New deployments should use the auth block directly. **New globals:** `OIDC_S2S_SCOPE` (default: `all`), `OIDC_BROWSER_SCOPE`, `INTERNAL_SUBJECT_ID` (default: `INTERNAL_CLIENT_ID`). **UserClouds → Union:** Authz templates accept `type: "Union"` alongside `"UserClouds"`. Both are equivalent; `"Union"` preferred for new deployments. ## Risk Assessment **Medium** — Restructures auth config. Verified by `compare_manifests.py` producing zero structural diffs against baseline. ## Test Plan - [ ] `helm template` with Okta defaults produces identical auth config - [ ] `helm template` with Entra overrides produces correct scopes/audiences - [ ] Verify deprecated globals still work for backward compatibility - [ ] `compare_manifests.py` zero diffs against baseline ref FAB-178, FAB-195 🤖 Generated with [Claude Code](https://claude.com/claude-code) - `main` <!-- branch-stack --> - **Consolidate auth config into base values.yaml** :point\_left: - \#352 - \#353 - \#354
be6eb97 to
20b7ffe
Compare
26251ed to
e92c015
Compare
20b7ffe to
7222586
Compare
## Summary - Move cloud-agnostic config from AWS/GCP overlays into base values.yaml - **AWS overlay**: \~400 → 102 lines - **GCP overlay**: 504 → 105 lines - Base chart is now self-contained for selfhosted deployments - Overlays reduced to only cloud-specific items: IAM, storage, region, scylla provisioner Stacked on #351 → #350 → #349 → #348 → main ### What moved to base - Namespace-derived FQDNs (`admin.endpoint`, `rootTenantURLPattern`) - Ingress configuration (server-alias, protectedIngress annotations) - ingress-nginx (enabled, ClusterIP, fullnameOverride) - Monitoring (kube\* disabled, serviceMonitor enabled) - Image repos (default: `registry.unionai.cloud/controlplane`) - Secrets (union-operator, union-secrets) - ScyllaDB generic config - envoy-gateway defaults ### What stays in overlays (\~100 lines each) - Cloud region, DB, storage bucket, IAM identifiers - `flyte.storage` (type:s3/gcs, region/projectId) - IAM annotations (IRSA / Workload Identity) - scylla storageClass provisioner - dataproxy endpoint, artifacts stow config ## Migration Notes **Base chart is now self-contained:** New selfhosted deployments only need cloud-specific overlay values (IAM, storage, DB). **Image repository default:** Base defaults to `registry.unionai.cloud/controlplane`. Internal deployments using ECR must set `IMAGE_REPOSITORY_PREFIX` via terraform. **Namespace-derived FQDNs:** Services use `{{ .Release.Namespace }}` instead of hardcoded values. Resolves identically in standard deployments. ## Risk Assessment **Higher** — Touches most of values.yaml. Verified by `compare_manifests.py` producing zero structural diffs against baseline from `mike/overlay-consolidated-backup-2026-04-20`. ## Test Plan - [ ] `helm template` with AWS overlay produces identical manifests to baseline - [ ] `helm template` with GCP overlay produces identical manifests to baseline - [ ] `compare_manifests.py` zero diffs against baseline - [ ] Deploy to identity-testing environment and verify all services healthy ref FAB-178, FAB-195, FAB-276 🤖 Generated with [Claude Code](https://claude.com/claude-code) - `main` <!-- branch-stack --> - **Move generic selfhosted config from overlays to base values.yaml** :point\_left: - \#353 - \#354
Overhaul the authorizer row in union-controlplane-overview dashboard
to use standardized BackendMetrics emitted by all backend types
(Noop, External, Union).
Changes:
- Fix metric name mismatches and query bugs in existing panels
- Fix Authorizer Mode panel value mappings for case sensitivity
- Add identity_type breakdown (User/App/External/Unknown) to auth panels
- Add proper units (ops, ms, percentunit) and thresholds
- Add noValue: "0" and "or vector(0)" for zero-state panels
- Fix legendFormat double-escaping ({{{{type}}}} → {{type}})
- Remove "(V1 + V2)" labels from all row titles
## Migration Notes
**Dashboard auto-updates:** The dashboard JSON is deployed via ConfigMap
sidecar. No manual action needed — it updates on next helm upgrade.
**New metrics required:** Panels reference `authz_backend_*` metrics from
the standardized BackendMetrics struct. These are emitted by cloud
v2026.4.x+. Older versions will show "No data" for backend panels.
ref FAB-178
Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
7222586 to
d6c3ccb
Compare
Current Aviator status
This PR was merged using Aviator.
See the real-time status of this PR on the
Aviator webapp.
Use the Aviator Chrome Extension
to see the status of your PR within GitHub.
|
|
/aviator merge |
|
Aviator has accepted the merge request. It will enter the queue when all of the required status checks have passed. Aviator will update the sticky status comment as the pull request moves through the queue. |
Snapshots updated to reflect Mike's PRs #348-#353 (base values consolidation, organizations service, OIDC globals) plus the identity service defaults from this PR. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Summary
BackendMetricsidentity_typebreakdown (User/App/External/Unknown) to auth panelsStacked on #352 → #351 → #350 → #349 → #348 → main
Migration Notes
Dashboard auto-updates: Deployed via ConfigMap sidecar — updates on next helm upgrade.
New metrics required: Panels reference
authz_backend_*metrics from standardizedBackendMetrics. Emitted by cloud v2026.4.x+. Older versions show "No data" for backend panels.Risk Assessment
Low — Dashboard JSON only + generated test snapshots.
Test Plan
make generate-expectedmatches committed snapshotsref FAB-178
🤖 Generated with Claude Code
main