Skip to content

Update authorizer dashboard: standardized backend metrics#353

Merged
aviator-app[bot] merged 1 commit intomainfrom
mike/selfhosted-dashboard-update
Apr 21, 2026
Merged

Update authorizer dashboard: standardized backend metrics#353
aviator-app[bot] merged 1 commit intomainfrom
mike/selfhosted-dashboard-update

Conversation

@mhotan
Copy link
Copy Markdown
Contributor

@mhotan mhotan commented Apr 20, 2026

Summary

  • Overhaul authorizer row in union-controlplane-overview dashboard for standardized BackendMetrics
  • Fix metric name mismatches, query bugs, and Authorizer Mode value mappings
  • Add identity_type breakdown (User/App/External/Unknown) to auth panels
  • Add proper units (ops, ms, percentunit), thresholds, and zero-state handling
  • Fix legendFormat double-escaping
  • Remove "(V1 + V2)" labels from row titles
  • Regenerate test snapshots

Stacked on #352#351#350#349#348 → main

Migration Notes

Dashboard auto-updates: Deployed via ConfigMap sidecar — updates on next helm upgrade.

New metrics required: Panels reference authz_backend_* metrics from standardized BackendMetrics. Emitted by cloud v2026.4.x+. Older versions show "No data" for backend panels.

Risk Assessment

Low — Dashboard JSON only + generated test snapshots.

Test Plan

  • Dashboard loads in Grafana without errors
  • Panels show data with cloud v2026.4.x+
  • Zero-state panels show "0" instead of "No data"
  • make generate-expected matches committed snapshots

ref FAB-178

🤖 Generated with Claude Code

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the union-controlplane-overview Grafana dashboard to modernize the Authorizer row (standardized backend metrics, identity_type breakdowns, improved units/thresholds/zero-state behavior) and refreshes generated Helm snapshot fixtures to match the new dashboard output.

Changes:

  • Overhaul Authorizer row panels/queries (mode, allow/deny, deny %, latency, backend errors, attribution, fail-open) and remove “(V1 + V2)” row-title suffixes.
  • Add identity_type breakdowns to decision-rate panels and adjust legends/units/thresholds.
  • Regenerate tests/generated/* snapshots to reflect updated rendered outputs.

Reviewed changes

Copilot reviewed 1 out of 6 changed files in this pull request and generated 5 comments.

File Description
charts/controlplane/dashboards/union-controlplane-overview.json Dashboard JSON updates: row titles + reworked Authorizer panels/PromQL.
tests/generated/controlplane.external-authz.yaml Updated expected Helm-render output (includes embedded dashboard JSON changes).
tests/generated/controlplane.custom-oidc.yaml Updated expected Helm-render output for custom OIDC fixture (includes embedded dashboard JSON changes).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

{
"expr": "rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]) / (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval]) + rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]))",
"legendFormat": "Deny %",
"expr": "(sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval])) or vector(0)) / clamp_min((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval])) + sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]))), 1e-10)",
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The or vector(0) fallback in this PromQL will always add an extra unlabeled {} series (because vector(0) has no labels and won’t match the sum by (identity_type) output). It also doesn’t reliably fix the zero-state for the ratio because the denominator can still be empty, producing no result. Use a label-matching fallback (e.g., ... or on() vector(0) / ... or on() vector(1e-10)), or restructure the expression so both numerator and denominator have compatible zero-state behavior without introducing an always-present extra series.

Suggested change
"expr": "(sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval])) or vector(0)) / clamp_min((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval])) + sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]))), 1e-10)",
"expr": "((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval]))) or on(identity_type) (0 * ((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval]))) + (sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval])))))) / clamp_min(((sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval]))) + (sum by (identity_type) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_denied{namespace=\"$namespace\"}[$__rate_interval])))), 1e-10)",

Copilot uses AI. Check for mistakes.
Comment on lines 3097 to 3100
{
"expr": "rate(authorizer:authorizer:cloudauthorizer:connect:external:fail_open_activated{namespace=\"$namespace\"}[$__rate_interval])",
"legendFormat": "Fail-Open",
"expr": "sum by (error_type) (rate(authorizer:authorizer:cloudauthorizer:connect:backend_authorize_errors{namespace=\"$namespace\"}[$__rate_interval])) or vector(0)",
"legendFormat": "{{error_type}}",
"refId": "A"
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

... or vector(0) will always include an additional unlabeled {} time series alongside the real error_type series, because the labelsets don’t match. Prefer or on() vector(0) (or remove the fallback and rely on Grafana’s no-value handling) so the zero fallback only appears when the metric is absent.

Copilot uses AI. Check for mistakes.
Comment on lines 3153 to 3156
{
"expr": "sum by (action) (rate(authorizer:authorizer:cloudauthorizer:connect:authz_allowed{namespace=\"$namespace\"}[$__rate_interval]))",
"legendFormat": "allowed: {{ action }}",
"expr": "sum by (error_source) (rate(authorizer:authorizer:cloudauthorizer:connect:authorize_errors_total{namespace=\"$namespace\"}[$__rate_interval])) or vector(0)",
"legendFormat": "{{error_source}}",
"refId": "A"
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here: or vector(0) adds a permanent unlabeled {} series (since sum by (error_source) won’t match vector(0)). Use or on() vector(0) if you want a zero-state fallback only when the metric is absent.

Copilot uses AI. Check for mistakes.
{
"expr": "sum by (error_source) (rate(authorizer:authorizer:cloudauthorizer:connect:authorize_errors_total{namespace=\"$namespace\"}[$__rate_interval]))",
"legendFormat": "{{ error_source }}",
"expr": "rate(authorizer:authorizer:cloudauthorizer:connect:external:fail_open_activated{namespace=\"$namespace\"}[$__rate_interval]) or vector(0)",
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rate(...) or vector(0) will always add an extra unlabeled {} series even when fail_open_activated is present (labelsets won’t match). If the goal is just to show 0 when the metric is missing, switch to or on() vector(0) so the fallback doesn’t appear when data exists.

Suggested change
"expr": "rate(authorizer:authorizer:cloudauthorizer:connect:external:fail_open_activated{namespace=\"$namespace\"}[$__rate_interval]) or vector(0)",
"expr": "rate(authorizer:authorizer:cloudauthorizer:connect:external:fail_open_activated{namespace=\"$namespace\"}[$__rate_interval]) or on() vector(0)",

Copilot uses AI. Check for mistakes.
Comment on lines 2977 to 2984
{
"expr": "histogram_quantile(0.50, sum by (le) (rate(authorizer:authorizer:cloudauthorizer:connect:external:authorize_duration_bucket{namespace=\"$namespace\"}[$__rate_interval])))",
"expr": "histogram_quantile(0.50, sum by (le) (rate(authorizer:authorizer:cloudauthorizer:connect:backend_authorize_duration_ms_bucket{namespace=\"$namespace\"}[$__rate_interval])))",
"legendFormat": "p50",
"refId": "A"
},
{
"expr": "histogram_quantile(0.95, sum by (le) (rate(authorizer:authorizer:cloudauthorizer:connect:external:authorize_duration_bucket{namespace=\"$namespace\"}[$__rate_interval])))",
"expr": "histogram_quantile(0.95, sum by (le) (rate(authorizer:authorizer:cloudauthorizer:connect:backend_authorize_duration_ms_bucket{namespace=\"$namespace\"}[$__rate_interval])))",
"legendFormat": "p95",
Copy link

Copilot AI Apr 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description/migration notes mention panels referencing authz_backend_* metrics, but this dashboard queries ...:backend_authorize_duration_ms_bucket / ...:backend_authorize_errors instead. Please either update the dashboard queries to the intended authz_backend_* names (if that’s the standardized metric prefix) or adjust the PR description/migration notes so operators know exactly which metrics are required.

Copilot uses AI. Check for mistakes.
@mhotan mhotan force-pushed the mike/selfhosted-overlay-to-base branch from b33ebdc to 446a6c6 Compare April 20, 2026 21:33
@mhotan mhotan force-pushed the mike/selfhosted-dashboard-update branch from 2acf6a8 to a404489 Compare April 20, 2026 21:35
aviator-app Bot pushed a commit that referenced this pull request Apr 20, 2026
## Overview

Adds authorizer configuration defaults for selfhosted deployments and fixes dashboard panel issues.

- **defaultIdentityToSubject**: Config for non-Okta IdPs (FAB-189) that don't include `identitytype` claim natively. When enabled, subjects without identitytype are treated as users.
- **UserClouds client defaults**: Pre-configured connection settings in the controlplane values so terraform doesn't need deep-merge overrides.
- **Dashboard fixes**: Authorizer Mode panel value mappings for case sensitivity, metric name mismatches.

## Migration Notes

No migration required. Additive defaults only — existing deployments are not affected.

## Test Plan

- [x] `helm template` renders correctly
- [x] Verified on identity-testing environment

ref FAB-189
ref FAB-178

🤖 Generated with [Claude Code](https://claude.com/claude-code)

- `main` <!-- branch-stack -->
  - **Add selfhosted authorizer defaults and dashboard fixes** :point\_left:
    - \#349
      - \#350
        - \#351
          - \#352
            - \#353
              - \#354
aviator-app Bot pushed a commit that referenced this pull request Apr 20, 2026
## Overview

Adds OIDC/OAuth2 globals to support identity providers beyond Okta (Entra ID, Keycloak, Authentik).

**New globals** (all empty by default — Okta behavior unchanged):

- `OIDC_METADATA_URL` — metadata discovery endpoint (Okta: `.well-known/oauth-authorization-server`, Entra: `.well-known/openid-configuration`)
- `OIDC_APP_SCOPE` — resource scope for CLI/SDK and task pods (Entra: `api://my-app/.default`)
- `OIDC_APP_AUDIENCE` — audience parameter for PKCE flows (Entra: `api://my-app`)
- `OIDC_S2S_SCOPE` — service-to-service scope (Entra: `api://my-app/.default`)
- Dataplane S2S scope support via `OIDC_S2S_SCOPE`

Includes inline documentation with Okta/Entra ID examples, OAuth app flow references, and test fixture for non-Okta configuration.

## Migration Notes

No migration required. All new globals default to empty which preserves existing Okta behavior. Non-Okta IdPs can now be configured without terraform deep-merge overrides.

## Test Plan

- [x] Added `tests/values/controlplane.custom-oidc.yaml` test fixture
- [x] `helm template` renders correctly with both empty and non-empty globals
- [x] Verified on identity-testing environment with Entra ID

ref FAB-178

🤖 Generated with [Claude Code](https://claude.com/claude-code)

- `main` <!-- branch-stack -->
  - **Add OIDC globals for multi-IdP support (Okta, Entra ID, Keycloak)** :point\_left:
    - \#350
      - \#351
        - \#352
          - \#353
            - \#354
@mhotan mhotan force-pushed the mike/selfhosted-overlay-to-base branch from 446a6c6 to 8c91953 Compare April 21, 2026 00:57
@mhotan mhotan force-pushed the mike/selfhosted-dashboard-update branch from a404489 to be6eb97 Compare April 21, 2026 00:59
aviator-app Bot pushed a commit that referenced this pull request Apr 21, 2026
…350)

## Summary

- **server-alias annotation**: Allows DP→CP intra-cluster traffic through nginx ingress by matching the internal service DNS name
- **gRPC configuration-snippet**: Forwards identity headers (X-User-Subject, X-User-Claim-Identitytype) from auth subrequest to gRPC backends. Required for BYOC since Oct 2024, now ported to selfhosted chart.
- **Organizations service**: New controlplane service for settings/org APIs. Required by CreateRun since #15107.
- **Fix organizations connectPort** in configmap

Stacked on #349#348 → main

## Migration Notes

**New behavior:** Intra-cluster DP→CP gRPC traffic now goes through nginx auth subrequests (via server-alias). This ensures consistent identity resolution for all callers. Previously, intra-cluster traffic could bypass auth if the `:authority` header didn't match the ingress host.

**Action required:** None for new deployments. Existing deployments will get the server-alias annotation on next helm upgrade.

## Risk Assessment

**Medium** — Changes ingress behavior. Test: DP→CP traffic, browser login.

## Test Plan

- [ ] Verify DP→CP gRPC traffic still works through ingress
- [ ] Verify browser login flow unchanged
- [ ] Verify organizations service endpoints respond
- [ ] `helm template` produces expected manifests

ref FAB-178, FAB-195

🤖 Generated with [Claude Code](https://claude.com/claude-code)

- `main` <!-- branch-stack -->
  - **Add selfhosted ingress identity forwarding and organizations service** :point\_left:
    - \#351
      - \#352
        - \#353
          - \#354
@mhotan mhotan force-pushed the mike/selfhosted-overlay-to-base branch from 8c91953 to 26251ed Compare April 21, 2026 03:30
aviator-app Bot pushed a commit that referenced this pull request Apr 21, 2026
## Summary

- Move all OIDC/OAuth2 auth configuration into a documented `flyte.configmap.adminServer.auth` block in base values.yaml with inline Okta/Entra ID examples
- Move `adminClient.connection` to base (was duplicated in AWS/GCP overlays)
- Wire `OIDC_S2S_SCOPE` for CP S2S auth (Entra requires `/.default`, Okta uses `all`)
- Add `OIDC_BROWSER_SCOPE` for Entra browser auth (AADSTS90009 fix)
- Deprecate scattered auth globals: `OIDC_BASE_URL`, `OIDC_CLIENT_ID`, `CLI_CLIENT_ID`
- Authz templates accept both `"Union"` and `"UserClouds"` type

Stacked on #350#349#348 → main

## Migration Notes

**Auth config location changed:** All auth config now lives in `flyte.configmap.adminServer.auth` in base values.yaml. Cloud overlays no longer set auth values — terraform authn modules output complete `app_auth`/`user_auth` blocks.

**Deprecated globals:** `OIDC_BASE_URL`, `OIDC_CLIENT_ID`, `CLI_CLIENT_ID` still work but are deprecated. New deployments should use the auth block directly.

**New globals:** `OIDC_S2S_SCOPE` (default: `all`), `OIDC_BROWSER_SCOPE`, `INTERNAL_SUBJECT_ID` (default: `INTERNAL_CLIENT_ID`).

**UserClouds → Union:** Authz templates accept `type: "Union"` alongside `"UserClouds"`. Both are equivalent; `"Union"` preferred for new deployments.

## Risk Assessment

**Medium** — Restructures auth config. Verified by `compare_manifests.py` producing zero structural diffs against baseline.

## Test Plan

- [ ] `helm template` with Okta defaults produces identical auth config
- [ ] `helm template` with Entra overrides produces correct scopes/audiences
- [ ] Verify deprecated globals still work for backward compatibility
- [ ] `compare_manifests.py` zero diffs against baseline

ref FAB-178, FAB-195

🤖 Generated with [Claude Code](https://claude.com/claude-code)

- `main` <!-- branch-stack -->
  - **Consolidate auth config into base values.yaml** :point\_left:
    - \#352
      - \#353
        - \#354
@mhotan mhotan force-pushed the mike/selfhosted-dashboard-update branch from be6eb97 to 20b7ffe Compare April 21, 2026 03:33
@mhotan mhotan force-pushed the mike/selfhosted-overlay-to-base branch from 26251ed to e92c015 Compare April 21, 2026 03:38
@mhotan mhotan force-pushed the mike/selfhosted-dashboard-update branch from 20b7ffe to 7222586 Compare April 21, 2026 03:39
aviator-app Bot pushed a commit that referenced this pull request Apr 21, 2026
## Summary

- Move cloud-agnostic config from AWS/GCP overlays into base values.yaml
- **AWS overlay**: \~400 → 102 lines
- **GCP overlay**: 504 → 105 lines
- Base chart is now self-contained for selfhosted deployments
- Overlays reduced to only cloud-specific items: IAM, storage, region, scylla provisioner

Stacked on #351#350#349#348 → main

### What moved to base

- Namespace-derived FQDNs (`admin.endpoint`, `rootTenantURLPattern`)
- Ingress configuration (server-alias, protectedIngress annotations)
- ingress-nginx (enabled, ClusterIP, fullnameOverride)
- Monitoring (kube\* disabled, serviceMonitor enabled)
- Image repos (default: `registry.unionai.cloud/controlplane`)
- Secrets (union-operator, union-secrets)
- ScyllaDB generic config
- envoy-gateway defaults

### What stays in overlays (\~100 lines each)

- Cloud region, DB, storage bucket, IAM identifiers
- `flyte.storage` (type:s3/gcs, region/projectId)
- IAM annotations (IRSA / Workload Identity)
- scylla storageClass provisioner
- dataproxy endpoint, artifacts stow config

## Migration Notes

**Base chart is now self-contained:** New selfhosted deployments only need cloud-specific overlay values (IAM, storage, DB).

**Image repository default:** Base defaults to `registry.unionai.cloud/controlplane`. Internal deployments using ECR must set `IMAGE_REPOSITORY_PREFIX` via terraform.

**Namespace-derived FQDNs:** Services use `{{ .Release.Namespace }}` instead of hardcoded values. Resolves identically in standard deployments.

## Risk Assessment

**Higher** — Touches most of values.yaml. Verified by `compare_manifests.py` producing zero structural diffs against baseline from `mike/overlay-consolidated-backup-2026-04-20`.

## Test Plan

- [ ] `helm template` with AWS overlay produces identical manifests to baseline
- [ ] `helm template` with GCP overlay produces identical manifests to baseline
- [ ] `compare_manifests.py` zero diffs against baseline
- [ ] Deploy to identity-testing environment and verify all services healthy

ref FAB-178, FAB-195, FAB-276

🤖 Generated with [Claude Code](https://claude.com/claude-code)

- `main` <!-- branch-stack -->
  - **Move generic selfhosted config from overlays to base values.yaml** :point\_left:
    - \#353
      - \#354
Base automatically changed from mike/selfhosted-overlay-to-base to main April 21, 2026 03:43
Overhaul the authorizer row in union-controlplane-overview dashboard
to use standardized BackendMetrics emitted by all backend types
(Noop, External, Union).

Changes:
- Fix metric name mismatches and query bugs in existing panels
- Fix Authorizer Mode panel value mappings for case sensitivity
- Add identity_type breakdown (User/App/External/Unknown) to auth panels
- Add proper units (ops, ms, percentunit) and thresholds
- Add noValue: "0" and "or vector(0)" for zero-state panels
- Fix legendFormat double-escaping ({{{{type}}}} → {{type}})
- Remove "(V1 + V2)" labels from all row titles

## Migration Notes

**Dashboard auto-updates:** The dashboard JSON is deployed via ConfigMap
sidecar. No manual action needed — it updates on next helm upgrade.

**New metrics required:** Panels reference `authz_backend_*` metrics from
the standardized BackendMetrics struct. These are emitted by cloud
v2026.4.x+. Older versions will show "No data" for backend panels.

ref FAB-178

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@mhotan mhotan force-pushed the mike/selfhosted-dashboard-update branch from 7222586 to d6c3ccb Compare April 21, 2026 03:44
@aviator-app
Copy link
Copy Markdown
Contributor

aviator-app Bot commented Apr 21, 2026

Current Aviator status

Aviator will automatically update this comment as the status of the PR changes.
Comment /aviator refresh to force Aviator to re-examine your PR (or learn about other /aviator commands).

This PR was merged using Aviator.


See the real-time status of this PR on the Aviator webapp.
Use the Aviator Chrome Extension to see the status of your PR within GitHub.

@mhotan
Copy link
Copy Markdown
Contributor Author

mhotan commented Apr 21, 2026

/aviator merge

@aviator-app
Copy link
Copy Markdown
Contributor

aviator-app Bot commented Apr 21, 2026

Aviator has accepted the merge request. It will enter the queue when all of the required status checks have passed. Aviator will update the sticky status comment as the pull request moves through the queue.

@aviator-app aviator-app Bot merged commit 2abab95 into main Apr 21, 2026
5 checks passed
@aviator-app aviator-app Bot deleted the mike/selfhosted-dashboard-update branch April 21, 2026 03:45
jmonty42 added a commit that referenced this pull request Apr 21, 2026
Snapshots updated to reflect Mike's PRs #348-#353 (base values consolidation,
organizations service, OIDC globals) plus the identity service defaults from this PR.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants