Skip to content

Overhaul dashboards for Flyte v2: remove v1 panels, add CreateRun latency, fix executor metrics#373

Open
mhotan wants to merge 6 commits intomainfrom
mike/v2-dashboards
Open

Overhaul dashboards for Flyte v2: remove v1 panels, add CreateRun latency, fix executor metrics#373
mhotan wants to merge 6 commits intomainfrom
mike/v2-dashboards

Conversation

@mhotan
Copy link
Copy Markdown
Contributor

@mhotan mhotan commented May 2, 2026

Overview

Overhaul the controlplane and dataplane Grafana dashboards to focus on Flyte v2 metrics. Removes v1-only panels that never fire on v2 deployments, adds new v2-specific panels for CreateRun rate/latency, and fixes several metric name issues discovered during live validation on mike-apple-aws.

Release Notes

Metrics removed (v1-only — never fire on v2 deployments)

Panel removed Metric Reason
Execution Create / Ack Rate executions:executions:handle_create_op_count, handle_ack_op_count v1 handleCreateExecution path only
Execution Create / Ack Latency executions:executions:handle_create_op_bucket, handle_ack_op_bucket Same v1 path
Assignment Duration executions:workqueue:announce_cluster_assignment_bucket v1 workqueue, v2 uses lease streaming
Workqueue Operations executions:workqueue:send_operation_count, claim_operations, failures v1 workqueue
Flyte Propeller (V1) row (DP) All flyte:propeller:* metrics Entire row removed from DP dashboard
Active Workflows (DP Health) flyte:propeller:all:execstats:active_workflow_executions v1 propeller
Queue Depth (DP Health) flyte:propeller:all:main_depth v1 propeller

Metrics added (new v2 panels)

Panel added Metric Description
CreateRun Rate connect:server_requests_handled_total{method="CreateRun"} v2 front-door request counter
CreateRun Latency (p50/p95/p99) connect:server_request_duration_seconds_bucket{method="CreateRun"} v2 latency histogram (requires cloud #15704)
V2 Run Methods connect:server_requests_handled_total{service=~".*RunService.*"} All RunService RPCs by method

Metrics renamed / fixed

Change Old New Reason
Executor metric prefix executor:active_actions_count executor::v2:active_actions_count Executor binary emits with ::v2: prefix (FAB-308)
Executor metric prefix executor:available_capacity executor::v2:available_capacity Same
Executor metric prefix executor:actions_terminated executor::v2:actions_terminated Same
Executor metric prefix executor:invalid_leases executor::v2:invalid_leases Same
Executor metric prefix executor:system_failures executor::v2:system_failures Same
Executor evaluate duration executor:evaluator:evaluate_duration executor::v2:evaluator:evaluate_duration_ms Correct name + prefix
Heartbeat quantile quantile="0.95" quantile="0.99" 0.95 doesn't exist; summary only emits 0.5, 0.9, 0.99
Panel title "Cluster API Latency (p95)" "Cluster API Latency (p99)" Matches corrected quantile

Panels relabeled (apps/serving path, not task runs)

Old title New title Metric
Pending Assignments Apps — Pending Assignments executions:app:leaser:pending_assignment_unlabeled
First Ack Latency (V2 SLI) Apps — First Ack Latency executions:app:service:first_ack_latency_unlabeled_bucket

DP PrometheusRule recording rule updated

Rule Old expr New expr
union:dp:executor:active_actions executor:active_actions_count{...} executor::v2:active_actions_count{...}
union:dp:slo:executor_success_rate executor:actions_terminated{...} executor::v2:actions_terminated{...}

Dashboard UX improvements

  • All 82 timeseries panels now show table legends with Min, Max, Latest values
  • DP dashboard: Propeller Latency p99 panel replaced with Executor Evaluate Duration p99

Validation

Tested on mike-apple-aws selfhosted environment. All 28 v2 metrics confirmed present in Prometheus with non-empty data. See scratch/metrics-validation-plan.md for full results with Grafana Explore links.

Dependencies

  • cloud #15704: Connect duration histogram (connect:server_request_duration_seconds) — required for CreateRun Latency panel
  • FAB-308: Executor ::v2: scope naming fix (separate issue for Runtime team)
  • FAB-309: Create v1-specific legacy dashboard (low priority follow-up)

Test plan

  • All metrics verified against live Prometheus (22 OK, 6 healthy-zero, 0 empty)
  • Grafana dashboard loads with no query errors
  • CreateRun Rate panel shows data after triggering runs
  • Executor panels show data with executor::v2: prefix
  • Heartbeat latency p99 now shows data (was empty with 0.95)
  • Table legends display Min/Max/Latest correctly

ref FAB-305

🤖 Generated with Claude Code

Create separate v2 dashboards that remove v1-only panels and add
v2-specific metrics. Existing v1 dashboards are unchanged.

Controlplane V2 (union-controlplane-v2-overview):
- Executions row: remove v1 handle_create_op, handle_ack_op,
  workqueue panels. Add CreateRun Rate (connect counter), CreateRun
  Latency (connect histogram), V2 Run Methods by service.
- Relabel apps panels: "Apps — Pending Assignments",
  "Apps — First Ack Latency"
- Keep all shared rows: FlyteAdmin, Cluster Service, Queue,
  CacheService, Authorizer, Data Proxy, Usage, Infrastructure

Dataplane V2 (union-dataplane-v2-overview):
- Remove Flyte Propeller (V1) row entirely
- Health: remove Active Workflows and Queue Depth (propeller)
- SLOs: replace Propeller Latency p99 with Executor Evaluate Duration
- Keep: Operator, Executor, gRPC Client, Infrastructure

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aviator-app
Copy link
Copy Markdown
Contributor

aviator-app Bot commented May 2, 2026

Current Aviator status

Aviator will automatically update this comment as the status of the PR changes.
Comment /aviator refresh to force Aviator to re-examine your PR (or learn about other /aviator commands).

This pull request is currently open (not queued).

How to merge

To merge this PR, comment /aviator merge or add the mergequeue label.


See the real-time status of this PR on the Aviator webapp.
Use the Aviator Chrome Extension to see the status of your PR within GitHub.

mhotan and others added 4 commits May 2, 2026 18:50
Remove v1-only panels from the shipped dashboards rather than
maintaining separate v1 and v2 files. Keeps the same filenames,
UIDs, and titles so existing bookmarks and Grafana links continue
to work.

The v1 dashboard triage (whether to create a separate legacy
dashboard) is tracked as a separate Linear issue.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The executorv2 binary registers metrics with a v2 scope suffix,
producing metric names like executor::v2:active_actions_count instead
of executor:active_actions_count. Update dashboard panels and
PrometheusRule recording rules to match the actual metric names.

Note: executor:handler_panic is unchanged (emitted by v1 scope).

A separate issue will be filed for the Runtime team to fix the
double-colon scope naming in the executor binary.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cluster:svc:heartbeat:success_ms is a summary metric with quantiles
0.5, 0.9, and 0.99. The panel queried for quantile="0.95" which
returned empty. Changed to 0.99 and renamed panel to "Cluster API
Latency (p99)".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set legend displayMode=table with min, max, lastNotNull calcs on
all 82 timeseries panels across both CP and DP dashboards. This
makes it easier to spot anomalies at a glance without hovering.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@mhotan mhotan changed the title Add v2-focused controlplane and dataplane overview dashboards Overhaul dashboards for Flyte v2: remove v1 panels, add CreateRun latency, fix executor metrics May 3, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants