Overhaul dashboards for Flyte v2: remove v1 panels, add CreateRun latency, fix executor metrics#373
Open
Overhaul dashboards for Flyte v2: remove v1 panels, add CreateRun latency, fix executor metrics#373
Conversation
Create separate v2 dashboards that remove v1-only panels and add v2-specific metrics. Existing v1 dashboards are unchanged. Controlplane V2 (union-controlplane-v2-overview): - Executions row: remove v1 handle_create_op, handle_ack_op, workqueue panels. Add CreateRun Rate (connect counter), CreateRun Latency (connect histogram), V2 Run Methods by service. - Relabel apps panels: "Apps — Pending Assignments", "Apps — First Ack Latency" - Keep all shared rows: FlyteAdmin, Cluster Service, Queue, CacheService, Authorizer, Data Proxy, Usage, Infrastructure Dataplane V2 (union-dataplane-v2-overview): - Remove Flyte Propeller (V1) row entirely - Health: remove Active Workflows and Queue Depth (propeller) - SLOs: replace Propeller Latency p99 with Executor Evaluate Duration - Keep: Operator, Executor, gRPC Client, Infrastructure Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Contributor
Current Aviator status
This pull request is currently open (not queued). How to mergeTo merge this PR, comment
See the real-time status of this PR on the
Aviator webapp.
Use the Aviator Chrome Extension
to see the status of your PR within GitHub.
|
Remove v1-only panels from the shipped dashboards rather than maintaining separate v1 and v2 files. Keeps the same filenames, UIDs, and titles so existing bookmarks and Grafana links continue to work. The v1 dashboard triage (whether to create a separate legacy dashboard) is tracked as a separate Linear issue. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The executorv2 binary registers metrics with a v2 scope suffix, producing metric names like executor::v2:active_actions_count instead of executor:active_actions_count. Update dashboard panels and PrometheusRule recording rules to match the actual metric names. Note: executor:handler_panic is unchanged (emitted by v1 scope). A separate issue will be filed for the Runtime team to fix the double-colon scope naming in the executor binary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cluster:svc:heartbeat:success_ms is a summary metric with quantiles 0.5, 0.9, and 0.99. The panel queried for quantile="0.95" which returned empty. Changed to 0.99 and renamed panel to "Cluster API Latency (p99)". Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Set legend displayMode=table with min, max, lastNotNull calcs on all 82 timeseries panels across both CP and DP dashboards. This makes it easier to spot anomalies at a glance without hovering. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
xjerod
approved these changes
May 4, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Overview
Overhaul the controlplane and dataplane Grafana dashboards to focus on Flyte v2 metrics. Removes v1-only panels that never fire on v2 deployments, adds new v2-specific panels for CreateRun rate/latency, and fixes several metric name issues discovered during live validation on mike-apple-aws.
Release Notes
Metrics removed (v1-only — never fire on v2 deployments)
executions:executions:handle_create_op_count,handle_ack_op_counthandleCreateExecutionpath onlyexecutions:executions:handle_create_op_bucket,handle_ack_op_bucketexecutions:workqueue:announce_cluster_assignment_bucketexecutions:workqueue:send_operation_count,claim_operations, failuresflyte:propeller:*metricsflyte:propeller:all:execstats:active_workflow_executionsflyte:propeller:all:main_depthMetrics added (new v2 panels)
connect:server_requests_handled_total{method="CreateRun"}connect:server_request_duration_seconds_bucket{method="CreateRun"}connect:server_requests_handled_total{service=~".*RunService.*"}Metrics renamed / fixed
executor:active_actions_countexecutor::v2:active_actions_count::v2:prefix (FAB-308)executor:available_capacityexecutor::v2:available_capacityexecutor:actions_terminatedexecutor::v2:actions_terminatedexecutor:invalid_leasesexecutor::v2:invalid_leasesexecutor:system_failuresexecutor::v2:system_failuresexecutor:evaluator:evaluate_durationexecutor::v2:evaluator:evaluate_duration_msquantile="0.95"quantile="0.99"Panels relabeled (apps/serving path, not task runs)
executions:app:leaser:pending_assignment_unlabeledexecutions:app:service:first_ack_latency_unlabeled_bucketDP PrometheusRule recording rule updated
union:dp:executor:active_actionsexecutor:active_actions_count{...}executor::v2:active_actions_count{...}union:dp:slo:executor_success_rateexecutor:actions_terminated{...}executor::v2:actions_terminated{...}Dashboard UX improvements
Validation
Tested on mike-apple-aws selfhosted environment. All 28 v2 metrics confirmed present in Prometheus with non-empty data. See
scratch/metrics-validation-plan.mdfor full results with Grafana Explore links.Dependencies
connect:server_request_duration_seconds) — required for CreateRun Latency panel::v2:scope naming fix (separate issue for Runtime team)Test plan
executor::v2:prefixref FAB-305
🤖 Generated with Claude Code