vllm - Add initial set of metrics #7285

rzabarazesh · 2025-10-03T15:21:16Z

Adds metrics for both CI runtime and code review cycle
Updated to now add reliability metrics as well.

vercel · 2025-10-03T15:21:21Z

@rzabarazesh is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

vercel · 2025-10-03T17:24:48Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Preview	Updated (UTC)
torchci	Ready	Preview	Oct 7, 2025 3:13am

huydhn · 2025-10-03T17:27:22Z

PyTorch and test-infra uses a tool called lintrunner to all linters, our version of pre-commit. You want to install it https://pypi.org/project/lintrunner and lintrunner init && lintrunner -a to fix these lint failures

Also, the failure in https://github.com/pytorch/test-infra/actions/runs/18228918765/job/51907208089?pr=7285 can be solved easily by running yarn format to format the React code automatically

torchci/components/metrics/panels/ScalarPanel.tsx

torchci/components/metrics/vllm/CiDurationsPanel.tsx

yangw-dev · 2025-10-06T19:08:18Z

torchci/components/metrics/vllm/CiDurationsPanel.tsx

+    }
+  }
+
+  const options: EChartsOption = {


do you want to choose only some legends ? right now the
{ name: "Success" },
{ name: "Failed" },
{ name: "Canceled" },
are not clickable

I'm not sure I understood this one. Do you mean the data points? Or that the legend itself isn't clickable?

torchci/components/metrics/vllm/CiDurationsPanel.tsx

torchci/components/metrics/vllm/MergesPanel.tsx

torchci/pages/metrics/vllm.tsx

huydhn · 2025-10-09T19:47:01Z

torchci/clickhouse_queries/vllm/ci_reliability/query.sql

+        bucket,
+        countIf(lowerUTF8(build_state) IN ('passed', 'finished', 'success'))
+            AS passed_count,
+        countIf(lowerUTF8(build_state) = 'failed') AS failed_count,


Do you want to split this up into actual failures and soft failed? Maybe we only care about the former category

Correct. We mostly care about hard failures. I added a commit to be more explicit about soft-failures

huydhn · 2025-10-09T20:25:04Z

torchci/clickhouse_queries/vllm/job_reliability/query.sql

+    success_rate
+FROM job_stats
+ORDER BY
+    success_rate ASC,


A curious q: My understand is that this query would return the worst job first. Why is this on the preview all the jobs with 100% success rate are show first? I guess we want to focus on those that are not in a good state and we should show unreliable jobs first, right?

Sure. Changed it to show the worst jobs first.

huydhn · 2025-10-09T20:48:28Z

torchci/clickhouse_queries/vllm/merges_percentage/query.sql

    GROUP BY
        bucket
 ),
+manual_merged_prs_pending AS (


Let's chat more on this one because I don't think it would work like this. My understanding is that job_state is a field that is updated when the job progresses changing from scheduled to pending to running, then successed or failured or cancelled etc. A manual merge due to impatience means that the job is scheduled or pending or running at the time the merge occurs. So, it's a snapshot in time. However, the job information we have here is only the latest state. This means that this query changes depending on when you query it.

If you are agree with this, we could exclude this KPI to implement it later in a different PR as I need to double check if the above snapshot is even kept in the database instead of being overwritten. If it's indeed being overwritten, we need to think about a way to persist the snapshot of all jobs at the time of a merge. Just FYI, PyTorch keeps that in a table call merges although I don't think we could reuse that one.

You are right. Removed it for now

torchci/clickhouse_queries/vllm/pr_cycle_time_breakdown/query.sql

huydhn · 2025-10-09T21:29:32Z

torchci/clickhouse_queries/vllm/trunk_health/query.sql

@@ -0,0 +1,23 @@
+-- vLLM trunk health history


This is more like a high level comment on the approach for many of these CI metrics including:

torchci/clickhouse_queries/vllm/ci_reliability/query.sql

torchci/clickhouse_queries/vllm/ci_run_duration/query.sql

torchci/clickhouse_queries/vllm/job_reliability/query.sql

Should we adopt the same approach as this torchci/clickhouse_queries/vllm/trunk_health/query.sql to limit these query to only jobs from the main branch? The reason why i bring this up is because contributors are free to experiments in their PR and that could really skew these metrics if PRs are included, for example, building new components that take longer, adding tests that are flaky on PRs are work in progress and they are ok. Only when the changes are approved and landed, they become the new norm. For this reason, in PyTorch, we generally just look at CI metrics from main branch.

Basically, my thought is that we should only capture issues that affect multiple contributors, and exclude work in progress noise.

Good idea! Done

huydhn · 2025-10-09T22:18:47Z

torchci/clickhouse_queries/vllm/trunk_recovery_time/query.sql

+),
+
+-- Track state changes
+build_with_prev AS (


Um, how does this work when there are multiple build failures before trunk is recovered? My understand is that we want to capture this pattern: Last success, F, F, ..., F, F, Success (recovered) and the time in between. I can see the second transition from F to Success here, but what's about the first transition from Success to F?

Good catch! Let me fix that

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Oct 3, 2025

rzabarazesh requested a review from huydhn October 3, 2025 15:26

rzabarazesh force-pushed the vllm-ci-metrics branch from 74cf17a to e7ddb4b Compare October 3, 2025 17:12

vercel bot deployed to Preview October 3, 2025 17:28 View deployment

huydhn requested review from clee2000 and yangw-dev October 3, 2025 18:15

rzabarazesh force-pushed the vllm-ci-metrics branch 2 times, most recently from aed2831 to 1d728cf Compare October 6, 2025 11:14

yangw-dev requested changes Oct 6, 2025

View reviewed changes

torchci/components/metrics/vllm/MergesPanel.tsx Outdated Show resolved Hide resolved

torchci/pages/metrics/vllm.tsx Outdated Show resolved Hide resolved

torchci/pages/metrics/vllm.tsx Outdated Show resolved Hide resolved

rzabarazesh requested a review from yangw-dev October 6, 2025 21:26

rzabarazesh changed the title ~~vllm - add CI runtime and review cycle metrics~~ vllm - Add initial set of metrics Oct 7, 2025

vercel bot deployed to Preview October 7, 2025 03:13 View deployment

huydhn reviewed Oct 9, 2025

View reviewed changes

torchci/clickhouse_queries/vllm/pr_cycle_time_breakdown/query.sql Show resolved Hide resolved

huydhn reviewed Oct 9, 2025

View reviewed changes

torchci/clickhouse_queries/vllm/pr_cycle_time_breakdown/query.sql Show resolved Hide resolved

huydhn reviewed Oct 9, 2025

View reviewed changes

torchci/clickhouse_queries/vllm/pr_cycle_time_breakdown/query.sql Show resolved Hide resolved

huydhn reviewed Oct 9, 2025

View reviewed changes

rzabarazesh added 6 commits October 13, 2025 15:47

vllm - add CI runtime and review cycle metrics

4464c13

DRY refactor

b2c3d2b

vllm reliability metrics

08e74bc

lint

114e2ea

Bettter handling of soft-failures

32e67de

Select events from the already-filtered prs table

c1b3ad9

rzabarazesh added 10 commits October 13, 2025 16:20

Filter out non-main branch results

b635887

Fix trunk_recovery_time query to account for all edge cases

a62a383

Remove force-merge panel and KPI

337c7b2

Better handling when there is no recovery data

3b7d479

Refactor cards

e5f4b1d

Add data zoom for charts that might need it

2a02b48

Implement Deltas for cards

ec527de

Add new daily metrics

52ffa54

Add job reliability table

f850146

Add retry metrics

7c3df6e

rzabarazesh force-pushed the vllm-ci-metrics branch from 76a8295 to 1630749 Compare October 14, 2025 06:43

rzabarazesh added 2 commits October 13, 2025 23:50

Fix bug in recovery chart

24acb43

lint

f3ae232

rzabarazesh force-pushed the vllm-ci-metrics branch from 1630749 to f3ae232 Compare October 14, 2025 06:51

vllm - Add initial set of metrics #7285

Are you sure you want to change the base?

vllm - Add initial set of metrics #7285

Uh oh!

Conversation

rzabarazesh commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vercel bot commented Oct 3, 2025

Uh oh!

vercel bot commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

huydhn commented Oct 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rzabarazesh Oct 13, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huydhn Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

huydhn Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huydhn Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rzabarazesh commented Oct 3, 2025 •

edited

Loading

vercel bot commented Oct 3, 2025 •

edited

Loading

huydhn commented Oct 3, 2025 •

edited

Loading

rzabarazesh Oct 13, 2025 •

edited

Loading

huydhn Oct 9, 2025 •

edited

Loading

huydhn Oct 9, 2025 •

edited

Loading

huydhn Oct 9, 2025 •

edited

Loading