-
Notifications
You must be signed in to change notification settings - Fork 105
vllm - Add initial set of metrics #7285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@rzabarazesh is attempting to deploy a commit to the Meta Open Source Team on Vercel. A member of the Team first needs to authorize it. |
74cf17a
to
e7ddb4b
Compare
The latest updates on your projects. Learn more about Vercel for GitHub.
|
PyTorch and test-infra uses a tool called lintrunner to all linters, our version of pre-commit. You want to install it https://pypi.org/project/lintrunner and Also, the failure in https://github.com/pytorch/test-infra/actions/runs/18228918765/job/51907208089?pr=7285 can be solved easily by running |
aed2831
to
1d728cf
Compare
} | ||
} | ||
|
||
const options: EChartsOption = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you want to choose only some legends ? right now the
{ name: "Success" },
{ name: "Failed" },
{ name: "Canceled" },
are not clickable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understood this one. Do you mean the data points? Or that the legend itself isn't clickable?
bucket, | ||
countIf(lowerUTF8(build_state) IN ('passed', 'finished', 'success')) | ||
AS passed_count, | ||
countIf(lowerUTF8(build_state) = 'failed') AS failed_count, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to split this up into actual failures and soft failed? Maybe we only care about the former category
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct. We mostly care about hard failures. I added a commit to be more explicit about soft-failures
success_rate | ||
FROM job_stats | ||
ORDER BY | ||
success_rate ASC, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A curious q: My understand is that this query would return the worst job first. Why is this on the preview all the jobs with 100% success rate are show first? I guess we want to focus on those that are not in a good state and we should show unreliable jobs first, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. Changed it to show the worst jobs first.
GROUP BY | ||
bucket | ||
), | ||
manual_merged_prs_pending AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's chat more on this one because I don't think it would work like this. My understanding is that job_state
is a field that is updated when the job progresses changing from scheduled
to pending
to running
, then successed
or failured
or cancelled
etc. A manual merge due to impatience means that the job is scheduled
or pending
or running
at the time the merge occurs. So, it's a snapshot in time. However, the job information we have here is only the latest state. This means that this query changes depending on when you query it.
If you are agree with this, we could exclude this KPI to implement it later in a different PR as I need to double check if the above snapshot is even kept in the database instead of being overwritten. If it's indeed being overwritten, we need to think about a way to persist the snapshot of all jobs at the time of a merge. Just FYI, PyTorch keeps that in a table call merges
although I don't think we could reuse that one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right. Removed it for now
@@ -0,0 +1,23 @@ | |||
-- vLLM trunk health history |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more like a high level comment on the approach for many of these CI metrics including:
torchci/clickhouse_queries/vllm/ci_reliability/query.sql
torchci/clickhouse_queries/vllm/ci_run_duration/query.sql
torchci/clickhouse_queries/vllm/job_reliability/query.sql
Should we adopt the same approach as this torchci/clickhouse_queries/vllm/trunk_health/query.sql
to limit these query to only jobs from the main branch? The reason why i bring this up is because contributors are free to experiments in their PR and that could really skew these metrics if PRs are included, for example, building new components that take longer, adding tests that are flaky on PRs are work in progress and they are ok. Only when the changes are approved and landed, they become the new norm. For this reason, in PyTorch, we generally just look at CI metrics from main branch.
Basically, my thought is that we should only capture issues that affect multiple contributors, and exclude work in progress noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea! Done
), | ||
|
||
-- Track state changes | ||
build_with_prev AS ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Um, how does this work when there are multiple build failures before trunk is recovered? My understand is that we want to capture this pattern: Last success, F, F, ..., F, F, Success (recovered) and the time in between. I can see the second transition from F to Success here, but what's about the first transition from Success to F?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch! Let me fix that
76a8295
to
1630749
Compare
1630749
to
f3ae232
Compare
Adds metrics for both CI runtime and code review cycle
Updated to now add reliability metrics as well.