Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Color code viable/strict blocking jobs differently on HUD #6239

Merged
merged 5 commits into from
Feb 5, 2025

Conversation

ZainRizvi
Copy link
Contributor

@ZainRizvi ZainRizvi commented Jan 31, 2025

A small experiment. The idea is to make it easier to see when failing jobs are blocking viable/strict upgrades.

This would have helped notice:

  • That linux-binary failures were blocking viable/strict upgrades two days ago
  • That the flaky rocm failures were not blocking viable/strict upgrades last week

Adds a thin border around grouped jobs on the hud home page when there's a viable/strict blocking job failing inside them.

image

I didn't add a border to the individual jobs (the non grouped view), since it seemed like it might feel more noisy than helpful. But I can add it in if folks feel otherwise.

Copy link

vercel bot commented Jan 31, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Updated (UTC)
torchci ✅ Ready (Inspect) Visit Preview Feb 5, 2025 4:54pm

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jan 31, 2025
@ZainRizvi ZainRizvi marked this pull request as draft January 31, 2025 00:33
@ZainRizvi ZainRizvi marked this pull request as ready for review January 31, 2025 23:05
@ZainRizvi ZainRizvi requested review from huydhn and a team and removed request for huydhn January 31, 2025 23:05
}

// Source of truth for these jobs is in https://github.com/pytorch/pytorch/blob/main/.github/workflows/update-viablestrict.yml#L26
const viablestrict_blocking_jobs_patterns = [
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic has a small issue is that it wrongly marks memory leak check jobs as blocking v/s. Also this would only work for PyTorch, other repos have different logic there, i.e. https://github.com/pytorch/executorch/blob/main/.github/workflows/update-viablestrict.yml#L23. It would be nice to have this for all repos, but it might be tricky to implement when this list is hardcoded here, so maybe just limit this feature to PyTorch for a start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some pull jobs in that memory leak check, oddly enough. Any idea how those are triggered?

The closest match for those are these distributed jobs, but if those are indeed the jobs running then the memory leak check condition is buried somewhere deep. @clee2000 might now...

Copy link
Contributor

@huydhn huydhn Feb 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, memory leak check jobs are just regular pull and trunk jobs running once per day with memory leak check turning on. They are triggered by this cron schedule https://github.com/pytorch/pytorch/blob/main/.github/workflows/pull.yml#L14, which is picked up and set by https://github.com/pytorch/pytorch/blob/main/.github/scripts/filter_test_configs.py#L603-L606

  1. mem_leak_check and its cousin rerun_disable_test jobs are supplementary and should not blocking viable/strict update. I vaguely remember that we have logic to exclude them in Rockset day.
  2. I don't see people pay attention to mem_leak_check jobs at all, maybe we should kill it @clee2000

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the query on ClickHouse, giving it as an input a commit that failed the mem_leak_check on pull.

Those failed commits do indeed show up in the results.
image

While I agree that this is not behavior we want, I think today memleak checks that affect pull requests do indeed become viable/strict blocking. It's generally not an issue for us though because only one commit a day will have that check run.

But I'll change this feature to only work on pytorch/pytorch for now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that this is not behavior we want, I think today memleak checks that affect pull requests do indeed become viable/strict blocking. It's generally not an issue for us though because only one commit a day will have that check run.

But I'll change this feature to only work on pytorch/pytorch for now

Sounds good! I’m pretty sure we ignored mem leak check in the past, so it’s a regression (likely due to CH migration). I could take a look

[key: string]: RegExp[];
};

// TODO: Move this to a config file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/pytorch/pytorch/blob/main/.github/pytorch-probot.yml seems to be a a natural place for this, but agree that we could do it later

Copy link
Contributor

@huydhn huydhn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Let's also do a quick post about this to let folks know about the change, any thoughts?

@ZainRizvi ZainRizvi merged commit c75ff3e into main Feb 5, 2025
6 checks passed
@ZainRizvi ZainRizvi deleted the zainr/hud-vs-blocking branch February 5, 2025 17:43
@izaitsevfb
Copy link
Contributor

@ZainRizvi, one minor note: to improve UX it would be helpful to add the information about what this new border-coding means in the tooltip, for people who are not familiar with this change.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants