Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Calling sacct on many jobs Slurm hangs causing Globus Compute tasks to timeout #3814

Open
tylern4 opened this issue Mar 17, 2025 · 0 comments · May be fixed by #3815
Open

Calling sacct on many jobs Slurm hangs causing Globus Compute tasks to timeout #3814

tylern4 opened this issue Mar 17, 2025 · 0 comments · May be fixed by #3815
Labels

Comments

@tylern4
Copy link
Contributor

tylern4 commented Mar 17, 2025

Describe the bug
A user saw that Slurm calls for _status hang when there are more than 200 jobs on Perlmutter. This could also be the case on other systems as well but haven't tested.

To Reproduce
Start many long running jobs on Perlmutter or other Slurm cluster through Parsl.

Expected behavior
Calls to _status should return quickly and not hang when interacting with Slurm. This should also work the same for a small job list as well as large a large job list.

Environment

  • Not sure of the users exact environment
  • Globus Compute on Perlmutter

Distributed Environment

  • This is coming from many long running Globus Compute jobs managed by Parsl
@tylern4 tylern4 added the bug label Mar 17, 2025
@tylern4 tylern4 linked a pull request Mar 17, 2025 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant