Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use itertools batch to get long jobs lists #3815

Draft
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tylern4
Copy link
Contributor

@tylern4 tylern4 commented Mar 17, 2025

Description

When there are many jobs being tracked by parsl sacct can become unresponsive and timeout if there are too many jobs. A users workflow was experiencing this while running through Globus Compute on Perlmutter.

Changed Behaviour

This should batch calls in groups to Slurm

Fixes

Fixes #3814

Type of change

Choose which options apply, and delete the ones which do not apply.

  • Bug fix

@tylern4 tylern4 marked this pull request as draft March 17, 2025 23:42
cmd_timeout : int (Default = 10)
Number of seconds to wait for slurm commands to finish. For schedulers with many this
may need to be increased to wait longer for scheduler information.
status_batch_size: ine (Default = 50)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

int

@benclifford
Copy link
Collaborator

don't worry about the globus compute tests not passing here - they're intended to be broken (!)

@benclifford
Copy link
Collaborator

it would be better if you do code reformatting in a separate PR that claims to not change behaviour -- the diff for this PR is mostly that and nothing to do with the PR topic, and it hurts me to go look at this stuff in git bisect in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Calling sacct on many jobs Slurm hangs causing Globus Compute tasks to timeout
2 participants