Skip to content

Kill VASP on reaching MaxCorrectionsPerJobError#410

Closed
janosh wants to merge 2 commits into
materialsproject:masterfrom
janosh:fix/terminate-vasp-on-max-errors
Closed

Kill VASP on reaching MaxCorrectionsPerJobError#410
janosh wants to merge 2 commits into
materialsproject:masterfrom
janosh:fix/terminate-vasp-on-max-errors

Conversation

@janosh
Copy link
Copy Markdown
Member

@janosh janosh commented Oct 30, 2025

we noticed that a lot of our nodes had zombie VASP processes that we're hogging compute and memory without doing anything useful after hitting MaxCorrectionsPerJobError, preventing new jobs from starting properly (they we're showing as running but not doing anything). it appears that's because Custodian._run_job had no logic to actually kill VASP when the default 5 errors per jobs is reached. is this by design? i'm surprised this didn't come up before so maybe we're missing something? fwiw, we're no longer seeing the zombie VASP processes after redeploying with the changes in this PR.

@Andrew-S-Rosen @esoteric-ephemera @shyuep

- Add process termination logic before raising MaxCorrectionsPerJobError and MaxCorrectionsError
- Ensures running processes are properly killed using the multi-node compatible approach from PR #396
- Prevents orphaned VASP processes in both single-node and multi-node setups
- Add comprehensive test case to verify process termination behavior
- Refactor _do_check() to use _terminate_process() instead of passing terminate_func
- DRY: centralizes process termination logic in one place
- uses multi-node compatible approach from PR #396
@Andrew-S-Rosen
Copy link
Copy Markdown
Member

Andrew-S-Rosen commented Oct 30, 2025

@janosh: Before proceeding with this PR, can you please tell me report what version of Custodian you are using? I modified the termination behavior in #396. If you are using v2025.10.11, can you please check if 2025.8.13 prevents the issue from occurring anymore? I would like to confirm that it is unrelated before proceeding with this PR.

@janosh
Copy link
Copy Markdown
Member Author

janosh commented Oct 30, 2025

yes, using v2025.10.11. i looked at #396 before starting work on this, thinking it could be related. if it is, at least it's not obvious

@Andrew-S-Rosen
Copy link
Copy Markdown
Member

Looking at your PR, indeed, it does not seem like it would be related (although you never know). I remain just as surprised as you that this behavior would not have been caught before now. I will try to find time to test this in production before it is merged, but it will take me a little bit.

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Oct 30, 2025

Codecov Report

❌ Patch coverage is 41.66667% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.89%. Comparing base (a305609) to head (fee3e62).

Files with missing lines Patch % Lines
src/custodian/custodian.py 41.66% 21 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #410      +/-   ##
==========================================
- Coverage   53.02%   52.89%   -0.14%     
==========================================
  Files          39       39              
  Lines        3508     3528      +20     
==========================================
+ Hits         1860     1866       +6     
- Misses       1648     1662      +14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Andrew-S-Rosen
Copy link
Copy Markdown
Member

@janosh --- As a follow-up question, if the VASP job errors out, wouldn't it not be running anymore?

@janosh
Copy link
Copy Markdown
Member Author

janosh commented Oct 30, 2025

@janosh --- As a follow-up question, if the VASP job errors out, wouldn't it not be running anymore?

i was wondering under what circumstances Custodian might think VASP errored out but where the VASP process is actually still running. is that possible?

@Andrew-S-Rosen
Copy link
Copy Markdown
Member

i was wondering under what circumstances Custodian might think VASP errored out but where the VASP process is actually still running. is that possible?

It would need to be one that has is_monitor=True like NonConverigngErrorHandler, but in those cases, it should be killing the active VASP job regardless of the maximum number of corrections.

Can you confirm if the lingering VASP process was doing any I/O at all, or if it was complete stale? Maybe you can also share the errors in your custodian.json?

@esoteric-ephemera
Copy link
Copy Markdown
Contributor

Personally, I haven't had this issue - jobs which hit that limit seem to get killed. Is it maybe some interplay between the job orchestration software and the sigkill custodian sends?

@shyuep
Copy link
Copy Markdown
Member

shyuep commented Oct 30, 2025

I think there are instances where the VASP process is still running on a HPC node even though it is killed on other nodes. We have seen this before.

@Andrew-S-Rosen
Copy link
Copy Markdown
Member

Andrew-S-Rosen commented Nov 1, 2025

@janosh: Just a brief update from me. I just used Custodian 2025.10.11 and had a toy job hit the maximum number of errors. Everything aborted as expected, with no lingering VASP process. The relevant logging info is below:

custodian.custodian.MaxCorrectionsPerJobError: Max errors per job reached: 2.

So, this at least means I did not break anything in my PR as I was inquiring about. But it also means that there are mechanisms in place to properly abort without your proposed changes.

It would be very helpful to get the logging module stdout from the failed run you are referring to, if it happens again. There is some useful information in here for debugging purposes. Of course, if you are confident that this PR fixes your issue, please let me know. I ran with your PR and confirmed that it at least didn't drastically ruin anything on that same toy run.

@Andrew-S-Rosen
Copy link
Copy Markdown
Member

@janosh can this be closed now?

@janosh
Copy link
Copy Markdown
Member Author

janosh commented Dec 13, 2025

yes, thanks for reminding

@janosh janosh closed this Dec 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants