Kill VASP on reaching MaxCorrectionsPerJobError#410
Conversation
- Add process termination logic before raising MaxCorrectionsPerJobError and MaxCorrectionsError - Ensures running processes are properly killed using the multi-node compatible approach from PR #396 - Prevents orphaned VASP processes in both single-node and multi-node setups - Add comprehensive test case to verify process termination behavior
- Refactor _do_check() to use _terminate_process() instead of passing terminate_func - DRY: centralizes process termination logic in one place - uses multi-node compatible approach from PR #396
|
@janosh: Before proceeding with this PR, can you please tell me report what version of Custodian you are using? I modified the termination behavior in #396. If you are using v2025.10.11, can you please check if 2025.8.13 prevents the issue from occurring anymore? I would like to confirm that it is unrelated before proceeding with this PR. |
|
yes, using v2025.10.11. i looked at #396 before starting work on this, thinking it could be related. if it is, at least it's not obvious |
|
Looking at your PR, indeed, it does not seem like it would be related (although you never know). I remain just as surprised as you that this behavior would not have been caught before now. I will try to find time to test this in production before it is merged, but it will take me a little bit. |
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #410 +/- ##
==========================================
- Coverage 53.02% 52.89% -0.14%
==========================================
Files 39 39
Lines 3508 3528 +20
==========================================
+ Hits 1860 1866 +6
- Misses 1648 1662 +14 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
@janosh --- As a follow-up question, if the VASP job errors out, wouldn't it not be running anymore? |
i was wondering under what circumstances |
It would need to be one that has Can you confirm if the lingering VASP process was doing any I/O at all, or if it was complete stale? Maybe you can also share the errors in your |
|
Personally, I haven't had this issue - jobs which hit that limit seem to get killed. Is it maybe some interplay between the job orchestration software and the sigkill custodian sends? |
|
I think there are instances where the VASP process is still running on a HPC node even though it is killed on other nodes. We have seen this before. |
|
@janosh: Just a brief update from me. I just used Custodian 2025.10.11 and had a toy job hit the maximum number of errors. Everything aborted as expected, with no lingering VASP process. The relevant logging info is below: So, this at least means I did not break anything in my PR as I was inquiring about. But it also means that there are mechanisms in place to properly abort without your proposed changes. It would be very helpful to get the logging module stdout from the failed run you are referring to, if it happens again. There is some useful information in here for debugging purposes. Of course, if you are confident that this PR fixes your issue, please let me know. I ran with your PR and confirmed that it at least didn't drastically ruin anything on that same toy run. |
|
@janosh can this be closed now? |
|
yes, thanks for reminding |
we noticed that a lot of our nodes had zombie VASP processes that we're hogging compute and memory without doing anything useful after hitting
MaxCorrectionsPerJobError, preventing new jobs from starting properly (they we're showing as running but not doing anything). it appears that's becauseCustodian._run_jobhad no logic to actually kill VASP when the default 5 errors per jobs is reached. is this by design? i'm surprised this didn't come up before so maybe we're missing something? fwiw, we're no longer seeing the zombie VASP processes after redeploying with the changes in this PR.@Andrew-S-Rosen @esoteric-ephemera @shyuep