Kill VASP on reaching `MaxCorrectionsPerJobError` by janosh · Pull Request #410 · materialsproject/custodian

janosh · 2025-10-30T21:30:15Z

we noticed that a lot of our nodes had zombie VASP processes that we're hogging compute and memory without doing anything useful after hitting MaxCorrectionsPerJobError, preventing new jobs from starting properly (they we're showing as running but not doing anything). it appears that's because Custodian._run_job had no logic to actually kill VASP when the default 5 errors per jobs is reached. is this by design? i'm surprised this didn't come up before so maybe we're missing something? fwiw, we're no longer seeing the zombie VASP processes after redeploying with the changes in this PR.

@Andrew-S-Rosen @esoteric-ephemera @shyuep

- Add process termination logic before raising MaxCorrectionsPerJobError and MaxCorrectionsError - Ensures running processes are properly killed using the multi-node compatible approach from PR #396 - Prevents orphaned VASP processes in both single-node and multi-node setups - Add comprehensive test case to verify process termination behavior

- Refactor _do_check() to use _terminate_process() instead of passing terminate_func - DRY: centralizes process termination logic in one place - uses multi-node compatible approach from PR #396

Andrew-S-Rosen · 2025-10-30T21:35:53Z

@janosh: Before proceeding with this PR, can you please tell me report what version of Custodian you are using? I modified the termination behavior in #396. If you are using v2025.10.11, can you please check if 2025.8.13 prevents the issue from occurring anymore? I would like to confirm that it is unrelated before proceeding with this PR.

janosh · 2025-10-30T21:38:55Z

yes, using v2025.10.11. i looked at #396 before starting work on this, thinking it could be related. if it is, at least it's not obvious

Andrew-S-Rosen · 2025-10-30T21:42:29Z

Looking at your PR, indeed, it does not seem like it would be related (although you never know). I remain just as surprised as you that this behavior would not have been caught before now. I will try to find time to test this in production before it is merged, but it will take me a little bit.

codecov-commenter · 2025-10-30T21:43:22Z

Codecov Report

❌ Patch coverage is 41.66667% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 52.89%. Comparing base (a305609) to head (fee3e62).

Files with missing lines	Patch %	Lines
src/custodian/custodian.py	41.66%	21 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #410      +/-   ##
==========================================
- Coverage   53.02%   52.89%   -0.14%     
==========================================
  Files          39       39              
  Lines        3508     3528      +20     
==========================================
+ Hits         1860     1866       +6     
- Misses       1648     1662      +14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Andrew-S-Rosen · 2025-10-30T21:44:21Z

@janosh --- As a follow-up question, if the VASP job errors out, wouldn't it not be running anymore?

janosh · 2025-10-30T21:51:12Z

@janosh --- As a follow-up question, if the VASP job errors out, wouldn't it not be running anymore?

i was wondering under what circumstances Custodian might think VASP errored out but where the VASP process is actually still running. is that possible?

Andrew-S-Rosen · 2025-10-30T21:53:40Z

i was wondering under what circumstances Custodian might think VASP errored out but where the VASP process is actually still running. is that possible?

It would need to be one that has is_monitor=True like NonConverigngErrorHandler, but in those cases, it should be killing the active VASP job regardless of the maximum number of corrections.

Can you confirm if the lingering VASP process was doing any I/O at all, or if it was complete stale? Maybe you can also share the errors in your custodian.json?

esoteric-ephemera · 2025-10-30T21:59:10Z

Personally, I haven't had this issue - jobs which hit that limit seem to get killed. Is it maybe some interplay between the job orchestration software and the sigkill custodian sends?

shyuep · 2025-10-30T22:13:56Z

I think there are instances where the VASP process is still running on a HPC node even though it is killed on other nodes. We have seen this before.

Andrew-S-Rosen · 2025-11-01T00:28:37Z

@janosh: Just a brief update from me. I just used Custodian 2025.10.11 and had a toy job hit the maximum number of errors. Everything aborted as expected, with no lingering VASP process. The relevant logging info is below:

custodian.custodian.MaxCorrectionsPerJobError: Max errors per job reached: 2.

So, this at least means I did not break anything in my PR as I was inquiring about. But it also means that there are mechanisms in place to properly abort without your proposed changes.

It would be very helpful to get the logging module stdout from the failed run you are referring to, if it happens again. There is some useful information in here for debugging purposes. Of course, if you are confident that this PR fixes your issue, please let me know. I ran with your PR and confirmed that it at least didn't drastically ruin anything on that same toy run.

Andrew-S-Rosen · 2025-12-13T22:19:14Z

@janosh can this be closed now?

janosh · 2025-12-13T22:28:40Z

yes, thanks for reminding

janosh added 2 commits October 28, 2025 13:31

add _terminate_process() on class Custodian

fee3e62

- Refactor _do_check() to use _terminate_process() instead of passing terminate_func - DRY: centralizes process termination logic in one place - uses multi-node compatible approach from PR #396

janosh mentioned this pull request Dec 11, 2025

Fix Custodian only killing MPI parent process but leaving VASP ghost processes blocking nodes #414

Merged

janosh closed this Dec 13, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Kill VASP on reaching `MaxCorrectionsPerJobError`#410

Kill VASP on reaching `MaxCorrectionsPerJobError`#410
janosh wants to merge 2 commits into
materialsproject:masterfrom
janosh:fix/terminate-vasp-on-max-errors

janosh commented Oct 30, 2025

Uh oh!

Andrew-S-Rosen commented Oct 30, 2025 •

edited

Loading

Uh oh!

janosh commented Oct 30, 2025

Uh oh!

Andrew-S-Rosen commented Oct 30, 2025

Uh oh!

codecov-commenter commented Oct 30, 2025 •

edited

Loading

Uh oh!

Andrew-S-Rosen commented Oct 30, 2025

Uh oh!

janosh commented Oct 30, 2025

Uh oh!

Andrew-S-Rosen commented Oct 30, 2025

Uh oh!

esoteric-ephemera commented Oct 30, 2025

Uh oh!

shyuep commented Oct 30, 2025

Uh oh!

Andrew-S-Rosen commented Nov 1, 2025 •

edited

Loading

Uh oh!

Andrew-S-Rosen commented Dec 13, 2025

Uh oh!

janosh commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

janosh commented Oct 30, 2025

Uh oh!

Andrew-S-Rosen commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

janosh commented Oct 30, 2025

Uh oh!

Andrew-S-Rosen commented Oct 30, 2025

Uh oh!

codecov-commenter commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Andrew-S-Rosen commented Oct 30, 2025

Uh oh!

janosh commented Oct 30, 2025

Uh oh!

Andrew-S-Rosen commented Oct 30, 2025

Uh oh!

esoteric-ephemera commented Oct 30, 2025

Uh oh!

shyuep commented Oct 30, 2025

Uh oh!

Andrew-S-Rosen commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Andrew-S-Rosen commented Dec 13, 2025

Uh oh!

janosh commented Dec 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Andrew-S-Rosen commented Oct 30, 2025 •

edited

Loading

codecov-commenter commented Oct 30, 2025 •

edited

Loading

Andrew-S-Rosen commented Nov 1, 2025 •

edited

Loading