-
Notifications
You must be signed in to change notification settings - Fork 374
Retry OOM killed jobs with more memory [AN-539] #7786
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Conversation
c0f3c96
to
04cea74
Compare
// From Gemini: | ||
// An exit code of 247, particularly in the context of process execution in Linux or containerized environments like | ||
// Docker, often indicates a process termination due to resource limitations, most commonly insufficient memory (RAM). | ||
val SIGCONTAINERKILL = 247 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Variants has encountered this one in the wild a few times but we don't currently have a test case to reproduce.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably where the AI trained on:
https://hpc-discourse.usc.edu/t/exit-codes-and-their-meanings/414/3
// From Gemini: | ||
// An exit code of 247, particularly in the context of process execution in Linux or containerized environments like | ||
// Docker, often indicates a process termination due to resource limitations, most commonly insufficient memory (RAM). | ||
val SIGCONTAINERKILL = 247 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably where the AI trained on:
https://hpc-discourse.usc.edu/t/exit-codes-and-their-meanings/414/3
04cea74
to
284782b
Compare
Description
An rc-based approach to handling OOM-killed jobs. Ready for review but could use more:
Release Notes Confirmation
CHANGELOG.md
CHANGELOG.md
in this PRCHANGELOG.md
because it doesn't impact community usersTerra Release Notes