Skip to content

Conversation

mcovarr
Copy link
Contributor

@mcovarr mcovarr commented Jul 25, 2025

Description

An rc-based approach to handling OOM-killed jobs. Ready for review but could use more:

  • Testing, particularly around interactions between retry with more memory and abort
  • Updates to error messages and documentation. Unlike the previous implementation this does not need to find particular words in stderr to decide if a job has been OOM-killed.
  • Updates to CHANGELOG.md and Terra Release Notes as requested below.

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

@mcovarr mcovarr force-pushed the an_539_retry_with_more_memory_oom_kills branch 2 times, most recently from c0f3c96 to 04cea74 Compare July 28, 2025 16:23
// From Gemini:
// An exit code of 247, particularly in the context of process execution in Linux or containerized environments like
// Docker, often indicates a process termination due to resource limitations, most commonly insufficient memory (RAM).
val SIGCONTAINERKILL = 247
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variants has encountered this one in the wild a few times but we don't currently have a test case to reproduce.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcovarr mcovarr marked this pull request as ready for review July 28, 2025 18:06
@mcovarr mcovarr requested a review from a team as a code owner July 28, 2025 18:06
// From Gemini:
// An exit code of 247, particularly in the context of process execution in Linux or containerized environments like
// Docker, often indicates a process termination due to resource limitations, most commonly insufficient memory (RAM).
val SIGCONTAINERKILL = 247
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mcovarr mcovarr force-pushed the an_539_retry_with_more_memory_oom_kills branch from 04cea74 to 284782b Compare July 28, 2025 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants