-
Notifications
You must be signed in to change notification settings - Fork 2k
Add job_max_count option to keep Nomad server from running out of memory #26858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @bmacdonell-roblox! This looks great overall. I'm mildly concerned about backwards compatibility by having a default limit, but it feels like the default we have here is so absurdly large that it's safe to break backcompat so long as we ship this in the upcoming major version (1.11.0) which we're feature-freezing in the next couple weeks. Perhaps in later versions of Nomad we could reduce this default limit further.
Unrelated to this implementation but something I want to post somewhere public so that we have a place to point to later are our thoughts how this interacts with potential governance controls in Nomad Enterprise.
Today Nomad Enterprise users can already get this feature with a Sentinel policy. It's one of the example policies we have in the UI for Enterprise users (ref count-limits.js
:
main = rule { all_counts_under }
# all_counts_under checks that all task group counts are under a certain value
all_counts_under = rule {
all job.task_groups as tg {
tg.count < 100
}
}
Normally we don't like to step on Enterprise features for obvious reasons, but we've had a long-going discussion of implementing this feature and feel like it makes sense to have cluster wide limits like this in the agent configuration.
For organizations that want finer-grained control, we'd likely implement this as a feature of quotas, which would fall under Nomad Enterprise, or as part of a feature around per-namespace scheduling controls that we're currently working on a design for (for 1.12.0).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks great @bmacdonell-roblox. I'm going to pull this down and do some end-to-end testing just to make sure we didn't miss anything. If you want to make that one minor change and mark it ready for review I think we'll be able to get this merged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh shoot, I'm also noticing that you haven't signed the CLA yet. We'll need that as well.
Yeah, that's unfortunately why I haven't marked this as ready for review yet - I'm still chasing down legal to approve the CLA. I expect that process should be complete in the next week or so. |
Co-authored-by: Tim Gross <[email protected]>
Also, combine positive and negative server configuration tests using an expectedErr field as suggested by @tgross.
Co-authored-by: Tim Gross <[email protected]>
eab568c
to
81e7d44
Compare
Description
If a Nomad job is started with a large number of instances (e.g. 4 billion), then the Nomad servers that attempt to schedule it will run out of memory and crash. While it's unlikely that anyone would intentionally schedule a job with 4 billion instances, we have occasionally run into issues with bugs in external automation. For example, an automated deployment system running on a test environment had an off-by-one error, and deployed a job with
count = uint32(-1)
, causing the Nomad servers for that environment to run out of memory and crash.To prevent this, this PR introduces a
job_max_count
Nomad server configuration parameter.job_max_count
limits the number of allocs that may be created from a job. The default value is 50000 - this is low enough that a job with the maximum possible number of allocs will not require much memory on the server, but is still much higher than the number of allocs in the largest Nomad job we have ever run.For now, this PR is just an initial draft to get the conversation started - there are a few open questions from my side:
While a per-job maximum is safer, it's more difficult to explain in the documentation. Is it worth using an easier-to-explain limit on the count for each group instead of a limit on total count for the job as a whole?
The checks are written so that no validation is performed if
JobMaxCount
is zero. This avoids updating several hundred tests to use a configuration object with appropriate defaults, instead of a struct literal. Is this a good practice, or would it be better to update all of the affected tests?Testing & Reproduction steps
To reproduce running out of memory, submit a job with
count = 1000000000
.Contributor Checklist
changelog entry using the
make cl
command.ensure regressions will be caught.
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.
Reviewer Checklist
backporting document.
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
within the public repository.
Changes to Security Controls
There are no changes to security controls.