Add job_max_count option to keep Nomad server from running out of memory #26858

bmacdonell-roblox · 2025-09-30T23:55:12Z

Description

If a Nomad job is started with a large number of instances (e.g. 4 billion), then the Nomad servers that attempt to schedule it will run out of memory and crash. While it's unlikely that anyone would intentionally schedule a job with 4 billion instances, we have occasionally run into issues with bugs in external automation. For example, an automated deployment system running on a test environment had an off-by-one error, and deployed a job with count = uint32(-1), causing the Nomad servers for that environment to run out of memory and crash.

To prevent this, this PR introduces a job_max_count Nomad server configuration parameter. job_max_count limits the number of allocs that may be created from a job. The default value is 50000 - this is low enough that a job with the maximum possible number of allocs will not require much memory on the server, but is still much higher than the number of allocs in the largest Nomad job we have ever run.

For now, this PR is just an initial draft to get the conversation started - there are a few open questions from my side:

While a per-job maximum is safer, it's more difficult to explain in the documentation. Is it worth using an easier-to-explain limit on the count for each group instead of a limit on total count for the job as a whole?
The checks are written so that no validation is performed if JobMaxCount is zero. This avoids updating several hundred tests to use a configuration object with appropriate defaults, instead of a struct literal. Is this a good practice, or would it be better to update all of the affected tests?

Testing & Reproduction steps

To reproduce running out of memory, submit a job with count = 1000000000.

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

There are no changes to security controls.

hashicorp-cla-app · 2025-09-30T23:55:26Z

All committers have signed the CLA.

tgross

Hi @bmacdonell-roblox! This looks great overall. I'm mildly concerned about backwards compatibility by having a default limit, but it feels like the default we have here is so absurdly large that it's safe to break backcompat so long as we ship this in the upcoming major version (1.11.0) which we're feature-freezing in the next couple weeks. Perhaps in later versions of Nomad we could reduce this default limit further.

Unrelated to this implementation but something I want to post somewhere public so that we have a place to point to later are our thoughts how this interacts with potential governance controls in Nomad Enterprise.

Today Nomad Enterprise users can already get this feature with a Sentinel policy. It's one of the example policies we have in the UI for Enterprise users (ref count-limits.js:

main = rule { all_counts_under }

# all_counts_under checks that all task group counts are under a certain value

all_counts_under = rule {
  all job.task_groups as tg {
    tg.count < 100
  }
}

Normally we don't like to step on Enterprise features for obvious reasons, but we've had a long-going discussion of implementing this feature and feel like it makes sense to have cluster wide limits like this in the agent configuration.

For organizations that want finer-grained control, we'd likely implement this as a feature of quotas, which would fall under Nomad Enterprise, or as part of a feature around per-namespace scheduling controls that we're currently working on a design for (for 1.12.0).

website/content/docs/configuration/server.mdx

nomad/job_endpoint_test.go

nomad/structs/structs.go

nomad/config.go

command/agent/agent_test.go

command/agent/agent.go

tgross

This looks great @bmacdonell-roblox. I'm going to pull this down and do some end-to-end testing just to make sure we didn't miss anything. If you want to make that one minor change and mark it ready for review I think we'll be able to get this merged.

.changelog/26858.txt

tgross

Oh shoot, I'm also noticing that you haven't signed the CLA yet. We'll need that as well.

bmacdonell-roblox · 2025-10-03T17:09:35Z

Oh shoot, I'm also noticing that you haven't signed the CLA yet. We'll need that as well.

Yeah, that's unfortunately why I haven't marked this as ready for review yet - I'm still chasing down legal to approve the CLA. I expect that process should be complete in the next week or so.

Co-authored-by: Tim Gross <[email protected]>

@tgross

Also, combine positive and negative server configuration tests using an expectedErr field as suggested by @tgross.

Co-authored-by: Tim Gross <[email protected]>

tgross self-requested a review October 2, 2025 12:31

tgross reviewed Oct 2, 2025

View reviewed changes

tgross self-assigned this Oct 2, 2025

tgross added this to Nomad - Community Issues Triage Oct 2, 2025

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Oct 2, 2025

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Oct 2, 2025

tgross added type/enhancement theme/scheduling labels Oct 2, 2025

tgross reviewed Oct 3, 2025

View reviewed changes

.changelog/26858.txt Outdated Show resolved Hide resolved

tgross requested changes Oct 3, 2025

View reviewed changes

bmacdonell-roblox and others added 7 commits October 3, 2025 10:12

WIP: Add job_max_count option to limit the task group instances

0272fec

Add changelog entry for job_max_count

821b656

Improve documentation and make language consistent

d659562

Co-authored-by: Tim Gross <[email protected]>

Migrate tests for job_max_count to shoenig/test

3d9f5d8

job_max_count = 0 now means the total task group count is unlimited

a568121

Also, combine positive and negative server configuration tests using an expectedErr field as suggested by @tgross.

Add job_max_count to upgrade notes for Nomad 1.11.0.

599b1be

Correct changelog entry for job_max_count

81e7d44

Co-authored-by: Tim Gross <[email protected]>

bmacdonell-roblox force-pushed the job-max-count branch from eab568c to 81e7d44 Compare October 3, 2025 17:12

bmacdonell-roblox marked this pull request as ready for review October 3, 2025 21:29

bmacdonell-roblox requested review from a team as code owners October 3, 2025 21:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add job_max_count option to keep Nomad server from running out of memory #26858

Add job_max_count option to keep Nomad server from running out of memory #26858

Uh oh!

bmacdonell-roblox commented Sep 30, 2025 •

edited

Loading

Uh oh!

hashicorp-cla-app bot commented Sep 30, 2025 •

edited

Loading

Uh oh!

tgross left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgross left a comment

Uh oh!

Uh oh!

tgross left a comment

Uh oh!

bmacdonell-roblox commented Oct 3, 2025

Uh oh!

Uh oh!

Add job_max_count option to keep Nomad server from running out of memory #26858

Are you sure you want to change the base?

Add job_max_count option to keep Nomad server from running out of memory #26858

Uh oh!

Conversation

bmacdonell-roblox commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing & Reproduction steps

Contributor Checklist

Reviewer Checklist

Changes to Security Controls

Uh oh!

hashicorp-cla-app bot commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

bmacdonell-roblox commented Oct 3, 2025

Uh oh!

Uh oh!

bmacdonell-roblox commented Sep 30, 2025 •

edited

Loading

hashicorp-cla-app bot commented Sep 30, 2025 •

edited

Loading