Skip to content

Conversation

bmacdonell-roblox
Copy link

@bmacdonell-roblox bmacdonell-roblox commented Sep 30, 2025

Description

If a Nomad job is started with a large number of instances (e.g. 4 billion), then the Nomad servers that attempt to schedule it will run out of memory and crash. While it's unlikely that anyone would intentionally schedule a job with 4 billion instances, we have occasionally run into issues with bugs in external automation. For example, an automated deployment system running on a test environment had an off-by-one error, and deployed a job with count = uint32(-1), causing the Nomad servers for that environment to run out of memory and crash.

To prevent this, this PR introduces a job_max_count Nomad server configuration parameter. job_max_count limits the number of allocs that may be created from a job. The default value is 50000 - this is low enough that a job with the maximum possible number of allocs will not require much memory on the server, but is still much higher than the number of allocs in the largest Nomad job we have ever run.

For now, this PR is just an initial draft to get the conversation started - there are a few open questions from my side:

  1. While a per-job maximum is safer, it's more difficult to explain in the documentation. Is it worth using an easier-to-explain limit on the count for each group instead of a limit on total count for the job as a whole?

  2. The checks are written so that no validation is performed if JobMaxCount is zero. This avoids updating several hundred tests to use a configuration object with appropriate defaults, instead of a struct literal. Is this a good practice, or would it be better to update all of the affected tests?

Testing & Reproduction steps

To reproduce running out of memory, submit a job with count = 1000000000.

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

There are no changes to security controls.

Copy link

hashicorp-cla-app bot commented Sep 30, 2025

CLA assistant check
All committers have signed the CLA.

@tgross tgross self-requested a review October 2, 2025 12:31
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @bmacdonell-roblox! This looks great overall. I'm mildly concerned about backwards compatibility by having a default limit, but it feels like the default we have here is so absurdly large that it's safe to break backcompat so long as we ship this in the upcoming major version (1.11.0) which we're feature-freezing in the next couple weeks. Perhaps in later versions of Nomad we could reduce this default limit further.

Unrelated to this implementation but something I want to post somewhere public so that we have a place to point to later are our thoughts how this interacts with potential governance controls in Nomad Enterprise.

Today Nomad Enterprise users can already get this feature with a Sentinel policy. It's one of the example policies we have in the UI for Enterprise users (ref count-limits.js:

main = rule { all_counts_under }

# all_counts_under checks that all task group counts are under a certain value

all_counts_under = rule {
  all job.task_groups as tg {
    tg.count < 100
  }
}

Normally we don't like to step on Enterprise features for obvious reasons, but we've had a long-going discussion of implementing this feature and feel like it makes sense to have cluster wide limits like this in the agent configuration.

For organizations that want finer-grained control, we'd likely implement this as a feature of quotas, which would fall under Nomad Enterprise, or as part of a feature around per-namespace scheduling controls that we're currently working on a design for (for 1.12.0).

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great @bmacdonell-roblox. I'm going to pull this down and do some end-to-end testing just to make sure we didn't miss anything. If you want to make that one minor change and mark it ready for review I think we'll be able to get this merged.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh shoot, I'm also noticing that you haven't signed the CLA yet. We'll need that as well.

@bmacdonell-roblox
Copy link
Author

Oh shoot, I'm also noticing that you haven't signed the CLA yet. We'll need that as well.

Yeah, that's unfortunately why I haven't marked this as ready for review yet - I'm still chasing down legal to approve the CLA. I expect that process should be complete in the next week or so.

@bmacdonell-roblox bmacdonell-roblox marked this pull request as ready for review October 3, 2025 21:29
@bmacdonell-roblox bmacdonell-roblox requested review from a team as code owners October 3, 2025 21:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging this pull request may close these issues.

2 participants