scheduler: fixed a bug where the bandwidth of reserved cores were not taken into account #26768

mvegter · 2025-09-15T09:44:22Z

Description

Configuring cores as part of the client reserved resources only limits scheduling on those cores, but isn't reflected in the available MHz bandwidth, thus affecting logic in places such as bin spread .

Testing & Reproduction steps

[sandbox@nomad-dev ~]$ curl -s localhost:4646/v1/metrics?format=prometheus | egrep '^nomad_client_(un)?allocated_cpu'  | grep ready
nomad_client_allocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="4ed2046a-e208-0751-2d2d-2bf4d966c140",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 0
nomad_client_unallocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="4ed2046a-e208-0751-2d2d-2bf4d966c140",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 278380

[sandbox@nomad-dev nomad]$ cat config.hcl
  client {
    reserved {
      # cpu = 1000
      # cores = "0-122"
    }
  }

versus

[sandbox@nomad-dev nomad]$ curl -s localhost:4646/v1/metrics?format=prometheus | egrep '^nomad_client_(un)?allocated_cpu' | grep ready
nomad_client_allocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="dc93b9fe-e3ab-8058-5006-3ee5696c3e1e",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 0
nomad_client_unallocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="dc93b9fe-e3ab-8058-5006-3ee5696c3e1e",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 2245
[sandbox@nomad-dev nomad]$ cat config.hcl
client {
  reserved {
    # cpu = 1000
    cores = "0-122"
  }
}

So on a VM with 124 cores, leaving the last core available for scheduling, we allegedly have 1 core or 2245 MHz / cpu available. While testing :

  [sandbox@nomad-dev nomad]$ cat job.hcl
job "redis-job" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:latest"
      }
      resources {
        cpu = 3000
      }
    }
  }
}

[sandbox@nomad-dev nomad]$ nomad job plan job.hcl
+ Job: "redis-job"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 job.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

[sandbox@nomad-dev nomad]$ cat job.hcl
job "redis-job" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:latest"
      }
      resources {
        cores = 2
      }
    }
  }
}

[sandbox@nomad-dev nomad]$ nomad job plan job.hcl
+ Job: "redis-job"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Dimension "cores" exhausted on 1 nodes

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 job.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Mimic of the situation, where we reserve the same amount of bandwidth but using cpu instead of cores we do fail :

[sandbox@nomad-dev nomad]$ cat config.hcl
  client {
    reserved {
      cpu = 276135
    }
  }
[sandbox@nomad-dev nomad]$ cat job.hcl
job "redis-job" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:latest"
      }
      resources {
        cpu = 3000
      }
    }
  }
}

[sandbox@nomad-dev nomad]$ nomad job plan job.hcl
+ Job: "redis-job"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Dimension "cpu" exhausted on 1 nodes

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 job.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Links

Contributor Checklist

Changelog Entry If this PR changes user-facing behavior, please generate and add a
changelog entry using the make cl command.
Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
ensure regressions will be caught.
Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
and job configuration, please update the Nomad website documentation to reflect this. Refer to
the website README for docs guidelines. Please also consider whether the
change requires notes within the upgrade guide.

Reviewer Checklist

Backport Labels Please add the correct backport labels as described by the internal
backporting document.
Commit Type Ensure the correct merge method is selected which should be "squash and merge"
in the majority of situations. The main exceptions are long-lived feature branches or merges where
history should be preserved.
Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
within the public repository.

If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

tgross

@mvegter before you get too much further with this and/or we get into a detailed review, could you help us out by describing the bug in a detail in the PR description?

mvegter · 2025-09-16T15:11:14Z

Hey @tgross , I like the speed at which you picking up on draft PRs 😉 I have added some rudimentary description, describing the basics of problem at hand and reproducer for latest main (but also some initial tests). I'm changing work laptops by the end of week, hence my premature push, so I probably won't be able to continue working on anyhow.

mvegter · 2025-09-29T13:19:02Z

Hey @tgross , before I pick this up again, do you have a preference of first creating a GitHub issue , or perhaps already any feedback in the direction of this solution ?

tgross · 2025-09-29T13:52:22Z

Sorry, I got sidetracked and didn't get a chance to re-review your updated description here. No need to open a new GitHub issue for it, we can discuss here.

The overall problem you're describing makes sense, It looks like the node is fingerprinting the usable compute as I'd expect. Ex. this instance has 22 cores and a total of 25400 MHz (mix of pCores and eCores). If I reserve one core:

client {
  reserved {
    cores = "0"
  }
}

And as your AvailableResources method reflects, the node fingerprint does reflect the correct values:

$ nomad node status -self -verbose | grep compute
cpu.totalcompute                         = 25400
cpu.usablecompute                        = 24000

It looks like your approach is to tweak the subtraction of comparable resources after the fact. And it looks like this has broken some tests. Wouldn't it be better to make sure that the comparable resources actually include the usable CPU (less the reserved cores) to begin with, rather than patching that up?

tgross · 2025-10-08T19:14:34Z

@mvegter I saw that you pushed an update. I'm going to try to get this reviewed this week.

tgross

Hi @mvegter! This is looking pretty good. In addition to my comments inline, I think the test changes aren't actually exercising the behavior you expect. If I drop the nomad/structs/funcs_test.go into main, it still passes all the tests, which suggests there's a gap there.

client/lib/numalib/topology.go

nomad/structs/structs.go

tgross · 2025-10-10T19:34:09Z

nomad/structs/funcs.go

-		nodeMem -= float64(reserved.Flattened.Memory.MemoryMB)
-	}
+	available := node.NodeResources.Comparable()
+	available.Subtract(node.ReservedResources.Comparable())


As I've noted in my comment on nomad/structs/structs.go, we've changed the return value of NodeResources.Comparable() so that it now includes the unreserved resources. But then we have to subtract the ReservedResources.Comparable() here and on line 195?

Thanks for the review @tgross , I fixed up most of your comments. As we are looping on this one, would you prefer the fix in the node.NodeResources.Comparable() to reflect the usable resource and dropping Comparable on the ReservedResources? This would be a slightly big change, as the NodeResources has no view of the reserved resource (except the leaking CPU details in the Processors.Topology ). Or , create a new method e.g. node.Comparable() that would essentially convert the NodeResources and ReservedResources into a single ComparableResources ?

If I drop the nomad/structs/funcs_test.go into main, it still passes all the tests, which suggests there's a gap there.

Not sure I see the gap you are referring, as the funcs_test.go covers the cases after fingerprinting ; the change in CpuShares behaviour is more accurately captured in client_test.go.

This would be a slightly big change, as the NodeResources has no view of the reserved resource (except the leaking CPU details in the Processors.Topology ). Or , create a new method e.g. node.Comparable() that would essentially convert the NodeResources and ReservedResources into a single ComparableResources ?

Oof, this data model is ugly, isn't it? 😁 Another alternative would be to have NodeResources.Comparable() take a parameter for the reserved resources:

func (n *NodeResources) Comparable(reserved *ComparableResources) *ComparableResources {

That way we'd force the caller to make sure they're correctly handling the reserved resources, while making it clear this doesn't include resources for allocations. That's still pretty ugly though.

Tell you what, none of this impacts the functionality and is just my aesthetic preferences, so we can land this PR without worrying about that.

Not sure I see the gap you are referring, as the funcs_test.go covers the cases after fingerprinting ; the change in CpuShares behaviour is more accurately captured in client_test.go.

I guess I'm asking why bother changing the tests in funcs_test.go if all the tests in nomad/structs pass even if we dropped those test changes in the main branch? What are we testing here? Is this just belt-and-suspenders to make sure we don't introduce nil pointer bugs later on? (That's fine if so.)

Tell you what, none of this impacts the functionality and is just my aesthetic preferences, so we can land this PR without worrying about that.

Given that NodeResources.Comparable and ReservedResources.Comparable are always called sequentially, essentially merging the two int a Node.Comparable and dropping their respective methods may be the cleanest approach for now.

What are we testing here? Is this just belt-and-suspenders to make sure we don't introduce nil pointer bugs later on? (That's fine if so.)

These are the first usages of "disabled" cores, the Disabled field is currently only for cosmetic reasons as none of the tests check Topology directly, but the actual bandwidth plus the reserved resources are now tied and will affect test results.

essentially merging the two int a Node.Comparable and dropping their respective methods may be the cleanest approach for now.

Go for it.

.changelog/26768.txt

nomad/structs/funcs_test.go

tgross · 2025-10-14T14:46:25Z

I've verified this impacts back to 1.8.x+ent, so I've added the appropriate backport labels.

… taken into account

mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch from 505ac99 to ab9cb3f Compare September 15, 2025 09:45

github-project-automation bot moved this to Needs Triage in Nomad - Community Issues Triage Sep 15, 2025

jrasell added this to Nomad - Community Issues Triage Sep 15, 2025

tgross requested changes Sep 15, 2025

View reviewed changes

tgross self-assigned this Sep 15, 2025

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Sep 15, 2025

mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch 4 times, most recently from 2296998 to b8e7428 Compare October 3, 2025 20:48

mvegter changed the title ~~scheduler: Fixed a bug where the bandwidth of reserved cores was not taken into account~~ client: fixed a bug where the bandwidth of reserved cores were not taken into account Oct 3, 2025

mvegter marked this pull request as ready for review October 6, 2025 11:10

mvegter requested review from a team as code owners October 6, 2025 11:10

mvegter requested a review from tgross October 6, 2025 11:10

tgross requested changes Oct 10, 2025

View reviewed changes

mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch from b8e7428 to 7f90a6f Compare October 13, 2025 09:15

mvegter changed the title ~~client: fixed a bug where the bandwidth of reserved cores were not taken into account~~ scheduler: fixed a bug where the bandwidth of reserved cores were not taken into account Oct 13, 2025

tgross added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent backport/1.10.x backport to 1.10.x release line labels Oct 14, 2025

mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch 4 times, most recently from 957d379 to 51ece85 Compare October 14, 2025 21:11

mvegter marked this pull request as draft October 14, 2025 21:13

scheduler: fixed a bug where the bandwidth of reserved cores were not…

f7854bc

… taken into account

mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch from 51ece85 to f7854bc Compare October 14, 2025 21:22

scheduler: fixed a bug where the bandwidth of reserved cores were not taken into account #26768

Are you sure you want to change the base?

scheduler: fixed a bug where the bandwidth of reserved cores were not taken into account #26768

Uh oh!

Conversation

mvegter commented Sep 15, 2025 • edited by tgross Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing & Reproduction steps

Links

Contributor Checklist

Reviewer Checklist

Changes to Security Controls

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

mvegter commented Sep 16, 2025

Uh oh!

mvegter commented Sep 29, 2025

Uh oh!

tgross commented Sep 29, 2025

Uh oh!

tgross commented Oct 8, 2025

Uh oh!

tgross left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

tgross Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

mvegter Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

tgross Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

mvegter Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

tgross Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tgross commented Oct 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mvegter commented Sep 15, 2025 •

edited by tgross

Loading