Skip to content

Conversation

mvegter
Copy link
Contributor

@mvegter mvegter commented Sep 15, 2025

Description

Configuring cores as part of the client reserved resources only limits scheduling on those cores, but isn't reflected in the available MHz bandwidth, thus affecting logic in places such as bin spread .

Testing & Reproduction steps

[sandbox@nomad-dev ~]$ curl -s localhost:4646/v1/metrics?format=prometheus | egrep '^nomad_client_(un)?allocated_cpu'  | grep ready
nomad_client_allocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="4ed2046a-e208-0751-2d2d-2bf4d966c140",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 0
nomad_client_unallocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="4ed2046a-e208-0751-2d2d-2bf4d966c140",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 278380
[sandbox@nomad-dev nomad]$ cat config.hcl
  client {
    reserved {
      # cpu = 1000
      # cores = "0-122"
    }
  }

versus

[sandbox@nomad-dev nomad]$ curl -s localhost:4646/v1/metrics?format=prometheus | egrep '^nomad_client_(un)?allocated_cpu' | grep ready
nomad_client_allocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="dc93b9fe-e3ab-8058-5006-3ee5696c3e1e",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 0
nomad_client_unallocated_cpu{datacenter="dc1",host="nomad-dev",node_class="none",node_id="dc93b9fe-e3ab-8058-5006-3ee5696c3e1e",node_pool="default",node_scheduling_eligibility="eligible",node_status="ready"} 2245
[sandbox@nomad-dev nomad]$ cat config.hcl
client {
  reserved {
    # cpu = 1000
    cores = "0-122"
  }
}

So on a VM with 124 cores, leaving the last core available for scheduling, we allegedly have 1 core or 2245 MHz / cpu available. While testing :

  [sandbox@nomad-dev nomad]$ cat job.hcl
job "redis-job" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:latest"
      }
      resources {
        cpu = 3000
      }
    }
  }
}

[sandbox@nomad-dev nomad]$ nomad job plan job.hcl
+ Job: "redis-job"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- All tasks successfully allocated.

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 job.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.
[sandbox@nomad-dev nomad]$ cat job.hcl
job "redis-job" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:latest"
      }
      resources {
        cores = 2
      }
    }
  }
}

[sandbox@nomad-dev nomad]$ nomad job plan job.hcl
+ Job: "redis-job"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Dimension "cores" exhausted on 1 nodes

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 job.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Mimic of the situation, where we reserve the same amount of bandwidth but using cpu instead of cores we do fail :

[sandbox@nomad-dev nomad]$ cat config.hcl
  client {
    reserved {
      cpu = 276135
    }
  }
[sandbox@nomad-dev nomad]$ cat job.hcl
job "redis-job" {
  type = "service"
  group "cache" {
    count = 1
    task "redis" {
      driver = "docker"
      config {
        image = "redis:latest"
      }
      resources {
        cpu = 3000
      }
    }
  }
}

[sandbox@nomad-dev nomad]$ nomad job plan job.hcl
+ Job: "redis-job"
+ Task Group: "cache" (1 create)
  + Task: "redis" (forces create)

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "cache" (failed to place 1 allocation):
    * Resources exhausted on 1 nodes
    * Dimension "cpu" exhausted on 1 nodes

Job Modify Index: 0
To submit the job with version verification run:

nomad job run -check-index 0 job.hcl

When running the job with the check-index flag, the job will only be run if the
job modify index given matches the server-side version. If the index has
changed, another user has modified the job and the plan's results are
potentially invalid.

Links

Contributor Checklist

  • Changelog Entry If this PR changes user-facing behavior, please generate and add a
    changelog entry using the make cl command.
  • Testing Please add tests to cover any new functionality or to demonstrate bug fixes and
    ensure regressions will be caught.
  • Documentation If the change impacts user-facing functionality such as the CLI, API, UI,
    and job configuration, please update the Nomad website documentation to reflect this. Refer to
    the website README for docs guidelines. Please also consider whether the
    change requires notes within the upgrade guide.

Reviewer Checklist

  • Backport Labels Please add the correct backport labels as described by the internal
    backporting document.
  • Commit Type Ensure the correct merge method is selected which should be "squash and merge"
    in the majority of situations. The main exceptions are long-lived feature branches or merges where
    history should be preserved.
  • Enterprise PRs If this is an enterprise only PR, please add any required changelog entry
    within the public repository.
  • If a change needs to be reverted, we will roll out an update to the code within 7 days.

Changes to Security Controls

Are there any changes to security controls (access controls, encryption, logging) in this pull request? If so, explain.

@mvegter mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch from 505ac99 to ab9cb3f Compare September 15, 2025 09:45
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mvegter before you get too much further with this and/or we get into a detailed review, could you help us out by describing the bug in a detail in the PR description?

@tgross tgross self-assigned this Sep 15, 2025
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Sep 15, 2025
@mvegter
Copy link
Contributor Author

mvegter commented Sep 16, 2025

Hey @tgross , I like the speed at which you picking up on draft PRs 😉 I have added some rudimentary description, describing the basics of problem at hand and reproducer for latest main (but also some initial tests). I'm changing work laptops by the end of week, hence my premature push, so I probably won't be able to continue working on anyhow.

@mvegter
Copy link
Contributor Author

mvegter commented Sep 29, 2025

Hey @tgross , before I pick this up again, do you have a preference of first creating a GitHub issue , or perhaps already any feedback in the direction of this solution ?

@tgross
Copy link
Member

tgross commented Sep 29, 2025

Sorry, I got sidetracked and didn't get a chance to re-review your updated description here. No need to open a new GitHub issue for it, we can discuss here.

The overall problem you're describing makes sense, It looks like the node is fingerprinting the usable compute as I'd expect. Ex. this instance has 22 cores and a total of 25400 MHz (mix of pCores and eCores). If I reserve one core:

client {
  reserved {
    cores = "0"
  }
}

And as your AvailableResources method reflects, the node fingerprint does reflect the correct values:

$ nomad node status -self -verbose | grep compute
cpu.totalcompute                         = 25400
cpu.usablecompute                        = 24000

It looks like your approach is to tweak the subtraction of comparable resources after the fact. And it looks like this has broken some tests. Wouldn't it be better to make sure that the comparable resources actually include the usable CPU (less the reserved cores) to begin with, rather than patching that up?

@mvegter mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch 4 times, most recently from 2296998 to b8e7428 Compare October 3, 2025 20:48
@mvegter mvegter changed the title scheduler: Fixed a bug where the bandwidth of reserved cores was not taken into account client: fixed a bug where the bandwidth of reserved cores were not taken into account Oct 3, 2025
@mvegter mvegter marked this pull request as ready for review October 6, 2025 11:10
@mvegter mvegter requested review from a team as code owners October 6, 2025 11:10
@mvegter mvegter requested a review from tgross October 6, 2025 11:10
@tgross
Copy link
Member

tgross commented Oct 8, 2025

@mvegter I saw that you pushed an update. I'm going to try to get this reviewed this week.

Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @mvegter! This is looking pretty good. In addition to my comments inline, I think the test changes aren't actually exercising the behavior you expect. If I drop the nomad/structs/funcs_test.go into main, it still passes all the tests, which suggests there's a gap there.

nodeMem -= float64(reserved.Flattened.Memory.MemoryMB)
}
available := node.NodeResources.Comparable()
available.Subtract(node.ReservedResources.Comparable())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I've noted in my comment on nomad/structs/structs.go, we've changed the return value of NodeResources.Comparable() so that it now includes the unreserved resources. But then we have to subtract the ReservedResources.Comparable() here and on line 195?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review @tgross , I fixed up most of your comments. As we are looping on this one, would you prefer the fix in the node.NodeResources.Comparable() to reflect the usable resource and dropping Comparable on the ReservedResources? This would be a slightly big change, as the NodeResources has no view of the reserved resource (except the leaking CPU details in the Processors.Topology ). Or , create a new method e.g. node.Comparable() that would essentially convert the NodeResources and ReservedResources into a single ComparableResources ?

If I drop the nomad/structs/funcs_test.go into main, it still passes all the tests, which suggests there's a gap there.

Not sure I see the gap you are referring, as the funcs_test.go covers the cases after fingerprinting ; the change in CpuShares behaviour is more accurately captured in client_test.go.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be a slightly big change, as the NodeResources has no view of the reserved resource (except the leaking CPU details in the Processors.Topology ). Or , create a new method e.g. node.Comparable() that would essentially convert the NodeResources and ReservedResources into a single ComparableResources ?

Oof, this data model is ugly, isn't it? 😁 Another alternative would be to have NodeResources.Comparable() take a parameter for the reserved resources:

func (n *NodeResources) Comparable(reserved *ComparableResources) *ComparableResources {

That way we'd force the caller to make sure they're correctly handling the reserved resources, while making it clear this doesn't include resources for allocations. That's still pretty ugly though.

Tell you what, none of this impacts the functionality and is just my aesthetic preferences, so we can land this PR without worrying about that.

Not sure I see the gap you are referring, as the funcs_test.go covers the cases after fingerprinting ; the change in CpuShares behaviour is more accurately captured in client_test.go.

I guess I'm asking why bother changing the tests in funcs_test.go if all the tests in nomad/structs pass even if we dropped those test changes in the main branch? What are we testing here? Is this just belt-and-suspenders to make sure we don't introduce nil pointer bugs later on? (That's fine if so.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tell you what, none of this impacts the functionality and is just my aesthetic preferences, so we can land this PR without worrying about that.

Given that NodeResources.Comparable and ReservedResources.Comparable are always called sequentially, essentially merging the two int a Node.Comparable and dropping their respective methods may be the cleanest approach for now.

What are we testing here? Is this just belt-and-suspenders to make sure we don't introduce nil pointer bugs later on? (That's fine if so.)

These are the first usages of "disabled" cores, the Disabled field is currently only for cosmetic reasons as none of the tests check Topology directly, but the actual bandwidth plus the reserved resources are now tied and will affect test results.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

essentially merging the two int a Node.Comparable and dropping their respective methods may be the cleanest approach for now.

Go for it.

@mvegter mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch from b8e7428 to 7f90a6f Compare October 13, 2025 09:15
@mvegter mvegter changed the title client: fixed a bug where the bandwidth of reserved cores were not taken into account scheduler: fixed a bug where the bandwidth of reserved cores were not taken into account Oct 13, 2025
@tgross tgross added backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent backport/1.10.x backport to 1.10.x release line labels Oct 14, 2025
@tgross
Copy link
Member

tgross commented Oct 14, 2025

I've verified this impacts back to 1.8.x+ent, so I've added the appropriate backport labels.

@mvegter mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch 4 times, most recently from 957d379 to 51ece85 Compare October 14, 2025 21:11
@mvegter mvegter marked this pull request as draft October 14, 2025 21:13
@mvegter mvegter force-pushed the mvegter-scheduler-fix-reserved-resource-calculation-for-cores branch from 51ece85 to f7854bc Compare October 14, 2025 21:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport/ent/1.8.x+ent Changes are backported to 1.8.x+ent backport/ent/1.9.x+ent Changes are backported to 1.9.x+ent backport/1.10.x backport to 1.10.x release line

Projects

Development

Successfully merging this pull request may close these issues.

2 participants