Implicit tolerations #5282

johnbelamaric · 2025-05-06T22:09:55Z

Enhancement Description

Administrators often taint nodes with high-value resources like GPUs, to avoid them being consumed by workloads that do not need them. To simplify the user experience, some platforms (e.g., GKE) run a webhook to automatically tolerate those taints, if the pods have extended resource requests for those resources. This ensures that pods still run even if the user forgets to add the toleration, but only for those pods that actually need it.

With the advent of DRA, the exact needs of the workload are no longer determinable simply by looking at the PodSpec during API admission. Instead, the resource claims and device classes must also be examined. Additionally, the optionality available in DRA resource claim APIs may mean that several different types of nodes/resources (and therefore several different types of tolerations) are needed. A webhook does not have access to all the information it would need to add the tolerations at API admission time.

We discussed adding a "high value resource" aspect to node capabilities, but after further discussion it's not clear that's the right way to solve this problem. This enhancement request provides an alternative approach.

In this approach, we create a new scheduler plugin (or update the existing taints & tolerations plugin), which can be configured to examine the PodSpec and all associated Resource Claims and DeviceClasses at scheduling time and, based on the needs of the workload, implicitly tolerate taints. Essentially, we move the behavior of the web hook from API server admission time, to Pod scheduling time. This allows all necessary information to be available.

The specific way to calculate the tolerations, and the taints which they will tolerate will likely need to be part of the configuration of the scheduler plugin, since it is not known upstream what those taints are and when/how they should be tolerated.

This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.

/cc @pohly @klueska @pravk03 @dom4ha @dchen1107
/sig scheduling
/wg device-management

One-line enhancement description (can be used as a release note): Enable configuration of the scheduler to implicitly tolerate taints based on data found in the PodSpec, Resource Claims, and Device Classes
Kubernetes Enhancement Proposal: TBD
Discussion Link:
- KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… kubernetes-sigs/scheduler-plugins#812 (comment)
- Introduce Node Lifecycle WG community#8396 (comment)
Primary contact (assignee): @johnbelamaric
Responsible SIGs: Scheduling
Enhancement target (which target equals to which milestone):
- Alpha release target (x.y): 1.34
- Beta release target (x.y):
- Stable release target (x.y):
Alpha
- KEP (k/enhancements) update PR(s):
- Code (k/k) update PR(s):
- Docs (k/website) update PR(s):

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

The text was updated successfully, but these errors were encountered:

ffromani · 2025-05-07T16:38:17Z

/cc

ffromani · 2025-05-07T16:41:49Z

I see the usecase for the devices or non-core resources. I'm thinking especially about exclusive CPUs, but memory/hugepages can have the same problem. IIUC core resources (cpu, memory) are out of scope of this specific KEP and they are expected to be handled as part of the node capabilities KEP, right?

johnbelamaric · 2025-05-07T16:53:53Z

The idea here is just to build on top of taints/tolerations for the "repel" use case. The "attract"/"constrain" use case - the automatic equivalent of label selection, basically - is not really covered here and would be part of the capabilities concept.

This KEP doesn't look at specifics, I think it should provide a framework that cluster architects can use to configure the scheduler to do what they want. I imagine a list of rules that could be defined for the scheduler, that run CEL expressions against PodSpec, Resource Claims, and Device Classes. If the result is "true", the pod is scheduled as if the user put a specific toleration on it. For example, the plugin could accept a config with a list of data structures like:

type ImplicitTolerationRule struct {
  Expression string
  Toleration corev1.Toleration
}

It would evaluate each expression against the whole "package" of scheduling constraints: PodSpec, all associated ResourceClaims, and the associated DeviceClasses for those. Any expression that returned "true" would result in that Toleration in the scheduling.

johnbelamaric · 2025-05-07T16:55:45Z

The nice thing here is that because it relies on the existing "taints" functionality, users can still manually add a toleration, if they need to run something on a Pod that does not use the high-value resources on a node that has them.

pravk03 · 2025-05-08T17:48:01Z

/cc

I see the usecase for the devices or non-core resources. I'm thinking especially about exclusive CPUs, but memory/hugepages can have the same problem.

@ffromani Considering that we have information about core resources available in the Node object (node.status.capacity, node.status.allocatable), are you considering more detailed information being published through NodeCapabilities ?. I am interested in understanding how you envision node capability to be useful with core resources. I would like to capture such requirements in the Node Capabilities proposal that I am working on.

SergeyKanzhelev · 2025-05-09T20:45:00Z

This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.

Does the third party daemonset need to set toleration for all well-known DRA "plugins"? Or there will be universal tolerations for all DRA-implied taints?

johnbelamaric · 2025-05-09T20:50:55Z

This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.

Does the third party daemonset need to set toleration for all well-known DRA "plugins"? Or there will be universal tolerations for all DRA-implied taints?

DRA would not implement taints. Adding the taints is up to the cluster provider, just as it is today. Adding the scheduler config to tell the scheduler when to add implicit tolerations would also be up to the cluster provider.

So - no, the third part daemon set would not need to know anything about this. That's kind of the point - let the cluster provider, who knows what taints they set and for what reason, manage the way to add the tolerations.

We may not do them all through CEL. For example, it would be easier to implement some things directly in Go and have a flag or policy field to control them. We'll have to sort that out in the KEP.

SergeyKanzhelev · 2025-05-09T21:38:26Z

That's kind of the point - let the cluster provider, who knows what taints they set and for what reason, manage the way to add the tolerations.

Makes sense. So third party DaemonSets need to know about all vendor-defined taints as today.

PodSpec and all associated Resource Claims and DeviceClasses at scheduling time

So Devices will be associated with the Pod independently from taints? How does the scheduler works today? Will it check taints first and then allocate devices? In this case, you will need to ignore all taints first, try allocate devices, than re-check which of ignored taints were fine to ignore. Or we will list all possible devices can be allocated with the associated taints, and then see which combination will result in all taints ignored by this CEL rule? So this will be a filter inside the DRA scheduler?

johnbelamaric · 2025-05-12T16:02:54Z

Makes sense. So third party DaemonSets need to know about all vendor-defined taints as today.

I am not sure what you mean? If a DaemonSet needs a GPU, for example, it won't need to know about the taint. But if there are just random taints stuck on a node, then, yes, the cluster admin will need to tolerate that taint if they want the DaemonSet to run on that node.

So Devices will be associated with the Pod independently from taints? How does the scheduler works today? Will it check taints first and then allocate devices? In this case, you will need to ignore all taints first, try allocate devices, than re-check which of ignored taints were fine to ignore. Or we will list all possible devices can be allocated with the associated taints, and then see which combination will result in all taints ignored by this CEL rule? So this will be a filter inside the DRA scheduler?

So, here's an example, maybe that will help. Consider a platform where all nodes containing GPUs are tainted with a "has-gpu, NoSched" taint. The platform admin would configure the scheduler plugin with the following extra rules:

If the extended resources contain a GPU request, implicitly tolerate "has-gpu, NoSched"
If the Pod references a ResourceClaim with a DeviceClass that contains a GPU, implicitly tolerate "has-gpu, NoSched".

So, that 2) is not something that would be easy (or even possible perhaps) to express in CEL. Instead, I think we need some Go-based rule/policy that the admin could leverage. We need to sort that out in the KEP. One thing I could imagine, is a rule that says "if this example device is part of any referenced DeviceClass, implicitly tolerate the taint".

In other words, I imagine the API to be a bit more than what I showed before. Maybe more like:

type ImplicitTolerationRule struct {
  Selector RuleSelector
  Toleration corev1.Toleration
}

type RuleSelector struct {
  Type string

  // for Type == 'ExtendedResource'
  ResourceNames []string

  // for Type == 'Device'
  DevicePrototypes []resourcev1.Device

  // for Type == 'CEL'
  Expression *string
}

everpeace · 2025-05-13T05:53:40Z

/cc

cici37 · 2025-05-15T17:10:52Z

/assign

Zeel-Patel · 2025-05-19T11:39:13Z

/assign

k8s-ci-robot added sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management. labels May 6, 2025

github-project-automation bot added this to SIG Scheduling and SIG Node: Dynamic Resource Allocation May 6, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling May 6, 2025

github-project-automation bot moved this to 🆕 New in SIG Node: Dynamic Resource Allocation May 6, 2025

johnbelamaric mentioned this issue May 7, 2025

KEP - Add a Filter plugin to ensure that non-GPU pods are not schedul… kubernetes-sigs/scheduler-plugins#812

Open

pohly moved this from 🆕 New to 📋 Backlog in SIG Node: Dynamic Resource Allocation May 8, 2025

ffromani mentioned this issue May 12, 2025

Node reboot leaving existing pod using resources stuck with error UnexpectedAdmissionError kubernetes/kubernetes#125579

Open

k8s-ci-robot assigned cici37 May 15, 2025

k8s-ci-robot assigned Zeel-Patel May 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implicit tolerations #5282

Implicit tolerations #5282

johnbelamaric commented May 6, 2025 •

edited

Loading

ffromani commented May 7, 2025

ffromani commented May 7, 2025

johnbelamaric commented May 7, 2025

johnbelamaric commented May 7, 2025

pravk03 commented May 8, 2025

SergeyKanzhelev commented May 9, 2025

johnbelamaric commented May 9, 2025 •

edited

Loading

SergeyKanzhelev commented May 9, 2025

johnbelamaric commented May 12, 2025 •

edited

Loading

everpeace commented May 13, 2025

cici37 commented May 15, 2025

Zeel-Patel commented May 19, 2025

Implicit tolerations #5282

Implicit tolerations #5282

Comments

johnbelamaric commented May 6, 2025 • edited Loading

Enhancement Description

ffromani commented May 7, 2025

ffromani commented May 7, 2025

johnbelamaric commented May 7, 2025

johnbelamaric commented May 7, 2025

pravk03 commented May 8, 2025

SergeyKanzhelev commented May 9, 2025

johnbelamaric commented May 9, 2025 • edited Loading

SergeyKanzhelev commented May 9, 2025

johnbelamaric commented May 12, 2025 • edited Loading

everpeace commented May 13, 2025

cici37 commented May 15, 2025

Zeel-Patel commented May 19, 2025

johnbelamaric commented May 6, 2025 •

edited

Loading

johnbelamaric commented May 9, 2025 •

edited

Loading

johnbelamaric commented May 12, 2025 •

edited

Loading