Skip to content

Implicit tolerations #5282

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
4 tasks
johnbelamaric opened this issue May 6, 2025 · 12 comments
Open
4 tasks

Implicit tolerations #5282

johnbelamaric opened this issue May 6, 2025 · 12 comments
Assignees
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.

Comments

@johnbelamaric
Copy link
Member

johnbelamaric commented May 6, 2025

Enhancement Description

Administrators often taint nodes with high-value resources like GPUs, to avoid them being consumed by workloads that do not need them. To simplify the user experience, some platforms (e.g., GKE) run a webhook to automatically tolerate those taints, if the pods have extended resource requests for those resources. This ensures that pods still run even if the user forgets to add the toleration, but only for those pods that actually need it.

With the advent of DRA, the exact needs of the workload are no longer determinable simply by looking at the PodSpec during API admission. Instead, the resource claims and device classes must also be examined. Additionally, the optionality available in DRA resource claim APIs may mean that several different types of nodes/resources (and therefore several different types of tolerations) are needed. A webhook does not have access to all the information it would need to add the tolerations at API admission time.

We discussed adding a "high value resource" aspect to node capabilities, but after further discussion it's not clear that's the right way to solve this problem. This enhancement request provides an alternative approach.

In this approach, we create a new scheduler plugin (or update the existing taints & tolerations plugin), which can be configured to examine the PodSpec and all associated Resource Claims and DeviceClasses at scheduling time and, based on the needs of the workload, implicitly tolerate taints. Essentially, we move the behavior of the web hook from API server admission time, to Pod scheduling time. This allows all necessary information to be available.

The specific way to calculate the tolerations, and the taints which they will tolerate will likely need to be part of the configuration of the scheduler plugin, since it is not known upstream what those taints are and when/how they should be tolerated.

This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.

/cc @pohly @klueska @pravk03 @dom4ha @dchen1107
/sig scheduling
/wg device-management

Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.

@ffromani
Copy link
Contributor

ffromani commented May 7, 2025

/cc

@ffromani
Copy link
Contributor

ffromani commented May 7, 2025

I see the usecase for the devices or non-core resources. I'm thinking especially about exclusive CPUs, but memory/hugepages can have the same problem. IIUC core resources (cpu, memory) are out of scope of this specific KEP and they are expected to be handled as part of the node capabilities KEP, right?

@johnbelamaric
Copy link
Member Author

The idea here is just to build on top of taints/tolerations for the "repel" use case. The "attract"/"constrain" use case - the automatic equivalent of label selection, basically - is not really covered here and would be part of the capabilities concept.

This KEP doesn't look at specifics, I think it should provide a framework that cluster architects can use to configure the scheduler to do what they want. I imagine a list of rules that could be defined for the scheduler, that run CEL expressions against PodSpec, Resource Claims, and Device Classes. If the result is "true", the pod is scheduled as if the user put a specific toleration on it. For example, the plugin could accept a config with a list of data structures like:

type ImplicitTolerationRule struct {
  Expression string
  Toleration corev1.Toleration
}

It would evaluate each expression against the whole "package" of scheduling constraints: PodSpec, all associated ResourceClaims, and the associated DeviceClasses for those. Any expression that returned "true" would result in that Toleration in the scheduling.

@johnbelamaric
Copy link
Member Author

The nice thing here is that because it relies on the existing "taints" functionality, users can still manually add a toleration, if they need to run something on a Pod that does not use the high-value resources on a node that has them.

@pohly pohly moved this from 🆕 New to 📋 Backlog in SIG Node: Dynamic Resource Allocation May 8, 2025
@pravk03
Copy link

pravk03 commented May 8, 2025

/cc

I see the usecase for the devices or non-core resources. I'm thinking especially about exclusive CPUs, but memory/hugepages can have the same problem.

@ffromani Considering that we have information about core resources available in the Node object (node.status.capacity, node.status.allocatable), are you considering more detailed information being published through NodeCapabilities ?. I am interested in understanding how you envision node capability to be useful with core resources. I would like to capture such requirements in the Node Capabilities proposal that I am working on.

@SergeyKanzhelev
Copy link
Member

This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.

Does the third party daemonset need to set toleration for all well-known DRA "plugins"? Or there will be universal tolerations for all DRA-implied taints?

@johnbelamaric
Copy link
Member Author

johnbelamaric commented May 9, 2025

This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.

Does the third party daemonset need to set toleration for all well-known DRA "plugins"? Or there will be universal tolerations for all DRA-implied taints?

DRA would not implement taints. Adding the taints is up to the cluster provider, just as it is today. Adding the scheduler config to tell the scheduler when to add implicit tolerations would also be up to the cluster provider.

So - no, the third part daemon set would not need to know anything about this. That's kind of the point - let the cluster provider, who knows what taints they set and for what reason, manage the way to add the tolerations.

We may not do them all through CEL. For example, it would be easier to implement some things directly in Go and have a flag or policy field to control them. We'll have to sort that out in the KEP.

@SergeyKanzhelev
Copy link
Member

That's kind of the point - let the cluster provider, who knows what taints they set and for what reason, manage the way to add the tolerations.

Makes sense. So third party DaemonSets need to know about all vendor-defined taints as today.

PodSpec and all associated Resource Claims and DeviceClasses at scheduling time

So Devices will be associated with the Pod independently from taints? How does the scheduler works today? Will it check taints first and then allocate devices? In this case, you will need to ignore all taints first, try allocate devices, than re-check which of ignored taints were fine to ignore. Or we will list all possible devices can be allocated with the associated taints, and then see which combination will result in all taints ignored by this CEL rule? So this will be a filter inside the DRA scheduler?

@johnbelamaric
Copy link
Member Author

johnbelamaric commented May 12, 2025

Makes sense. So third party DaemonSets need to know about all vendor-defined taints as today.

I am not sure what you mean? If a DaemonSet needs a GPU, for example, it won't need to know about the taint. But if there are just random taints stuck on a node, then, yes, the cluster admin will need to tolerate that taint if they want the DaemonSet to run on that node.

So Devices will be associated with the Pod independently from taints? How does the scheduler works today? Will it check taints first and then allocate devices? In this case, you will need to ignore all taints first, try allocate devices, than re-check which of ignored taints were fine to ignore. Or we will list all possible devices can be allocated with the associated taints, and then see which combination will result in all taints ignored by this CEL rule? So this will be a filter inside the DRA scheduler?

So, here's an example, maybe that will help. Consider a platform where all nodes containing GPUs are tainted with a "has-gpu, NoSched" taint. The platform admin would configure the scheduler plugin with the following extra rules:

  1. If the extended resources contain a GPU request, implicitly tolerate "has-gpu, NoSched"
  2. If the Pod references a ResourceClaim with a DeviceClass that contains a GPU, implicitly tolerate "has-gpu, NoSched".

So, that 2) is not something that would be easy (or even possible perhaps) to express in CEL. Instead, I think we need some Go-based rule/policy that the admin could leverage. We need to sort that out in the KEP. One thing I could imagine, is a rule that says "if this example device is part of any referenced DeviceClass, implicitly tolerate the taint".

In other words, I imagine the API to be a bit more than what I showed before. Maybe more like:

type ImplicitTolerationRule struct {
  Selector RuleSelector
  Toleration corev1.Toleration
}

type RuleSelector struct {
  Type string

  // for Type == 'ExtendedResource'
  ResourceNames []string

  // for Type == 'Device'
  DevicePrototypes []resourcev1.Device

  // for Type == 'CEL'
  Expression *string
}

@everpeace
Copy link
Contributor

/cc

@cici37
Copy link
Contributor

cici37 commented May 15, 2025

/assign

@Zeel-Patel
Copy link

/assign

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. wg/device-management Categorizes an issue or PR as relevant to WG Device Management.
Projects
Status: 📋 Backlog
Status: Needs Triage
Development

No branches or pull requests

8 participants