-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Implicit tolerations #5282
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
/cc |
I see the usecase for the devices or non-core resources. I'm thinking especially about exclusive CPUs, but memory/hugepages can have the same problem. IIUC core resources (cpu, memory) are out of scope of this specific KEP and they are expected to be handled as part of the node capabilities KEP, right? |
The idea here is just to build on top of taints/tolerations for the "repel" use case. The "attract"/"constrain" use case - the automatic equivalent of label selection, basically - is not really covered here and would be part of the capabilities concept. This KEP doesn't look at specifics, I think it should provide a framework that cluster architects can use to configure the scheduler to do what they want. I imagine a list of rules that could be defined for the scheduler, that run CEL expressions against PodSpec, Resource Claims, and Device Classes. If the result is "true", the pod is scheduled as if the user put a specific toleration on it. For example, the plugin could accept a config with a list of data structures like: type ImplicitTolerationRule struct {
Expression string
Toleration corev1.Toleration
} It would evaluate each expression against the whole "package" of scheduling constraints: PodSpec, all associated ResourceClaims, and the associated DeviceClasses for those. Any expression that returned "true" would result in that Toleration in the scheduling. |
The nice thing here is that because it relies on the existing "taints" functionality, users can still manually add a toleration, if they need to run something on a Pod that does not use the high-value resources on a node that has them. |
/cc
@ffromani Considering that we have information about core resources available in the Node object (node.status.capacity, node.status.allocatable), are you considering more detailed information being published through NodeCapabilities ?. I am interested in understanding how you envision node capability to be useful with core resources. I would like to capture such requirements in the Node Capabilities proposal that I am working on. |
Does the third party daemonset need to set toleration for all well-known DRA "plugins"? Or there will be universal tolerations for all DRA-implied taints? |
DRA would not implement taints. Adding the taints is up to the cluster provider, just as it is today. Adding the scheduler config to tell the scheduler when to add implicit tolerations would also be up to the cluster provider. So - no, the third part daemon set would not need to know anything about this. That's kind of the point - let the cluster provider, who knows what taints they set and for what reason, manage the way to add the tolerations. We may not do them all through CEL. For example, it would be easier to implement some things directly in Go and have a flag or policy field to control them. We'll have to sort that out in the KEP. |
Makes sense. So third party DaemonSets need to know about all vendor-defined taints as today.
So Devices will be associated with the Pod independently from taints? How does the scheduler works today? Will it check taints first and then allocate devices? In this case, you will need to ignore all taints first, try allocate devices, than re-check which of ignored taints were fine to ignore. Or we will list all possible devices can be allocated with the associated taints, and then see which combination will result in all taints ignored by this CEL rule? So this will be a filter inside the DRA scheduler? |
I am not sure what you mean? If a DaemonSet needs a GPU, for example, it won't need to know about the taint. But if there are just random taints stuck on a node, then, yes, the cluster admin will need to tolerate that taint if they want the DaemonSet to run on that node.
So, here's an example, maybe that will help. Consider a platform where all nodes containing GPUs are tainted with a "has-gpu, NoSched" taint. The platform admin would configure the scheduler plugin with the following extra rules:
So, that 2) is not something that would be easy (or even possible perhaps) to express in CEL. Instead, I think we need some Go-based rule/policy that the admin could leverage. We need to sort that out in the KEP. One thing I could imagine, is a rule that says "if this example device is part of any referenced DeviceClass, implicitly tolerate the taint". In other words, I imagine the API to be a bit more than what I showed before. Maybe more like: type ImplicitTolerationRule struct {
Selector RuleSelector
Toleration corev1.Toleration
}
type RuleSelector struct {
Type string
// for Type == 'ExtendedResource'
ResourceNames []string
// for Type == 'Device'
DevicePrototypes []resourcev1.Device
// for Type == 'CEL'
Expression *string
} |
/cc |
/assign |
/assign |
Enhancement Description
Administrators often taint nodes with high-value resources like GPUs, to avoid them being consumed by workloads that do not need them. To simplify the user experience, some platforms (e.g., GKE) run a webhook to automatically tolerate those taints, if the pods have extended resource requests for those resources. This ensures that pods still run even if the user forgets to add the toleration, but only for those pods that actually need it.
With the advent of DRA, the exact needs of the workload are no longer determinable simply by looking at the PodSpec during API admission. Instead, the resource claims and device classes must also be examined. Additionally, the optionality available in DRA resource claim APIs may mean that several different types of nodes/resources (and therefore several different types of tolerations) are needed. A webhook does not have access to all the information it would need to add the tolerations at API admission time.
We discussed adding a "high value resource" aspect to node capabilities, but after further discussion it's not clear that's the right way to solve this problem. This enhancement request provides an alternative approach.
In this approach, we create a new scheduler plugin (or update the existing taints & tolerations plugin), which can be configured to examine the PodSpec and all associated Resource Claims and DeviceClasses at scheduling time and, based on the needs of the workload, implicitly tolerate taints. Essentially, we move the behavior of the web hook from API server admission time, to Pod scheduling time. This allows all necessary information to be available.
The specific way to calculate the tolerations, and the taints which they will tolerate will likely need to be part of the configuration of the scheduler plugin, since it is not known upstream what those taints are and when/how they should be tolerated.
This approach requires no new user-facing APIs, and enables Pods that must run on tainted nodes, but do not actually need the specialized device (like management pods) to be configured with the appropriate tolerations, explicitly.
/cc @pohly @klueska @pravk03 @dom4ha @dchen1107
/sig scheduling
/wg device-management
k/enhancements
) update PR(s):k/k
) update PR(s):k/website
) update PR(s):Please keep this description up to date. This will help the Enhancement Team to track the evolution of the enhancement efficiently.
The text was updated successfully, but these errors were encountered: