-
Notifications
You must be signed in to change notification settings - Fork 5.2k
WIP: Introduce Node Lifecycle WG #8396
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
/hold |
Looks like I'm not a member of kubernetes org anymore. I was a few years back, but didn't keep up with contributions recently. You can remove me as a lead and I can reapply after some contributions to this WG. |
75e1096
to
a19a192
Compare
We have had impactful conversations with Ryan about this group and its goals. He has experience with cluster maintenance and I look forward to his participation in the WG. |
/cc |
a19a192
to
2d6ac13
Compare
|
||
### In scope | ||
|
||
- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth including something about DRA device taints/drains?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems relevant to me as this affects the pod and device/node lifecycle. @pohly what do you think about including and discussing kubernetes/enhancements#5055 in the WG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the motivation for 5055 is to not have to drain entire nodes when doing some maintenance which only affects one device or hardware managed by one driver - unless of course that maintenance ends up with having to reboot the node.
So I guess it depends?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For what I understand about device taints, they are a way we can make device health scheduler-aware. This fits into our scope because we need a way to decide what Node should be prioritized for maintenance and a plan to drain that Node. E.g. a Node with all its devices tainted is a great target for Node maintenance. However, I think we should hold onto the DRA device taints feature for when we discuss Node Maintenance and Node Drain designs. I don't think we need it called out in the scope as Node Maintenance and Node Drain should cover it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would also like to handle the pod lifecycle better in any descheduling scenario (not just Node Drain/Maintenance). One option is to use the EvictionRequest API which should give more power to the applications that are being disrupted. So it might be interesting to see if we can make the disruption more graceful in 5055.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of the motivation for 5055 is to not have to drain entire nodes when doing some maintenance which only affects one device or hardware managed by one driver - unless of course that maintenance ends up with having to reboot the node.
The other end of this to consider though is devices that span multiple nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added it to the WG. I think it would be good to discuss this feature and its impact . Also, it might be better to hold off beta for some time.
f1fe43f
to
0d4e43a
Compare
As per KubeCon discussion I would like to be part of the working group as a representative of SCL and Cluster API. |
79fbef9
to
8a70937
Compare
Cool, thank you for joining the WG @fabriziopandini. Added. |
- As a user, I want to prevent any disruption to my pet or expensive workloads (VMs, ML with | ||
accelerators) and either prevent termination altogether or have a reliable migration path. | ||
Features like `terminationGracePeriodSeconds` are not sufficient as the termination/migration can | ||
take hours if not days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can users not already do this with a PDB? Are we suggesting that node maintenance would override blocking PDBs if they block for some extended period of time?
I'm aware of k8s-shredder, an Adobe project that puts nodes into maintenance and then gives them a week to be clear before removing them. I'm wondering if this case is to say, even in that scenario, don't kill my workload?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think PDBs have a different use-case, so we may need to reword. The PodDisruptionBudget protects the availability of the application. What we're saying is that there's no API that protects both the availability of the infrastructure and the availability of the application. E.g. an accelerator is degraded on a Node, so I don't want to run future workloads there but it's ok for the current one to finish. It's in the best interest of the application and infrastructure provider that an admin remediates the accelerator, so admin-user mutually agree on when that can occur.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I agree there's a problem that eviction API / drain doesn't guarantee it will finish within a reasonable time, especially if the node is having issues (things get stuck terminating etc.).
But this at least we can do today right?
an accelerator is degraded on a Node, so I don't want to run future workloads there but it's ok for the current one to finish
You can just taint the node or the devices with NoSchedule.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can just taint the node or the devices with NoSchedule.
Yes, that is a solution assuming that:
- workloads will eventually drain if we wait long enough
- all termination steps will be successful
However, 1) can theoretically always work but it is the slowest possible solution and 2) is not guaranteed to work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm aware of k8s-shredder, an Adobe project that puts nodes into maintenance and then gives them a week to be clear before removing them. I'm wondering if this case is to say, even in that scenario, don't kill my workload?
This is a good example. We want the applications/admins be aware of upcoming maintenances. But also pods in most of descheduling scenarios so that they are given opportunity to migrate or cleanup before the termination, which is hard to do with PDBs.
The goal is not to override PDBs (it is also hard to do without breaking someone), the goal is to have a smarter layer above the PDBs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If eviction
API has been used and if the terminationGracePeriod
passed, isnt there a force kill executed ? Is there a scenario where the pod get stuck for days/weeks in this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just the eviction API alone can cause pods to get stuck. All in all, I would prefer we do not dive deep into the topic and focus mostly on the scope in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm.. ooc, can you please share any reference issues where eviction API
itself get stuck? while I understand the scope to an extent of this WG, I feel , the need to have a new WG itself arised due to the lack of co-ordination or complexity of the Primitive/features we have in K/K around scheduling, Preemption and Eviction.. imo, with the new design if we still not address half of the issues on this area, it dont serve the purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the eviction gets through; i.e. results in a delete call then everything works fine. The stuck eviction can be caused by:
- PDB itself: that is the pods don't satisfy the constraints (and never will)
- catching the eviction request via a validation webhook and discarding it (e.g. some projects from https://github.com/atiratree/kube-enhancements/blob/improve-node-maintenance/keps/sig-apps/4212-declarative-node-maintenance/README.md#motivation use it)
co-ordination or complexity of the Primitive/features we have in K/K around scheduling, Preemption and Eviction.. imo, with the new design if we still not address half of the issues on this area, it dont serve the purpose.
Yes, this is one of the important main goals that should be solved even before we introduce a node maintenance API IMO. Added to the goals to make it clear.
- As a user, I want my application to finish all network and storage operations before terminating a | ||
pod. This includes closing pod connections, removing pods from endpoints, writing cached writes | ||
to the underlying storage and completing storage cleanup routines. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In some scenarios, evicting/removing certain pods would prevent some of these operations. Consider if we were also evicting daemonset pods as an option (is that a goal I wasn't sure?), then we might need ordering somehow to make sure the CSI driver or CNI driver aren't removed until certain other cleanup has happened
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The way we handle this is our internal drain API has a label selector for things to ignore and by default just ignores daemonsets instead of worrying about the ordering here. Daemonsets aren't really supported under drain anyway. That should handle most cases for system level things like csi/cni/etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, daemonset pods should be considered as part of the maintenance scenarios, especially in cases when the node is going to shutdown. I have added it to the goals to make it clear.
We also had discussions with SIG node about static pod termination in the past, and they were generally not against it. But we lack use cases for it so far.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iiuc, this proposal aim to define an order for static pod termination, DaemonSet pods or other system-node-critical
Priority class Pod termination in node drain scenario. Is that a correct assumption ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not saying that we will solve static pod termination, just that we will look into that :) And yes, I think there should definitely be an ordering for both DS/Static.
@erictune @sanposhiho @macsko
So SIG scheduling definitely should be involved in this effort, although the problem space is much wider than the issues above and spans over the whole scheduling space. We will take those issues into consideration, especially methods to prevent from breaking running workloads by other components. It implies giving sufficient information about the impact of such disturbing admin actions, so definitely there is an overlap between our efforts. |
8a70937
to
f526bf3
Compare
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: atiratree The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
d725bb9
to
a3da4df
Compare
a3da4df
to
62b033d
Compare
controllers, API validation, integration with existing core components and extension points for the | ||
ecosystem. This should be accompanied by E2E / Conformance tests. | ||
|
||
## Relevant Projects |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For visibility, please let me know, if anyone has a relevant project they would like to see included here.
Co-authored-by: Ryan Hallisey <[email protected]>
62b033d
to
86d036f
Compare
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle. | ||
To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000) | ||
feature to GA and resolve the associated node shutdown issues. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle. | |
To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000) | |
feature to GA and resolve the associated node shutdown issues. | |
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle. |
Let's stick to general topics, w/o mentioning specific KEPs in the charter.
- Improve the scheduling and pod/node autoscaling to take into account ongoing node maintenance and | ||
the new disruption model/evictions. This includes balancing of the pods according to scheduling | ||
constraints. | ||
- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance. | |
- Consider improving the pod lifecycle of DaemonSets and static pods during a node maintenance. |
...) and other scenarios to use the new unified node draining approach. | ||
- Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to | ||
track and react to each of them. Consider past discussions/historical perspective | ||
(e.g. "thumbstones"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(e.g. "thumbstones"). | |
(e.g. "tombstones"). |
and check on its progress. I also want to be able to discover workloads that are blocking the node | ||
drain. | ||
- To support the new features, node maintenance, scheduler, descheduler, pod autoscaling, kubelet, | ||
and other actors want to use a new eviction API to gracefully remove pods. This would enable new |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and other actors want to use a new eviction API to gracefully remove pods. This would enable new | |
and other actors should to use a new eviction API to gracefully remove pods. This would enable new |
or even stronger must
is appropriate.
- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without | ||
any required manual interventions. I also want to be able to observe the node drain via the API | ||
and check on its progress. I also want to be able to discover workloads that are blocking the node | ||
drain. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: this entire section has 3 separate use-cases:
- initiate
- observe
- discover
Can you just split them accordingly. It's easier to read shorter user stories.
Area we expect to explore: | ||
|
||
- An API to express node drain/maintenance. | ||
Currently tracked in https://github.com/kubernetes/enhancements/issues/4212. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As mentioned elsewhere, drop these links. I'd focus on goals, not actual KEPs in charter. KEPs can and should be tracked as part of the ongoing work of a particular WG.
DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055. | ||
- An API to remove pods from endpoints before they terminate. | ||
Currently tracked in https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y. | ||
- Introduce enhancements across multiple Kubernetes SIGs to add support for the new APIs to solve |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This statement doesn't seem to fit in Area we expect to explore:
. I'd drop it entirely.
Currently tracked in https://github.com/kubernetes/enhancements/issues/4563. | ||
- An API/mechanism to gracefully terminate pods during a node shutdown. | ||
Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000. | ||
- An API to deschedule pods that use DRA devices. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying there will be separate API for descheduling any Pod and a Pod with DRA device? Why both can't just use /evict
?
Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000. | ||
- An API to deschedule pods that use DRA devices. | ||
DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055. | ||
- An API to remove pods from endpoints before they terminate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here, /evict
isn't sufficient?
projects and addressing scenarios that impede node drain or cause improper pod termination. Our | ||
objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with | ||
existing APIs and behaviors. We will strive to make these solutions minimalistic and extensible to | ||
support advanced use cases across the ecosystem. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to especially stress this section:
We will strive to make these solutions minimalistic and extensible to support advanced
use cases across the ecosystem.
to ensure we first look into existing APIs and how we can expand them, rather than introducing new ones.
We already struggle with small usage of Eviction API, adding new API will not resolve the problem, but will only make it more complicated for users to find the right one. I believe someone else already stressed that out, but I'd like to see this being one of the key goals for this WG.
@atiratree, even though I don't work for Red Hat any more, I would like to join this WG, this topic is still of interest to me. |
@atiratree, I would like to be part of this WG. Pls include me as well. |
No description provided.