WIP: Introduce Node Lifecycle WG #8396

atiratree · 2025-03-24T12:17:05Z

No description provided.

atiratree · 2025-03-24T13:48:19Z

/hold

rthallisey · 2025-03-24T14:04:33Z

Looks like I'm not a member of kubernetes org anymore. I was a few years back, but didn't keep up with contributions recently. You can remove me as a lead and I can reapply after some contributions to this WG.

atiratree · 2025-03-24T14:59:20Z

We have had impactful conversations with Ryan about this group and its goals. He has experience with cluster maintenance and I look forward to his participation in the WG.

marquiz · 2025-03-25T17:09:04Z

/cc

ivelichkovich · 2025-04-03T09:11:42Z

wg-node-lifecycle/charter.md

+
+### In scope
+
+- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs


Is it worth including something about DRA device taints/drains?

Seems relevant to me as this affects the pod and device/node lifecycle. @pohly what do you think about including and discussing kubernetes/enhancements#5055 in the WG?

Part of the motivation for 5055 is to not have to drain entire nodes when doing some maintenance which only affects one device or hardware managed by one driver - unless of course that maintenance ends up with having to reboot the node.

So I guess it depends?

For what I understand about device taints, they are a way we can make device health scheduler-aware. This fits into our scope because we need a way to decide what Node should be prioritized for maintenance and a plan to drain that Node. E.g. a Node with all its devices tainted is a great target for Node maintenance. However, I think we should hold onto the DRA device taints feature for when we discuss Node Maintenance and Node Drain designs. I don't think we need it called out in the scope as Node Maintenance and Node Drain should cover it.

We would also like to handle the pod lifecycle better in any descheduling scenario (not just Node Drain/Maintenance). One option is to use the EvictionRequest API which should give more power to the applications that are being disrupted. So it might be interesting to see if we can make the disruption more graceful in 5055.

Part of the motivation for 5055 is to not have to drain entire nodes when doing some maintenance which only affects one device or hardware managed by one driver - unless of course that maintenance ends up with having to reboot the node.

The other end of this to consider though is devices that span multiple nodes

I have added it to the WG. I think it would be good to discuss this feature and its impact . Also, it might be better to hold off beta for some time.

fabriziopandini · 2025-04-08T16:43:57Z

As per KubeCon discussion I would like to be part of the working group as a representative of SCL and Cluster API.

atiratree · 2025-04-09T11:15:05Z

Cool, thank you for joining the WG @fabriziopandini. Added.

JoelSpeed · 2025-04-10T12:57:27Z

wg-node-lifecycle/charter.md

+- As a user, I want to prevent any disruption to my pet or expensive workloads (VMs, ML with
+  accelerators) and either prevent termination altogether or have a reliable migration path. 
+  Features like `terminationGracePeriodSeconds` are not sufficient as the termination/migration can
+  take hours if not days.


Can users not already do this with a PDB? Are we suggesting that node maintenance would override blocking PDBs if they block for some extended period of time?

I'm aware of k8s-shredder, an Adobe project that puts nodes into maintenance and then gives them a week to be clear before removing them. I'm wondering if this case is to say, even in that scenario, don't kill my workload?

I think PDBs have a different use-case, so we may need to reword. The PodDisruptionBudget protects the availability of the application. What we're saying is that there's no API that protects both the availability of the infrastructure and the availability of the application. E.g. an accelerator is degraded on a Node, so I don't want to run future workloads there but it's ok for the current one to finish. It's in the best interest of the application and infrastructure provider that an admin remediates the accelerator, so admin-user mutually agree on when that can occur.

Yeah I agree there's a problem that eviction API / drain doesn't guarantee it will finish within a reasonable time, especially if the node is having issues (things get stuck terminating etc.).

But this at least we can do today right?

an accelerator is degraded on a Node, so I don't want to run future workloads there but it's ok for the current one to finish

You can just taint the node or the devices with NoSchedule.

You can just taint the node or the devices with NoSchedule.

Yes, that is a solution assuming that:

workloads will eventually drain if we wait long enough

all termination steps will be successful

However, 1) can theoretically always work but it is the slowest possible solution and 2) is not guaranteed to work.

I'm aware of k8s-shredder, an Adobe project that puts nodes into maintenance and then gives them a week to be clear before removing them. I'm wondering if this case is to say, even in that scenario, don't kill my workload?

This is a good example. We want the applications/admins be aware of upcoming maintenances. But also pods in most of descheduling scenarios so that they are given opportunity to migrate or cleanup before the termination, which is hard to do with PDBs.

The goal is not to override PDBs (it is also hard to do without breaking someone), the goal is to have a smarter layer above the PDBs.

If eviction API has been used and if the terminationGracePeriod passed, isnt there a force kill executed ? Is there a scenario where the pod get stuck for days/weeks in this case?

Just the eviction API alone can cause pods to get stuck. All in all, I would prefer we do not dive deep into the topic and focus mostly on the scope in this PR.

Hmm.. ooc, can you please share any reference issues where eviction API itself get stuck? while I understand the scope to an extent of this WG, I feel , the need to have a new WG itself arised due to the lack of co-ordination or complexity of the Primitive/features we have in K/K around scheduling, Preemption and Eviction.. imo, with the new design if we still not address half of the issues on this area, it dont serve the purpose.

When the eviction gets through; i.e. results in a delete call then everything works fine. The stuck eviction can be caused by:

PDB itself: that is the pods don't satisfy the constraints (and never will)

catching the eviction request via a validation webhook and discarding it (e.g. some projects from https://github.com/atiratree/kube-enhancements/blob/improve-node-maintenance/keps/sig-apps/4212-declarative-node-maintenance/README.md#motivation use it)

co-ordination or complexity of the Primitive/features we have in K/K around scheduling, Preemption and Eviction.. imo, with the new design if we still not address half of the issues on this area, it dont serve the purpose.

Yes, this is one of the important main goals that should be solved even before we introduce a node maintenance API IMO. Added to the goals to make it clear.

JoelSpeed · 2025-04-10T12:58:55Z

wg-node-lifecycle/charter.md

+- As a user, I want my application to finish all network and storage operations before terminating a
+  pod. This includes closing pod connections, removing pods from endpoints, writing cached writes
+  to the underlying storage and completing storage cleanup routines.


In some scenarios, evicting/removing certain pods would prevent some of these operations. Consider if we were also evicting daemonset pods as an option (is that a goal I wasn't sure?), then we might need ordering somehow to make sure the CSI driver or CNI driver aren't removed until certain other cleanup has happened

The way we handle this is our internal drain API has a label selector for things to ignore and by default just ignores daemonsets instead of worrying about the ordering here. Daemonsets aren't really supported under drain anyway. That should handle most cases for system level things like csi/cni/etc.

Yes, daemonset pods should be considered as part of the maintenance scenarios, especially in cases when the node is going to shutdown. I have added it to the goals to make it clear.

We also had discussions with SIG node about static pod termination in the past, and they were generally not against it. But we lack use cases for it so far.

iiuc, this proposal aim to define an order for static pod termination, DaemonSet pods or other system-node-critical Priority class Pod termination in node drain scenario. Is that a correct assumption ?

I am not saying that we will solve static pod termination, just that we will look into that :) And yes, I think there should definitely be an ordering for both DS/Static.

wg-node-lifecycle/charter.md

dom4ha · 2025-04-10T19:06:59Z

@erictune @sanposhiho @macsko
Some of the problems that are brought up here are actually related to scheduling, or rather a things that are missing in scheduling, that are:

no ability to reserve resources upfront (which makes eviction unreliable and cannot be considered atomic)
no ability to link pods to workloads they belong to, so admins have no tools to assess the impact of their actions.

So SIG scheduling definitely should be involved in this effort, although the problem space is much wider than the issues above and spans over the whole scheduling space.

We will take those issues into consideration, especially methods to prevent from breaking running workloads by other components. It implies giving sufficient information about the impact of such disturbing admin actions, so definitely there is an overlap between our efforts.

k8s-ci-robot · 2025-04-11T07:37:46Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree
Once this PR has been reviewed and has the lgtm label, please assign pohly for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

wg-node-lifecycle/charter.md

atiratree · 2025-04-14T13:19:18Z

wg-node-lifecycle/charter.md

+controllers, API validation, integration with existing core components and extension points for the
+ecosystem. This should be accompanied by E2E / Conformance tests.
+
+## Relevant Projects


For visibility, please let me know, if anyone has a relevant project they would like to see included here.

Co-authored-by: Ryan Hallisey <[email protected]>

soltysh · 2025-04-14T13:56:34Z

wg-node-lifecycle/charter.md

+- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.
+  To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000)
+  feature to GA and resolve the associated node shutdown issues.


Suggested change

- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.

To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000)

feature to GA and resolve the associated node shutdown issues.

- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.

Let's stick to general topics, w/o mentioning specific KEPs in the charter.

soltysh · 2025-04-14T13:57:31Z

wg-node-lifecycle/charter.md

+- Improve the scheduling and pod/node autoscaling to take into account ongoing node maintenance and
+  the new disruption model/evictions. This includes balancing of the pods according to scheduling
+  constraints. 
+- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance.


Suggested change

- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance.

- Consider improving the pod lifecycle of DaemonSets and static pods during a node maintenance.

soltysh · 2025-04-14T13:58:25Z

wg-node-lifecycle/charter.md

+  ...) and other scenarios to use the new unified node draining approach.
+- Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to
+  track and react to each of them. Consider past discussions/historical perspective
+  (e.g. "thumbstones").


Suggested change

(e.g. "thumbstones").

(e.g. "tombstones").

soltysh · 2025-04-14T13:59:36Z

wg-node-lifecycle/charter.md

+  and check on its progress. I also want to be able to discover workloads that are blocking the node
+  drain.
+- To support the new features, node maintenance, scheduler, descheduler, pod autoscaling, kubelet,
+  and other actors want to use a new eviction API to gracefully remove pods. This would enable new


Suggested change

and other actors want to use a new eviction API to gracefully remove pods. This would enable new

and other actors should to use a new eviction API to gracefully remove pods. This would enable new

or even stronger must is appropriate.

soltysh · 2025-04-14T14:01:45Z

wg-node-lifecycle/charter.md

+- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without
+  any required manual interventions. I also want to be able to observe the node drain via the API
+  and check on its progress. I also want to be able to discover workloads that are blocking the node
+  drain.


Nit: this entire section has 3 separate use-cases:

initiate

observe

discover

Can you just split them accordingly. It's easier to read shorter user stories.

soltysh · 2025-04-14T14:12:57Z

wg-node-lifecycle/charter.md

+Area we expect to explore:
+
+- An API to express node drain/maintenance.
+  Currently tracked in https://github.com/kubernetes/enhancements/issues/4212.


As mentioned elsewhere, drop these links. I'd focus on goals, not actual KEPs in charter. KEPs can and should be tracked as part of the ongoing work of a particular WG.

soltysh · 2025-04-14T14:14:56Z

wg-node-lifecycle/charter.md

+  DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055.
+- An API to remove pods from endpoints before they terminate.
+  Currently tracked in https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y.
+- Introduce enhancements across multiple Kubernetes SIGs to add support for the new APIs to solve


This statement doesn't seem to fit in Area we expect to explore:. I'd drop it entirely.

soltysh · 2025-04-14T14:15:38Z

wg-node-lifecycle/charter.md

+  Currently tracked in https://github.com/kubernetes/enhancements/issues/4563.
+- An API/mechanism to gracefully terminate pods during a node shutdown.
+  Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000.
+- An API to deschedule pods that use DRA devices.


Are you saying there will be separate API for descheduling any Pod and a Pod with DRA device? Why both can't just use /evict?

soltysh · 2025-04-14T14:15:55Z

wg-node-lifecycle/charter.md

+  Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000.
+- An API to deschedule pods that use DRA devices.
+  DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055.
+- An API to remove pods from endpoints before they terminate.


Same question here, /evict isn't sufficient?

soltysh · 2025-04-14T14:19:18Z

wg-node-lifecycle/charter.md

+projects and addressing scenarios that impede node drain or cause improper pod termination. Our
+objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with
+existing APIs and behaviors. We will strive to make these solutions minimalistic and extensible to
+support advanced use cases across the ecosystem.


I'd like to especially stress this section:

We will strive to make these solutions minimalistic and extensible to support advanced use cases across the ecosystem.

to ensure we first look into existing APIs and how we can expand them, rather than introducing new ones.

We already struggle with small usage of Eviction API, adding new API will not resolve the problem, but will only make it more complicated for users to find the right one. I believe someone else already stressed that out, but I'd like to see this being one of the key goals for this WG.

kwilczynski · 2025-04-14T15:12:54Z

@atiratree, even though I don't work for Red Hat any more, I would like to join this WG, this topic is still of interest to me.

selansen · 2025-04-14T16:38:48Z

@atiratree, I would like to be part of this WG. Pls include me as well.

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/community-management area/slack-management Issues or PRs related to the Slack Management subproject labels Mar 24, 2025

k8s-ci-robot requested review from ahg-g and ardaguclu March 24, 2025 12:17

github-project-automation bot added this to SIG Scheduling Mar 24, 2025

github-project-automation bot moved this to Needs Triage in SIG Scheduling Mar 24, 2025

atiratree changed the title ~~Introduce Node Lifecycle WG~~ WIP: Introduce Node Lifecycle WG Mar 24, 2025

k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2025

k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025

atiratree force-pushed the wg-node-lifecycle branch from 75e1096 to a19a192 Compare March 24, 2025 14:47

k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 24, 2025

k8s-ci-robot requested a review from marquiz March 25, 2025 17:09

atiratree force-pushed the wg-node-lifecycle branch from a19a192 to 2d6ac13 Compare March 25, 2025 17:52

ivelichkovich reviewed Apr 3, 2025

View reviewed changes

dhenkel92 mentioned this pull request Apr 3, 2025

Enhancement: Add a new probe to distinguish between application is ready to serve traffic and when it is safe to disrupt. kubernetes/kubernetes#131167

Open

atiratree force-pushed the wg-node-lifecycle branch 3 times, most recently from f1fe43f to 0d4e43a Compare April 4, 2025 11:21

rthallisey mentioned this pull request Apr 7, 2025

REQUEST: New membership for rthallisey kubernetes/org#5518

Closed

11 tasks

atiratree force-pushed the wg-node-lifecycle branch 2 times, most recently from 79fbef9 to 8a70937 Compare April 9, 2025 11:11

JoelSpeed reviewed Apr 10, 2025

View reviewed changes

atiratree force-pushed the wg-node-lifecycle branch from 8a70937 to f526bf3 Compare April 11, 2025 07:37

atiratree force-pushed the wg-node-lifecycle branch 4 times, most recently from d725bb9 to a3da4df Compare April 11, 2025 09:08

janetkuo reviewed Apr 11, 2025

View reviewed changes

wg-node-lifecycle/charter.md Outdated Show resolved Hide resolved

wg-node-lifecycle/charter.md Outdated Show resolved Hide resolved

wg-node-lifecycle/charter.md Outdated Show resolved Hide resolved

atiratree force-pushed the wg-node-lifecycle branch from a3da4df to 62b033d Compare April 14, 2025 13:08

atiratree commented Apr 14, 2025

View reviewed changes

Introduce Node Lifecycle WG

86d036f

Co-authored-by: Ryan Hallisey <[email protected]>

atiratree force-pushed the wg-node-lifecycle branch from 62b033d to 86d036f Compare April 14, 2025 13:38

soltysh reviewed Apr 14, 2025

View reviewed changes


		### In scope

		- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs

	- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance.
	- Consider improving the pod lifecycle of DaemonSets and static pods during a node maintenance.

	and other actors want to use a new eviction API to gracefully remove pods. This would enable new
	and other actors should to use a new eviction API to gracefully remove pods. This would enable new

WIP: Introduce Node Lifecycle WG #8396

Are you sure you want to change the base?

WIP: Introduce Node Lifecycle WG #8396

Conversation

atiratree commented Mar 24, 2025

atiratree commented Mar 24, 2025

rthallisey commented Mar 24, 2025

atiratree commented Mar 24, 2025

marquiz commented Mar 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fabriziopandini commented Apr 8, 2025

atiratree commented Apr 9, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humblec Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

atiratree Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dom4ha commented Apr 10, 2025

k8s-ci-robot commented Apr 11, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kwilczynski commented Apr 14, 2025

selansen commented Apr 14, 2025

humblec Apr 11, 2025 •

edited

Loading

atiratree Apr 11, 2025 •

edited

Loading