Skip to content

WIP: Introduce Node Lifecycle WG #8396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

atiratree
Copy link
Member

No description provided.

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. area/community-management area/slack-management Issues or PRs related to the Slack Management subproject labels Mar 24, 2025
@k8s-ci-robot k8s-ci-robot requested review from ahg-g and ardaguclu March 24, 2025 12:17
@k8s-ci-robot k8s-ci-robot added committee/steering Denotes an issue or PR intended to be handled by the steering committee. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. labels Mar 24, 2025
@github-project-automation github-project-automation bot moved this to Needs Triage in SIG Scheduling Mar 24, 2025
@atiratree atiratree changed the title Introduce Node Lifecycle WG WIP: Introduce Node Lifecycle WG Mar 24, 2025
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Mar 24, 2025
@atiratree
Copy link
Member Author

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 24, 2025
@rthallisey
Copy link

Looks like I'm not a member of kubernetes org anymore. I was a few years back, but didn't keep up with contributions recently. You can remove me as a lead and I can reapply after some contributions to this WG.

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/invalid-owners-file Indicates that a PR should not merge because it has an invalid OWNERS file in it. label Mar 24, 2025
@atiratree
Copy link
Member Author

We have had impactful conversations with Ryan about this group and its goals. He has experience with cluster maintenance and I look forward to his participation in the WG.

@marquiz
Copy link
Contributor

marquiz commented Mar 25, 2025

/cc

@k8s-ci-robot k8s-ci-robot requested a review from marquiz March 25, 2025 17:09

### In scope

- Explore a unified way of draining the nodes and managing node maintenance by introducing new APIs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth including something about DRA device taints/drains?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems relevant to me as this affects the pod and device/node lifecycle. @pohly what do you think about including and discussing kubernetes/enhancements#5055 in the WG?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the motivation for 5055 is to not have to drain entire nodes when doing some maintenance which only affects one device or hardware managed by one driver - unless of course that maintenance ends up with having to reboot the node.

So I guess it depends?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For what I understand about device taints, they are a way we can make device health scheduler-aware. This fits into our scope because we need a way to decide what Node should be prioritized for maintenance and a plan to drain that Node. E.g. a Node with all its devices tainted is a great target for Node maintenance. However, I think we should hold onto the DRA device taints feature for when we discuss Node Maintenance and Node Drain designs. I don't think we need it called out in the scope as Node Maintenance and Node Drain should cover it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would also like to handle the pod lifecycle better in any descheduling scenario (not just Node Drain/Maintenance). One option is to use the EvictionRequest API which should give more power to the applications that are being disrupted. So it might be interesting to see if we can make the disruption more graceful in 5055.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of the motivation for 5055 is to not have to drain entire nodes when doing some maintenance which only affects one device or hardware managed by one driver - unless of course that maintenance ends up with having to reboot the node.

The other end of this to consider though is devices that span multiple nodes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added it to the WG. I think it would be good to discuss this feature and its impact . Also, it might be better to hold off beta for some time.

@fabriziopandini
Copy link
Member

As per KubeCon discussion I would like to be part of the working group as a representative of SCL and Cluster API.

@atiratree atiratree force-pushed the wg-node-lifecycle branch 2 times, most recently from 79fbef9 to 8a70937 Compare April 9, 2025 11:11
@atiratree
Copy link
Member Author

Cool, thank you for joining the WG @fabriziopandini. Added.

Comment on lines +97 to +102
- As a user, I want to prevent any disruption to my pet or expensive workloads (VMs, ML with
accelerators) and either prevent termination altogether or have a reliable migration path.
Features like `terminationGracePeriodSeconds` are not sufficient as the termination/migration can
take hours if not days.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can users not already do this with a PDB? Are we suggesting that node maintenance would override blocking PDBs if they block for some extended period of time?

I'm aware of k8s-shredder, an Adobe project that puts nodes into maintenance and then gives them a week to be clear before removing them. I'm wondering if this case is to say, even in that scenario, don't kill my workload?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think PDBs have a different use-case, so we may need to reword. The PodDisruptionBudget protects the availability of the application. What we're saying is that there's no API that protects both the availability of the infrastructure and the availability of the application. E.g. an accelerator is degraded on a Node, so I don't want to run future workloads there but it's ok for the current one to finish. It's in the best interest of the application and infrastructure provider that an admin remediates the accelerator, so admin-user mutually agree on when that can occur.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I agree there's a problem that eviction API / drain doesn't guarantee it will finish within a reasonable time, especially if the node is having issues (things get stuck terminating etc.).

But this at least we can do today right?

an accelerator is degraded on a Node, so I don't want to run future workloads there but it's ok for the current one to finish

You can just taint the node or the devices with NoSchedule.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just taint the node or the devices with NoSchedule.

Yes, that is a solution assuming that:

  1. workloads will eventually drain if we wait long enough
  2. all termination steps will be successful

However, 1) can theoretically always work but it is the slowest possible solution and 2) is not guaranteed to work.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm aware of k8s-shredder, an Adobe project that puts nodes into maintenance and then gives them a week to be clear before removing them. I'm wondering if this case is to say, even in that scenario, don't kill my workload?

This is a good example. We want the applications/admins be aware of upcoming maintenances. But also pods in most of descheduling scenarios so that they are given opportunity to migrate or cleanup before the termination, which is hard to do with PDBs.

The goal is not to override PDBs (it is also hard to do without breaking someone), the goal is to have a smarter layer above the PDBs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eviction API has been used and if the terminationGracePeriod passed, isnt there a force kill executed ? Is there a scenario where the pod get stuck for days/weeks in this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just the eviction API alone can cause pods to get stuck. All in all, I would prefer we do not dive deep into the topic and focus mostly on the scope in this PR.

Copy link
Contributor

@humblec humblec Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm.. ooc, can you please share any reference issues where eviction API itself get stuck? while I understand the scope to an extent of this WG, I feel , the need to have a new WG itself arised due to the lack of co-ordination or complexity of the Primitive/features we have in K/K around scheduling, Preemption and Eviction.. imo, with the new design if we still not address half of the issues on this area, it dont serve the purpose.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the eviction gets through; i.e. results in a delete call then everything works fine. The stuck eviction can be caused by:

co-ordination or complexity of the Primitive/features we have in K/K around scheduling, Preemption and Eviction.. imo, with the new design if we still not address half of the issues on this area, it dont serve the purpose.

Yes, this is one of the important main goals that should be solved even before we introduce a node maintenance API IMO. Added to the goals to make it clear.

Comment on lines +101 to +105
- As a user, I want my application to finish all network and storage operations before terminating a
pod. This includes closing pod connections, removing pods from endpoints, writing cached writes
to the underlying storage and completing storage cleanup routines.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In some scenarios, evicting/removing certain pods would prevent some of these operations. Consider if we were also evicting daemonset pods as an option (is that a goal I wasn't sure?), then we might need ordering somehow to make sure the CSI driver or CNI driver aren't removed until certain other cleanup has happened

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way we handle this is our internal drain API has a label selector for things to ignore and by default just ignores daemonsets instead of worrying about the ordering here. Daemonsets aren't really supported under drain anyway. That should handle most cases for system level things like csi/cni/etc.

Copy link
Member Author

@atiratree atiratree Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, daemonset pods should be considered as part of the maintenance scenarios, especially in cases when the node is going to shutdown. I have added it to the goals to make it clear.

We also had discussions with SIG node about static pod termination in the past, and they were generally not against it. But we lack use cases for it so far.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iiuc, this proposal aim to define an order for static pod termination, DaemonSet pods or other system-node-critical Priority class Pod termination in node drain scenario. Is that a correct assumption ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not saying that we will solve static pod termination, just that we will look into that :) And yes, I think there should definitely be an ordering for both DS/Static.

@dom4ha
Copy link
Member

dom4ha commented Apr 10, 2025

@erictune @sanposhiho @macsko
Some of the problems that are brought up here are actually related to scheduling, or rather a things that are missing in scheduling, that are:

  1. no ability to reserve resources upfront (which makes eviction unreliable and cannot be considered atomic)
  2. no ability to link pods to workloads they belong to, so admins have no tools to assess the impact of their actions.

So SIG scheduling definitely should be involved in this effort, although the problem space is much wider than the issues above and spans over the whole scheduling space.

We will take those issues into consideration, especially methods to prevent from breaking running workloads by other components. It implies giving sufficient information about the impact of such disturbing admin actions, so definitely there is an overlap between our efforts.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: atiratree
Once this PR has been reviewed and has the lgtm label, please assign pohly for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@atiratree atiratree force-pushed the wg-node-lifecycle branch 4 times, most recently from d725bb9 to a3da4df Compare April 11, 2025 09:08
controllers, API validation, integration with existing core components and extension points for the
ecosystem. This should be accompanied by E2E / Conformance tests.

## Relevant Projects
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For visibility, please let me know, if anyone has a relevant project they would like to see included here.

Co-authored-by: Ryan Hallisey <[email protected]>
Comment on lines +36 to +38
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.
To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000)
feature to GA and resolve the associated node shutdown issues.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.
To graduate the [Graceful Node Shutdown](https://github.com/kubernetes/enhancements/issues/2000)
feature to GA and resolve the associated node shutdown issues.
- Improve the Graceful/Non-Graceful Node Shutdown and consider how this affects the node lifecycle.

Let's stick to general topics, w/o mentioning specific KEPs in the charter.

- Improve the scheduling and pod/node autoscaling to take into account ongoing node maintenance and
the new disruption model/evictions. This includes balancing of the pods according to scheduling
constraints.
- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Consider improving the pod lifecycle of DaemonSets and Static pods during a node maintenance.
- Consider improving the pod lifecycle of DaemonSets and static pods during a node maintenance.

...) and other scenarios to use the new unified node draining approach.
- Explore possible scenarios behind the reason why the node was terminated/drained/killed and how to
track and react to each of them. Consider past discussions/historical perspective
(e.g. "thumbstones").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
(e.g. "thumbstones").
(e.g. "tombstones").

and check on its progress. I also want to be able to discover workloads that are blocking the node
drain.
- To support the new features, node maintenance, scheduler, descheduler, pod autoscaling, kubelet,
and other actors want to use a new eviction API to gracefully remove pods. This would enable new
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and other actors want to use a new eviction API to gracefully remove pods. This would enable new
and other actors should to use a new eviction API to gracefully remove pods. This would enable new

or even stronger must is appropriate.

- As a cluster admin I want to have a simple interface to initiate a node drain/maintenance without
any required manual interventions. I also want to be able to observe the node drain via the API
and check on its progress. I also want to be able to discover workloads that are blocking the node
drain.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this entire section has 3 separate use-cases:

  1. initiate
  2. observe
  3. discover

Can you just split them accordingly. It's easier to read shorter user stories.

Area we expect to explore:

- An API to express node drain/maintenance.
Currently tracked in https://github.com/kubernetes/enhancements/issues/4212.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned elsewhere, drop these links. I'd focus on goals, not actual KEPs in charter. KEPs can and should be tracked as part of the ongoing work of a particular WG.

DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055.
- An API to remove pods from endpoints before they terminate.
Currently tracked in https://docs.google.com/document/d/1t25jgO_-LRHhjRXf4KJ5xY_t8BZYdapv7MDAxVGY6R8/edit?tab=t.0#heading=h.i4lwa7rdng7y.
- Introduce enhancements across multiple Kubernetes SIGs to add support for the new APIs to solve
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This statement doesn't seem to fit in Area we expect to explore:. I'd drop it entirely.

Currently tracked in https://github.com/kubernetes/enhancements/issues/4563.
- An API/mechanism to gracefully terminate pods during a node shutdown.
Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000.
- An API to deschedule pods that use DRA devices.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you saying there will be separate API for descheduling any Pod and a Pod with DRA device? Why both can't just use /evict?

Graceful node shutdown feature tracked in https://github.com/kubernetes/enhancements/issues/2000.
- An API to deschedule pods that use DRA devices.
DRA: device taints and tolerations feature tracked in https://github.com/kubernetes/enhancements/issues/5055.
- An API to remove pods from endpoints before they terminate.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here, /evict isn't sufficient?

projects and addressing scenarios that impede node drain or cause improper pod termination. Our
objective is to create easily configurable, out-of-the-box solutions that seamlessly integrate with
existing APIs and behaviors. We will strive to make these solutions minimalistic and extensible to
support advanced use cases across the ecosystem.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to especially stress this section:

We will strive to make these solutions minimalistic and extensible to support advanced 
use cases across the ecosystem.

to ensure we first look into existing APIs and how we can expand them, rather than introducing new ones.

We already struggle with small usage of Eviction API, adding new API will not resolve the problem, but will only make it more complicated for users to find the right one. I believe someone else already stressed that out, but I'd like to see this being one of the key goals for this WG.

@kwilczynski
Copy link
Member

@atiratree, even though I don't work for Red Hat any more, I would like to join this WG, this topic is still of interest to me.

@selansen
Copy link

@atiratree, I would like to be part of this WG. Pls include me as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/community-management area/slack-management Issues or PRs related to the Slack Management subproject cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. committee/steering Denotes an issue or PR intended to be handled by the steering committee. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. sig/apps Categorizes an issue or PR as relevant to SIG Apps. sig/architecture Categorizes an issue or PR as relevant to SIG Architecture. sig/autoscaling Categorizes an issue or PR as relevant to SIG Autoscaling. sig/cli Categorizes an issue or PR as relevant to SIG CLI. sig/cloud-provider Categorizes an issue or PR as relevant to SIG Cloud Provider. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. sig/contributor-experience Categorizes an issue or PR as relevant to SIG Contributor Experience. sig/network Categorizes an issue or PR as relevant to SIG Network. sig/node Categorizes an issue or PR as relevant to SIG Node. sig/scheduling Categorizes an issue or PR as relevant to SIG Scheduling. sig/storage Categorizes an issue or PR as relevant to SIG Storage. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
Status: Needs Triage
Development

Successfully merging this pull request may close these issues.