-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Trigger graceful Machine disruption via Node deletion #13591
Copy link
Copy link
Open
Labels
area/machineIssues or PRs related to machine lifecycle managementIssues or PRs related to machine lifecycle managementkind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.needs-priorityIndicates an issue lacks a `priority/foo` label and requires one.Indicates an issue lacks a `priority/foo` label and requires one.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
Metadata
Metadata
Assignees
Labels
area/machineIssues or PRs related to machine lifecycle managementIssues or PRs related to machine lifecycle managementkind/featureCategorizes issue or PR as related to a new feature.Categorizes issue or PR as related to a new feature.needs-priorityIndicates an issue lacks a `priority/foo` label and requires one.Indicates an issue lacks a `priority/foo` label and requires one.needs-triageIndicates an issue or PR lacks a `triage/foo` label and requires one.Indicates an issue or PR lacks a `triage/foo` label and requires one.
What would you like to be added (User Story)?
As an operator, I would like to be able to gracefully remove a Node and its underlying Machine/Infrastructure respecting any disruption configurations through a deletion of the Node object in the child cluster.
Detailed Description
CAPI supports a configurable and sophisticated deletion process for Machine objects that respect workload availability (PDB's, etc). At present, this is triggered through the deletion timestamp (
metadata.deletionTimestamp) on the Machine object being set.In some environments, the operator of the CAPI infrastructure differs from the operator of the child-cluster and its resources - or, jumping across clusters is cumbersome. In this model, it is challenging for a child-cluster operator to trigger this same, safe, deletion process from the child-cluster directly.
A potential solution is to (via feature flag) have CAPI configure
metadata.finalizerson managed Nodes and have it react to the Node'smetadata.deletionTimestampto trigger a deletion of the parent Machine resource, removing the finalizer when the disruption process has completed. This follows a similar pattern to what karpenter has implemented and allows for safe removal of 'bad' nodes in a pinch.Anything else you would like to add?
If there is appetite to explore this, I'm happy to put together a formal proposal/PR.
Label(s) to be applied
/kind feature
/area machine