Skip to content

Add optional pause prior to completing lifecycle action to allow PVC cleanup #651

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
mjseid opened this issue Jun 14, 2022 · 4 comments
Closed
Assignees
Labels
Type: Question All types of questions to/from customers

Comments

@mjseid
Copy link

mjseid commented Jun 14, 2022

Describe the feature
Add an optional wait after sucessful pod evcition before completing the ASG lifecycle hook. Could default to 0, but in my case it appears the nodes are getting terminated before all needed drain tasks are complete.

Is the feature request related to a problem?
I came across this project as a potential solution to a common issue with the EBS CSI storage driver. If a node is ungracefully terminated (ex ASG instance refresh) without being drained pods with PVC's will not be able to come up on new nodes until the 6 minute force-detach happens in the controller.

I've installed this project, and it does sucessfully evict the pods with PVC's and wait for a completed lifecycle hook prior to terminating the nodes, but my stateful pods are still getting the pvc multi-attach error and having to wait the 6 minutes. If I manually drain the nodes and then manually delete them, I am not seeing this issue so I believe ASG is just terminating the nodes too quickly before the controller can fully un-attach any PVC's.

Describe alternatives you've considered
I can get pods with PVC's to move appropriately if I set the lifecycle hook heartbeat timeout to 60 seconds with action of CONTINUE, and then remove the "autoscaling:CompleteLifecycleAction" privledge on the IAM role for this project. So the handler evicts the pod, and then can't complete the lifecycle action but Autoscaling continues with deleting the node after the 60 second timeout.

It works for my use case, but only b/c I have a small amount of pods per node and they evict quickly. It would be better to just inject this wait time prior to completing the lifecycle action.

@jillmon jillmon added the Type: Question All types of questions to/from customers label Jun 15, 2022
@github-actions
Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.

@github-actions github-actions bot added the stale Issues / PRs with no activity label Jul 16, 2022
@snay2 snay2 removed the stale Issues / PRs with no activity label Jul 20, 2022
@cjerad
Copy link
Member

cjerad commented Jul 28, 2022

Hi mjseid, can you clarify whether your Pods are managed by a StatefulSet? From this blurb in the docs, it may be able to help in this situation:

If you want to use storage volumes to provide persistence for your workload, you can use a StatefulSet as part of the solution. Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing volumes to the new Pods that replace any that have failed.

@tim-sendible
Copy link

Certainly for me, this happens with both sts and normal deployments. This proposed pause would indeed solve the issue.

@cjerad cjerad added the Pending-Release Pending an NTH or eks-charts release label Aug 31, 2022
@snay2
Copy link
Contributor

snay2 commented Aug 31, 2022

Released in v1.17.2 (chart version v0.19.2).

@snay2 snay2 closed this as completed Aug 31, 2022
@snay2 snay2 removed the Pending-Release Pending an NTH or eks-charts release label Aug 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Question All types of questions to/from customers
Projects
None yet
Development

No branches or pull requests

5 participants