You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the feature
Add an optional wait after sucessful pod evcition before completing the ASG lifecycle hook. Could default to 0, but in my case it appears the nodes are getting terminated before all needed drain tasks are complete.
Is the feature request related to a problem?
I came across this project as a potential solution to a common issue with the EBS CSI storage driver. If a node is ungracefully terminated (ex ASG instance refresh) without being drained pods with PVC's will not be able to come up on new nodes until the 6 minute force-detach happens in the controller.
I've installed this project, and it does sucessfully evict the pods with PVC's and wait for a completed lifecycle hook prior to terminating the nodes, but my stateful pods are still getting the pvc multi-attach error and having to wait the 6 minutes. If I manually drain the nodes and then manually delete them, I am not seeing this issue so I believe ASG is just terminating the nodes too quickly before the controller can fully un-attach any PVC's.
Describe alternatives you've considered
I can get pods with PVC's to move appropriately if I set the lifecycle hook heartbeat timeout to 60 seconds with action of CONTINUE, and then remove the "autoscaling:CompleteLifecycleAction" privledge on the IAM role for this project. So the handler evicts the pod, and then can't complete the lifecycle action but Autoscaling continues with deleting the node after the 60 second timeout.
It works for my use case, but only b/c I have a small amount of pods per node and they evict quickly. It would be better to just inject this wait time prior to completing the lifecycle action.
The text was updated successfully, but these errors were encountered:
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. If you want this issue to never become stale, please ask a maintainer to apply the "stalebot-ignore" label.
Hi mjseid, can you clarify whether your Pods are managed by a StatefulSet? From this blurb in the docs, it may be able to help in this situation:
If you want to use storage volumes to provide persistence for your workload, you can use a StatefulSet as part of the solution. Although individual Pods in a StatefulSet are susceptible to failure, the persistent Pod identifiers make it easier to match existing volumes to the new Pods that replace any that have failed.
Describe the feature
Add an optional wait after sucessful pod evcition before completing the ASG lifecycle hook. Could default to 0, but in my case it appears the nodes are getting terminated before all needed drain tasks are complete.
Is the feature request related to a problem?
I came across this project as a potential solution to a common issue with the EBS CSI storage driver. If a node is ungracefully terminated (ex ASG instance refresh) without being drained pods with PVC's will not be able to come up on new nodes until the 6 minute force-detach happens in the controller.
I've installed this project, and it does sucessfully evict the pods with PVC's and wait for a completed lifecycle hook prior to terminating the nodes, but my stateful pods are still getting the pvc multi-attach error and having to wait the 6 minutes. If I manually drain the nodes and then manually delete them, I am not seeing this issue so I believe ASG is just terminating the nodes too quickly before the controller can fully un-attach any PVC's.
Describe alternatives you've considered
I can get pods with PVC's to move appropriately if I set the lifecycle hook heartbeat timeout to 60 seconds with action of CONTINUE, and then remove the "autoscaling:CompleteLifecycleAction" privledge on the IAM role for this project. So the handler evicts the pod, and then can't complete the lifecycle action but Autoscaling continues with deleting the node after the 60 second timeout.
It works for my use case, but only b/c I have a small amount of pods per node and they evict quickly. It would be better to just inject this wait time prior to completing the lifecycle action.
The text was updated successfully, but these errors were encountered: