-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PV can not attach to new node if the previous node is deleted #359
Comments
@luwang-vmware This means CSI is unable to discover the node during startup. This usually happens if the vsphere conf secret has incorrect entries or the providerId in Node API object is incorrect(it is set by vSphere cloud provider). Can you check the secret and Node object? Also can you paste the CSI controller logs during startup? |
@SandeepPissay 5844930c-f7d5-4b8d-a618-0337da1b14c7 was the worker node, which has been deleted by BOSH and a new worker node is created by BOSH as well. The scenario is similar like a PV is attached to the worker node, when the worker node is deleted by accident or on purpose, the PV can not attach to the other node. |
/assign @SandeepPissay |
@luwang-vmware since you work at VMware, can you file a bug and upload CSI logs(all containers), VC support and ESX support bundle? Thanks! |
@SandeepPissay We will repo in-house again and file a bug then. |
@SandeepPissay I believe we've reproduced this issue during a routine cluster upgrade. Would new logs be useful? |
Support Request #21191453601 was filed on my behalf by a team member from our virtualization team. I failed to add the repro steps, I'll ask him to add these details in that ticket, but the repro method was as follows:
We started seeing PV(C) issues post worker-tier upgrade completion. |
We ran into this issue when performing resiliency testing on our Kubernetes cluster nodes.
And the csi-controller had these errors
We were able to recover from this state by deleting the corresponding VolumeAttachment resource. |
I found a work-around for this issue, in case it helps anyone.
In our node eviction code, I've added the following logic directly after the code which evicts all of the pods:
|
@brathina-spectro Did you drain the node before you shutdown the node VM? If you do not drain the node before shutdown, you may end up with a known upstream issue where the pod in the shutdown node will go to terminating state and the replacement pod would never come up since the volume is still attached to the shutdown node. Also see kubernetes/enhancements#1116. There are 2 ways to workaround this problem in Kubernetes:
|
@SandeepPissay Our objective was to test the resiliency of the system overall and so we did not shut down the nodes gracefully.
Pods were not stuck in terminating state. When CPI detects the node is gone, it removes the node from the cluster and so all the pods scheduled on the node gets Terminated. When the new pod comes up, it stays in "ContainerCreating" state forever with the error below.
|
I agree with @brathina-spectro's findings above. While the work-around I shared above helps to prevent this issue from happening, node issues (freeze/delete w/out drain, etc) can still trigger the issue which is difficult to recover from for our users since they don't know to (or in some cases have RBAC permissions to perform) nullify finalizers. |
Can you provide more details on how exactly are you testing the resiliency? And have you considered enabling vSphere HA on the vSphere cluster so that vSphere HA can restart the node VM if it crashes? I see node VM shutodown as a planned activity and the component/person doing that should drain the node completely before performing shutdown. If this is not done, we have an upstream issue that the volume detach does not happen(this is not unique to vSphere CSI). |
@SandeepPissay Thanks for replying. We haven't enabled vsphere HA and the node did not go through node drain. Could you share pointers to the upstream issue where volume detach does not happen? |
|
@SandeepPissay Thanks. We're seeing the same behavior with pods staying in "ContainerCreating" state even when the node was drained before shutting down. Volume attachment deletion started a few mins after the new node got launched, but it never finished, And VolumeAttachment is stuck there ever since. Tried restarting CSI controller. The controller detects that the node is gone, but is never deleting the VolumeAttachment. What logs will help you troubleshoot this and how can I upload them? I saw support section, but not sure which product category to choose to file ticket. |
CSI driver is not responsible for deleting the VolumeAttachment. The kube-controller-manager does that. Anyways, we need to take a look at the kube-controller-manager, external attacher and CSI controller logs. Please talk to VMware GSS on how to file a ticket. |
After you drain the node, all the pods would be deleted but VolumeAttachments still remain. It is because pods would get deleted from API server after the unmount is successful. So wait until VolumeAttachments are deleted and later delete the node. So the sequence of steps would be:
This way, the pods on the new node would come up properly without any issue. |
I tested node shudown scenario with statefulset on v1.20.4 Kubernetes cluster, and here's what I did:
Basically, for planned shutdown we should drain the node first and wait for the volume attachments for that node to be deleted for quick recovery of the app(see @BaluDontu previous comment). For unplanned node down scenarios(OS crash, hung, etc), we should force drain the node and wait for some time for Kubernetes/CSI to detach the volume from shutdown/hung node to the new node. It is also advised to enable vSphere HA on the vSphere cluster to make sure that vSphere HA can restart crashed/hung node VMs. Hope this info helps. |
@SandeepPissay Thanks for recording your observation. On ideal scenarios, we noticed the same behavior as well. But on multiple occurrences, we ran into the issue where volumeattchments were not getting deleted and that was preventing volumes to get detached from the deleted node. A very similar issue got addressed in Kubernetes upstream and we're trying to confirm if upgrading to newer versions helps. |
When BOSH recreates a problematic worker VM, which means deleting the old one and creating a new one, then the vSphere CSI Driver regards them as different VMs. So the vSphere CSI Controller couldn't find the old worker VM on which the PV was attached any more, accordingly it couldn't detach the PV from the old VM. I think this could be the key point why @SandeepPissay did not reproduce this issue. FYI. BOSH has Auto-healing Capabilities, which means It's responsible for automatically recreating VMs that become inaccessible. Refer to, |
I just ran into this same issue. The node deleted ungracefully (VM deleted and node removed from kubernetes). All pods running on that node were automatically rescheduled over to new nodes once it was deleted from kubernetes. The CSI driver showing errors "Error: node wasn't found" which makes since because the node was deleted completely. We are on 1.19.9 and that above "fix" #96617 was merged in at 1.19.7 which proves it doesn't resolve this issue. We would like the volume to be able to be detached (remove the VolumeAttachment object) if the VM does not exist anymore so that way the disk can be moved to another VM. We cannot expect all of our nodes to be gracefully removed (drained and wait for volumeattachment to be removed). Here is the kube controller logs when it tries to do this operation
|
In our experience, Kubernetes does attempt to delete the VolumeAttachment. However, the attachments have a finalizer that never completes because it cannot find the node. If a node is actually missing from vSphere, the finalizer should assume it's gone and assume the disk is detached and allow Kubernetes to delete the VolumeAttachment. The moment we remove the finalizer from the VolumeAttachment, it deletes, and the Pod that was needing that PVC successfuly proceeds to starting. |
Still ran into this running Kubernetes v1.21.3. The node was deleted and the VolumeAttachment was still around specifying the old node name. Required us to update the finalizers on each resource to be deleted so the pods can start up. |
This one-liner removes the kubectl get volumeattachments \
-o=custom-columns='NAME:.metadata.name,UUID:.metadata.uid,NODE:.spec.nodeName,ERROR:.status.detachError' \
--no-headers | grep -vE '<none>$' | awk '{print $1}' | \
xargs -n1 kubectl patch -p '{"metadata":{"finalizers":[]}}' --type=merge volumeattachments |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs. This bot triages issues and PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
/remove-lifecycle stale |
This issue is fixed with the PR - #1879 |
/kind bug
What happened:
We deployed k8s env in TKGI(PKS) and deployed CSI2.0 in the cluster. Deployed a statefulset. When the worker node in which the pod/pv is registered is in the error status, bosh resurrector mechanism created a new worker node to replace the error one. After a while, the statefulset pods/pv were rescheduled to the other nodes, but it got hang. The error said:
check the logs in csi-attacher.log, it printed as below. node 5844930c-f7d5-4b8d-a618-0337da1b14c7 was the worker node which was replaced by bosh resurrector mechanism.
What you expected to happen:
the pods/pv can be succeed to attach to the new node.
How to reproduce it (as minimally and precisely as possible):
I also executed the same steps in VCP, pods can be running in another worker node.
Anything else we need to know?:
before the testing, the pod-node mapping as below. csi-controller pods are not in the same worker node as the statefulset workload.
Environment:
gcr.io/cloud-provider-vsphere/csi/release/driver:v2.0.0
uname -a
):The text was updated successfully, but these errors were encountered: