-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PV stuck in unattached state #1590
Comments
Further investigation suggests that the problem is some kind of race condition between the volume detachment process and the deletion of the node's Virtual Machine. This leaves the CNS Volumes in some broken state where the CSI driver cannot re-attach them. Since the volumes show up in vSphere UI as unattached and ready, but vCenter still refuses to attach them to a VM, I think this might actually be a bug in vSphere CNS rather than the CSI driver. I think the problem is related to this issue kubermatic/machine-controller#1189 |
This issue is likely a duplicate of #359. |
We've run into this issue before and have had luck with this process:
|
I agree that the linked issue seems quite similar. The event that triggers the issue is probably the same (i.e. node removal before the volume detachment process is 100% complete). There are some slight differences though
So in respect to your proposed rescue process this means
I think the difference in the underlying issue might be the following: The problem causing #359 seems to be that the VM is deleted before the CSI is done processing the VolumeAttachment. A stale VolumeAttachment is the result. The problem in our case is probably more akin to: The CSI deletes the |
Probably related to #1416 but in our case, we also did not delete any node from the k8s cluster. |
We also started experiencing this issue without the deletion of any nodes. It is now also (sometimes) happening simply when a volume is re-mounted (i.e. when a pod is re-created). We are in contact with vmware support to further investigate the issue. We are also seeing more evidence that this issue is not a bug in vsphere-csi but ESXi/vSphere/CNS. Specifically, we noticed that the disks belonging to the container volumes can also not be mounted manually in vSphere anymore. So for now it seems like vsphere-csi is simply accurately reporting the errors happening in the underlying infrastructure. Although, the question, what the root cause of the "locking" of the virtual disks might be, remains unanswered. |
See #1416 (comment) |
Thanks for the update. In the meantime we had also figured out, that this error was caused by the Changed Block Tracking feature. However, in our case, it was the other way around: CBT is disabled by default on our machines and was enabled by accident on one of the worker machines. That worker started 'tainting' every volume that was attached to it with the CBT feature flag. After that these 'tainted' volumes were lost for all other nodes in the cluster. So the solution was to remove the node with the activated CBT and replace the virtual disks. |
I had a similar issue @heilerich, but in my case the CBT feature was being activated by a backup solution for VMs, used by another team. Took me a good while to figure it out. Since I didn't find a way to disable it for a single volume, I enabled CBT for all worker VMs with a Powershell script, made sure all CNS volumes were attached to these VMs, and then disabled CBT for all worker VMs again, so that the volumes were set back to CBT disabled. |
@fabiorauber thanks for the tip! I will try this should I ever encounter this problem again. Backup software seems to be the most likely culprit in this case. It was exactly the same situation for us: A backup solution backing up VMs it was not supposed to touch. |
/kind bug
What happened:
A PersistentVolume is permanently stuck in an unattached state. The vsphere-csi regularly tries to attach the volume to a node, but fails receiving
The operation is not allowed in the current state
from vCenter. This error also shows up in the vSphere UI as a failing aTask: Attach a virtual disk
on the node's event view.The error first occurred without human intervention, so I can only make assumptions about the cause, but I think the event that triggered this error was a scaling operation by the cluster autoscaler which removed some nodes from the cluster. The pods successfully moved to other nodes, just this one PV is stuck since then.
Controller Logs
What you expected to happen:
Since the volume shows up in vSphere UI as ready and unattached, the volume attachment should be successful.
How to reproduce it (as minimally and precisely as possible):
Unfortunately, no idea.
Environment:
v2.4.1
v1.21.0
v1.21.9
v7.0u2d
uname -a
): Linux 5.10.96Other than investigating the cause of this error I would also greatly appreciate any ideas on how to rescue the stuck volume. Thanks :)
The text was updated successfully, but these errors were encountered: