mark volume detach as success when Node vm is deleted from vcenter #1879

divyenpatel · 2022-07-20T06:13:59Z

What this PR does / why we need it:
Handle case to detach volume when Node VM is deleted from VC inventory.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #
Fixes the known issue present in the vSphere CSI Driver

Refer to the Issue Persistent volume fails to be detached from a node in the release note - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.6/rn/vmware-vsphere-container-storage-plugin-26-release-notes/index.html

Testing done:

Created Statefulsets with 2 replicas with vSphere CSI Driver volumes.


# kubectl get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE                      NOMINATED NODE   READINESS GATES
web-0   1/1     Running   0          23m   10.244.7.2    k8s-node-2-1658516767     <none>           <none>
web-1   1/1     Running   0          30m   10.244.3.28   k8s-node-989-1658516731   <none>           <none>

# kubectl get pvc
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                        AGE
www-web-0   Bound    pvc-886f971d-111f-4d55-ac55-029f0c31e6a2   1Gi        RWO            example-vanilla-rwo-filesystem-sc   94s
www-web-1   Bound    pvc-875daf5d-2e75-4924-ab65-c3460516d58b   1Gi        RWO            example-vanilla-rwo-filesystem-sc   79s

# kubectl get volumeattachment
NAME                                                                   ATTACHER                 PV                                         NODE                      ATTACHED   AGE
csi-614022547a4ef8aed367d85ac85157dd7b7239532f7a095f5535743d070e54a8   csi.vsphere.vmware.com   pvc-875daf5d-2e75-4924-ab65-c3460516d58b   k8s-node-989-1658516731   true       32m
csi-76efd889ad91ab975990ee11d454649c511111d4c4f68129198bfc5a31f216d2   csi.vsphere.vmware.com   pvc-886f971d-111f-4d55-ac55-029f0c31e6a2   k8s-node-2-1658516767     true       12m

From vCenter powered off the VM for the k8s node (k8s-node-2-1658516767) and removed it from vCenter inventory.
Node on the k8s went to Not Ready state

# kubectl get node -o wide
NAME                         STATUS     ROLES           AGE    VERSION   INTERNAL-IP      EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-control-599-1658516675   Ready      control-plane   4h5m   v1.24.1   10.185.255.64    10.185.255.64    Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6
k8s-control-677-1658516694   Ready      control-plane   4h4m   v1.24.1   10.185.241.108   10.185.241.108   Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6
k8s-control-865-1658516712   Ready      control-plane   4h3m   v1.24.1   10.185.245.31    10.185.245.31    Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6
k8s-node-2-1658516767        NotReady   <none>          38m    v1.24.1   10.185.242.253   10.185.242.253   Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6
k8s-node-989-1658516731      Ready      <none>          4h1m   v1.24.1   10.185.252.70    10.185.252.70    Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6

Deleted node from k8s

# kubectl delete node k8s-node-2-1658516767
node "k8s-node-2-1658516767" deleted

Pod gets rescheduled and volume detached successfully even when Node VM is not present on the vCenter.

# kubectl get pods -o wide
NAME    READY   STATUS        RESTARTS   AGE   IP            NODE                      NOMINATED NODE   READINESS GATES
web-0   1/1     Terminating   0          37m   10.244.7.2    k8s-node-2-1658516767     <none>           <none>
web-1   1/1     Running       0          44m   10.244.3.28   k8s-node-989-1658516731   <none>           <none>

# kubectl get pods -o wide
NAME    READY   STATUS              RESTARTS   AGE     IP            NODE                      NOMINATED NODE   READINESS GATES
web-0   0/1     ContainerCreating   0          4m11s   <none>        k8s-node-989-1658516731   <none>           <none>
web-1   1/1     Running             0          51m     10.244.3.28   k8s-node-989-1658516731   <none>           <none>

# kubectl get volumeattachment
NAME                                                                   ATTACHER                 PV                                         NODE                      ATTACHED   AGE
csi-046f0ccd3ac2b5e9e85043effec82ae378bcb6f81883c556f4919f3c869f1cc7   csi.vsphere.vmware.com   pvc-886f971d-111f-4d55-ac55-029f0c31e6a2   k8s-node-989-1658516731   true       15s
csi-614022547a4ef8aed367d85ac85157dd7b7239532f7a095f5535743d070e54a8   csi.vsphere.vmware.com   pvc-875daf5d-2e75-4924-ab65-c3460516d58b   k8s-node-989-1658516731   true       53m

Log

2022-07-22T23:30:23.608Z	INFO	vanilla/controller.go:1169	ControllerUnpublishVolume: called with args {VolumeId:487b981f-4ce4-496d-9700-a153491ceed5 NodeId:42227b0a-e117-14d4-f83e-8712b94a068e Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.642Z	INFO	node/manager.go:195	Node hasn't been discovered yet with nodeUUID 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.642Z	INFO	vsphere/virtualmachine.go:147	Initiating asynchronous datacenter listing with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	INFO	vsphere/datacenter.go:151	Publishing datacenter Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.185.254.145]	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.658Z	INFO	vsphere/virtualmachine.go:184	AsyncGetAllDatacenters with uuid 42227b0a-e117-14d4-f83e-8712b94a068e sent a dc Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.185.254.145]	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.662Z	ERROR	vsphere/datacenter.go:104	Couldn't find VM given uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.663Z	WARN	vsphere/virtualmachine.go:188	Couldn't find VM given uuid 42227b0a-e117-14d4-f83e-8712b94a068e on DC Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.185.254.145] with err: virtual machine wasn't found, continuing search	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.663Z	DEBUG	vsphere/virtualmachine.go:179	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.663Z	ERROR	vsphere/virtualmachine.go:215	Returning VM not found err for UUID 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.664Z	ERROR	node/manager.go:138	Couldn't find VM instance with nodeUUID 42227b0a-e117-14d4-f83e-8712b94a068e, failed to discover with err: virtual machine wasn't found	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.664Z	ERROR	node/manager.go:207	failed to discover node with nodeUUID 42227b0a-e117-14d4-f83e-8712b94a068e with err: virtual machine wasn't found	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.664Z	INFO	vanilla/controller.go:1251	Virtual Machine for Node ID: 42227b0a-e117-14d4-f83e-8712b94a068e is not present in the VC Inventory. Marking ControllerUnpublishVolume for Volume: "487b981f-4ce4-496d-9700-a153491ceed5" as successful.	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.664Z	DEBUG	vanilla/controller.go:1268	controllerUnpublishVolumeInternal: returns fault "" for volume "487b981f-4ce4-496d-9700-a153491ceed5"	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.665Z	INFO	vanilla/controller.go:1276	Volume "487b981f-4ce4-496d-9700-a153491ceed5" detached successfully from node "42227b0a-e117-14d4-f83e-8712b94a068e".	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}

Special notes for your reviewer:

Release note:

mark volume detach as success when Node vm is deleted from vcenter

xing-yang · 2022-07-20T13:44:55Z

This seems to be a reasonable fix.

xing-yang · 2022-07-20T13:45:00Z

/approve

SandeepPissay · 2022-07-20T15:25:56Z

The code change looks good to me. Can you run the CI jobs?

chethanv28

/approve

k8s-ci-robot · 2022-07-21T16:57:28Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chethanv28, divyenpatel, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [chethanv28,divyenpatel,xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

SandeepPissay · 2022-07-21T17:32:29Z

pkg/csi/service/vanilla/controller.go

@@ -1247,8 +1247,14 @@ func (c *controller) ControllerUnpublishVolume(ctx context.Context, req *csi.Con
 			node, err = c.nodeMgr.GetNodeByName(ctx, req.NodeId)
 		}
 		if err != nil {
-			return nil, csifault.CSIInternalFault, logger.LogNewErrorCodef(log, codes.Internal,
-				"failed to find VirtualMachine for node:%q. Error: %v", req.NodeId, err)
+			if err == cnsvsphere.ErrVMNotFound {


I'm not sure if we can infer that the disk is detached when we get ErrVmNotFound. A better way to be sure is to check if the volume is attached to any VM. FCD has an API for that but that API is currently not exposed by CNS. Can CNS expose this API? Once available, CSI can invoke that to be sure.

I'm not sure if we can infer that the disk is detached when we get ErrVmNotFound.

@SandeepPissay so are you saying the searchIndex.FindByUuid API call can not determine if VM is deleted from VC Inventory.
if VM is not found in the VC inventory then what is holding us from letting k8s know that this volume can be marked as detached from the requested node. Why do we need to check if FCD is attached to any other VM? Here we are making a change in the ControllerUnpublishVolume call.

svm, err := searchIndex.FindByUuid(ctx, dc.Datacenter, uuid, true, &instanceUUID) if err != nil { log.Errorf("failed to find VM given uuid %s with err: %v", uuid, err) return nil, err } else if svm == nil { log.Errorf("Couldn't find VM given uuid %s", uuid) return nil, ErrVMNotFound }

Also for Supervisor Pod VM, we have relied on this API.
Refer to https://github.com/kubernetes-sigs/vsphere-csi-driver/pull/1702/files
then why we can not use this API for Vanilla Node VM?

searchIndex.FindByUuid returns from the VC inventory and it could return an incorrect result when it is stale. IIRC the VC inventory can be stale in a few cases (VC restored to backup, host sync issues, etc). We should confirm this with the VC team. Instead of VM existence in VC inventory, its better to have a reliable way to know whether the disk is really attached to any VM or not. And this operation should actually check at the ESX host layer. If such an API does not exist today, we should file a request. As far as this PR goes, we can commit with the above caveats.

svcbot-qecnsdp · 2022-07-21T18:43:45Z

Started vanilla Block pipeline... Build Number: 1337

svcbot-qecnsdp · 2022-07-21T21:17:27Z

Block vanilla build status: FAILURE 
Stage before exit: testbed-deployment

svcbot-qecnsdp · 2022-07-22T19:24:01Z

Started vanilla Block pipeline... Build Number: 1350

svcbot-qecnsdp · 2022-07-22T21:17:03Z

Block vanilla build status: FAILURE 
Stage before exit: e2e-tests 
Jenkins E2E Test Results: 
JUnit report was created: /home/worker/workspace/Block-Vanilla@4/Results/1350/vsphere-csi-driver/tests/e2e/junit.xml

Ran 1 of 616 Specs in 428.544 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 615 Skipped
PASS

Ginkgo ran 1 suite in 8m56.918829132s
Test Suite Passed
--
JUnit report was created: /home/worker/workspace/Block-Vanilla@4/Results/1350/vsphere-csi-driver/tests/e2e/junit.xml

Ran 13 of 616 Specs in 5431.352 seconds
SUCCESS! -- 13 Passed | 0 Failed | 0 Pending | 603 Skipped
PASS

Ginkgo ran 1 suite in 1h30m57.505120296s
Test Suite Passed
--
/home/worker/workspace/Block-Vanilla@4/Results/1350/vsphere-csi-driver/tests/e2e/operationstorm.go:216

Ran 41 of 616 Specs in 691.124 seconds
FAIL! -- 39 Passed | 2 Failed | 0 Pending | 575 Skipped


Ginkgo ran 1 suite in 11m56.939342615s
Test Suite Failed

divyenpatel · 2022-07-22T23:33:53Z

@SandeepPissay I have executed e2e pipeline - #1879 (comment)

and also updated manual test result on the PR for deletion of Node VM and automatic removal of Volumeattachment without manual intervention.

SandeepPissay · 2022-07-22T23:44:14Z

/lgtm

…ubernetes-sigs#1879)

…1879) (#1917)

…ubernetes-sigs#1879)

…1879) (#1924)

…ubernetes-sigs#1879) (kubernetes-sigs#1917)

fungaren · 2024-04-17T13:42:34Z

Maybe we can learn from this:

https://github.com/kubernetes-sigs/aws-ebs-csi-driver/blob/release-1.29/cmd/hooks/prestop.go

Also, kubernetes-csi/external-attacher#215

divyenpatel requested review from chethanv28 and SandeepPissay July 20, 2022 06:13

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 20, 2022

k8s-ci-robot requested review from deepakkinni and gohilankit July 20, 2022 06:14

k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 20, 2022

mark volume detach as success when Node vm is deleted from vcenter

5571cf5

divyenpatel force-pushed the fix-detach-when-node-vm-is-deleted branch from 5d48733 to 5571cf5 Compare July 20, 2022 06:26

chethanv28 approved these changes Jul 21, 2022

View reviewed changes

SandeepPissay reviewed Jul 21, 2022

View reviewed changes

k8s-ci-robot assigned SandeepPissay Jul 22, 2022

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jul 22, 2022

k8s-ci-robot merged commit 4c2af70 into kubernetes-sigs:master Jul 22, 2022

divyenpatel mentioned this pull request Jul 23, 2022

PV can not attach to new node if the previous node is deleted #359

Closed

divyenpatel added a commit to divyenpatel/vsphere-csi-driver that referenced this pull request Aug 10, 2022

mark volume detach as success when Node vm is deleted from vcenter (k…

79e20f9

…ubernetes-sigs#1879)

divyenpatel mentioned this pull request Aug 10, 2022

[Cherry-pick 2.6 #1879] mark volume detach as success when Node vm is deleted from vcenter #1917

Merged

divyenpatel added a commit to divyenpatel/vsphere-csi-driver that referenced this pull request Aug 10, 2022

mark volume detach as success when Node vm is deleted from vcenter (k…

5c80698

…ubernetes-sigs#1879)

divyenpatel mentioned this pull request Aug 10, 2022

[Cherry-pick 2.4 #1879] mark volume detach as success when Node vm is deleted from vcenter #1919

Closed

k8s-ci-robot pushed a commit that referenced this pull request Aug 10, 2022

mark volume detach as success when Node vm is deleted from vcenter (#…

617da46

…1879) (#1917)

abhinavnagaraj pushed a commit to spectrocloud/vsphere-csi-driver that referenced this pull request Aug 10, 2022

mark volume detach as success when Node vm is deleted from vcenter (k…

dc62c63

…ubernetes-sigs#1879)

divyenpatel added a commit to divyenpatel/vsphere-csi-driver that referenced this pull request Aug 10, 2022

mark volume detach as success when Node vm is deleted from vcenter (k…

a6c48a1

…ubernetes-sigs#1879)

divyenpatel added a commit to divyenpatel/vsphere-csi-driver that referenced this pull request Aug 10, 2022

mark volume detach as success when Node vm is deleted from vcenter (k…

9db1cf9

…ubernetes-sigs#1879)

divyenpatel mentioned this pull request Aug 10, 2022

[Cherry-pick 2.5 #1879] mark volume detach as success when Node vm is deleted from vcenter #1924

Merged

k8s-ci-robot pushed a commit that referenced this pull request Aug 10, 2022

mark volume detach as success when Node vm is deleted from vcenter (#…

d588292

…1879) (#1924)

abhinavnagaraj pushed a commit to spectrocloud/vsphere-csi-driver that referenced this pull request Aug 30, 2022

mark volume detach as success when Node vm is deleted from vcenter (k…

5e42a37

…ubernetes-sigs#1879) (kubernetes-sigs#1917)

janre mentioned this pull request Jan 13, 2023

update vsphere-csi-driver to min 2.5.4 aws/eks-anywhere#4626

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mark volume detach as success when Node vm is deleted from vcenter #1879

mark volume detach as success when Node vm is deleted from vcenter #1879

divyenpatel commented Jul 20, 2022 •

edited

Loading

xing-yang commented Jul 20, 2022

xing-yang commented Jul 20, 2022

SandeepPissay commented Jul 20, 2022

chethanv28 left a comment

k8s-ci-robot commented Jul 21, 2022

SandeepPissay Jul 21, 2022

divyenpatel Jul 21, 2022

divyenpatel Jul 22, 2022

SandeepPissay Jul 22, 2022

svcbot-qecnsdp commented Jul 21, 2022

svcbot-qecnsdp commented Jul 21, 2022

svcbot-qecnsdp commented Jul 22, 2022

svcbot-qecnsdp commented Jul 22, 2022

divyenpatel commented Jul 22, 2022

SandeepPissay commented Jul 22, 2022

fungaren commented Apr 17, 2024

mark volume detach as success when Node vm is deleted from vcenter #1879

mark volume detach as success when Node vm is deleted from vcenter #1879

Conversation

divyenpatel commented Jul 20, 2022 • edited Loading

Log

xing-yang commented Jul 20, 2022

xing-yang commented Jul 20, 2022

SandeepPissay commented Jul 20, 2022

chethanv28 left a comment

Choose a reason for hiding this comment

k8s-ci-robot commented Jul 21, 2022

SandeepPissay Jul 21, 2022

Choose a reason for hiding this comment

divyenpatel Jul 21, 2022

Choose a reason for hiding this comment

divyenpatel Jul 22, 2022

Choose a reason for hiding this comment

SandeepPissay Jul 22, 2022

Choose a reason for hiding this comment

svcbot-qecnsdp commented Jul 21, 2022

svcbot-qecnsdp commented Jul 21, 2022

svcbot-qecnsdp commented Jul 22, 2022

svcbot-qecnsdp commented Jul 22, 2022

divyenpatel commented Jul 22, 2022

SandeepPissay commented Jul 22, 2022

fungaren commented Apr 17, 2024

divyenpatel commented Jul 20, 2022 •

edited

Loading