Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mark volume detach as success when Node vm is deleted from vcenter #1879

Conversation

divyenpatel
Copy link
Member

@divyenpatel divyenpatel commented Jul 20, 2022

What this PR does / why we need it:
Handle case to detach volume when Node VM is deleted from VC inventory.

Which issue this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close that issue when PR gets merged): fixes #
Fixes the known issue present in the vSphere CSI Driver

Refer to the Issue Persistent volume fails to be detached from a node in the release note - https://docs.vmware.com/en/VMware-vSphere-Container-Storage-Plug-in/2.6/rn/vmware-vsphere-container-storage-plugin-26-release-notes/index.html

Testing done:

  • Created Statefulsets with 2 replicas with vSphere CSI Driver volumes.

# kubectl get pods -o wide
NAME    READY   STATUS    RESTARTS   AGE   IP            NODE                      NOMINATED NODE   READINESS GATES
web-0   1/1     Running   0          23m   10.244.7.2    k8s-node-2-1658516767     <none>           <none>
web-1   1/1     Running   0          30m   10.244.3.28   k8s-node-989-1658516731   <none>           <none>

# kubectl get pvc
NAME        STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS                        AGE
www-web-0   Bound    pvc-886f971d-111f-4d55-ac55-029f0c31e6a2   1Gi        RWO            example-vanilla-rwo-filesystem-sc   94s
www-web-1   Bound    pvc-875daf5d-2e75-4924-ab65-c3460516d58b   1Gi        RWO            example-vanilla-rwo-filesystem-sc   79s

# kubectl get volumeattachment
NAME                                                                   ATTACHER                 PV                                         NODE                      ATTACHED   AGE
csi-614022547a4ef8aed367d85ac85157dd7b7239532f7a095f5535743d070e54a8   csi.vsphere.vmware.com   pvc-875daf5d-2e75-4924-ab65-c3460516d58b   k8s-node-989-1658516731   true       32m
csi-76efd889ad91ab975990ee11d454649c511111d4c4f68129198bfc5a31f216d2   csi.vsphere.vmware.com   pvc-886f971d-111f-4d55-ac55-029f0c31e6a2   k8s-node-2-1658516767     true       12m

  • From vCenter powered off the VM for the k8s node (k8s-node-2-1658516767) and removed it from vCenter inventory.
  • Node on the k8s went to Not Ready state
# kubectl get node -o wide
NAME                         STATUS     ROLES           AGE    VERSION   INTERNAL-IP      EXTERNAL-IP      OS-IMAGE             KERNEL-VERSION     CONTAINER-RUNTIME
k8s-control-599-1658516675   Ready      control-plane   4h5m   v1.24.1   10.185.255.64    10.185.255.64    Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6
k8s-control-677-1658516694   Ready      control-plane   4h4m   v1.24.1   10.185.241.108   10.185.241.108   Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6
k8s-control-865-1658516712   Ready      control-plane   4h3m   v1.24.1   10.185.245.31    10.185.245.31    Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6
k8s-node-2-1658516767        NotReady   <none>          38m    v1.24.1   10.185.242.253   10.185.242.253   Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6
k8s-node-989-1658516731      Ready      <none>          4h1m   v1.24.1   10.185.252.70    10.185.252.70    Ubuntu 20.04.2 LTS   5.4.0-66-generic   containerd://1.6.6
  • Deleted node from k8s
# kubectl delete node k8s-node-2-1658516767
node "k8s-node-2-1658516767" deleted
  • Pod gets rescheduled and volume detached successfully even when Node VM is not present on the vCenter.
# kubectl get pods -o wide
NAME    READY   STATUS        RESTARTS   AGE   IP            NODE                      NOMINATED NODE   READINESS GATES
web-0   1/1     Terminating   0          37m   10.244.7.2    k8s-node-2-1658516767     <none>           <none>
web-1   1/1     Running       0          44m   10.244.3.28   k8s-node-989-1658516731   <none>           <none>

# kubectl get pods -o wide
NAME    READY   STATUS              RESTARTS   AGE     IP            NODE                      NOMINATED NODE   READINESS GATES
web-0   0/1     ContainerCreating   0          4m11s   <none>        k8s-node-989-1658516731   <none>           <none>
web-1   1/1     Running             0          51m     10.244.3.28   k8s-node-989-1658516731   <none>           <none>

# kubectl get volumeattachment
NAME                                                                   ATTACHER                 PV                                         NODE                      ATTACHED   AGE
csi-046f0ccd3ac2b5e9e85043effec82ae378bcb6f81883c556f4919f3c869f1cc7   csi.vsphere.vmware.com   pvc-886f971d-111f-4d55-ac55-029f0c31e6a2   k8s-node-989-1658516731   true       15s
csi-614022547a4ef8aed367d85ac85157dd7b7239532f7a095f5535743d070e54a8   csi.vsphere.vmware.com   pvc-875daf5d-2e75-4924-ab65-c3460516d58b   k8s-node-989-1658516731   true       53m


Log

2022-07-22T23:30:23.608Z	INFO	vanilla/controller.go:1169	ControllerUnpublishVolume: called with args {VolumeId:487b981f-4ce4-496d-9700-a153491ceed5 NodeId:42227b0a-e117-14d4-f83e-8712b94a068e Secrets:map[] XXX_NoUnkeyedLiteral:{} XXX_unrecognized:[] XXX_sizecache:0}	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.642Z	INFO	node/manager.go:195	Node hasn't been discovered yet with nodeUUID 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.642Z	INFO	vsphere/virtualmachine.go:147	Initiating asynchronous datacenter listing with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	INFO	vsphere/datacenter.go:151	Publishing datacenter Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.185.254.145]	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.657Z	DEBUG	vsphere/virtualmachine.go:163	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.658Z	INFO	vsphere/virtualmachine.go:184	AsyncGetAllDatacenters with uuid 42227b0a-e117-14d4-f83e-8712b94a068e sent a dc Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.185.254.145]	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.662Z	ERROR	vsphere/datacenter.go:104	Couldn't find VM given uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.663Z	WARN	vsphere/virtualmachine.go:188	Couldn't find VM given uuid 42227b0a-e117-14d4-f83e-8712b94a068e on DC Datacenter [Datacenter: Datacenter:datacenter-3 @ /VSAN-DC, VirtualCenterHost: 10.185.254.145] with err: virtual machine wasn't found, continuing search	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.663Z	DEBUG	vsphere/virtualmachine.go:179	AsyncGetAllDatacenters finished with uuid 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.663Z	ERROR	vsphere/virtualmachine.go:215	Returning VM not found err for UUID 42227b0a-e117-14d4-f83e-8712b94a068e	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.664Z	ERROR	node/manager.go:138	Couldn't find VM instance with nodeUUID 42227b0a-e117-14d4-f83e-8712b94a068e, failed to discover with err: virtual machine wasn't found	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.664Z	ERROR	node/manager.go:207	failed to discover node with nodeUUID 42227b0a-e117-14d4-f83e-8712b94a068e with err: virtual machine wasn't found	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.664Z	INFO	vanilla/controller.go:1251	Virtual Machine for Node ID: 42227b0a-e117-14d4-f83e-8712b94a068e is not present in the VC Inventory. Marking ControllerUnpublishVolume for Volume: "487b981f-4ce4-496d-9700-a153491ceed5" as successful.	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.664Z	DEBUG	vanilla/controller.go:1268	controllerUnpublishVolumeInternal: returns fault "" for volume "487b981f-4ce4-496d-9700-a153491ceed5"	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}
2022-07-22T23:30:23.665Z	INFO	vanilla/controller.go:1276	Volume "487b981f-4ce4-496d-9700-a153491ceed5" detached successfully from node "42227b0a-e117-14d4-f83e-8712b94a068e".	{"TraceId": "61a01c49-65ef-408a-88b2-8e19fef09a85"}

Special notes for your reviewer:

Release note:

mark volume detach as success when Node vm is deleted from vcenter

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 20, 2022
@k8s-ci-robot k8s-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Jul 20, 2022
@divyenpatel divyenpatel force-pushed the fix-detach-when-node-vm-is-deleted branch from 5d48733 to 5571cf5 Compare July 20, 2022 06:26
@xing-yang
Copy link
Contributor

This seems to be a reasonable fix.

@xing-yang
Copy link
Contributor

/approve

@SandeepPissay
Copy link
Contributor

The code change looks good to me. Can you run the CI jobs?

Copy link
Collaborator

@chethanv28 chethanv28 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chethanv28, divyenpatel, xing-yang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [chethanv28,divyenpatel,xing-yang]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -1247,8 +1247,14 @@ func (c *controller) ControllerUnpublishVolume(ctx context.Context, req *csi.Con
node, err = c.nodeMgr.GetNodeByName(ctx, req.NodeId)
}
if err != nil {
return nil, csifault.CSIInternalFault, logger.LogNewErrorCodef(log, codes.Internal,
"failed to find VirtualMachine for node:%q. Error: %v", req.NodeId, err)
if err == cnsvsphere.ErrVMNotFound {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can infer that the disk is detached when we get ErrVmNotFound. A better way to be sure is to check if the volume is attached to any VM. FCD has an API for that but that API is currently not exposed by CNS. Can CNS expose this API? Once available, CSI can invoke that to be sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if we can infer that the disk is detached when we get ErrVmNotFound.

@SandeepPissay so are you saying the searchIndex.FindByUuid API call can not determine if VM is deleted from VC Inventory.
if VM is not found in the VC inventory then what is holding us from letting k8s know that this volume can be marked as detached from the requested node. Why do we need to check if FCD is attached to any other VM? Here we are making a change in the ControllerUnpublishVolume call.

svm, err := searchIndex.FindByUuid(ctx, dc.Datacenter, uuid, true, &instanceUUID)
	if err != nil {
		log.Errorf("failed to find VM given uuid %s with err: %v", uuid, err)
		return nil, err
	} else if svm == nil {
		log.Errorf("Couldn't find VM given uuid %s", uuid)
		return nil, ErrVMNotFound
	}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also for Supervisor Pod VM, we have relied on this API.
Refer to https://github.com/kubernetes-sigs/vsphere-csi-driver/pull/1702/files
then why we can not use this API for Vanilla Node VM?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

searchIndex.FindByUuid returns from the VC inventory and it could return an incorrect result when it is stale. IIRC the VC inventory can be stale in a few cases (VC restored to backup, host sync issues, etc). We should confirm this with the VC team. Instead of VM existence in VC inventory, its better to have a reliable way to know whether the disk is really attached to any VM or not. And this operation should actually check at the ESX host layer. If such an API does not exist today, we should file a request. As far as this PR goes, we can commit with the above caveats.

@svcbot-qecnsdp
Copy link

Started vanilla Block pipeline... Build Number: 1337

@svcbot-qecnsdp
Copy link

Block vanilla build status: FAILURE 
Stage before exit: testbed-deployment 

@svcbot-qecnsdp
Copy link

Started vanilla Block pipeline... Build Number: 1350

@svcbot-qecnsdp
Copy link

Block vanilla build status: FAILURE 
Stage before exit: e2e-tests 
Jenkins E2E Test Results: 
JUnit report was created: /home/worker/workspace/Block-Vanilla@4/Results/1350/vsphere-csi-driver/tests/e2e/junit.xml

Ran 1 of 616 Specs in 428.544 seconds
SUCCESS! -- 1 Passed | 0 Failed | 0 Pending | 615 Skipped
PASS

Ginkgo ran 1 suite in 8m56.918829132s
Test Suite Passed
--
JUnit report was created: /home/worker/workspace/Block-Vanilla@4/Results/1350/vsphere-csi-driver/tests/e2e/junit.xml

Ran 13 of 616 Specs in 5431.352 seconds
SUCCESS! -- 13 Passed | 0 Failed | 0 Pending | 603 Skipped
PASS

Ginkgo ran 1 suite in 1h30m57.505120296s
Test Suite Passed
--
/home/worker/workspace/Block-Vanilla@4/Results/1350/vsphere-csi-driver/tests/e2e/operationstorm.go:216

Ran 41 of 616 Specs in 691.124 seconds
FAIL! -- 39 Passed | 2 Failed | 0 Pending | 575 Skipped


Ginkgo ran 1 suite in 11m56.939342615s
Test Suite Failed

@divyenpatel
Copy link
Member Author

@SandeepPissay I have executed e2e pipeline - #1879 (comment)

and also updated manual test result on the PR for deletion of Node VM and automatic removal of Volumeattachment without manual intervention.

@SandeepPissay
Copy link
Contributor

/lgtm

@fungaren
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants