-
Notifications
You must be signed in to change notification settings - Fork 69
Description
Environment:
- HPE CSI Driver version: Upgraded from
2.5.2to3.0.2(Deployed via Helm) - Storage Arrays: HPE Primera and HPE Alletra B10000 (Alletra Storage MP)
- Storage Protocol: iSCSI
- Orchestrator: Kubernetes 1.32.9
- Context: Real production services
Bug Summary & Root Cause Analysis:
We are actively using the HPE CSI driver in our real production services with HPE Primera and Alletra B10000 arrays via iSCSI. After upgrading to version 3.0.2, deleting PVCs frequently leaves the PVs in a terminating/released state forever. We have successfully reproduced the issue and isolated the root cause to a regression/race condition in how the 3.0.2 driver handles VLUN removal over iSCSI.
The root cause: ControllerUnpublishVolume is returning SUCCESS to Kubernetes before the physical storage array has actually finished removing the VLUN mapping.
Because K8s receives a premature success response, it immediately calls DeleteVolume. The DeleteVolume call correctly fails (returning 409 Conflict from the array, mapped to HTTP 500 by the CSP) because the array protects the VV since the VLUN is technically still mapped. This causes an infinite retry loop in K8s.
This is an intermittent race condition that becomes highly reproducible when bulk deleting Pods/PVCs simultaneously.
Steps to Reproduce (Real Execution Trace):
Executing multiple deletions simultaneously triggers the race condition.
# 1. Bulk delete multiple deployments and PVCs simultaneously
$ kubectl delete -f deployment-pvc2000-standard01.yaml & kubectl delete -f deployment-pvc2000-standard02.yaml & kubectl delete -f deployment-pvc2000-standard03.yaml & kubectl delete -f first-test-pvc2000-velero-standard01.yaml & kubectl delete -f first-test-pvc2000-velero-standard02.yaml & kubectl delete -f first-test-pvc2000-velero-standard03.yaml
[1] 46887
[2] 46888
[3] 46889
...
[1] Done kubectl delete -f deployment-pvc2000-standard01.yaml
[2] Done kubectl delete -f deployment-pvc2000-standard02.yaml
[3] Done kubectl delete -f deployment-pvc2000-standard03.yaml
# 2. Most PVs are successfully deleted:
$ kubectl get pv pvc-17d2e393-64a4-4705-9996-46523049244b
Error from server (NotFound): persistentvolumes "pvc-17d2e393-64a4-4705-9996-46523049244b" not found
$ kubectl get pv pvc-910a9d7f-88a0-442e-a516-8db67947e35c
Error from server (NotFound): persistentvolumes "pvc-910a9d7f-88a0-442e-a516-8db67947e35c" not found
# 3. However, some PVs get completely stuck due to the 409 Conflict:
$ kubectl get pv pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec 3Gi RWO Delete Released kamada/my-first-pvc2000-velero-standard03 hpe-standard03 3m20s
Evidence 1: Timing comparison from CSI Controller Logs (The Smoking Gun)
We compared the logs of a successful deletion (pvc-17d2e...) vs the failed deletion (pvc-808d2c...). A physical Primera/Alletra array takes about ~2 seconds to unmap a VLUN.
[Failed Run - Bug occurring for pvc-808d2c...]
The driver returns success for Unpublish in just 0.48 seconds, which is physically impossible for the array to actually complete the unmap. K8s immediately calls DeleteVolume and gets rejected.
# 1. Unpublish is called
I0227 02:46:15.186976 1 connection.go:264] "GRPC call" driver="csi.hpe.com" method="/csi.v1.Controller/ControllerUnpublishVolume" request="{\"node_id\":\"...\",\"volume_id\":\"pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec\"}"
# 2. Only 0.48 seconds later, DeleteVolume is called (Driver returned success prematurely!)
I0227 02:46:15.666214 1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec\"}"
# 3. DeleteVolume immediately fails because the VLUN is still there on the array
time="2026-02-27T02:46:15Z" level=error msg="Error deleting the volume pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec, err: rpc error: code = Internal desc = Error while deleting volume ... err: Request failed with status code 500
[Successful Run - Normal behavior for pvc-17d2e...]
In a successful run, the gap between Unpublish and DeleteVolume is ~2.08 seconds, meaning it properly waited for the array.
# 1. Unpublish is called
I0227 02:46:17.638852 1 connection.go:264] "GRPC call" driver="csi.hpe.com" method="/csi.v1.Controller/ControllerUnpublishVolume" request="{\"volume_id\":\"pvc-17d2e393-64a4-4705-9996-46523049244b\"}"
# 2. ~2.08 seconds later, DeleteVolume is called
I0227 02:46:19.718236 1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"pvc-17d2e393-64a4-4705-9996-46523049244b\"}"
# 3. Volume deleted successfully
I0227 02:46:20.041216 1 controller.go:1574] "Volume deleted" PV="pvc-17d2e393-64a4-4705-9996-46523049244b"
Evidence 2: Storage Array verifies the VLUN was NEVER removed
Checking the array directly via CLI (showvlun) confirms the VLUN remains indefinitely for the stuck volume:
fuga-int-c00-g02-hxc0002 cli% showvlun -vv pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec
Lun VVName HostName -Host_WWN/iSCSI_Name- Port Type
0 pvc-808d2c4b-f292-4aab-a1c9-... iqn-hoge-w4523-dev ---------------- 0:4:1 matched set
Attached Logs:
Please find the attached 20260227-hpe-csi-driver-3.0.2-errorlog.zip which contains the full hpe-csi-controller and CSP logs capturing this behavior.
Expected Behavior:
Just like in version 2.5.2, ControllerUnpublishVolume MUST synchronously wait for the array to actually complete the VLUN removal operation, and ONLY return success to K8s when it is fully unmapped. The current async/non-blocking behavior (or swallowed error) in 3.0.2 breaks the deletion sequence completely. This is critically impacting our production environments.
Thanks,
Aki