Skip to content

Regression in hpe-csi-driver helm 3.0.2: DeleteVolume fails with 409 "resource in use" on Primera/Alletra B10000 due to race condition / premature success in ControllerUnpublishVolume #518

@aki-kamada

Description

@aki-kamada

Environment:

  • HPE CSI Driver version: Upgraded from 2.5.2 to 3.0.2 (Deployed via Helm)
  • Storage Arrays: HPE Primera and HPE Alletra B10000 (Alletra Storage MP)
  • Storage Protocol: iSCSI
  • Orchestrator: Kubernetes 1.32.9
  • Context: Real production services

Bug Summary & Root Cause Analysis:
We are actively using the HPE CSI driver in our real production services with HPE Primera and Alletra B10000 arrays via iSCSI. After upgrading to version 3.0.2, deleting PVCs frequently leaves the PVs in a terminating/released state forever. We have successfully reproduced the issue and isolated the root cause to a regression/race condition in how the 3.0.2 driver handles VLUN removal over iSCSI.

The root cause: ControllerUnpublishVolume is returning SUCCESS to Kubernetes before the physical storage array has actually finished removing the VLUN mapping.
Because K8s receives a premature success response, it immediately calls DeleteVolume. The DeleteVolume call correctly fails (returning 409 Conflict from the array, mapped to HTTP 500 by the CSP) because the array protects the VV since the VLUN is technically still mapped. This causes an infinite retry loop in K8s.

This is an intermittent race condition that becomes highly reproducible when bulk deleting Pods/PVCs simultaneously.

Steps to Reproduce (Real Execution Trace):
Executing multiple deletions simultaneously triggers the race condition.

# 1. Bulk delete multiple deployments and PVCs simultaneously
$ kubectl delete -f deployment-pvc2000-standard01.yaml & kubectl delete -f deployment-pvc2000-standard02.yaml & kubectl delete -f deployment-pvc2000-standard03.yaml & kubectl delete -f first-test-pvc2000-velero-standard01.yaml & kubectl delete -f first-test-pvc2000-velero-standard02.yaml & kubectl delete -f first-test-pvc2000-velero-standard03.yaml
[1] 46887
[2] 46888
[3] 46889
...
[1]   Done                    kubectl delete -f deployment-pvc2000-standard01.yaml
[2]   Done                    kubectl delete -f deployment-pvc2000-standard02.yaml
[3]   Done                    kubectl delete -f deployment-pvc2000-standard03.yaml

# 2. Most PVs are successfully deleted:
$ kubectl get pv pvc-17d2e393-64a4-4705-9996-46523049244b
Error from server (NotFound): persistentvolumes "pvc-17d2e393-64a4-4705-9996-46523049244b" not found

$ kubectl get pv pvc-910a9d7f-88a0-442e-a516-8db67947e35c
Error from server (NotFound): persistentvolumes "pvc-910a9d7f-88a0-442e-a516-8db67947e35c" not found

# 3. However, some PVs get completely stuck due to the 409 Conflict:
$ kubectl get pv pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                       STORAGECLASS     REASON   AGE
pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec   3Gi        RWO            Delete           Released   kamada/my-first-pvc2000-velero-standard03   hpe-standard03            3m20s

Evidence 1: Timing comparison from CSI Controller Logs (The Smoking Gun)
We compared the logs of a successful deletion (pvc-17d2e...) vs the failed deletion (pvc-808d2c...). A physical Primera/Alletra array takes about ~2 seconds to unmap a VLUN.

[Failed Run - Bug occurring for pvc-808d2c...]
The driver returns success for Unpublish in just 0.48 seconds, which is physically impossible for the array to actually complete the unmap. K8s immediately calls DeleteVolume and gets rejected.

# 1. Unpublish is called
I0227 02:46:15.186976 1 connection.go:264] "GRPC call" driver="csi.hpe.com" method="/csi.v1.Controller/ControllerUnpublishVolume" request="{\"node_id\":\"...\",\"volume_id\":\"pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec\"}"

# 2. Only 0.48 seconds later, DeleteVolume is called (Driver returned success prematurely!)
I0227 02:46:15.666214 1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec\"}"

# 3. DeleteVolume immediately fails because the VLUN is still there on the array
time="2026-02-27T02:46:15Z" level=error msg="Error deleting the volume pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec, err: rpc error: code = Internal desc = Error while deleting volume ... err: Request failed with status code 500

[Successful Run - Normal behavior for pvc-17d2e...]
In a successful run, the gap between Unpublish and DeleteVolume is ~2.08 seconds, meaning it properly waited for the array.

# 1. Unpublish is called
I0227 02:46:17.638852 1 connection.go:264] "GRPC call" driver="csi.hpe.com" method="/csi.v1.Controller/ControllerUnpublishVolume" request="{\"volume_id\":\"pvc-17d2e393-64a4-4705-9996-46523049244b\"}"

# 2. ~2.08 seconds later, DeleteVolume is called
I0227 02:46:19.718236 1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"pvc-17d2e393-64a4-4705-9996-46523049244b\"}"

# 3. Volume deleted successfully
I0227 02:46:20.041216 1 controller.go:1574] "Volume deleted" PV="pvc-17d2e393-64a4-4705-9996-46523049244b"

Evidence 2: Storage Array verifies the VLUN was NEVER removed
Checking the array directly via CLI (showvlun) confirms the VLUN remains indefinitely for the stuck volume:

fuga-int-c00-g02-hxc0002 cli% showvlun -vv pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec
Lun VVName                          HostName          -Host_WWN/iSCSI_Name-  Port        Type
  0 pvc-808d2c4b-f292-4aab-a1c9-... iqn-hoge-w4523-dev ----------------      0:4:1 matched set

Attached Logs:
Please find the attached 20260227-hpe-csi-driver-3.0.2-errorlog.zip which contains the full hpe-csi-controller and CSP logs capturing this behavior.

Expected Behavior:
Just like in version 2.5.2, ControllerUnpublishVolume MUST synchronously wait for the array to actually complete the VLUN removal operation, and ONLY return success to K8s when it is fully unmapped. The current async/non-blocking behavior (or swallowed error) in 3.0.2 breaks the deletion sequence completely. This is critically impacting our production environments.


Thanks,
Aki

20260227-hpe-csi-driver-3.0.2-errorlog.zip

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions