Regression in hpe-csi-driver helm 3.0.2: `DeleteVolume` fails with 409 "resource in use" on Primera/Alletra B10000 due to race condition / premature success in `ControllerUnpublishVolume`

**Environment:**

* HPE CSI Driver version: Upgraded from `2.5.2` to `3.0.2` (Deployed via Helm)
* Storage Arrays: **HPE Primera** and **HPE Alletra B10000 (Alletra Storage MP)**
* Storage Protocol: **iSCSI**
* Orchestrator: Kubernetes 1.32.9
* Context: **Real production services**

**Bug Summary & Root Cause Analysis:**
We are actively using the HPE CSI driver in our real production services with HPE Primera and Alletra B10000 arrays via iSCSI. After upgrading to version 3.0.2, deleting PVCs frequently leaves the PVs in a terminating/released state forever. We have successfully reproduced the issue and isolated the root cause to a regression/race condition in how the 3.0.2 driver handles VLUN removal over iSCSI.

The root cause: **`ControllerUnpublishVolume` is returning SUCCESS to Kubernetes before the physical storage array has actually finished removing the VLUN mapping.**
Because K8s receives a premature success response, it immediately calls `DeleteVolume`. The `DeleteVolume` call correctly fails (returning `409 Conflict` from the array, mapped to `HTTP 500` by the CSP) because the array protects the VV since the VLUN is technically still mapped. This causes an infinite retry loop in K8s.

This is an intermittent race condition that becomes highly reproducible when bulk deleting Pods/PVCs simultaneously.

**Steps to Reproduce (Real Execution Trace):**
Executing multiple deletions simultaneously triggers the race condition.

```bash
# 1. Bulk delete multiple deployments and PVCs simultaneously
$ kubectl delete -f deployment-pvc2000-standard01.yaml & kubectl delete -f deployment-pvc2000-standard02.yaml & kubectl delete -f deployment-pvc2000-standard03.yaml & kubectl delete -f first-test-pvc2000-velero-standard01.yaml & kubectl delete -f first-test-pvc2000-velero-standard02.yaml & kubectl delete -f first-test-pvc2000-velero-standard03.yaml
[1] 46887
[2] 46888
[3] 46889
...
[1]   Done                    kubectl delete -f deployment-pvc2000-standard01.yaml
[2]   Done                    kubectl delete -f deployment-pvc2000-standard02.yaml
[3]   Done                    kubectl delete -f deployment-pvc2000-standard03.yaml

# 2. Most PVs are successfully deleted:
$ kubectl get pv pvc-17d2e393-64a4-4705-9996-46523049244b
Error from server (NotFound): persistentvolumes "pvc-17d2e393-64a4-4705-9996-46523049244b" not found

$ kubectl get pv pvc-910a9d7f-88a0-442e-a516-8db67947e35c
Error from server (NotFound): persistentvolumes "pvc-910a9d7f-88a0-442e-a516-8db67947e35c" not found

# 3. However, some PVs get completely stuck due to the 409 Conflict:
$ kubectl get pv pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec
NAME                                       CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS     CLAIM                                       STORAGECLASS     REASON   AGE
pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec   3Gi        RWO            Delete           Released   kamada/my-first-pvc2000-velero-standard03   hpe-standard03            3m20s

```

**Evidence 1: Timing comparison from CSI Controller Logs (The Smoking Gun)**
We compared the logs of a successful deletion (`pvc-17d2e...`) vs the failed deletion (`pvc-808d2c...`). A physical Primera/Alletra array takes about ~2 seconds to unmap a VLUN.

**[Failed Run - Bug occurring for pvc-808d2c...]**
The driver returns success for Unpublish in just **0.48 seconds**, which is physically impossible for the array to actually complete the unmap. K8s immediately calls `DeleteVolume` and gets rejected.

```text
# 1. Unpublish is called
I0227 02:46:15.186976 1 connection.go:264] "GRPC call" driver="csi.hpe.com" method="/csi.v1.Controller/ControllerUnpublishVolume" request="{\"node_id\":\"...\",\"volume_id\":\"pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec\"}"

# 2. Only 0.48 seconds later, DeleteVolume is called (Driver returned success prematurely!)
I0227 02:46:15.666214 1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec\"}"

# 3. DeleteVolume immediately fails because the VLUN is still there on the array
time="2026-02-27T02:46:15Z" level=error msg="Error deleting the volume pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec, err: rpc error: code = Internal desc = Error while deleting volume ... err: Request failed with status code 500

```

**[Successful Run - Normal behavior for pvc-17d2e...]**
In a successful run, the gap between Unpublish and DeleteVolume is **~2.08 seconds**, meaning it properly waited for the array.

```text
# 1. Unpublish is called
I0227 02:46:17.638852 1 connection.go:264] "GRPC call" driver="csi.hpe.com" method="/csi.v1.Controller/ControllerUnpublishVolume" request="{\"volume_id\":\"pvc-17d2e393-64a4-4705-9996-46523049244b\"}"

# 2. ~2.08 seconds later, DeleteVolume is called
I0227 02:46:19.718236 1 connection.go:264] "GRPC call" method="/csi.v1.Controller/DeleteVolume" request="{\"volume_id\":\"pvc-17d2e393-64a4-4705-9996-46523049244b\"}"

# 3. Volume deleted successfully
I0227 02:46:20.041216 1 controller.go:1574] "Volume deleted" PV="pvc-17d2e393-64a4-4705-9996-46523049244b"

```

**Evidence 2: Storage Array verifies the VLUN was NEVER removed**
Checking the array directly via CLI (`showvlun`) confirms the VLUN remains indefinitely for the stuck volume:

```text
fuga-int-c00-g02-hxc0002 cli% showvlun -vv pvc-808d2c4b-f292-4aab-a1c9-5f55c362b5ec
Lun VVName                          HostName          -Host_WWN/iSCSI_Name-  Port        Type
  0 pvc-808d2c4b-f292-4aab-a1c9-... iqn-hoge-w4523-dev ----------------      0:4:1 matched set

```

**Attached Logs:**
Please find the attached `20260227-hpe-csi-driver-3.0.2-errorlog.zip` which contains the full `hpe-csi-controller` and CSP logs capturing this behavior.

**Expected Behavior:**
Just like in version 2.5.2, `ControllerUnpublishVolume` MUST synchronously wait for the array to actually complete the VLUN removal operation, and ONLY return success to K8s when it is fully unmapped. The current async/non-blocking behavior (or swallowed error) in 3.0.2 breaks the deletion sequence completely. This is critically impacting our production environments.

---

Thanks,
Aki

[20260227-hpe-csi-driver-3.0.2-errorlog.zip](https://github.com/user-attachments/files/25598008/20260227-hpe-csi-driver-3.0.2-errorlog.zip)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regression in hpe-csi-driver helm 3.0.2: `DeleteVolume` fails with 409 "resource in use" on Primera/Alletra B10000 due to race condition / premature success in `ControllerUnpublishVolume` #518

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Regression in hpe-csi-driver helm 3.0.2: DeleteVolume fails with 409 "resource in use" on Primera/Alletra B10000 due to race condition / premature success in ControllerUnpublishVolume #518

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Regression in hpe-csi-driver helm 3.0.2: `DeleteVolume` fails with 409 "resource in use" on Primera/Alletra B10000 due to race condition / premature success in `ControllerUnpublishVolume` #518