Add opt-in DetachReconciler for orphan hot-plug volumes#185
Add opt-in DetachReconciler for orphan hot-plug volumes#185mattia-eleuteri wants to merge 1 commit into
Conversation
The companion commit makes the synchronous detach path VMI-aware, but that only fixes orphans created after the upgrade. Orphans accumulated before upgrading the driver, or created by edge cases the synchronous path still cannot reach (external-attacher giving up after force-detach timeout, VMI in mid-migration when removevolume is issued, downstream forks short-circuiting on missing infra PVC), need a backstop. This commit introduces a background reconciler that runs in the controller process and periodically scans every VMI in the infra cluster namespace, looking for hot-plugs that are no longer referenced by the parent VM spec (or whose parent VM is gone altogether). On the second observation, separated by at least gracePeriod from the first, the reconciler issues RemoveVolumeFromVMI to release the QEMU device and the hot-plug pod that pins the infra storage attachment. The reconciler is opt-in via --enable-detach-reconciler (default off) so existing deployments are unchanged. Two more flags expose the cadence: --detach-reconciler-sync-period (default 5m) and --detach-reconciler-grace-period (default 5m). The grace period absorbs transient divergence around live migrations and normal removevolume API calls; the seen-map is pruned each pass so it does not grow unboundedly. Logging uses klog, matching the rest of the driver — no new dependencies. Signed-off-by: Mattia Eleuteri <mattia@hidora.io>
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Hi @mattia-eleuteri. Thanks for your PR. PRs from untrusted users cannot be marked as trusted with I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
kvaps
left a comment
There was a problem hiding this comment.
Thanks for the detailed write-up, but I don't think the motivation holds up for a permanent reconciler, and I'd suggest reshaping this as a documented one-time cleanup instead.
The external-attacher bullet is not accurate: the attacher puts a finalizer on the VolumeAttachment and keeps calling ControllerUnpublishVolume with exponential backoff (capped at 5 minutes by default) until it succeeds — the VA object cannot disappear without a successful unpublish. The 6-minute timeout is the A/D controller's maxWaitForUnmountDuration force-detach: it lets the new attach proceed early (which is exactly how the DRBD failed to set source device readwrite race happens), but the old VA stays in Terminating and keeps being retried. The mid-migration 409 case converges the same way — the attacher simply retries after the migration finishes. So with #184 merged, every orphan that still has a VolumeAttachment self-heals through the normal retry path.
That leaves orphans without a VA, which only exist because older driver versions falsely returned success on unpublish (the bug #184 fixes). That is a finite legacy set, and cleaning a finite set once is a migration, not a reconciliation loop. A documented procedure covers it: list VMIs whose status.volumeStatus[].hotplugVolume entries are absent from the parent VM spec (or whose VM is gone), then virtctl removevolume <vmi> --volume-name=<vol> for each — the same subresource this reconciler calls, but as a one-time, operator-supervised step in the release notes of the version that ships #184.
If maintainers still want a standing reconciler, there is also a correctness problem to solve first: non-persistent hot-plugs (virtctl addvolume without --persist) live only in the VMI and never appear in VM.spec.template, so the "in VMI status but not in VM spec" criterion classifies every such legitimate hot-plug as an orphan and detaches it after the grace period — repeatedly, on every re-add. And since the loop scans all VMIs in the infra namespace with no ownership filter, in shared-namespace deployments (e.g. Cozystack tenant namespaces hosting both cluster node VMs and standalone user VMs) it would detach volumes from VMs that have nothing to do with this driver. At minimum it would need to filter VMIs by cluster-ownership labels and only consider volumes matching the driver's --volume-prefix.
The downstream-fork case from the description is an argument for fixing the fork (by porting #184) rather than for an upstream daemon.
|
Note the non persistent hotplug has been deprecated, and the default behaviour is to attach to the VM with a hotplug flag, so it can be unplugged. This behavior retains the hotplugged status after reboot. |
What this PR does / why we need it:
Even with a fully VMI-aware synchronous detach path, the CSI driver cannot catch every source of orphan hot-plug volumes:
external-attachergives up onVolumeAttachmentafter itsforce-detach-aftertimeout (6 min default) and removes the object without re-invoking the driver; the VMI keeps the hot-plug forever.removevolumeis issued; the subresource returns HTTP 409 and the 30-second backoff insideControllerUnpublishVolumegives up.ControllerUnpublishVolumeis never reached.In all four cases the VMI ends up with
.status.volumeStatus[].hotplugVolumeentries that noVolumeAttachmentwould ever reference again. The hot-plug pod stays alive and pins the infra storage device exclusively to the source host, blocking subsequent attachments.This PR introduces an opt-in
DetachReconcilerthat runs in the controller process alongside the gRPC server. Each cycle it:NotFoundas "every hot-plug on this VMI is orphan").VMI.Status.VolumeStatus[*].HotplugVolumeagainstVM.spec.template.spec.volumes. Entries present in the VMI but absent from the VM spec are candidate orphans.RemoveVolumeFromVMI. The grace period absorbs transient divergence around live migrations and normal removevolume calls.seenmap of entries that are no longer divergent, so it does not grow unboundedly.The reconciler is fully opt-in:
--enable-detach-reconciler(defaultfalse) — must be set to start the loop.--detach-reconciler-sync-period(default5m) — how often the loop runs.--detach-reconciler-grace-period(default5m) — minimum age of an observed divergence before acting.No new dependencies are introduced. The reconciler uses only methods already on
kubevirt.Client(ListVirtualMachines,GetWorkloadManagingVirtualMachine,RemoveVolumeFromVMI) andk8s.io/apimachinery/pkg/util/wait, which is already vendored. Logging usesklog, matching the rest of the driver.Which issue(s) this PR fixes:
Fixes #183
Special notes for your reviewer:
ControllerUnpublishVolumeVMI-aware so it no longer creates new orphans in the most common race (VM teardown before VMI). This reconciler is the backstop for everything that synchronous path cannot reach (legacy state, external-attacher timeouts, mid-migration detaches, downstream forks). The two PRs can be reviewed and merged in any order — neither depends on the other.VM.spec, it requires two observations spaced bygracePeriod, and it skips VMIs with a deletion timestamp. The combination yields zero false positives against the failure modes documented in the linked issue.Sync(ctx)is exposed for testing (the periodic loop inRun()is what users will rely on). Happy to switch to a package-private API + behavioural-only tests if reviewers prefer.seen-map pruning.Release note: