Skip to content

Add opt-in DetachReconciler for orphan hot-plug volumes#185

Open
mattia-eleuteri wants to merge 1 commit into
kubevirt:mainfrom
mattia-eleuteri:feat/detach-reconciler
Open

Add opt-in DetachReconciler for orphan hot-plug volumes#185
mattia-eleuteri wants to merge 1 commit into
kubevirt:mainfrom
mattia-eleuteri:feat/detach-reconciler

Conversation

@mattia-eleuteri

@mattia-eleuteri mattia-eleuteri commented Jun 4, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

Even with a fully VMI-aware synchronous detach path, the CSI driver cannot catch every source of orphan hot-plug volumes:

  • Driver upgrades do not retroactively clean state created by older versions.
  • external-attacher gives up on VolumeAttachment after its force-detach-after timeout (6 min default) and removes the object without re-invoking the driver; the VMI keeps the hot-plug forever.
  • The VMI is mid-migration when removevolume is issued; the subresource returns HTTP 409 and the 30-second backoff inside ControllerUnpublishVolume gives up.
  • Downstream forks short-circuit earlier in the call graph (e.g. when the infra PVC is already gone), so upstream's ControllerUnpublishVolume is never reached.

In all four cases the VMI ends up with .status.volumeStatus[].hotplugVolume entries that no VolumeAttachment would ever reference again. The hot-plug pod stays alive and pins the infra storage device exclusively to the source host, blocking subsequent attachments.

This PR introduces an opt-in DetachReconciler that runs in the controller process alongside the gRPC server. Each cycle it:

  1. Lists all VMIs in the infra cluster namespace.
  2. For each VMI, fetches the parent VM (treating NotFound as "every hot-plug on this VMI is orphan").
  3. Compares VMI.Status.VolumeStatus[*].HotplugVolume against VM.spec.template.spec.volumes. Entries present in the VMI but absent from the VM spec are candidate orphans.
  4. Records the first observation timestamp for each candidate. Only on a second observation, separated from the first by at least the configured grace period, does the reconciler issue RemoveVolumeFromVMI. The grace period absorbs transient divergence around live migrations and normal removevolume calls.
  5. Prunes its in-memory seen map of entries that are no longer divergent, so it does not grow unboundedly.

The reconciler is fully opt-in:

  • --enable-detach-reconciler (default false) — must be set to start the loop.
  • --detach-reconciler-sync-period (default 5m) — how often the loop runs.
  • --detach-reconciler-grace-period (default 5m) — minimum age of an observed divergence before acting.

No new dependencies are introduced. The reconciler uses only methods already on kubevirt.Client (ListVirtualMachines, GetWorkloadManagingVirtualMachine, RemoveVolumeFromVMI) and k8s.io/apimachinery/pkg/util/wait, which is already vendored. Logging uses klog, matching the rest of the driver.

Which issue(s) this PR fixes:

Fixes #183

Special notes for your reviewer:

  • A companion PR (Check VMI status when detaching, not just VM spec #184, separate and independent) makes ControllerUnpublishVolume VMI-aware so it no longer creates new orphans in the most common race (VM teardown before VMI). This reconciler is the backstop for everything that synchronous path cannot reach (legacy state, external-attacher timeouts, mid-migration detaches, downstream forks). The two PRs can be reviewed and merged in any order — neither depends on the other.
  • The reconciler is deliberately conservative: it never touches a hot-plug that is still in VM.spec, it requires two observations spaced by gracePeriod, and it skips VMIs with a deletion timestamp. The combination yields zero false positives against the failure modes documented in the linked issue.
  • Public Sync(ctx) is exposed for testing (the periodic loop in Run() is what users will rely on). Happy to switch to a package-private API + behavioural-only tests if reviewers prefer.
  • Tests cover: grace-period gating, no-op on legitimate hot-plug, VM-missing fan-out, terminating-VMI skip, and seen-map pruning.

Release note:

Add opt-in `--enable-detach-reconciler` flag that starts a background reconciler in the controller process. It periodically scans VMIs in the infra namespace and removes hot-plug volumes that are no longer referenced by the parent VM spec, after a configurable grace period. Default off; cadence configurable via `--detach-reconciler-sync-period` and `--detach-reconciler-grace-period`.

The companion commit makes the synchronous detach path VMI-aware, but
that only fixes orphans created after the upgrade. Orphans accumulated
before upgrading the driver, or created by edge cases the synchronous
path still cannot reach (external-attacher giving up after force-detach
timeout, VMI in mid-migration when removevolume is issued, downstream
forks short-circuiting on missing infra PVC), need a backstop.

This commit introduces a background reconciler that runs in the
controller process and periodically scans every VMI in the infra
cluster namespace, looking for hot-plugs that are no longer referenced
by the parent VM spec (or whose parent VM is gone altogether). On the
second observation, separated by at least gracePeriod from the first,
the reconciler issues RemoveVolumeFromVMI to release the QEMU device
and the hot-plug pod that pins the infra storage attachment.

The reconciler is opt-in via --enable-detach-reconciler (default off)
so existing deployments are unchanged. Two more flags expose the
cadence: --detach-reconciler-sync-period (default 5m) and
--detach-reconciler-grace-period (default 5m). The grace period
absorbs transient divergence around live migrations and normal
removevolume API calls; the seen-map is pruned each pass so it does
not grow unboundedly. Logging uses klog, matching the rest of the
driver — no new dependencies.

Signed-off-by: Mattia Eleuteri <mattia@hidora.io>
@kubevirt-bot kubevirt-bot added release-note Denotes a PR that will be considered when it comes time to generate release notes. dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/L labels Jun 4, 2026
@kubevirt-bot

Copy link
Copy Markdown

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign aglitke for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@kubevirt-bot

Copy link
Copy Markdown

Hi @mattia-eleuteri. Thanks for your PR.

PRs from untrusted users cannot be marked as trusted with /ok-to-test in this repo meaning untrusted PR authors can never trigger tests themselves. Collaborators can still trigger tests on the PR using /test all.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@kvaps kvaps left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed write-up, but I don't think the motivation holds up for a permanent reconciler, and I'd suggest reshaping this as a documented one-time cleanup instead.

The external-attacher bullet is not accurate: the attacher puts a finalizer on the VolumeAttachment and keeps calling ControllerUnpublishVolume with exponential backoff (capped at 5 minutes by default) until it succeeds — the VA object cannot disappear without a successful unpublish. The 6-minute timeout is the A/D controller's maxWaitForUnmountDuration force-detach: it lets the new attach proceed early (which is exactly how the DRBD failed to set source device readwrite race happens), but the old VA stays in Terminating and keeps being retried. The mid-migration 409 case converges the same way — the attacher simply retries after the migration finishes. So with #184 merged, every orphan that still has a VolumeAttachment self-heals through the normal retry path.

That leaves orphans without a VA, which only exist because older driver versions falsely returned success on unpublish (the bug #184 fixes). That is a finite legacy set, and cleaning a finite set once is a migration, not a reconciliation loop. A documented procedure covers it: list VMIs whose status.volumeStatus[].hotplugVolume entries are absent from the parent VM spec (or whose VM is gone), then virtctl removevolume <vmi> --volume-name=<vol> for each — the same subresource this reconciler calls, but as a one-time, operator-supervised step in the release notes of the version that ships #184.

If maintainers still want a standing reconciler, there is also a correctness problem to solve first: non-persistent hot-plugs (virtctl addvolume without --persist) live only in the VMI and never appear in VM.spec.template, so the "in VMI status but not in VM spec" criterion classifies every such legitimate hot-plug as an orphan and detaches it after the grace period — repeatedly, on every re-add. And since the loop scans all VMIs in the infra namespace with no ownership filter, in shared-namespace deployments (e.g. Cozystack tenant namespaces hosting both cluster node VMs and standalone user VMs) it would detach volumes from VMs that have nothing to do with this driver. At minimum it would need to filter VMIs by cluster-ownership labels and only consider volumes matching the driver's --volume-prefix.

The downstream-fork case from the description is an argument for fixing the fork (by porting #184) rather than for an upstream daemon.

@awels

awels commented Jun 25, 2026

Copy link
Copy Markdown
Member

Note the non persistent hotplug has been deprecated, and the default behaviour is to attach to the VM with a hotplug flag, so it can be unplugged. This behavior retains the hotplugged status after reboot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

dco-signoff: yes Indicates the PR's author has DCO signed all their commits. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/L

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add opt-in background reconciler to clean up orphan hot-plug volumes

4 participants