VEP #53: Moving virtiofsd to infrastructure for robust live migration #54

germag · 2025-05-26T17:08:58Z

VEP Metadata

Tracking issue: #53
SIG label: /sig storage

What this PR does

This VEP proposes a fundamental change to how the virtiofsd process is managed within KubeVirt. Virtiofsd is moved as part of the infrastructure. The virtiofs container will remain rootless starting a dummy process. Then, the virtiofsd binary will be launched by virt-handler, and it will join the virtiofs container namespace and cgroup.

Special notes for your reviewer

We already posted a PR:
kubevirt/kubevirt#14115

Release note:

NONE

kubevirt-bot · 2025-05-26T17:09:03Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xpivarc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

germag · 2025-06-03T09:45:29Z

/sig storage

…igration Signed-off-by: German Maglione <[email protected]>

CongLi · 2025-08-08T05:48:27Z

Test pass in kubevirt for this new enhancement.

vladikr · 2025-08-16T02:24:47Z

@xpivarc @jean-edouard, how do you feel about moving virtiofsd to virt-handler?

vladikr · 2025-08-19T13:07:07Z

From my side, this proposal feels like a necessary evolution. In general, I like that we are moving all the overhead away from the virt-launcher.
It does raise a question about debugging that will now require tighter compiling between the virt-launcher and virt-handler.
Looking at the PR, the new functionality is largely isolated, so I don't have a concern about introducing it; we will need to make sure that the entry points are protected.
I don't see a gate in the PR, we should definitely have one.

What is not clear to me (maybe I missed it) is how the fallback to the previous behavior will work.

germag · 2025-08-20T09:12:17Z

Hi @vladikr, thanks for taking a look

I don't see a gate in the PR, we should definitely have one.

We are reusing the current FG EnableVirtioFsPVC (see below)

What is not clear to me (maybe I missed it) is how the fallback to the previous behavior will work.

There is no fallback, this completely replaces the current behavior (only for PVCs), and since the current implementation is not GA we just keep the current FG.

Currently, there is no difference between sharing a configMap/secret/etc and a PVC (besides both are behind different FGs). But, under the current conditions (unprivileged virtiofsd), live migrating while sharing a PVC leaves a lot of room for unexpected errors in the guest.

0xFelix · 2025-08-21T13:01:09Z

/cc

akalenyu

Overall looks good!

Couple of questions, main concern is CVEs since we're kind of going the opposite direction (unprivileged->privileged through our own managing)

akalenyu · 2025-08-26T10:18:30Z

veps/sig-storage/virtiofsd-as-infra.md

+Given that virtiofsd functionality has been available in Kubevirt for years,
+and the scope of these changes only affects how virtiofsd runs, we could squash
+the alpha and beta versions into a single phase.


I would advocate for an on opt out field, back to the completely unprivileged impl. for security concerned admins (accepting it's limitations, ofc). This way we can also pull the plug on CVEs if those come up without code changes in a running environment.

+1 Put it behind a FG

sorry this part is a left over of the first version, I don't plan to ask for squashing alpha and beta.

+1 Put it behind a FG

is already behind a FG: EnableVirtioFsPVC

I would advocate for an on opt out field, back to the completely unprivileged impl. for security concerned admins (accepting it's limitations, ofc). This way we can also pull the plug on CVEs if those come up without code changes in a running environment.

I assume this "opt out field" is not an FG right?, otherwise how that will work for when this becomes GA?. But more important, if there is something else, it means no live-migration for a VMI with PCV shared using unprivileged virtiofsd. The unprivileged virtiofsd uses a method that is not reliable for PVCs that can be modified, by others VMs, during the migration.

akalenyu · 2025-08-26T10:20:06Z

veps/sig-storage/virtiofsd-as-infra.md

+and hinders a robust live migration. The proposed solution is to move `virtiofsd`
+as part of the infrastructure. The virtiofs container will remain **rootless**,
+running a dummy process, while the `virt-handler` will launch the `virtiofsd`
+binary and ensure it joins the virtiofs container's namespaces and cgroup.


I guess the biggest worry about this is going to be CVEs;
We're kind of bypassing the k8s security model by doing this
(though I get this is pretty much required)

Is there anything we could do proactively to ensure we don't hit those?

Could you elaborate on what do you mean by bypassing the k8s security model?

Is there anything we could do proactively to ensure we don't hit those?

the process keeps only the required capabilities, and chroot itself on the shared dir (dropping the CAP_MKNOD), and all the create operations are done with the UID/GID 107

@germag I think what @akalenyu trying to say is the same as my concern - by this approach we bypass the kubernetes node SW stack : kubelet, CRI, container runtime. We are injecting a privileged process into a container whose parent sandbox container AKA pod was instructed to by unprivileged by the pod security configuration in the virt-launcher Pod spec.
Unfortunately I don't have an alternative to suggest at the moment. I wish we could push seitan forward to be adopted by the container engines and then by the k8s sigs, so that we could set capabilities with a limited usage via the pod security API.

Yeah, we have been burned before messing with the cgroup directly. Since it's not really managed by us, we have been breaking systemds single write rule at the minimum https://systemd.io/CGROUP_DELEGATION/

Unfortunately I don't have an alternative to suggest at the moment. I wish we could push seitan forward to be adopted by the container engines and then by the k8s sigs, so that we could set capabilities with a limited usage via the pod security API.

Same here. I posted my latest take, understanding that there could be inherent limitations that make it impossible. Seitan is really interesting!
EDIT:
Here's an old thread about using it in kubevirt - https://groups.google.com/g/kubevirt-dev/c/jEa-s5pFgZA/m/3yAiZwrXAAAJ

akalenyu · 2025-08-26T10:45:33Z

veps/sig-storage/virtiofsd-as-infra.md

+infrastructure. The virtiofs container will remain rootless, starting a dummy
+process as PID 1.
+
+The `virt-handler` will launch the `virtiofsd` binary inside the virtiofs


Could you elaborate a little on that? will there be a separate container under the virt-handler pod, similar to the pr-helper with the dispatcher app? any extra info here would be awesome 🙏

Sure, I'll add more context.

will there be a separate container under the virt-handler pod, similar to the pr-helper with the dispatcher app?

No, the containers are the same as the current impl, the VM pod will still have 1 compute container, and 1 virtiofsd rootless container (per PVCs). The "dispatcher" (part of the virt-handler) after finding the container will run a privileged virtiofsd inside that container.

enp0s3

@germag Thank you for the VEP, I have a small question below

enp0s3 · 2025-08-26T14:43:05Z

veps/sig-storage/virtiofsd-as-infra.md

+Outline any alternative designs that have been considered)
+-->
+
+* **Run virtiofsd as a privileged container:** This would involve running a


How does this approach differs from the one suggested in the design section? IIUC what you are suggesting is to have a dispatcher which is child process of virt-handler. The dispatcher process will inherit the parent privileged caps and all the syscalls allowed because no seccomp filter will be applied to that process as it not applied to its virt-handler parent. Then we do setns for the virtiofs dummy PID linux namespace and we run from there. That is basically like running a privileged container process but to bypass all the kubernetes runtime restrictions that apply to the unprivileged virt-launcher pod. Isn't this the case?

I think any subsequent process created in virtiofs container will not be privileged which would not be the case if the container was privileged. So attacker can't somehow start a root shell for example

@mhenriks From what I undertstood the dispatcher process will be created in the virt-handler, and then moved to the virtiofs container. The virtiofsd then will be forked by the dispatcher.

@enp0s3 yes that is my understanding too, I was trying to explain how the proposed process here is different and potentially more secure than having a privileged container in the pod

@mhenriks Thanks for the explanation, I see that the trade-off is having privileged container vs having a privileged process inside unprivileged container. TBH I would prefer to increase the privilege of the virt-launcher pod rather then bypassing the container management stack. I know that we already doing it in the project and I wish we could get rid of this approach. Manipulating cgroups directly can be raceful.

Actually I don't know that there will be cgroup manipulation in this case
(this seems to be focusing on the other container building block, namespaces, and not so much on resource allocation)
But yeah we're all singing to the same tune.. I think a user should also be able to disable this completely post GA since it's inherently a risky impl.

@akalenyu IIUC the aim is to migrate the dispatcher process ID from virt-handler cgroup.procs to the virtiofs container cgroup.procs

TBH I would prefer to increase the privilege of the virt-launcher pod rather then bypassing the container management stack

that was the original design, and I was asked to remove the privileged container, because in the (near) future those types of containers will not be allowed in k8s

You can see what the dispatcher does here: https://github.com/kubevirt/kubevirt/pull/14115/files#diff-9211391cc49c045761d05c2a94b20d24b4382123d1693deedb83a40b30dbe177

0xFelix · 2025-08-26T14:51:16Z

veps/sig-storage/virtiofsd-as-infra.md

+container's namespaces and cgroups, thereby operating within the same system
+views and resource limitations defined for that container. Furthermore, the
+virtiofs container's dummy PID 1 process will be designed to ensure that the
+container's lifetime is bound to that of virtiofsd; if virtiofsd terminates,


Will it also have a timeout to wait for the virtiofsd to spawn initially?

No, there is no timeout. I don't really like timeouts, I don't know what can we use as an alternative, if that is required.

0xFelix · 2025-08-26T14:53:43Z

veps/sig-storage/virtiofsd-as-infra.md

+
+* **Run virtiofsd as a privileged container:** This would involve running a
+privileged virtiofs container, granting specific Linux capabilities
+(e.g., `CAP_DAC_READ_SEARCH`) to the virtiofs container's security context.


Which caps does it actually need? I assume it needs less caps than virt-handler?

These: CHOWN, DAC_OVERRIDE, FOWNER, FSETID, SETGID, SETUID, SETFCAP, and CAP_DAC_READ_SEARCH

I'll add that in the document

0xFelix · 2025-08-26T15:06:04Z

veps/sig-storage/virtiofsd-as-infra.md

+The privileged monitor, runs these syscalls on behalf of virtiofsd,
+returning an HMAC-signed file handle.
+
+Disadvantage: This is an elegant solution that requires minimal changes to


Can you run the monitor as a separate privileged container/Pod? Or would it be basically the same as using a privileged container for virtiofs?

We consider two approaches, a binary inside the virt-handler container or a privileged deployment in the kubevirt namespace

0xFelix · 2025-08-26T15:06:32Z

veps/sig-storage/virtiofsd-as-infra.md

+Given that virtiofsd functionality has been available in Kubevirt for years,
+and the scope of these changes only affects how virtiofsd runs, we could squash
+the alpha and beta versions into a single phase.


+1 Put it behind a FG

0xFelix · 2025-08-26T15:07:33Z

veps/sig-storage/virtiofsd-as-infra.md

+
+## Goals
+
+* Enable `virtiofsd` to utilize file handles, facilitating robust PVC sharing


By providing virtiofsd more capabilities, would it be possible to exploit those from within the guest?

I think it would be unlikely given that the guest is running in a different container

By providing virtiofsd more capabilities, would it be possible to exploit those from within the guest?

Everything is possible :), but unlikely, virtiofsd is in a different container (but different user namespace), and the guest only "talks" with virtiofsd via 2 virtio queues

akalenyu · 2025-08-31T12:28:01Z

veps/sig-storage/virtiofsd-as-infra.md

+* **File Handle support:** Running `virtiofsd` in an unprivileged container
+restricts its ability to track guest files using file handles. This capability
+is crucial for efficient and robust file sharing, especially when dealing with
+PVCs that might have a large number of files. Without proper file handle
+support, `virtiofsd` must rely on file descriptors which are a limited resource.
+
+* **Live Migration challenges**: The inability to properly manage file handles
+directly impacts the safety and reliability of live migration for VMs utilizing
+Virtiofs for PVCs sharing that might be concurrently accessed. During a live
+migration, the `virtiofsd` instance needs to hand over its internal state to the
+target destination. Restrictions on file handles make this hand-off prone to
+issues, potentially leading to data inconsistencies or migration failures.


So it's a bit hard to reason about these existing challenges (tbh at first I just accepted them)
since I am not a virtiofs expert, but what would make me feel much better about this is if a single privileged daemon on the host could be tackling these on behalf of the unprivileged process.

I could be missing a fundamental issue with doing that in virtiofs, like I said, I am not an expert, but we have been in similar situations before - O_DIRECT against a backing storage device that's not exposed to the unprivileged container for example.
One of the ideas floated (not sure they took that path, it was a vendor):
qemu->daemon over socket, send all I/O to daemon which is privileged
(I/O done on behalf of qemu)

Somewhat related, the implementation might make more sense in the form of a DRA/CSI driver/device plugin, for example, similar to the one for non root containers using FUSE:
https://github.com/nextflow-io/k8s-fuse-plugin

Maybe it would be beneficial to describe the background on why the current approach leads to problems in more details.

a single privileged daemon on the host could be tackling these on behalf of the unprivileged process.

We tried that (not with a single privileged daemon one per virtiofsd), using seccomp notify, it worked and also has a nice way to be integrated on kubevirt, but it requires substantial changes in the kernel and k8s to support reconnection, for instance after kubevirt updates itself. Also, it has some perf penalty since it needs to sign/check the signature of the file handles (because those can be forged), it also requires carefully checking the path of the shared dir since requests are coming from an untrusted party.

I'll add more context about the limitation of using a path based live migration

xpivarc

Mainly I would like us to describe the security in more details and consider some alternatives (see comments)

xpivarc · 2025-08-31T15:22:22Z

veps/sig-storage/virtiofsd-as-infra.md

+within KubeVirt. Currently, `virtiofsd` runs as an unprivileged process inside a
+dedicated virtiofs container. This setup limits its ability to utilize file
+handles for sharing Persistent Volume Claims (PVCs) with Virtual Machines (VMs)
+and hinders a robust live migration. The proposed solution is to move `virtiofsd`


I am not sure I would call this "move to infrastructure", the context remains same, adjacent container but who creates the virtiofsd changes.

Agreed, I'm not sure how to call it honestly.

I'm also in the middle of some changes, I'll install the binary into the virt-handler container and run that binary instead of the one inside the launcher container.

xpivarc · 2025-08-31T15:25:51Z

veps/sig-storage/virtiofsd-as-infra.md

+* **File Handle support:** Running `virtiofsd` in an unprivileged container
+restricts its ability to track guest files using file handles. This capability
+is crucial for efficient and robust file sharing, especially when dealing with
+PVCs that might have a large number of files. Without proper file handle
+support, `virtiofsd` must rely on file descriptors which are a limited resource.
+
+* **Live Migration challenges**: The inability to properly manage file handles
+directly impacts the safety and reliability of live migration for VMs utilizing
+Virtiofs for PVCs sharing that might be concurrently accessed. During a live
+migration, the `virtiofsd` instance needs to hand over its internal state to the
+target destination. Restrictions on file handles make this hand-off prone to
+issues, potentially leading to data inconsistencies or migration failures.


Maybe it would be beneficial to describe the background on why the current approach leads to problems in more details.

xpivarc · 2025-08-31T15:27:18Z

veps/sig-storage/virtiofsd-as-infra.md

+
+By moving `virtiofsd` to be managed by `virt-handler` and allowing it to join
+the container's namespaces and cgroups, we aim to overcome these limitations,
+enabling `virtiofsd` file handle functionality while keeping a strong


The security part is important enough that I would say it is worth to have own section in this design for it.

xpivarc · 2025-08-31T15:27:54Z

veps/sig-storage/virtiofsd-as-infra.md

+
+## Non Goals
+
+* Using this method to share `configMaps`, `secrets`, `donwardAPIs` or


xpivarc · 2025-08-31T15:29:03Z

veps/sig-storage/virtiofsd-as-infra.md

+`serviceAccounts`; the current implementation of a `virtiofsd` process
+ within a dedicated, unprivileged container will continue serving these volumes.
+
+* Cross cluster live migration and/or storage migration. it is beyond the scope


So cross cluster lm might or might not be impacted depending on if the underlying FS changes, right?

Yes, I think I'll remove that part, since it can be confusing

xpivarc · 2025-08-31T15:52:01Z

veps/sig-storage/virtiofsd-as-infra.md

+Kubevirt infrastructure. However, the current kernel's implementation of the
+seccomp notify does not support reconnection, making recovery impossible after
+a Kubevirt upgrade or if the monitor dies for any other reason.
+


What about:
"Run virtiofsd as a sub-privileged Pod"?
"Kubernetes user namespaces" ?

I'm not sure which text are you referring

xpivarc · 2025-08-31T15:52:22Z

veps/sig-storage/virtiofsd-as-infra.md

+
+## Scalability
+
+No impact on the scalability. We keep the current design of a single container


Any possible delays to startup time?

Yes that's true, I'll add it

xpivarc · 2025-08-31T15:53:13Z

veps/sig-storage/virtiofsd-as-infra.md

+
+## Update/Rollback Compatibility
+
+No impact on the update/rollback compatibility.


Worth noting that virt-handler will be updated first before virt-controller and that is why update is not an issue?

It's not an issue since virtiofsd will not be restarted during an update/rollback, its lifetime is linked to the VM, like today. So, virtiofsd will be updated/rolled back when the VM migrates.

xpivarc · 2025-08-31T15:53:30Z

veps/sig-storage/virtiofsd-as-infra.md

+
+## Functional Testing Approach
+
+Besides reusing the virtiofsd's functional tests, both unit and functional


Please be more detailed.

xpivarc · 2025-08-31T15:55:01Z

veps/sig-storage/virtiofsd-as-infra.md

+Refer to https://github.com/kubevirt/community/blob/main/design-proposals/feature-lifecycle.md#releases for more details
+-->
+
+Given that virtiofsd functionality has been available in Kubevirt for years,


As others I vote as well for FG or configuration given that we re-implement what runtime are doing for us today and potentially we side-step security measures.

It's currently behind an FG, and since this is replacing how currently is running, I'm just reusing it.
I'll add a configuration knob to disable this, and run completely unprivileged, but it will also disable the live migration (only for PVC)

xpivarc · 2025-09-01T13:16:32Z

fyi kubevirt/kubevirt#14015

kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 26, 2025

kubevirt-bot requested review from lyarwood, vladikr and xpivarc May 26, 2025 17:09

kubevirt-bot added the size/L label May 26, 2025

germag mentioned this pull request May 22, 2025

VEP 53: Moving virtiofsd to infrastructure for robust live migration #53

Open

4 tasks

germag force-pushed the virtiofsd-as-infra branch from 0c8ba32 to cfd787a Compare May 26, 2025 17:10

kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels May 26, 2025

germag mentioned this pull request May 26, 2025

Expose filesystem PVCs using rootless containers and moving virtiofs as part of the infrastructure kubevirt/kubevirt#14115

Open

8 tasks

kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 2, 2025

germag force-pushed the virtiofsd-as-infra branch from cfd787a to ffdf24e Compare June 2, 2025 09:42

kubevirt-bot added the sig/storage label Jun 3, 2025

VEP kubevirt#53: Moving virtiofsd to infrastructure for robust live m…

d059123

…igration Signed-off-by: German Maglione <[email protected]>

germag force-pushed the virtiofsd-as-infra branch from ffdf24e to d059123 Compare July 28, 2025 15:11

vladikr requested a review from jean-edouard August 16, 2025 02:25

kubevirt-bot requested a review from 0xFelix August 21, 2025 13:01

akalenyu reviewed Aug 26, 2025

View reviewed changes

enp0s3 reviewed Aug 26, 2025

View reviewed changes

0xFelix reviewed Aug 26, 2025

View reviewed changes

akalenyu reviewed Aug 31, 2025

View reviewed changes

xpivarc reviewed Aug 31, 2025

View reviewed changes


		## Goals

		* Enable `virtiofsd` to utilize file handles, facilitating robust PVC sharing


		## Non Goals

		* Using this method to share `configMaps`, `secrets`, `donwardAPIs` or


		## Scalability

		No impact on the scalability. We keep the current design of a single container


		## Update/Rollback Compatibility

		No impact on the update/rollback compatibility.


		## Functional Testing Approach

		Besides reusing the virtiofsd's functional tests, both unit and functional

VEP #53: Moving virtiofsd to infrastructure for robust live migration #54

Are you sure you want to change the base?

VEP #53: Moving virtiofsd to infrastructure for robust live migration #54

Conversation

germag commented May 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VEP Metadata

What this PR does

Special notes for your reviewer

Uh oh!

kubevirt-bot commented May 26, 2025

Uh oh!

germag commented Jun 3, 2025

Uh oh!

CongLi commented Aug 8, 2025

Uh oh!

vladikr commented Aug 16, 2025

Uh oh!

vladikr commented Aug 19, 2025

Uh oh!

germag commented Aug 20, 2025

Uh oh!

0xFelix commented Aug 21, 2025

Uh oh!

akalenyu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enp0s3 Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akalenyu Aug 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

enp0s3 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

akalenyu Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

germag commented May 26, 2025 •

edited

Loading

enp0s3 Aug 28, 2025 •

edited

Loading

akalenyu Aug 31, 2025 •

edited

Loading

akalenyu Aug 28, 2025 •

edited

Loading

akalenyu Aug 31, 2025 •

edited

Loading