Skip to content

Conversation

germag
Copy link

@germag germag commented May 26, 2025

VEP Metadata

Tracking issue: #53
SIG label: /sig storage

What this PR does

This VEP proposes a fundamental change to how the virtiofsd process is managed within KubeVirt. Virtiofsd is moved as part of the infrastructure. The virtiofs container will remain rootless starting a dummy process. Then, the virtiofsd binary will be launched by virt-handler, and it will join the virtiofs container namespace and cgroup.

Special notes for your reviewer

We already posted a PR:
kubevirt/kubevirt#14115

Release note:

NONE

@kubevirt-bot kubevirt-bot added dco-signoff: no Indicates the PR's author has not DCO signed all their commits. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels May 26, 2025
@kubevirt-bot
Copy link

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign xpivarc for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@germag germag force-pushed the virtiofsd-as-infra branch from 0c8ba32 to cfd787a Compare May 26, 2025 17:10
@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. and removed dco-signoff: no Indicates the PR's author has not DCO signed all their commits. labels May 26, 2025
@kubevirt-bot kubevirt-bot added release-note-none Denotes a PR that doesn't merit a release note. and removed do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. labels Jun 2, 2025
@germag germag force-pushed the virtiofsd-as-infra branch from cfd787a to ffdf24e Compare June 2, 2025 09:42
@germag
Copy link
Author

germag commented Jun 3, 2025

/sig storage

@germag germag force-pushed the virtiofsd-as-infra branch from ffdf24e to d059123 Compare July 28, 2025 15:11
@CongLi
Copy link

CongLi commented Aug 8, 2025

Test pass in kubevirt for this new enhancement.

@vladikr
Copy link
Member

vladikr commented Aug 16, 2025

@xpivarc @jean-edouard, how do you feel about moving virtiofsd to virt-handler?

@vladikr vladikr requested a review from jean-edouard August 16, 2025 02:25
@vladikr
Copy link
Member

vladikr commented Aug 19, 2025

From my side, this proposal feels like a necessary evolution. In general, I like that we are moving all the overhead away from the virt-launcher.
It does raise a question about debugging that will now require tighter compiling between the virt-launcher and virt-handler.
Looking at the PR, the new functionality is largely isolated, so I don't have a concern about introducing it; we will need to make sure that the entry points are protected.
I don't see a gate in the PR, we should definitely have one.

What is not clear to me (maybe I missed it) is how the fallback to the previous behavior will work.

@germag
Copy link
Author

germag commented Aug 20, 2025

Hi @vladikr, thanks for taking a look

I don't see a gate in the PR, we should definitely have one.

We are reusing the current FG EnableVirtioFsPVC (see below)

What is not clear to me (maybe I missed it) is how the fallback to the previous behavior will work.

There is no fallback, this completely replaces the current behavior (only for PVCs), and since the current implementation is not GA we just keep the current FG.

Currently, there is no difference between sharing a configMap/secret/etc and a PVC (besides both are behind different FGs). But, under the current conditions (unprivileged virtiofsd), live migrating while sharing a PVC leaves a lot of room for unexpected errors in the guest.

@0xFelix
Copy link
Member

0xFelix commented Aug 21, 2025

/cc

@kubevirt-bot kubevirt-bot requested a review from 0xFelix August 21, 2025 13:01
Copy link

@akalenyu akalenyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good!

Couple of questions, main concern is CVEs since we're kind of going the opposite direction (unprivileged->privileged through our own managing)

Comment on lines +161 to +163
Given that virtiofsd functionality has been available in Kubevirt for years,
and the scope of these changes only affects how virtiofsd runs, we could squash
the alpha and beta versions into a single phase.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would advocate for an on opt out field, back to the completely unprivileged impl. for security concerned admins (accepting it's limitations, ofc). This way we can also pull the plug on CVEs if those come up without code changes in a running environment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Put it behind a FG

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry this part is a left over of the first version, I don't plan to ask for squashing alpha and beta.

+1 Put it behind a FG

is already behind a FG: EnableVirtioFsPVC

I would advocate for an on opt out field, back to the completely unprivileged impl. for security concerned admins (accepting it's limitations, ofc). This way we can also pull the plug on CVEs if those come up without code changes in a running environment.

I assume this "opt out field" is not an FG right?, otherwise how that will work for when this becomes GA?. But more important, if there is something else, it means no live-migration for a VMI with PCV shared using unprivileged virtiofsd. The unprivileged virtiofsd uses a method that is not reliable for PVCs that can be modified, by others VMs, during the migration.

and hinders a robust live migration. The proposed solution is to move `virtiofsd`
as part of the infrastructure. The virtiofs container will remain **rootless**,
running a dummy process, while the `virt-handler` will launch the `virtiofsd`
binary and ensure it joins the virtiofs container's namespaces and cgroup.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the biggest worry about this is going to be CVEs;
We're kind of bypassing the k8s security model by doing this
(though I get this is pretty much required)

Is there anything we could do proactively to ensure we don't hit those?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate on what do you mean by bypassing the k8s security model?

Is there anything we could do proactively to ensure we don't hit those?

the process keeps only the required capabilities, and chroot itself on the shared dir (dropping the CAP_MKNOD), and all the create operations are done with the UID/GID 107

Copy link

@enp0s3 enp0s3 Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@germag I think what @akalenyu trying to say is the same as my concern - by this approach we bypass the kubernetes node SW stack : kubelet, CRI, container runtime. We are injecting a privileged process into a container whose parent sandbox container AKA pod was instructed to by unprivileged by the pod security configuration in the virt-launcher Pod spec.
Unfortunately I don't have an alternative to suggest at the moment. I wish we could push seitan forward to be adopted by the container engines and then by the k8s sigs, so that we could set capabilities with a limited usage via the pod security API.

Copy link

@akalenyu akalenyu Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, we have been burned before messing with the cgroup directly. Since it's not really managed by us, we have been breaking systemds single write rule at the minimum https://systemd.io/CGROUP_DELEGATION/

Unfortunately I don't have an alternative to suggest at the moment. I wish we could push seitan forward to be adopted by the container engines and then by the k8s sigs, so that we could set capabilities with a limited usage via the pod security API.

Same here. I posted my latest take, understanding that there could be inherent limitations that make it impossible. Seitan is really interesting!
EDIT:
Here's an old thread about using it in kubevirt - https://groups.google.com/g/kubevirt-dev/c/jEa-s5pFgZA/m/3yAiZwrXAAAJ

infrastructure. The virtiofs container will remain rootless, starting a dummy
process as PID 1.

The `virt-handler` will launch the `virtiofsd` binary inside the virtiofs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you elaborate a little on that? will there be a separate container under the virt-handler pod, similar to the pr-helper with the dispatcher app? any extra info here would be awesome 🙏

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'll add more context.

will there be a separate container under the virt-handler pod, similar to the pr-helper with the dispatcher app?

No, the containers are the same as the current impl, the VM pod will still have 1 compute container, and 1 virtiofsd rootless container (per PVCs). The "dispatcher" (part of the virt-handler) after finding the container will run a privileged virtiofsd inside that container.

Copy link

@enp0s3 enp0s3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@germag Thank you for the VEP, I have a small question below

Outline any alternative designs that have been considered)
-->

* **Run virtiofsd as a privileged container:** This would involve running a
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this approach differs from the one suggested in the design section? IIUC what you are suggesting is to have a dispatcher which is child process of virt-handler. The dispatcher process will inherit the parent privileged caps and all the syscalls allowed because no seccomp filter will be applied to that process as it not applied to its virt-handler parent. Then we do setns for the virtiofs dummy PID linux namespace and we run from there. That is basically like running a privileged container process but to bypass all the kubernetes runtime restrictions that apply to the unprivileged virt-launcher pod. Isn't this the case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any subsequent process created in virtiofs container will not be privileged which would not be the case if the container was privileged. So attacker can't somehow start a root shell for example

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhenriks From what I undertstood the dispatcher process will be created in the virt-handler, and then moved to the virtiofs container. The virtiofsd then will be forked by the dispatcher.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@enp0s3 yes that is my understanding too, I was trying to explain how the proposed process here is different and potentially more secure than having a privileged container in the pod

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhenriks Thanks for the explanation, I see that the trade-off is having privileged container vs having a privileged process inside unprivileged container. TBH I would prefer to increase the privilege of the virt-launcher pod rather then bypassing the container management stack. I know that we already doing it in the project and I wish we could get rid of this approach. Manipulating cgroups directly can be raceful.

Copy link

@akalenyu akalenyu Aug 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I don't know that there will be cgroup manipulation in this case
(this seems to be focusing on the other container building block, namespaces, and not so much on resource allocation)
But yeah we're all singing to the same tune.. I think a user should also be able to disable this completely post GA since it's inherently a risky impl.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@akalenyu IIUC the aim is to migrate the dispatcher process ID from virt-handler cgroup.procs to the virtiofs container cgroup.procs

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TBH I would prefer to increase the privilege of the virt-launcher pod rather then bypassing the container management stack

that was the original design, and I was asked to remove the privileged container, because in the (near) future those types of containers will not be allowed in k8s

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

container's namespaces and cgroups, thereby operating within the same system
views and resource limitations defined for that container. Furthermore, the
virtiofs container's dummy PID 1 process will be designed to ensure that the
container's lifetime is bound to that of virtiofsd; if virtiofsd terminates,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will it also have a timeout to wait for the virtiofsd to spawn initially?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, there is no timeout. I don't really like timeouts, I don't know what can we use as an alternative, if that is required.


* **Run virtiofsd as a privileged container:** This would involve running a
privileged virtiofs container, granting specific Linux capabilities
(e.g., `CAP_DAC_READ_SEARCH`) to the virtiofs container's security context.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which caps does it actually need? I assume it needs less caps than virt-handler?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These: CHOWN, DAC_OVERRIDE, FOWNER, FSETID, SETGID, SETUID, SETFCAP, and CAP_DAC_READ_SEARCH

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that in the document

The privileged monitor, runs these syscalls on behalf of virtiofsd,
returning an HMAC-signed file handle.

Disadvantage: This is an elegant solution that requires minimal changes to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run the monitor as a separate privileged container/Pod? Or would it be basically the same as using a privileged container for virtiofs?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We consider two approaches, a binary inside the virt-handler container or a privileged deployment in the kubevirt namespace

Comment on lines +161 to +163
Given that virtiofsd functionality has been available in Kubevirt for years,
and the scope of these changes only affects how virtiofsd runs, we could squash
the alpha and beta versions into a single phase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 Put it behind a FG


## Goals

* Enable `virtiofsd` to utilize file handles, facilitating robust PVC sharing
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By providing virtiofsd more capabilities, would it be possible to exploit those from within the guest?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be unlikely given that the guest is running in a different container

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By providing virtiofsd more capabilities, would it be possible to exploit those from within the guest?

Everything is possible :), but unlikely, virtiofsd is in a different container (but different user namespace), and the guest only "talks" with virtiofsd via 2 virtio queues

Comment on lines +28 to +39
* **File Handle support:** Running `virtiofsd` in an unprivileged container
restricts its ability to track guest files using file handles. This capability
is crucial for efficient and robust file sharing, especially when dealing with
PVCs that might have a large number of files. Without proper file handle
support, `virtiofsd` must rely on file descriptors which are a limited resource.

* **Live Migration challenges**: The inability to properly manage file handles
directly impacts the safety and reliability of live migration for VMs utilizing
Virtiofs for PVCs sharing that might be concurrently accessed. During a live
migration, the `virtiofsd` instance needs to hand over its internal state to the
target destination. Restrictions on file handles make this hand-off prone to
issues, potentially leading to data inconsistencies or migration failures.
Copy link

@akalenyu akalenyu Aug 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's a bit hard to reason about these existing challenges (tbh at first I just accepted them)
since I am not a virtiofs expert, but what would make me feel much better about this is if a single privileged daemon on the host could be tackling these on behalf of the unprivileged process.

I could be missing a fundamental issue with doing that in virtiofs, like I said, I am not an expert, but we have been in similar situations before - O_DIRECT against a backing storage device that's not exposed to the unprivileged container for example.
One of the ideas floated (not sure they took that path, it was a vendor):
qemu->daemon over socket, send all I/O to daemon which is privileged
(I/O done on behalf of qemu)

Somewhat related, the implementation might make more sense in the form of a DRA/CSI driver/device plugin, for example, similar to the one for non root containers using FUSE:
https://github.com/nextflow-io/k8s-fuse-plugin

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be beneficial to describe the background on why the current approach leads to problems in more details.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a single privileged daemon on the host could be tackling these on behalf of the unprivileged process.

We tried that (not with a single privileged daemon one per virtiofsd), using seccomp notify, it worked and also has a nice way to be integrated on kubevirt, but it requires substantial changes in the kernel and k8s to support reconnection, for instance after kubevirt updates itself. Also, it has some perf penalty since it needs to sign/check the signature of the file handles (because those can be forged), it also requires carefully checking the path of the shared dir since requests are coming from an untrusted party.

I'll add more context about the limitation of using a path based live migration

Copy link
Member

@xpivarc xpivarc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mainly I would like us to describe the security in more details and consider some alternatives (see comments)

within KubeVirt. Currently, `virtiofsd` runs as an unprivileged process inside a
dedicated virtiofs container. This setup limits its ability to utilize file
handles for sharing Persistent Volume Claims (PVCs) with Virtual Machines (VMs)
and hinders a robust live migration. The proposed solution is to move `virtiofsd`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure I would call this "move to infrastructure", the context remains same, adjacent container but who creates the virtiofsd changes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I'm not sure how to call it honestly.

I'm also in the middle of some changes, I'll install the binary into the virt-handler container and run that binary instead of the one inside the launcher container.

Comment on lines +28 to +39
* **File Handle support:** Running `virtiofsd` in an unprivileged container
restricts its ability to track guest files using file handles. This capability
is crucial for efficient and robust file sharing, especially when dealing with
PVCs that might have a large number of files. Without proper file handle
support, `virtiofsd` must rely on file descriptors which are a limited resource.

* **Live Migration challenges**: The inability to properly manage file handles
directly impacts the safety and reliability of live migration for VMs utilizing
Virtiofs for PVCs sharing that might be concurrently accessed. During a live
migration, the `virtiofsd` instance needs to hand over its internal state to the
target destination. Restrictions on file handles make this hand-off prone to
issues, potentially leading to data inconsistencies or migration failures.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it would be beneficial to describe the background on why the current approach leads to problems in more details.


By moving `virtiofsd` to be managed by `virt-handler` and allowing it to join
the container's namespaces and cgroups, we aim to overcome these limitations,
enabling `virtiofsd` file handle functionality while keeping a strong
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The security part is important enough that I would say it is worth to have own section in this design for it.


## Non Goals

* Using this method to share `configMaps`, `secrets`, `donwardAPIs` or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

`serviceAccounts`; the current implementation of a `virtiofsd` process
within a dedicated, unprivileged container will continue serving these volumes.

* Cross cluster live migration and/or storage migration. it is beyond the scope
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So cross cluster lm might or might not be impacted depending on if the underlying FS changes, right?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think I'll remove that part, since it can be confusing

Kubevirt infrastructure. However, the current kernel's implementation of the
seccomp notify does not support reconnection, making recovery impossible after
a Kubevirt upgrade or if the monitor dies for any other reason.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about:
"Run virtiofsd as a sub-privileged Pod"?
"Kubernetes user namespaces" ?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure which text are you referring


## Scalability

No impact on the scalability. We keep the current design of a single container
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any possible delays to startup time?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that's true, I'll add it


## Update/Rollback Compatibility

No impact on the update/rollback compatibility.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth noting that virt-handler will be updated first before virt-controller and that is why update is not an issue?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not an issue since virtiofsd will not be restarted during an update/rollback, its lifetime is linked to the VM, like today. So, virtiofsd will be updated/rolled back when the VM migrates.


## Functional Testing Approach

Besides reusing the virtiofsd's functional tests, both unit and functional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be more detailed.

Refer to https://github.com/kubevirt/community/blob/main/design-proposals/feature-lifecycle.md#releases for more details
-->

Given that virtiofsd functionality has been available in Kubevirt for years,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As others I vote as well for FG or configuration given that we re-implement what runtime are doing for us today and potentially we side-step security measures.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's currently behind an FG, and since this is replacing how currently is running, I'm just reusing it.
I'll add a configuration knob to disable this, and run completely unprivileged, but it will also disable the live migration (only for PVC)

@xpivarc
Copy link
Member

xpivarc commented Sep 1, 2025

fyi kubevirt/kubevirt#14015

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. release-note-none Denotes a PR that doesn't merit a release note. sig/storage size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants