- 
                Notifications
    You must be signed in to change notification settings 
- Fork 41
VEP #53: Moving virtiofsd to infrastructure for robust live migration #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
| @@ -0,0 +1,175 @@ | ||
| # VEP #53: Moving virtiofsd to infrastructure for robust live migration | ||
|  | ||
| ## Release Signoff Checklist | ||
|  | ||
| Items marked with (R) are required *prior to targeting to a milestone / release*. | ||
|  | ||
| - [X] (R) Enhancement issue created, which links to VEP dir in [kubevirt/enhancements] (not the initial VEP PR) | ||
| - [ ] (R) Target version is explicitly mentioned and approved | ||
| - [ ] (R) Graduation criteria filled | ||
|  | ||
| ## Overview | ||
|  | ||
| This VEP proposes a fundamental change to how the `virtiofsd` process is managed | ||
| within KubeVirt. Currently, `virtiofsd` runs as an unprivileged process inside a | ||
| dedicated virtiofs container. This setup limits its ability to utilize file | ||
| handles for sharing Persistent Volume Claims (PVCs) with Virtual Machines (VMs) | ||
| and hinders a robust live migration. The proposed solution is to move `virtiofsd` | ||
| as part of the infrastructure. The virtiofs container will remain **rootless**, | ||
| running a dummy process, while the `virt-handler` will launch the `virtiofsd` | ||
| binary and ensure it joins the virtiofs container's namespaces and cgroup. | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess the biggest worry about this is going to be CVEs; Is there anything we could do proactively to ensure we don't hit those? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate on what do you mean by bypassing the k8s security model? 
 the process keeps only the required capabilities, and chroot itself on the shared dir (dropping the CAP_MKNOD), and all the create operations are done with the UID/GID 107 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @germag I think what @akalenyu trying to say is the same as my concern - by this approach we bypass the kubernetes node SW stack : kubelet, CRI, container runtime. We are injecting a privileged process into a container whose parent sandbox container AKA pod was instructed to by unprivileged by the pod security configuration in the virt-launcher Pod spec. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah, we have been burned before messing with the cgroup directly. Since it's not really managed by us, we have been breaking systemds single write rule at the minimum https://systemd.io/CGROUP_DELEGATION/ 
 Same here. I posted my latest take, understanding that there could be inherent limitations that make it impossible. Seitan is really interesting! | ||
|  | ||
| ## Motivation | ||
|  | ||
| The current architecture for Virtiofs in KubeVirt involves running the virtiofsd | ||
| daemon within a dedicated, unprivileged container. While this approach offers a | ||
| strong security boundary by isolating virtiofsd from the host, it introduces | ||
| significant functional limitations: | ||
| * **File Handle support:** Running `virtiofsd` in an unprivileged container | ||
| restricts its ability to track guest files using file handles. This capability | ||
| is crucial for efficient and robust file sharing, especially when dealing with | ||
| PVCs that might have a large number of files. Without proper file handle | ||
| support, `virtiofsd` must rely on file descriptors which are a limited resource. | ||
|  | ||
| * **Live Migration challenges**: The inability to properly manage file handles | ||
| directly impacts the safety and reliability of live migration for VMs utilizing | ||
| Virtiofs for PVCs sharing that might be concurrently accessed. During a live | ||
| migration, the `virtiofsd` instance needs to hand over its internal state to the | ||
| target destination. Restrictions on file handles make this hand-off prone to | ||
| issues, potentially leading to data inconsistencies or migration failures. | ||
| 
      Comment on lines
    
      +28
     to 
      +39
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So it's a bit hard to reason about these existing challenges (tbh at first I just accepted them) I could be missing a fundamental issue with doing that in virtiofs, like I said, I am not an expert, but we have been in similar situations before - O_DIRECT against a backing storage device that's not exposed to the unprivileged container for example. Somewhat related, the implementation might make more sense in the form of a DRA/CSI driver/device plugin, for example, similar to the one for non root containers using FUSE: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe it would be beneficial to describe the background on why the current approach leads to problems in more details. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 We tried that (not with a single privileged daemon one per virtiofsd), using seccomp notify, it worked and also has a nice way to be integrated on kubevirt, but it requires substantial changes in the kernel and k8s to support reconnection, for instance after kubevirt updates itself. Also, it has some perf penalty since it needs to sign/check the signature of the file handles (because those can be forged), it also requires carefully checking the path of the shared dir since requests are coming from an untrusted party. I'll add more context about the limitation of using a path based live migration | ||
|  | ||
| By moving `virtiofsd` to be managed by `virt-handler` and allowing it to join | ||
| the container's namespaces and cgroups, we aim to overcome these limitations, | ||
| enabling `virtiofsd` file handle functionality while keeping a strong | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The security part is important enough that I would say it is worth to have own section in this design for it. | ||
| security. | ||
|  | ||
| ## Goals | ||
|  | ||
| * Enable `virtiofsd` to utilize file handles, facilitating robust PVC sharing | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. By providing  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think it would be unlikely given that the guest is running in a different container There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 Everything is possible :), but unlikely, virtiofsd is in a different container (but different user namespace), and the guest only "talks" with virtiofsd via 2 virtio queues | ||
| with VMs. | ||
| * Facilitate safe and reliable live migration for VMs that leverage Virtiofs | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Facilitate, in meaning it will work once implemented? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, but currently the live migration works, but is not reliable for live mig. while sharing a PVC, this uses a different method that provides that reliability (I hope I explained myself) | ||
| for PVC sharing. | ||
| * Maintain the rootless execution model for the virtiofs container, | ||
| preserving its security benefits. | ||
|  | ||
| ## Non Goals | ||
|  | ||
| * Using this method to share `configMaps`, `secrets`, `donwardAPIs` or | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 | ||
| `serviceAccounts`; the current implementation of a `virtiofsd` process | ||
| within a dedicated, unprivileged container will continue serving these volumes. | ||
|  | ||
| * Cross cluster live migration and/or storage migration. it is beyond the scope | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. So cross cluster lm might or might not be impacted depending on if the underlying FS changes, right? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think I'll remove that part, since it can be confusing | ||
| of virtiofsd to migrate the contents of the files. In addition, when using file | ||
| handles, both the source and target are required to have access to the same file | ||
| system, since the files are accessed using their inode number. | ||
|  | ||
| ## Definition of Users | ||
|  | ||
| * VM Owners | ||
| * Cluster Admins | ||
|  | ||
| ## User Stories | ||
|  | ||
| * As a KubeVirt user, I want to use Virtiofs for sharing PVCs with my VMs | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please list the limit explicitly. | ||
| without encountering issues related to file descriptor limitations, like | ||
| reaching the open files limit. | ||
| * As a KubeVirt user, I want to be able to live migrate VMs that use Virtiofs | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 | ||
| reliably and safely, ensuring data consistency during migration events. | ||
| * As a KubeVirt administrator, I want to be able to provide robust live | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is good point. When thinking about alternatives, some of them would not work with Pod Security Admission for example. | ||
| migration without allowing privileged containers. | ||
|  | ||
| ## Repos | ||
|  | ||
| [KubeVirt](https://github.com/kubevirt/kubevirt) | ||
|  | ||
|  | ||
| ## Design | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I important note is that this is implemented in C. I see this as a risk for the project, I fear most people will look away but we kept the usage of C at minimum and only with good reasoning. Is there any particular reason you choose C? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 The virtiofs-placeholder doesn't have any risk, it runs without any privileges and basically do nothing, its only job is to die if virtiofsd dies, linking virtiofsd's lifetime to the container lifetime. Should I add more context to the doc? | ||
|  | ||
| The management of the `virtiofsd` process will be integrated into the KubeVirt | ||
| infrastructure. The virtiofs container will remain rootless, starting a dummy | ||
| process as PID 1. | ||
|  | ||
| The `virt-handler` will launch the `virtiofsd` binary inside the virtiofs | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you elaborate a little on that? will there be a separate container under the virt-handler pod, similar to the pr-helper with the dispatcher app? any extra info here would be awesome 🙏 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I'll add more context. 
 No, the containers are the same as the current impl, the VM pod will still have 1 compute container, and 1 virtiofsd rootless container (per PVCs). The "dispatcher" (part of the virt-handler) after finding the container will run a privileged virtiofsd inside that container. | ||
| container's namespaces and cgroups, thereby operating within the same system | ||
| views and resource limitations defined for that container. Furthermore, the | ||
| virtiofs container's dummy PID 1 process will be designed to ensure that the | ||
| container's lifetime is bound to that of virtiofsd; if virtiofsd terminates, | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Will it also have a timeout to wait for the virtiofsd to spawn initially? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No, there is no timeout. I don't really like timeouts, I don't know what can we use as an alternative, if that is required. | ||
| the dummy process will exit, leading to the container's termination. | ||
|  | ||
| The following figure explains how virtiofsd is launched step-by-step: | ||
|  | ||
|  | ||
| ## API Examples | ||
|  | ||
| No changes to the KubeVirt API are required. This is an internal implementation | ||
| detail that changes how `virtiofsd` is managed, not how it is exposed to the | ||
| user via the API. | ||
|  | ||
| ## Alternatives | ||
|  | ||
| <!-- | ||
| Outline any alternative designs that have been considered) | ||
| --> | ||
|  | ||
| * **Run virtiofsd as a privileged container:** This would involve running a | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How does this approach differs from the one suggested in the design section? IIUC what you are suggesting is to have a dispatcher which is child process of virt-handler. The dispatcher process will inherit the parent privileged caps and all the syscalls allowed because no seccomp filter will be applied to that process as it not applied to its virt-handler parent. Then we do setns for the virtiofs dummy PID linux namespace and we run from there. That is basically like running a privileged container process but to bypass all the kubernetes runtime restrictions that apply to the unprivileged virt-launcher pod. Isn't this the case? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think any subsequent process created in virtiofs container will not be privileged which would not be the case if the container was privileged. So attacker can't somehow start a root shell for example There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mhenriks From what I undertstood the dispatcher process will be created in the virt-handler, and then moved to the virtiofs container. The virtiofsd then will be forked by the dispatcher. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @enp0s3 yes that is my understanding too, I was trying to explain how the proposed process here is different and potentially more secure than having a privileged container in the pod There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @mhenriks Thanks for the explanation, I see that the trade-off is having privileged container vs having a privileged process inside unprivileged container. TBH I would prefer to increase the privilege of the virt-launcher pod rather then bypassing the container management stack. I know that we already doing it in the project and I wish we could get rid of this approach. Manipulating cgroups directly can be raceful. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually I don't know that there will be cgroup manipulation in this case There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @akalenyu IIUC the aim is to migrate the dispatcher process ID from virt-handler cgroup.procs to the virtiofs container cgroup.procs There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 
 that was the original design, and I was asked to remove the privileged container, because in the (near) future those types of containers will not be allowed in k8s There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You can see what the dispatcher does here: https://github.com/kubevirt/kubevirt/pull/14115/files#diff-9211391cc49c045761d05c2a94b20d24b4382123d1693deedb83a40b30dbe177 | ||
| privileged virtiofs container, granting specific Linux capabilities | ||
| (e.g., `CAP_DAC_READ_SEARCH`) to the virtiofs container's security context. | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Which caps does it actually need? I assume it needs less caps than virt-handler? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These: CHOWN, DAC_OVERRIDE, FOWNER, FSETID, SETGID, SETUID, SETFCAP, and CAP_DAC_READ_SEARCH There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll add that in the document | ||
|  | ||
| Disadvantage: While seemingly simpler, this is generally considered a security | ||
| risk. Also, in the future it is expected that it will not be possible to run | ||
| privileged containers outside the kubevirt namespace. So, this is a non-starter | ||
| for KubeVirt's future security model. | ||
| 
      Comment on lines
    
      +118
     to 
      +121
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. While not ideal, the Pod Security Admission can be turned off per namespace or for specific service account. I think we should also think about the advantage, the runtime will handler all the configuration of the process/container. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Making the virtiofsd's container run as root is the simplest solution and allows to also switching UIDs/GIDs, will this be accepted as GA?. But if you are talking about another alternative, I'm not sure is a good idea to have so many ways to run virtiofsd with PVCs. As part of this I will add the possibility to disable the method I'm proposing (disabling live migration), so the user can run privileged-as-infra (this VEP), unpriv-without-live-migration (if the admin disables it), and in privileged container | ||
|  | ||
| * **Using a delegated privileged monitor:** A new privileged component as part | ||
| of Kubevirt infrastructure. Since it is stateless, no data needs to be migrated. | ||
| Uses [seccomp notify](https://brauner.io/2020/07/23/seccomp-notify.html) to | ||
| intercept `name_to_handle_at(2)` and `open_by_handle_at(2)`. | ||
| The privileged monitor, runs these syscalls on behalf of virtiofsd, | ||
| returning an HMAC-signed file handle. | ||
|  | ||
| Disadvantage: This is an elegant solution that requires minimal changes to | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you run the monitor as a separate privileged container/Pod? Or would it be basically the same as using a privileged container for virtiofs? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We consider two approaches, a binary inside the virt-handler container or a privileged deployment in the kubevirt namespace | ||
| Kubevirt infrastructure. However, the current kernel's implementation of the | ||
| seccomp notify does not support reconnection, making recovery impossible after | ||
| a Kubevirt upgrade or if the monitor dies for any other reason. | ||
|  | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about: There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not sure which text are you referring | ||
| ## Scalability | ||
|  | ||
| No impact on the scalability. We keep the current design of a single container | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Any possible delays to startup time? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes that's true, I'll add it | ||
| for each volume. | ||
|  | ||
| ## Update/Rollback Compatibility | ||
|  | ||
| No impact on the update/rollback compatibility. | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Worth noting that virt-handler will be updated first before virt-controller and that is why update is not an issue? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's not an issue since virtiofsd will not be restarted during an update/rollback, its lifetime is linked to the VM, like today. So, virtiofsd will be updated/rolled back when the VM migrates. | ||
|  | ||
| ## Functional Testing Approach | ||
|  | ||
| Besides reusing the virtiofsd's functional tests, both unit and functional | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please be more detailed. | ||
| tests are added to test the 'injection' of virtiofsd into the container. | ||
|  | ||
| ## Implementation Phases | ||
|  | ||
| This feature can be implemented in a single phase. | ||
|  | ||
| ## Feature lifecycle Phases | ||
|  | ||
| <!-- | ||
| How and when will the feature progress through the Alpha, Beta and GA lifecycle phases | ||
|  | ||
| Refer to https://github.com/kubevirt/community/blob/main/design-proposals/feature-lifecycle.md#releases for more details | ||
| --> | ||
|  | ||
| Given that virtiofsd functionality has been available in Kubevirt for years, | ||
| There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As others I vote as well for FG or configuration given that we re-implement what runtime are doing for us today and potentially we side-step security measures. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It's currently behind an FG, and since this is replacing how currently is running, I'm just reusing it. | ||
| and the scope of these changes only affects how virtiofsd runs, we could squash | ||
| the alpha and beta versions into a single phase. | ||
| 
      Comment on lines
    
      +161
     to 
      +163
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would advocate for an on opt out field, back to the completely unprivileged impl. for security concerned admins (accepting it's limitations, ofc). This way we can also pull the plug on CVEs if those come up without code changes in a running environment. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 Put it behind a FG There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. +1 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. sorry this part is a left over of the first version, I don't plan to ask for squashing alpha and beta. 
 is already behind a FG:  
 I assume this "opt out field" is not an FG right?, otherwise how that will work for when this becomes GA?. But more important, if there is something else, it means no live-migration for a VMI with PCV shared using unprivileged virtiofsd. The unprivileged virtiofsd uses a method that is not reliable for PVCs that can be modified, by others VMs, during the migration. | ||
|  | ||
| ### Alpha | ||
|  | ||
| v1.7 | ||
|  | ||
| ### Beta | ||
|  | ||
| v1.8 | ||
|  | ||
| ### GA | ||
|  | ||
| v1.9 | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I would call this "move to infrastructure", the context remains same, adjacent container but who creates the
virtiofsdchanges.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'm not sure how to call it honestly.
I'm also in the middle of some changes, I'll install the binary into the virt-handler container and run that binary instead of the one inside the launcher container.