-
Notifications
You must be signed in to change notification settings - Fork 35
VEP #53: Moving virtiofsd to infrastructure for robust live migration #54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
0c8ba32
to
cfd787a
Compare
cfd787a
to
ffdf24e
Compare
/sig storage |
…igration Signed-off-by: German Maglione <[email protected]>
ffdf24e
to
d059123
Compare
Test pass in kubevirt for this new enhancement. |
@xpivarc @jean-edouard, how do you feel about moving virtiofsd to virt-handler? |
From my side, this proposal feels like a necessary evolution. In general, I like that we are moving all the overhead away from the virt-launcher. What is not clear to me (maybe I missed it) is how the fallback to the previous behavior will work. |
Hi @vladikr, thanks for taking a look
We are reusing the current FG
There is no fallback, this completely replaces the current behavior (only for PVCs), and since the current implementation is not GA we just keep the current FG. Currently, there is no difference between sharing a configMap/secret/etc and a PVC (besides both are behind different FGs). But, under the current conditions (unprivileged virtiofsd), live migrating while sharing a PVC leaves a lot of room for unexpected errors in the guest. |
/cc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks good!
Couple of questions, main concern is CVEs since we're kind of going the opposite direction (unprivileged->privileged through our own managing)
Given that virtiofsd functionality has been available in Kubevirt for years, | ||
and the scope of these changes only affects how virtiofsd runs, we could squash | ||
the alpha and beta versions into a single phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would advocate for an on opt out field, back to the completely unprivileged impl. for security concerned admins (accepting it's limitations, ofc). This way we can also pull the plug on CVEs if those come up without code changes in a running environment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Put it behind a FG
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry this part is a left over of the first version, I don't plan to ask for squashing alpha and beta.
+1 Put it behind a FG
is already behind a FG: EnableVirtioFsPVC
I would advocate for an on opt out field, back to the completely unprivileged impl. for security concerned admins (accepting it's limitations, ofc). This way we can also pull the plug on CVEs if those come up without code changes in a running environment.
I assume this "opt out field" is not an FG right?, otherwise how that will work for when this becomes GA?. But more important, if there is something else, it means no live-migration for a VMI with PCV shared using unprivileged virtiofsd. The unprivileged virtiofsd uses a method that is not reliable for PVCs that can be modified, by others VMs, during the migration.
and hinders a robust live migration. The proposed solution is to move `virtiofsd` | ||
as part of the infrastructure. The virtiofs container will remain **rootless**, | ||
running a dummy process, while the `virt-handler` will launch the `virtiofsd` | ||
binary and ensure it joins the virtiofs container's namespaces and cgroup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the biggest worry about this is going to be CVEs;
We're kind of bypassing the k8s security model by doing this
(though I get this is pretty much required)
Is there anything we could do proactively to ensure we don't hit those?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate on what do you mean by bypassing the k8s security model?
Is there anything we could do proactively to ensure we don't hit those?
the process keeps only the required capabilities, and chroot itself on the shared dir (dropping the CAP_MKNOD), and all the create operations are done with the UID/GID 107
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@germag I think what @akalenyu trying to say is the same as my concern - by this approach we bypass the kubernetes node SW stack : kubelet, CRI, container runtime. We are injecting a privileged process into a container whose parent sandbox container AKA pod was instructed to by unprivileged by the pod security configuration in the virt-launcher Pod spec.
Unfortunately I don't have an alternative to suggest at the moment. I wish we could push seitan forward to be adopted by the container engines and then by the k8s sigs, so that we could set capabilities with a limited usage via the pod security API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, we have been burned before messing with the cgroup directly. Since it's not really managed by us, we have been breaking systemds single write rule at the minimum https://systemd.io/CGROUP_DELEGATION/
Unfortunately I don't have an alternative to suggest at the moment. I wish we could push seitan forward to be adopted by the container engines and then by the k8s sigs, so that we could set capabilities with a limited usage via the pod security API.
Same here. I posted my latest take, understanding that there could be inherent limitations that make it impossible. Seitan is really interesting!
EDIT:
Here's an old thread about using it in kubevirt - https://groups.google.com/g/kubevirt-dev/c/jEa-s5pFgZA/m/3yAiZwrXAAAJ
infrastructure. The virtiofs container will remain rootless, starting a dummy | ||
process as PID 1. | ||
|
||
The `virt-handler` will launch the `virtiofsd` binary inside the virtiofs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate a little on that? will there be a separate container under the virt-handler pod, similar to the pr-helper with the dispatcher app? any extra info here would be awesome 🙏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'll add more context.
will there be a separate container under the virt-handler pod, similar to the pr-helper with the dispatcher app?
No, the containers are the same as the current impl, the VM pod will still have 1 compute container, and 1 virtiofsd rootless container (per PVCs). The "dispatcher" (part of the virt-handler) after finding the container will run a privileged virtiofsd inside that container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@germag Thank you for the VEP, I have a small question below
Outline any alternative designs that have been considered) | ||
--> | ||
|
||
* **Run virtiofsd as a privileged container:** This would involve running a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this approach differs from the one suggested in the design section? IIUC what you are suggesting is to have a dispatcher which is child process of virt-handler. The dispatcher process will inherit the parent privileged caps and all the syscalls allowed because no seccomp filter will be applied to that process as it not applied to its virt-handler parent. Then we do setns for the virtiofs dummy PID linux namespace and we run from there. That is basically like running a privileged container process but to bypass all the kubernetes runtime restrictions that apply to the unprivileged virt-launcher pod. Isn't this the case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think any subsequent process created in virtiofs container will not be privileged which would not be the case if the container was privileged. So attacker can't somehow start a root shell for example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhenriks From what I undertstood the dispatcher process will be created in the virt-handler, and then moved to the virtiofs container. The virtiofsd then will be forked by the dispatcher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enp0s3 yes that is my understanding too, I was trying to explain how the proposed process here is different and potentially more secure than having a privileged container in the pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mhenriks Thanks for the explanation, I see that the trade-off is having privileged container vs having a privileged process inside unprivileged container. TBH I would prefer to increase the privilege of the virt-launcher pod rather then bypassing the container management stack. I know that we already doing it in the project and I wish we could get rid of this approach. Manipulating cgroups directly can be raceful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I don't know that there will be cgroup manipulation in this case
(this seems to be focusing on the other container building block, namespaces, and not so much on resource allocation)
But yeah we're all singing to the same tune.. I think a user should also be able to disable this completely post GA since it's inherently a risky impl.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akalenyu IIUC the aim is to migrate the dispatcher process ID from virt-handler cgroup.procs to the virtiofs container cgroup.procs
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TBH I would prefer to increase the privilege of the virt-launcher pod rather then bypassing the container management stack
that was the original design, and I was asked to remove the privileged container, because in the (near) future those types of containers will not be allowed in k8s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can see what the dispatcher does here: https://github.com/kubevirt/kubevirt/pull/14115/files#diff-9211391cc49c045761d05c2a94b20d24b4382123d1693deedb83a40b30dbe177
container's namespaces and cgroups, thereby operating within the same system | ||
views and resource limitations defined for that container. Furthermore, the | ||
virtiofs container's dummy PID 1 process will be designed to ensure that the | ||
container's lifetime is bound to that of virtiofsd; if virtiofsd terminates, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will it also have a timeout to wait for the virtiofsd to spawn initially?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there is no timeout. I don't really like timeouts, I don't know what can we use as an alternative, if that is required.
|
||
* **Run virtiofsd as a privileged container:** This would involve running a | ||
privileged virtiofs container, granting specific Linux capabilities | ||
(e.g., `CAP_DAC_READ_SEARCH`) to the virtiofs container's security context. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which caps does it actually need? I assume it needs less caps than virt-handler?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These: CHOWN, DAC_OVERRIDE, FOWNER, FSETID, SETGID, SETUID, SETFCAP, and CAP_DAC_READ_SEARCH
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll add that in the document
The privileged monitor, runs these syscalls on behalf of virtiofsd, | ||
returning an HMAC-signed file handle. | ||
|
||
Disadvantage: This is an elegant solution that requires minimal changes to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you run the monitor as a separate privileged container/Pod? Or would it be basically the same as using a privileged container for virtiofs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We consider two approaches, a binary inside the virt-handler container or a privileged deployment in the kubevirt namespace
Given that virtiofsd functionality has been available in Kubevirt for years, | ||
and the scope of these changes only affects how virtiofsd runs, we could squash | ||
the alpha and beta versions into a single phase. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Put it behind a FG
|
||
## Goals | ||
|
||
* Enable `virtiofsd` to utilize file handles, facilitating robust PVC sharing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By providing virtiofsd
more capabilities, would it be possible to exploit those from within the guest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be unlikely given that the guest is running in a different container
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By providing virtiofsd more capabilities, would it be possible to exploit those from within the guest?
Everything is possible :), but unlikely, virtiofsd is in a different container (but different user namespace), and the guest only "talks" with virtiofsd via 2 virtio queues
* **File Handle support:** Running `virtiofsd` in an unprivileged container | ||
restricts its ability to track guest files using file handles. This capability | ||
is crucial for efficient and robust file sharing, especially when dealing with | ||
PVCs that might have a large number of files. Without proper file handle | ||
support, `virtiofsd` must rely on file descriptors which are a limited resource. | ||
|
||
* **Live Migration challenges**: The inability to properly manage file handles | ||
directly impacts the safety and reliability of live migration for VMs utilizing | ||
Virtiofs for PVCs sharing that might be concurrently accessed. During a live | ||
migration, the `virtiofsd` instance needs to hand over its internal state to the | ||
target destination. Restrictions on file handles make this hand-off prone to | ||
issues, potentially leading to data inconsistencies or migration failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's a bit hard to reason about these existing challenges (tbh at first I just accepted them)
since I am not a virtiofs expert, but what would make me feel much better about this is if a single privileged daemon on the host could be tackling these on behalf of the unprivileged process.
I could be missing a fundamental issue with doing that in virtiofs, like I said, I am not an expert, but we have been in similar situations before - O_DIRECT against a backing storage device that's not exposed to the unprivileged container for example.
One of the ideas floated (not sure they took that path, it was a vendor):
qemu->daemon over socket, send all I/O to daemon which is privileged
(I/O done on behalf of qemu)
Somewhat related, the implementation might make more sense in the form of a DRA/CSI driver/device plugin, for example, similar to the one for non root containers using FUSE:
https://github.com/nextflow-io/k8s-fuse-plugin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be beneficial to describe the background on why the current approach leads to problems in more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a single privileged daemon on the host could be tackling these on behalf of the unprivileged process.
We tried that (not with a single privileged daemon one per virtiofsd), using seccomp notify, it worked and also has a nice way to be integrated on kubevirt, but it requires substantial changes in the kernel and k8s to support reconnection, for instance after kubevirt updates itself. Also, it has some perf penalty since it needs to sign/check the signature of the file handles (because those can be forged), it also requires carefully checking the path of the shared dir since requests are coming from an untrusted party.
I'll add more context about the limitation of using a path based live migration
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mainly I would like us to describe the security in more details and consider some alternatives (see comments)
within KubeVirt. Currently, `virtiofsd` runs as an unprivileged process inside a | ||
dedicated virtiofs container. This setup limits its ability to utilize file | ||
handles for sharing Persistent Volume Claims (PVCs) with Virtual Machines (VMs) | ||
and hinders a robust live migration. The proposed solution is to move `virtiofsd` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I would call this "move to infrastructure", the context remains same, adjacent container but who creates the virtiofsd
changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, I'm not sure how to call it honestly.
I'm also in the middle of some changes, I'll install the binary into the virt-handler container and run that binary instead of the one inside the launcher container.
* **File Handle support:** Running `virtiofsd` in an unprivileged container | ||
restricts its ability to track guest files using file handles. This capability | ||
is crucial for efficient and robust file sharing, especially when dealing with | ||
PVCs that might have a large number of files. Without proper file handle | ||
support, `virtiofsd` must rely on file descriptors which are a limited resource. | ||
|
||
* **Live Migration challenges**: The inability to properly manage file handles | ||
directly impacts the safety and reliability of live migration for VMs utilizing | ||
Virtiofs for PVCs sharing that might be concurrently accessed. During a live | ||
migration, the `virtiofsd` instance needs to hand over its internal state to the | ||
target destination. Restrictions on file handles make this hand-off prone to | ||
issues, potentially leading to data inconsistencies or migration failures. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it would be beneficial to describe the background on why the current approach leads to problems in more details.
|
||
By moving `virtiofsd` to be managed by `virt-handler` and allowing it to join | ||
the container's namespaces and cgroups, we aim to overcome these limitations, | ||
enabling `virtiofsd` file handle functionality while keeping a strong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The security part is important enough that I would say it is worth to have own section in this design for it.
|
||
## Non Goals | ||
|
||
* Using this method to share `configMaps`, `secrets`, `donwardAPIs` or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
`serviceAccounts`; the current implementation of a `virtiofsd` process | ||
within a dedicated, unprivileged container will continue serving these volumes. | ||
|
||
* Cross cluster live migration and/or storage migration. it is beyond the scope |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So cross cluster lm might or might not be impacted depending on if the underlying FS changes, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think I'll remove that part, since it can be confusing
Kubevirt infrastructure. However, the current kernel's implementation of the | ||
seccomp notify does not support reconnection, making recovery impossible after | ||
a Kubevirt upgrade or if the monitor dies for any other reason. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about:
"Run virtiofsd as a sub-privileged Pod"?
"Kubernetes user namespaces" ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure which text are you referring
|
||
## Scalability | ||
|
||
No impact on the scalability. We keep the current design of a single container |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any possible delays to startup time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes that's true, I'll add it
|
||
## Update/Rollback Compatibility | ||
|
||
No impact on the update/rollback compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth noting that virt-handler will be updated first before virt-controller and that is why update is not an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not an issue since virtiofsd will not be restarted during an update/rollback, its lifetime is linked to the VM, like today. So, virtiofsd will be updated/rolled back when the VM migrates.
|
||
## Functional Testing Approach | ||
|
||
Besides reusing the virtiofsd's functional tests, both unit and functional |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be more detailed.
Refer to https://github.com/kubevirt/community/blob/main/design-proposals/feature-lifecycle.md#releases for more details | ||
--> | ||
|
||
Given that virtiofsd functionality has been available in Kubevirt for years, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As others I vote as well for FG or configuration given that we re-implement what runtime are doing for us today and potentially we side-step security measures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's currently behind an FG, and since this is replacing how currently is running, I'm just reusing it.
I'll add a configuration knob to disable this, and run completely unprivileged, but it will also disable the live migration (only for PVC)
VEP Metadata
Tracking issue: #53
SIG label: /sig storage
What this PR does
This VEP proposes a fundamental change to how the
virtiofsd
process is managed within KubeVirt. Virtiofsd is moved as part of the infrastructure. The virtiofs container will remain rootless starting a dummy process. Then, the virtiofsd binary will be launched by virt-handler, and it will join the virtiofs container namespace and cgroup.Special notes for your reviewer
We already posted a PR:
kubevirt/kubevirt#14115
Release note: