|
| 1 | +# VEP #53: Moving virtiofsd to infrastructure for robust live migration |
| 2 | + |
| 3 | +## Release Signoff Checklist |
| 4 | + |
| 5 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 6 | + |
| 7 | +- [X] (R) Enhancement issue created, which links to VEP dir in [kubevirt/enhancements] (not the initial VEP PR) |
| 8 | +- [ ] (R) Target version is explicitly mentioned and approved |
| 9 | +- [ ] (R) Graduation criteria filled |
| 10 | + |
| 11 | +## Overview |
| 12 | + |
| 13 | +This VEP proposes a fundamental change to how the `virtiofsd` process is managed |
| 14 | +within KubeVirt. Currently, `virtiofsd` runs as an unprivileged process inside a |
| 15 | +dedicated virtiofs container. This setup limits its ability to utilize file |
| 16 | +handles for sharing Persistent Volume Claims (PVCs) with Virtual Machines (VMs) |
| 17 | +and hinders a robust live migration. The proposed solution is to move `virtiofsd` |
| 18 | +as part of the infrastructure. The virtiofs container will remain **rootless**, |
| 19 | +running a dummy process, while the `virt-handler` will launch the `virtiofsd` |
| 20 | +binary and ensure it joins the virtiofs container's namespaces and cgroup. |
| 21 | + |
| 22 | +## Motivation |
| 23 | + |
| 24 | +The current architecture for Virtiofs in KubeVirt involves running the virtiofsd |
| 25 | +daemon within a dedicated, unprivileged container. While this approach offers a |
| 26 | +strong security boundary by isolating virtiofsd from the host, it introduces |
| 27 | +significant functional limitations: |
| 28 | +* **File Handle support:** Running `virtiofsd` in an unprivileged container |
| 29 | +restricts its ability to track guest files using file handles. This capability |
| 30 | +is crucial for efficient and robust file sharing, especially when dealing with |
| 31 | +PVCs that might have a large number of files. Without proper file handle |
| 32 | +support, `virtiofsd` must rely on file descriptors which are a limited resource. |
| 33 | + |
| 34 | +* **Live Migration challenges**: The inability to properly manage file handles |
| 35 | +directly impacts the safety and reliability of live migration for VMs utilizing |
| 36 | +Virtiofs for PVCs sharing that might be concurrently accessed. During a live |
| 37 | +migration, the `virtiofsd` instance needs to hand over its internal state to the |
| 38 | +target destination. Restrictions on file handles make this hand-off prone to |
| 39 | +issues, potentially leading to data inconsistencies or migration failures. |
| 40 | + |
| 41 | +By moving `virtiofsd` to be managed by `virt-handler` and allowing it to join |
| 42 | +the container's namespaces and cgroups, we aim to overcome these limitations, |
| 43 | +enabling `virtiofsd` file handle functionality while keeping a strong |
| 44 | +security. |
| 45 | + |
| 46 | +## Goals |
| 47 | + |
| 48 | +* Enable `virtiofsd` to utilize file handles, facilitating robust PVC sharing |
| 49 | +with VMs. |
| 50 | +* Facilitate safe and reliable live migration for VMs that leverage Virtiofs |
| 51 | +for PVC sharing. |
| 52 | +* Maintain the rootless execution model for the virtiofs container, |
| 53 | +preserving its security benefits. |
| 54 | + |
| 55 | +## Non Goals |
| 56 | + |
| 57 | +* Using this method to share `configMaps`, `secrets`, `donwardAPIs` or |
| 58 | +`serviceAccounts`; the current implementation of a `virtiofsd` process |
| 59 | + within a dedicated, unprivileged container will continue serving these volumes. |
| 60 | + |
| 61 | +## Definition of Users |
| 62 | + |
| 63 | +* VM Owners |
| 64 | +* Cluster Admins |
| 65 | + |
| 66 | +## User Stories |
| 67 | + |
| 68 | +* As a KubeVirt user, I want to use Virtiofs for sharing PVCs with my VMs |
| 69 | +without encountering issues related to file descriptor limitations, like |
| 70 | +reaching the open files limit. |
| 71 | +* As a KubeVirt user, I want to be able to live migrate VMs that use Virtiofs |
| 72 | +reliably and safely, ensuring data consistency during migration events. |
| 73 | +* As a KubeVirt administrator, I want to be able to provide robust live |
| 74 | +migration without allowing privileged containers. |
| 75 | + |
| 76 | +## Repos |
| 77 | + |
| 78 | +[KubeVirt](https://github.com/kubevirt/kubevirt) |
| 79 | + |
| 80 | + |
| 81 | +## Design |
| 82 | + |
| 83 | +The management of the `virtiofsd` process will be integrated into the KubeVirt |
| 84 | +infrastructure. The virtiofs container will remain rootless, starting a dummy |
| 85 | +process as PID 1. |
| 86 | + |
| 87 | +The `virt-handler` will launch the `virtiofsd` binary inside the virtiofs |
| 88 | +container's namespaces and cgroups, thereby operating within the same system |
| 89 | +views and resource limitations defined for that container. Furthermore, the |
| 90 | +virtiofs container's dummy PID 1 process will be designed to ensure that the |
| 91 | +container's lifetime is bound to that of virtiofsd; if virtiofsd terminates, |
| 92 | +the dummy process will exit, leading to the container's termination. |
| 93 | + |
| 94 | +The following figure explains how virtiofsd is launched step-by-step: |
| 95 | + |
| 96 | + |
| 97 | +## API Examples |
| 98 | + |
| 99 | +No changes to the KubeVirt API are required. This is an internal implementation |
| 100 | +detail that changes how `virtiofsd` is managed, not how it is exposed to the |
| 101 | +user via the API. |
| 102 | + |
| 103 | +## Alternatives |
| 104 | + |
| 105 | +<!-- |
| 106 | +Outline any alternative designs that have been considered) |
| 107 | +--> |
| 108 | + |
| 109 | +* **Run virtiofsd as a privileged container:** This would involve running a |
| 110 | +privileged virtiofs container, granting specific Linux capabilities |
| 111 | +(e.g., `CAP_DAC_READ_SEARCH`) to the virtiofs container's security context. |
| 112 | + |
| 113 | +Disadvantage: While seemingly simpler, this is generally considered a security |
| 114 | +risk. Also, in the future it is expected that it will not be possible to run |
| 115 | +privileged containers outside the kubevirt namespace. So, this is a non-starter |
| 116 | +for KubeVirt's future security model. |
| 117 | + |
| 118 | +* **Using a delegated privileged monitor:** A new privileged component as part |
| 119 | +of Kubevirt infrastructure. Since it is stateless, no data needs to be migrated. |
| 120 | +Uses [seccomp notify](https://brauner.io/2020/07/23/seccomp-notify.html) to |
| 121 | +intercept `name_to_handle_at(2)` and `open_by_handle_at(2)`. |
| 122 | +The privileged monitor, runs these syscalls on behalf of virtiofsd, |
| 123 | +returning an HMAC-signed file handle. |
| 124 | + |
| 125 | +Disadvantage: This is an elegant solution that requires minimal changes to |
| 126 | +Kubevirt infrastructure. However, the current kernel's implementation of the |
| 127 | +seccomp notify does not support reconnection, making recovery impossible after |
| 128 | +a Kubevirt upgrade or if the monitor dies for any other reason. |
| 129 | + |
| 130 | +## Scalability |
| 131 | + |
| 132 | +No impact on the scalability. We keep the current design of a single container |
| 133 | +for each volume. |
| 134 | + |
| 135 | +## Update/Rollback Compatibility |
| 136 | + |
| 137 | +No impact on the update/rollback compatibility. |
| 138 | + |
| 139 | +## Functional Testing Approach |
| 140 | + |
| 141 | +Besides reusing the virtiofsd's functional tests, both unit and functional |
| 142 | +tests are added to test the 'injection' of virtiofsd into the container. |
| 143 | + |
| 144 | +## Implementation Phases |
| 145 | + |
| 146 | +This feature can be implemented in a single phase. |
| 147 | + |
| 148 | +## Feature lifecycle Phases |
| 149 | + |
| 150 | +<!-- |
| 151 | +How and when will the feature progress through the Alpha, Beta and GA lifecycle phases |
| 152 | +
|
| 153 | +Refer to https://github.com/kubevirt/community/blob/main/design-proposals/feature-lifecycle.md#releases for more details |
| 154 | +--> |
| 155 | + |
| 156 | +Given that virtiofsd functionality has been available in Kubevirt for years, |
| 157 | +and the scope of these changes only affects how virtiofsd runs, we could squash |
| 158 | +the alpha and beta versions into a single phase. |
| 159 | + |
| 160 | +### Alpha |
| 161 | + |
| 162 | +v1.6 |
| 163 | + |
| 164 | +### Beta |
| 165 | + |
| 166 | +v1.7 |
| 167 | + |
| 168 | +### GA |
| 169 | + |
| 170 | +v1.8 |
0 commit comments