Skip to content

Commit cfd787a

Browse files
committed
VEP #53: Moving virtiofsd to infrastructure for robust live migration
Signed-off-by: German Maglione <[email protected]>
1 parent a06daa1 commit cfd787a

File tree

2 files changed

+170
-0
lines changed

2 files changed

+170
-0
lines changed
Lines changed: 170 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,170 @@
1+
# VEP #53: Moving virtiofsd to infrastructure for robust live migration
2+
3+
## Release Signoff Checklist
4+
5+
Items marked with (R) are required *prior to targeting to a milestone / release*.
6+
7+
- [X] (R) Enhancement issue created, which links to VEP dir in [kubevirt/enhancements] (not the initial VEP PR)
8+
- [ ] (R) Target version is explicitly mentioned and approved
9+
- [ ] (R) Graduation criteria filled
10+
11+
## Overview
12+
13+
This VEP proposes a fundamental change to how the `virtiofsd` process is managed
14+
within KubeVirt. Currently, `virtiofsd` runs as an unprivileged process inside a
15+
dedicated virtiofs container. This setup limits its ability to utilize file
16+
handles for sharing Persistent Volume Claims (PVCs) with Virtual Machines (VMs)
17+
and hinders a robust live migration. The proposed solution is to move `virtiofsd`
18+
as part of the infrastructure. The virtiofs container will remain **rootless**,
19+
running a dummy process, while the `virt-handler` will launch the `virtiofsd`
20+
binary and ensure it joins the virtiofs container's namespaces and cgroup.
21+
22+
## Motivation
23+
24+
The current architecture for Virtiofs in KubeVirt involves running the virtiofsd
25+
daemon within a dedicated, unprivileged container. While this approach offers a
26+
strong security boundary by isolating virtiofsd from the host, it introduces
27+
significant functional limitations:
28+
* **File Handle support:** Running `virtiofsd` in an unprivileged container
29+
restricts its ability to track guest files using file handles. This capability
30+
is crucial for efficient and robust file sharing, especially when dealing with
31+
PVCs that might have a large number of files. Without proper file handle
32+
support, `virtiofsd` must rely on file descriptors which are a limited resource.
33+
34+
* **Live Migration challenges**: The inability to properly manage file handles
35+
directly impacts the safety and reliability of live migration for VMs utilizing
36+
Virtiofs for PVCs sharing that might be concurrently accessed. During a live
37+
migration, the `virtiofsd` instance needs to hand over its internal state to the
38+
target destination. Restrictions on file handles make this hand-off prone to
39+
issues, potentially leading to data inconsistencies or migration failures.
40+
41+
By moving `virtiofsd` to be managed by `virt-handler` and allowing it to join
42+
the container's namespaces and cgroups, we aim to overcome these limitations,
43+
enabling `virtiofsd` file handle functionality while keeping a strong
44+
security.
45+
46+
## Goals
47+
48+
* Enable `virtiofsd` to utilize file handles, facilitating robust PVC sharing
49+
with VMs.
50+
* Facilitate safe and reliable live migration for VMs that leverage Virtiofs
51+
for PVC sharing.
52+
* Maintain the rootless execution model for the virtiofs container,
53+
preserving its security benefits.
54+
55+
## Non Goals
56+
57+
* Using this method to share `configMaps`, `secrets`, `donwardAPIs` or
58+
`serviceAccounts`; the current implementation of a `virtiofsd` process
59+
within a dedicated, unprivileged container will continue serving these volumes.
60+
61+
## Definition of Users
62+
63+
* VM Owners
64+
* Cluster Admins
65+
66+
## User Stories
67+
68+
* As a KubeVirt user, I want to use Virtiofs for sharing PVCs with my VMs
69+
without encountering issues related to file descriptor limitations, like
70+
reaching the open files limit.
71+
* As a KubeVirt user, I want to be able to live migrate VMs that use Virtiofs
72+
reliably and safely, ensuring data consistency during migration events.
73+
* As a KubeVirt administrator, I want to be able to provide robust live
74+
migration without allowing privileged containers.
75+
76+
## Repos
77+
78+
[KubeVirt](https://github.com/kubevirt/kubevirt)
79+
80+
81+
## Design
82+
83+
The management of the `virtiofsd` process will be integrated into the KubeVirt
84+
infrastructure. The virtiofs container will remain rootless, starting a dummy
85+
process as PID 1.
86+
87+
The `virt-handler` will launch the `virtiofsd` binary inside the virtiofs
88+
container's namespaces and cgroups, thereby operating within the same system
89+
views and resource limitations defined for that container. Furthermore, the
90+
virtiofs container's dummy PID 1 process will be designed to ensure that the
91+
container's lifetime is bound to that of virtiofsd; if virtiofsd terminates,
92+
the dummy process will exit, leading to the container's termination.
93+
94+
The following figure explains how virtiofsd is launched step-by-step:
95+
![virtiofsd launch](virtiofsd-as-infra.png)
96+
97+
## API Examples
98+
99+
No changes to the KubeVirt API are required. This is an internal implementation
100+
detail that changes how `virtiofsd` is managed, not how it is exposed to the
101+
user via the API.
102+
103+
## Alternatives
104+
105+
<!--
106+
Outline any alternative designs that have been considered)
107+
-->
108+
109+
* **Run virtiofsd as a privileged container:** This would involve running a
110+
privileged virtiofs container, granting specific Linux capabilities
111+
(e.g., `CAP_DAC_READ_SEARCH`) to the virtiofs container's security context.
112+
113+
Disadvantage: While seemingly simpler, this is generally considered a security
114+
risk. Also, in the future it is expected that it will not be possible to run
115+
privileged containers outside the kubevirt namespace. So, this is a non-starter
116+
for KubeVirt's future security model.
117+
118+
* **Using a delegated privileged monitor:** A new privileged component as part
119+
of Kubevirt infrastructure. Since it is stateless, no data needs to be migrated.
120+
Uses [seccomp notify](https://brauner.io/2020/07/23/seccomp-notify.html) to
121+
intercept `name_to_handle_at(2)` and `open_by_handle_at(2)`.
122+
The privileged monitor, runs these syscalls on behalf of virtiofsd,
123+
returning an HMAC-signed file handle.
124+
125+
Disadvantage: This is an elegant solution that requires minimal changes to
126+
Kubevirt infrastructure. However, the current kernel's implementation of the
127+
seccomp notify does not support reconnection, making recovery impossible after
128+
a Kubevirt upgrade or if the monitor dies for any other reason.
129+
130+
## Scalability
131+
132+
No impact on the scalability. We keep the current design of a single container
133+
for each volume.
134+
135+
## Update/Rollback Compatibility
136+
137+
No impact on the update/rollback compatibility.
138+
139+
## Functional Testing Approach
140+
141+
Besides reusing the virtiofsd's functional tests, both unit and functional
142+
tests are added to test the 'injection' of virtiofsd into the container.
143+
144+
## Implementation Phases
145+
146+
This feature can be implemented in a single phase.
147+
148+
## Feature lifecycle Phases
149+
150+
<!--
151+
How and when will the feature progress through the Alpha, Beta and GA lifecycle phases
152+
153+
Refer to https://github.com/kubevirt/community/blob/main/design-proposals/feature-lifecycle.md#releases for more details
154+
-->
155+
156+
Given that virtiofsd functionality has been available in Kubevirt for years,
157+
and the scope of these changes only affects how virtiofsd runs, we could squash
158+
the alpha and beta versions into a single phase.
159+
160+
### Alpha
161+
162+
v1.6
163+
164+
### Beta
165+
166+
v1.7
167+
168+
### GA
169+
170+
v1.8
103 KB
Loading

0 commit comments

Comments
 (0)