Skip to content

Commit 51e8bb9

Browse files
author
Lunfan Zhang (张伦凡)
committed
CP-312361 [Design doc] SXM V3 support
Signed-off-by: Lunfan Zhang (张伦凡) <lunfan.zhang@citrix.com>
1 parent 0db9a87 commit 51e8bb9

1 file changed

Lines changed: 346 additions & 0 deletions

File tree

Lines changed: 346 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,346 @@
1+
---
2+
title: SMAPIv3 snapshot-aware storage migration
3+
layout: default
4+
design_doc: true
5+
revision: 1
6+
status: draft
7+
---
8+
9+
<!--toc:start-->
10+
- [Overview](#overview)
11+
- [Why current SMAPIv3 migration fails with snapshots](#why-current-smapiv3-migration-fails-with-snapshots)
12+
- [The snapshot chain problem](#the-snapshot-chain-problem)
13+
- [Loss of the parent-child chain](#loss-of-the-parent-child-chain)
14+
- [How SMAPIv1 avoids the problem](#how-smapiv1-avoids-the-problem)
15+
- [New SMAPIv3 migration mechanism](#new-smapiv3-migration-mechanism)
16+
- [The approach in one paragraph](#the-approach-in-one-paragraph)
17+
- [A worked example](#a-worked-example)
18+
- [Walking through the migration](#walking-through-the-migration)
19+
- [Phase 1 — Discover the snapshot tree](#phase-1--discover-the-snapshot-tree)
20+
- [Phase 2 — Mirror the tree](#phase-2--mirror-the-tree)
21+
- [Phase 3 — Mirror the live leaf](#phase-3--mirror-the-live-leaf)
22+
- [Phase 4 — Restore metadata](#phase-4--restore-metadata)
23+
- [Notes](#notes)
24+
<!--toc:end-->
25+
26+
## Overview
27+
28+
This document is a high-level design for adding snapshot-chain awareness to
29+
SMAPIv3 storage migration (SXM). It explains why the current SMAPIv3 SXM
30+
mechanism, which mirrors only the writable leaf VDI of a running VM, cannot
31+
preserve VMs that have snapshots, and proposes a snapshot-aware migration
32+
flow that re-uses the existing NBD-backed QEMU mirror once per node of the
33+
snapshot tree.
34+
35+
The reader is assumed to be familiar with the high-level shape of SXM
36+
already documented in [Storage migration](https://github.com/xapi-project/xen-api/blob/master/doc/content/xapi/storage/sxm/index.md).
37+
That document describes the per-VDI mirror as the unit of work; the
38+
design below extends that unit from a single VDI to a complete snapshot
39+
tree.
40+
41+
## Why current SMAPIv3 migration fails with snapshots
42+
43+
### The snapshot chain problem
44+
45+
When a VM has snapshots, its storage is not a single VDI but a chain of
46+
related VDIs. Each snapshot captures the state of the disk at a particular
47+
point in time, and in current SMAPIv1 and SMAPIv3 backends these snapshots
48+
are realised as a hierarchical structure of differencing nodes from oldest
49+
(base) to newest (leaf).
50+
51+
A typical snapshot scenario looks like:
52+
53+
```
54+
Source SR storage structure (nested parent-child):
55+
56+
V1 (original base)
57+
/ \
58+
V2 S1 (snapshot, taken at T2)
59+
/ \
60+
(leaf) V3 S2 (snapshot, taken at T1)
61+
62+
Snapshot chain (XAPI's view):
63+
S1 (snapshot of V3) → S2 (snapshot of V3) → V3 (leaf)
64+
```
65+
66+
In this structure:
67+
68+
* `V3` is the current leaf VDI (writable, receiving new writes from the
69+
running VM). It is a child of `V2`, which is a child of `V1` (the
70+
original base). Snapshots `S1` and `S2` branch off from intermediate
71+
points in this chain. The parent-child structure exists in the storage
72+
backend but is not visible to XAPI.
73+
* XAPI sees the leaf and its snapshots, but not the hidden parent nodes
74+
`V2` and `V1`.
75+
* To read a full, bootable disk image the backend must traverse from leaf
76+
to base: `V3``V2``V1`. The snapshots branch off at intermediate
77+
points.
78+
79+
The nested storage structure represents the real on-disk parent-child
80+
relationships in the backend; the snapshot chain is XAPI's user-facing
81+
view of the same data. The key challenge during migration is that the
82+
backend's chain (`V3``V2``V1`) must be reproduced on the destination
83+
for snapshots to remain functional after migration.
84+
85+
### Loss of the parent-child chain
86+
87+
The current SMAPIv3 migration mechanism was designed for live VM migration
88+
of a single writable VDI. When a VM with snapshots is migrated, XAPI
89+
transfers the visible VDIs (leaf and snapshots) but the underlying
90+
parent-child relationships between them are not preserved:
91+
92+
```
93+
Current SMAPIv3 migration:
94+
95+
Source SR Destination SR
96+
--------- --------------
97+
98+
V1 V1'
99+
/ \ \
100+
V2 S1 ------copy----→ V2' S1' (copied)
101+
/ \ \
102+
V3 S2 ------copy----→ V3' S2' (copied)
103+
104+
└──mirror ----------→
105+
106+
What gets transferred:
107+
V3 (leaf) ----mirrored----→ V3'
108+
S1, S2 ----copied------→ S1', S2'
109+
110+
What is LOST:
111+
chain V3' → V2' (not transferred)
112+
chain V2' → V1' (not transferred)
113+
```
114+
115+
Although `V3`, `V2` and `V1` physically exist on the destination as
116+
separate VDIs, the chain between them is not re-created,
117+
The root cause is simple: VDI.similar_content exists in SMAPIv1, but it is not implemented in [SMAPIv3](https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi-storage-script/main.ml#L1747) and cannot be, so there is no way to rebuild the chain.
118+
119+
SMAPIv1 rebuilds the chain with a BFS over the snapshot tree. At each node it reads VDI.other_config.vhd-parent to find that VDI's parent on the destination. This field comes from the
120+
VHD backend, and the whole similar_content / content_id matching depends on it. SMAPIv3 uses qcow2 instead of VHD, so there is no vhd-parent to read, and the chain cannot be rebuilt
121+
the same way. Adding a vhd-parent field for SMAPIv3 is also a bad fix: it would push a storage-internal detail up into XAPI, which is something the storage layer is meant to hide. So
122+
the similar_content + content_id + vhd-parent path does not work for SMAPIv3, and it also breaks the layering. That is why we mirror the tree directly instead.
123+
124+
so the destination ends up with several disconnected trees. The mirrored leaf `V3'` is only
125+
the delta that sat on top of `V2` on the source; without the chain to
126+
`V2'` and `V1'` it has no base data. The snapshot copies `S1'` and `S2'`
127+
likewise have no parent to reference.
128+
129+
The user-visible symptoms are unpleasant: migration appears to succeed
130+
with no error, but when the VM is started on the destination the data
131+
path cannot fetch the full disk (only the leaf delta is reachable) and
132+
the VM fails to boot. Because the chain is lost rather than corrupted,
133+
there is no straightforward rollback.
134+
135+
## New SMAPIv3 migration mechanism
136+
137+
### The approach in one paragraph
138+
139+
The fix is straightforward in spirit: **mirror the entire snapshot tree,
140+
not just the leaf**. We keep the existing NBD-backed QEMU mirror — it
141+
already works well for a single VDI — and apply it once per snapshot,
142+
walking the snapshot tree depth-first. After mirroring each snapshot's
143+
data into a destination VDI we ask the destination SR to take a snapshot
144+
of that VDI, which anchors the data as a real parent on the destination
145+
chain. The mirrored leaf at the end naturally inherits this chain as its
146+
base, so the destination ends up with exactly the same parent-child
147+
topology as the source. A small table of `(source snapshot VDI,
148+
destination snapshot VDI, snapshot time)` tuples is propagated at the end
149+
so XAPI's database and the backend's bookkeeping both reflect the new
150+
layout.
151+
152+
### A worked example
153+
154+
We will use one running example throughout the rest of this section. A
155+
user takes two snapshots, then **reverts** to the first one and takes two
156+
more:
157+
158+
```
159+
Action timeline Resulting snapshot tree
160+
161+
T1: take snap1 snap1
162+
T2: take snap2 / \
163+
T3: revert to snap1 snap2 snap3
164+
T4: take snap3 |
165+
T5: take snap4 snap4
166+
T6: VM keeps running |
167+
live
168+
```
169+
170+
Two things to notice:
171+
172+
* The tree has a **branch** at `snap1`. `snap2` belongs to the original
173+
line that was abandoned by the revert; `snap3 → snap4 → live` is the
174+
active line the running VM is on today.
175+
* XAPI models this branching via each snapshot VM's `parent` / `children`
176+
pointers. Walking `VM.parent` upward from the live VM gives the active
177+
path (`snap4 → snap3 → snap1`); every snapshot VM not on that walk is
178+
on a reverted side branch.
179+
180+
The source storage backend reproduces the same shape one level down, as a
181+
tree of VDIs with `snap1`'s VDI as the shared base for both branches. The
182+
goal of migration is to reproduce *that* tree on the destination.
183+
184+
### Walking through the migration
185+
186+
Migration runs in four phases. We describe each in terms of the example
187+
above.
188+
189+
#### Phase 1 — Discover the snapshot tree
190+
191+
We need to know which destination VDIs to create and how they should be
192+
related. Starting from the disk being migrated, we find the live VM that
193+
owns it, read the VBD `userdevice` (the disk slot, e.g. `"0"`), then
194+
enumerate all of the live VM's snapshot VMs. Projecting each snapshot VM
195+
onto the same `userdevice` slot gives us the snapshot VDI for that disk
196+
at that point in history. The `parent` / `children` pointers among the
197+
snapshot VMs give us the tree shape, and a single walk of `VM.parent`
198+
upward from the live VM marks which nodes are on the active path.
199+
200+
Each tree node is a small record:
201+
202+
```ocaml
203+
type snapshot_tree_node = {
204+
vdi_uuid: string
205+
; snapshot_time: string (* ISO8601 *)
206+
; on_active_path: bool
207+
; children: snapshot_tree_node list
208+
}
209+
```
210+
211+
For our example, the discovery result is:
212+
213+
```
214+
snap1 (active path)
215+
├── snap2 (inactive — orphaned by the revert)
216+
└── snap3 (active path)
217+
└── snap4 (active path, immediate parent of the live VM)
218+
```
219+
220+
#### Phase 2 — Mirror the tree
221+
222+
We walk the tree depth-first, carrying along a *working VDI* on the
223+
destination. The working VDI starts as `mirror_vdi`, the destination VDI
224+
returned by `receive_start3` for the live leaf. At each node we do the
225+
same two things:
226+
227+
1. Run a one-shot QEMU mirror from the node's source snapshot VDI into
228+
the working VDI.
229+
2. Ask the destination SR to take a snapshot of the working VDI. The new
230+
destination snapshot serves two purposes at once: it records the
231+
mirrored data as a real VDI on the destination chain, and it acts as
232+
a stable anchor that we can clone if this node turns out to be a
233+
branch point.
234+
235+
When a node has more than one child (a branch point, caused by a revert),
236+
we **recurse into the inactive subtrees first, on a clone of the anchor**,
237+
and into the active continuation **last, on the same working VDI**. The
238+
clone is torn down as soon as its subtree finishes. This way the working
239+
VDI never leaves the active path, so when the recursion bottoms out it is
240+
still `mirror_vdi`, sitting on top of the deepest active-path snapshot —
241+
ready for the live mirror to take over. No "splice" or rename step is
242+
required.
243+
244+
The DFS body, with most plumbing elided, looks like:
245+
246+
```ocaml
247+
let rec dfs_process_node ~ctx ~working_vdi node =
248+
let dest_snapshot, relation =
249+
mirror_node_into ~ctx ~working_vdi node
250+
in
251+
let inactive, active =
252+
List.partition (fun c -> not c.on_active_path) node.children
253+
in
254+
let inactive_relations =
255+
List.concat_map (fun child ->
256+
let branch_vdi, cleanup = prepare_branch_vdi ~ctx ~dest_snapshot in
257+
let rels =
258+
try dfs_process_node ~ctx ~working_vdi:branch_vdi child
259+
with e -> (try cleanup () with _ -> ()) ; raise e
260+
in
261+
cleanup () ; rels
262+
) inactive
263+
in
264+
let active_relations =
265+
List.concat_map
266+
(fun child -> dfs_process_node ~ctx ~working_vdi child)
267+
active
268+
in
269+
relation :: (inactive_relations @ active_relations)
270+
```
271+
272+
Applied to our example, with `W` denoting the working VDI:
273+
274+
| Step | Source | Operation | W after step |
275+
| ---: | ------- | -------------------------------------- | ------------- |
276+
| 1 | `snap1` | mirror → `W`; snapshot → `D1` | `mirror_vdi` |
277+
| 2 || clone `D1`, attach | `branch_vdi` |
278+
| 3 | `snap2` | mirror → `W`; snapshot → `D2` | `branch_vdi` |
279+
| 4 || destroy `branch_vdi` | `mirror_vdi` |
280+
| 5 | `snap3` | mirror → `W`; snapshot → `D3` | `mirror_vdi` |
281+
| 6 | `snap4` | mirror → `W`; snapshot → `D4` | `mirror_vdi` |
282+
283+
At the end the destination has the chain `(D1 → D3 → D4) ← mirror_vdi`
284+
plus `D2` hanging off `D1` — the exact shape of the source tree. We also
285+
keep a table of `(source snapshot VDI, destination snapshot VDI, snapshot
286+
time)` tuples that Phase 4 will need:
287+
288+
```ocaml
289+
type snapshot_relation = {
290+
src_vdi: Storage_interface.Vdi.t
291+
; dest_vdi: Storage_interface.Vdi.t
292+
; snapshot_time: string (* ISO8601 *)
293+
}
294+
```
295+
296+
These relations are stashed under the mirror ID in a small
297+
mutex-protected table in `Storage_migrate_helper.State` so the
298+
VM-migration orchestrator can pick them up once the live mirror is
299+
complete.
300+
301+
#### Phase 3 — Mirror the live leaf
302+
303+
This is the original SMAPIv3 flow, with one small wrapper. The snapshots
304+
in Phase 2 were taken while `mirror_vdi` was attached read-only on the
305+
destination (snapshotting an actively-written VDI is not safe); we now
306+
flip it back to writable and run the continuous QEMU mirror from the
307+
running VM's leaf into `mirror_vdi`, exactly as before. Because Phase 2
308+
left `mirror_vdi` positioned on top of the correct chain, the live mirror
309+
lands on a complete base — no separate splicing step is required.
310+
311+
#### Phase 4 — Restore metadata
312+
313+
Snapshot VDIs now exist on the destination, but the destination SR does
314+
not yet know they are snapshots. We reuse the pre-existing SMAPIv1 SXM
315+
path for this step: the orchestrator in `xapi_vm_migrate.ml` feeds the
316+
per-snapshot mirror records into the same `update_snapshot_info` flow
317+
that legacy migration already uses, which RPCs into
318+
`SR.update_snapshot_info_dest` on the destination. For each entry that
319+
RPC sets `snapshot_of`, `snapshot_time` and `is_a_snapshot` in the XAPI
320+
database and pushes the same fields into the backend's own custom-key
321+
store (so the storage backend's view stays in sync across an `SR.scan`).
322+
The `content_id` check it performs is satisfied because Phase 2
323+
propagated `content_id` per snapshot during the DFS. Finally the
324+
orchestrator updates each snapshot VM's VBD to reference the destination
325+
snapshot VDI.
326+
327+
The destination now has both the data and the topology, in both XAPI and
328+
the storage backend, and the snapshot tree is fully usable: snapshots can
329+
be inspected, booted, and reverted to.
330+
331+
### Notes
332+
333+
Two non-obvious choices are worth highlighting:
334+
335+
* **Disk identity by `userdevice`, not `snapshot_of`.** A snapshot VDI's
336+
`snapshot_of` points at the active VDI that existed at the time the
337+
snapshot was taken; a revert destroys that VDI, so pre-revert snapshots
338+
end up pointing at a stale ref. The VBD `userdevice` (disk slot inside
339+
the VM) survives reverts, so it is the reliable way to follow "the same
340+
disk" across the snapshot VM tree.
341+
* **DFS, inactive-first, active-last.** Processing inactive subtrees
342+
before the active continuation is what lets us reuse `mirror_vdi` as
343+
the working VDI for the whole active path: each inactive branch comes
344+
and goes on a short-lived clone, and `mirror_vdi` is never displaced.
345+
It also bounds the number of simultaneously-attached destination VDIs
346+
to `O(tree depth)` rather than `O(branch count)`.

0 commit comments

Comments
 (0)