|
| 1 | +--- |
| 2 | +title: SMAPIv3 snapshot-aware storage migration |
| 3 | +layout: default |
| 4 | +design_doc: true |
| 5 | +revision: 1 |
| 6 | +status: draft |
| 7 | +--- |
| 8 | + |
| 9 | +<!--toc:start--> |
| 10 | +- [Overview](#overview) |
| 11 | +- [Why current SMAPIv3 migration fails with snapshots](#why-current-smapiv3-migration-fails-with-snapshots) |
| 12 | + - [The snapshot chain problem](#the-snapshot-chain-problem) |
| 13 | + - [Loss of the parent-child chain](#loss-of-the-parent-child-chain) |
| 14 | + - [How SMAPIv1 avoids the problem](#how-smapiv1-avoids-the-problem) |
| 15 | +- [New SMAPIv3 migration mechanism](#new-smapiv3-migration-mechanism) |
| 16 | + - [The approach in one paragraph](#the-approach-in-one-paragraph) |
| 17 | + - [A worked example](#a-worked-example) |
| 18 | + - [Walking through the migration](#walking-through-the-migration) |
| 19 | + - [Phase 1 — Discover the snapshot tree](#phase-1--discover-the-snapshot-tree) |
| 20 | + - [Phase 2 — Mirror the tree](#phase-2--mirror-the-tree) |
| 21 | + - [Phase 3 — Mirror the live leaf](#phase-3--mirror-the-live-leaf) |
| 22 | + - [Phase 4 — Restore metadata](#phase-4--restore-metadata) |
| 23 | + - [Notes](#notes) |
| 24 | +<!--toc:end--> |
| 25 | + |
| 26 | +## Overview |
| 27 | + |
| 28 | +This document is a high-level design for adding snapshot-chain awareness to |
| 29 | +SMAPIv3 storage migration (SXM). It explains why the current SMAPIv3 SXM |
| 30 | +mechanism, which mirrors only the writable leaf VDI of a running VM, cannot |
| 31 | +preserve VMs that have snapshots, and proposes a snapshot-aware migration |
| 32 | +flow that re-uses the existing NBD-backed QEMU mirror once per node of the |
| 33 | +snapshot tree. |
| 34 | + |
| 35 | +The reader is assumed to be familiar with the high-level shape of SXM |
| 36 | +already documented in [Storage migration](https://github.com/xapi-project/xen-api/blob/master/doc/content/xapi/storage/sxm/index.md). |
| 37 | +That document describes the per-VDI mirror as the unit of work; the |
| 38 | +design below extends that unit from a single VDI to a complete snapshot |
| 39 | +tree. |
| 40 | + |
| 41 | +## Why current SMAPIv3 migration fails with snapshots |
| 42 | + |
| 43 | +### The snapshot chain problem |
| 44 | + |
| 45 | +When a VM has snapshots, its storage is not a single VDI but a chain of |
| 46 | +related VDIs. Each snapshot captures the state of the disk at a particular |
| 47 | +point in time, and in current SMAPIv1 and SMAPIv3 backends these snapshots |
| 48 | +are realised as a hierarchical structure of differencing nodes from oldest |
| 49 | +(base) to newest (leaf). |
| 50 | + |
| 51 | +A typical snapshot scenario looks like: |
| 52 | + |
| 53 | +``` |
| 54 | +Source SR storage structure (nested parent-child): |
| 55 | +
|
| 56 | + V1 (original base) |
| 57 | + / \ |
| 58 | + V2 S1 (snapshot, taken at T2) |
| 59 | + / \ |
| 60 | + (leaf) V3 S2 (snapshot, taken at T1) |
| 61 | +
|
| 62 | +Snapshot chain (XAPI's view): |
| 63 | + S1 (snapshot of V3) → S2 (snapshot of V3) → V3 (leaf) |
| 64 | +``` |
| 65 | + |
| 66 | +In this structure: |
| 67 | + |
| 68 | +* `V3` is the current leaf VDI (writable, receiving new writes from the |
| 69 | + running VM). It is a child of `V2`, which is a child of `V1` (the |
| 70 | + original base). Snapshots `S1` and `S2` branch off from intermediate |
| 71 | + points in this chain. The parent-child structure exists in the storage |
| 72 | + backend but is not visible to XAPI. |
| 73 | +* XAPI sees the leaf and its snapshots, but not the hidden parent nodes |
| 74 | + `V2` and `V1`. |
| 75 | +* To read a full, bootable disk image the backend must traverse from leaf |
| 76 | + to base: `V3` → `V2` → `V1`. The snapshots branch off at intermediate |
| 77 | + points. |
| 78 | + |
| 79 | +The nested storage structure represents the real on-disk parent-child |
| 80 | +relationships in the backend; the snapshot chain is XAPI's user-facing |
| 81 | +view of the same data. The key challenge during migration is that the |
| 82 | +backend's chain (`V3` → `V2` → `V1`) must be reproduced on the destination |
| 83 | +for snapshots to remain functional after migration. |
| 84 | + |
| 85 | +### Loss of the parent-child chain |
| 86 | + |
| 87 | +The current SMAPIv3 migration mechanism was designed for live VM migration |
| 88 | +of a single writable VDI. When a VM with snapshots is migrated, XAPI |
| 89 | +transfers the visible VDIs (leaf and snapshots) but the underlying |
| 90 | +parent-child relationships between them are not preserved: |
| 91 | + |
| 92 | +``` |
| 93 | +Current SMAPIv3 migration: |
| 94 | +
|
| 95 | +Source SR Destination SR |
| 96 | +--------- -------------- |
| 97 | +
|
| 98 | + V1 V1' |
| 99 | + / \ \ |
| 100 | + V2 S1 ------copy----→ V2' S1' (copied) |
| 101 | + / \ \ |
| 102 | + V3 S2 ------copy----→ V3' S2' (copied) |
| 103 | + │ |
| 104 | + └──mirror ----------→ |
| 105 | +
|
| 106 | +What gets transferred: |
| 107 | + V3 (leaf) ----mirrored----→ V3' |
| 108 | + S1, S2 ----copied------→ S1', S2' |
| 109 | +
|
| 110 | +What is LOST: |
| 111 | + chain V3' → V2' (not transferred) |
| 112 | + chain V2' → V1' (not transferred) |
| 113 | +``` |
| 114 | + |
| 115 | +Although `V3`, `V2` and `V1` physically exist on the destination as |
| 116 | +separate VDIs, the chain between them is not re-created, |
| 117 | +The root cause is simple: VDI.similar_content exists in SMAPIv1, but it is not implemented in [SMAPIv3](https://github.com/xapi-project/xen-api/blob/master/ocaml/xapi-storage-script/main.ml#L1747) and cannot be, so there is no way to rebuild the chain. |
| 118 | + |
| 119 | +SMAPIv1 rebuilds the chain with a BFS over the snapshot tree. At each node it reads VDI.other_config.vhd-parent to find that VDI's parent on the destination. This field comes from the |
| 120 | +VHD backend, and the whole similar_content / content_id matching depends on it. SMAPIv3 uses qcow2 instead of VHD, so there is no vhd-parent to read, and the chain cannot be rebuilt |
| 121 | +the same way. Adding a vhd-parent field for SMAPIv3 is also a bad fix: it would push a storage-internal detail up into XAPI, which is something the storage layer is meant to hide. So |
| 122 | +the similar_content + content_id + vhd-parent path does not work for SMAPIv3, and it also breaks the layering. That is why we mirror the tree directly instead. |
| 123 | + |
| 124 | +so the destination ends up with several disconnected trees. The mirrored leaf `V3'` is only |
| 125 | +the delta that sat on top of `V2` on the source; without the chain to |
| 126 | +`V2'` and `V1'` it has no base data. The snapshot copies `S1'` and `S2'` |
| 127 | +likewise have no parent to reference. |
| 128 | + |
| 129 | +The user-visible symptoms are unpleasant: migration appears to succeed |
| 130 | +with no error, but when the VM is started on the destination the data |
| 131 | +path cannot fetch the full disk (only the leaf delta is reachable) and |
| 132 | +the VM fails to boot. Because the chain is lost rather than corrupted, |
| 133 | +there is no straightforward rollback. |
| 134 | + |
| 135 | +## New SMAPIv3 migration mechanism |
| 136 | + |
| 137 | +### The approach in one paragraph |
| 138 | + |
| 139 | +The fix is straightforward in spirit: **mirror the entire snapshot tree, |
| 140 | +not just the leaf**. We keep the existing NBD-backed QEMU mirror — it |
| 141 | +already works well for a single VDI — and apply it once per snapshot, |
| 142 | +walking the snapshot tree depth-first. After mirroring each snapshot's |
| 143 | +data into a destination VDI we ask the destination SR to take a snapshot |
| 144 | +of that VDI, which anchors the data as a real parent on the destination |
| 145 | +chain. The mirrored leaf at the end naturally inherits this chain as its |
| 146 | +base, so the destination ends up with exactly the same parent-child |
| 147 | +topology as the source. A small table of `(source snapshot VDI, |
| 148 | +destination snapshot VDI, snapshot time)` tuples is propagated at the end |
| 149 | +so XAPI's database and the backend's bookkeeping both reflect the new |
| 150 | +layout. |
| 151 | + |
| 152 | +### A worked example |
| 153 | + |
| 154 | +We will use one running example throughout the rest of this section. A |
| 155 | +user takes two snapshots, then **reverts** to the first one and takes two |
| 156 | +more: |
| 157 | + |
| 158 | +``` |
| 159 | +Action timeline Resulting snapshot tree |
| 160 | +
|
| 161 | + T1: take snap1 snap1 |
| 162 | + T2: take snap2 / \ |
| 163 | + T3: revert to snap1 snap2 snap3 |
| 164 | + T4: take snap3 | |
| 165 | + T5: take snap4 snap4 |
| 166 | + T6: VM keeps running | |
| 167 | + live |
| 168 | +``` |
| 169 | + |
| 170 | +Two things to notice: |
| 171 | + |
| 172 | +* The tree has a **branch** at `snap1`. `snap2` belongs to the original |
| 173 | + line that was abandoned by the revert; `snap3 → snap4 → live` is the |
| 174 | + active line the running VM is on today. |
| 175 | +* XAPI models this branching via each snapshot VM's `parent` / `children` |
| 176 | + pointers. Walking `VM.parent` upward from the live VM gives the active |
| 177 | + path (`snap4 → snap3 → snap1`); every snapshot VM not on that walk is |
| 178 | + on a reverted side branch. |
| 179 | + |
| 180 | +The source storage backend reproduces the same shape one level down, as a |
| 181 | +tree of VDIs with `snap1`'s VDI as the shared base for both branches. The |
| 182 | +goal of migration is to reproduce *that* tree on the destination. |
| 183 | + |
| 184 | +### Walking through the migration |
| 185 | + |
| 186 | +Migration runs in four phases. We describe each in terms of the example |
| 187 | +above. |
| 188 | + |
| 189 | +#### Phase 1 — Discover the snapshot tree |
| 190 | + |
| 191 | +We need to know which destination VDIs to create and how they should be |
| 192 | +related. Starting from the disk being migrated, we find the live VM that |
| 193 | +owns it, read the VBD `userdevice` (the disk slot, e.g. `"0"`), then |
| 194 | +enumerate all of the live VM's snapshot VMs. Projecting each snapshot VM |
| 195 | +onto the same `userdevice` slot gives us the snapshot VDI for that disk |
| 196 | +at that point in history. The `parent` / `children` pointers among the |
| 197 | +snapshot VMs give us the tree shape, and a single walk of `VM.parent` |
| 198 | +upward from the live VM marks which nodes are on the active path. |
| 199 | + |
| 200 | +Each tree node is a small record: |
| 201 | + |
| 202 | +```ocaml |
| 203 | +type snapshot_tree_node = { |
| 204 | + vdi_uuid: string |
| 205 | + ; snapshot_time: string (* ISO8601 *) |
| 206 | + ; on_active_path: bool |
| 207 | + ; children: snapshot_tree_node list |
| 208 | +} |
| 209 | +``` |
| 210 | + |
| 211 | +For our example, the discovery result is: |
| 212 | + |
| 213 | +``` |
| 214 | +snap1 (active path) |
| 215 | +├── snap2 (inactive — orphaned by the revert) |
| 216 | +└── snap3 (active path) |
| 217 | + └── snap4 (active path, immediate parent of the live VM) |
| 218 | +``` |
| 219 | + |
| 220 | +#### Phase 2 — Mirror the tree |
| 221 | + |
| 222 | +We walk the tree depth-first, carrying along a *working VDI* on the |
| 223 | +destination. The working VDI starts as `mirror_vdi`, the destination VDI |
| 224 | +returned by `receive_start3` for the live leaf. At each node we do the |
| 225 | +same two things: |
| 226 | + |
| 227 | +1. Run a one-shot QEMU mirror from the node's source snapshot VDI into |
| 228 | + the working VDI. |
| 229 | +2. Ask the destination SR to take a snapshot of the working VDI. The new |
| 230 | + destination snapshot serves two purposes at once: it records the |
| 231 | + mirrored data as a real VDI on the destination chain, and it acts as |
| 232 | + a stable anchor that we can clone if this node turns out to be a |
| 233 | + branch point. |
| 234 | + |
| 235 | +When a node has more than one child (a branch point, caused by a revert), |
| 236 | +we **recurse into the inactive subtrees first, on a clone of the anchor**, |
| 237 | +and into the active continuation **last, on the same working VDI**. The |
| 238 | +clone is torn down as soon as its subtree finishes. This way the working |
| 239 | +VDI never leaves the active path, so when the recursion bottoms out it is |
| 240 | +still `mirror_vdi`, sitting on top of the deepest active-path snapshot — |
| 241 | +ready for the live mirror to take over. No "splice" or rename step is |
| 242 | +required. |
| 243 | + |
| 244 | +The DFS body, with most plumbing elided, looks like: |
| 245 | + |
| 246 | +```ocaml |
| 247 | +let rec dfs_process_node ~ctx ~working_vdi node = |
| 248 | + let dest_snapshot, relation = |
| 249 | + mirror_node_into ~ctx ~working_vdi node |
| 250 | + in |
| 251 | + let inactive, active = |
| 252 | + List.partition (fun c -> not c.on_active_path) node.children |
| 253 | + in |
| 254 | + let inactive_relations = |
| 255 | + List.concat_map (fun child -> |
| 256 | + let branch_vdi, cleanup = prepare_branch_vdi ~ctx ~dest_snapshot in |
| 257 | + let rels = |
| 258 | + try dfs_process_node ~ctx ~working_vdi:branch_vdi child |
| 259 | + with e -> (try cleanup () with _ -> ()) ; raise e |
| 260 | + in |
| 261 | + cleanup () ; rels |
| 262 | + ) inactive |
| 263 | + in |
| 264 | + let active_relations = |
| 265 | + List.concat_map |
| 266 | + (fun child -> dfs_process_node ~ctx ~working_vdi child) |
| 267 | + active |
| 268 | + in |
| 269 | + relation :: (inactive_relations @ active_relations) |
| 270 | +``` |
| 271 | + |
| 272 | +Applied to our example, with `W` denoting the working VDI: |
| 273 | + |
| 274 | +| Step | Source | Operation | W after step | |
| 275 | +| ---: | ------- | -------------------------------------- | ------------- | |
| 276 | +| 1 | `snap1` | mirror → `W`; snapshot → `D1` | `mirror_vdi` | |
| 277 | +| 2 | — | clone `D1`, attach | `branch_vdi` | |
| 278 | +| 3 | `snap2` | mirror → `W`; snapshot → `D2` | `branch_vdi` | |
| 279 | +| 4 | — | destroy `branch_vdi` | `mirror_vdi` | |
| 280 | +| 5 | `snap3` | mirror → `W`; snapshot → `D3` | `mirror_vdi` | |
| 281 | +| 6 | `snap4` | mirror → `W`; snapshot → `D4` | `mirror_vdi` | |
| 282 | + |
| 283 | +At the end the destination has the chain `(D1 → D3 → D4) ← mirror_vdi` |
| 284 | +plus `D2` hanging off `D1` — the exact shape of the source tree. We also |
| 285 | +keep a table of `(source snapshot VDI, destination snapshot VDI, snapshot |
| 286 | +time)` tuples that Phase 4 will need: |
| 287 | + |
| 288 | +```ocaml |
| 289 | +type snapshot_relation = { |
| 290 | + src_vdi: Storage_interface.Vdi.t |
| 291 | + ; dest_vdi: Storage_interface.Vdi.t |
| 292 | + ; snapshot_time: string (* ISO8601 *) |
| 293 | +} |
| 294 | +``` |
| 295 | + |
| 296 | +These relations are stashed under the mirror ID in a small |
| 297 | +mutex-protected table in `Storage_migrate_helper.State` so the |
| 298 | +VM-migration orchestrator can pick them up once the live mirror is |
| 299 | +complete. |
| 300 | + |
| 301 | +#### Phase 3 — Mirror the live leaf |
| 302 | + |
| 303 | +This is the original SMAPIv3 flow, with one small wrapper. The snapshots |
| 304 | +in Phase 2 were taken while `mirror_vdi` was attached read-only on the |
| 305 | +destination (snapshotting an actively-written VDI is not safe); we now |
| 306 | +flip it back to writable and run the continuous QEMU mirror from the |
| 307 | +running VM's leaf into `mirror_vdi`, exactly as before. Because Phase 2 |
| 308 | +left `mirror_vdi` positioned on top of the correct chain, the live mirror |
| 309 | +lands on a complete base — no separate splicing step is required. |
| 310 | + |
| 311 | +#### Phase 4 — Restore metadata |
| 312 | + |
| 313 | +Snapshot VDIs now exist on the destination, but the destination SR does |
| 314 | +not yet know they are snapshots. We reuse the pre-existing SMAPIv1 SXM |
| 315 | +path for this step: the orchestrator in `xapi_vm_migrate.ml` feeds the |
| 316 | +per-snapshot mirror records into the same `update_snapshot_info` flow |
| 317 | +that legacy migration already uses, which RPCs into |
| 318 | +`SR.update_snapshot_info_dest` on the destination. For each entry that |
| 319 | +RPC sets `snapshot_of`, `snapshot_time` and `is_a_snapshot` in the XAPI |
| 320 | +database and pushes the same fields into the backend's own custom-key |
| 321 | +store (so the storage backend's view stays in sync across an `SR.scan`). |
| 322 | +The `content_id` check it performs is satisfied because Phase 2 |
| 323 | +propagated `content_id` per snapshot during the DFS. Finally the |
| 324 | +orchestrator updates each snapshot VM's VBD to reference the destination |
| 325 | +snapshot VDI. |
| 326 | + |
| 327 | +The destination now has both the data and the topology, in both XAPI and |
| 328 | +the storage backend, and the snapshot tree is fully usable: snapshots can |
| 329 | +be inspected, booted, and reverted to. |
| 330 | + |
| 331 | +### Notes |
| 332 | + |
| 333 | +Two non-obvious choices are worth highlighting: |
| 334 | + |
| 335 | +* **Disk identity by `userdevice`, not `snapshot_of`.** A snapshot VDI's |
| 336 | + `snapshot_of` points at the active VDI that existed at the time the |
| 337 | + snapshot was taken; a revert destroys that VDI, so pre-revert snapshots |
| 338 | + end up pointing at a stale ref. The VBD `userdevice` (disk slot inside |
| 339 | + the VM) survives reverts, so it is the reliable way to follow "the same |
| 340 | + disk" across the snapshot VM tree. |
| 341 | +* **DFS, inactive-first, active-last.** Processing inactive subtrees |
| 342 | + before the active continuation is what lets us reuse `mirror_vdi` as |
| 343 | + the working VDI for the whole active path: each inactive branch comes |
| 344 | + and goes on a short-lived clone, and `mirror_vdi` is never displaced. |
| 345 | + It also bounds the number of simultaneously-attached destination VDIs |
| 346 | + to `O(tree depth)` rather than `O(branch count)`. |
0 commit comments