Skip to content

feat(block): block-level COW overlay with dirty bitmap tracking#5

Open
meAmitPatil wants to merge 9 commits into
mainfrom
feat/block-overlay-cow
Open

feat(block): block-level COW overlay with dirty bitmap tracking#5
meAmitPatil wants to merge 9 commits into
mainfrom
feat/block-overlay-cow

Conversation

@meAmitPatil
Copy link
Copy Markdown

@meAmitPatil meAmitPatil commented Mar 20, 2026

Summary

Adds a block-level copy-on-write overlay filesystem to Firecracker's virtio block device, enabling efficient snapshot/clone/reset for sandbox workloads.

  • Dirty bitmap — tracks modified 4KB blocks with a compact bit array
  • Sync overlay engine — routes reads between a shared read-only base image and a per-VM sparse overlay file using the bitmap
  • Snapshot integration — persists the dirty bitmap on snapshot save and restores it on load
  • Delta file format — compact binary format (with CRC64 integrity checks) that captures only dirty blocks for O(dirty) snapshots instead of O(disk)
  • Fast VM cloning — apply deltas to fresh overlays on restore, sharing the base image across clones
  • DISCARD support — clears bitmap bits and punches holes in the overlay when guest trims blocks
  • Hot-update rejection — prevents silent data corruption from disk hot-swap on overlay devices
  • API surfaceio_engine: "Overlay", base_path, block_delta_dir on create/load snapshot

Benchmark Results (bare metal — AMD Ryzen 9 7950X3D, NVMe)

Snapshot / Clone

Metric Sync (vanilla) Overlay (ours) Improvement
Snapshot time 225ms 223ms same (memory-dominated)
Clone disk cost 205ms (full 396MB copy) 0ms (560KB delta) no copy needed
Total clone cost 430ms 223ms ~2x faster
Restore time 7ms 8ms same
Clone disk size 396MB 560KB 700x smaller

Reset (return to clean state)

Method Time Speedup
Sync (full disk copy) 190ms baseline
Overlay (truncate + clear bitmap) 1-2ms ~100x faster

Test plan

  • 40 unit tests passing (dirty_bitmap, overlay_io, delta, persist)
  • Snapshot/clone/restore benchmark on bare metal
  • Reset benchmark: 100x faster (1-2ms vs 190ms)
  • Full CI with ./tools/devtool checkbuild --all

Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
…stence

Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
…rruption

Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
pavitrabhalla added a commit that referenced this pull request Mar 20, 2026
The implementation diverged from our stub API in four ways:

- DirtyBitmap::new takes (disk_size_bytes, block_size) not (num_blocks)
- bm.set(offset, len) not bm.set(block_idx)
- bm.clear() not bm.clear_all(); bm.dirty_count() not bm.dirty_block_count()
- OverlayFileEngine::from_files() not with_block_size(); read/write take
  GuestMemoryMmap + GuestAddress not plain byte slices
- No reset() method — reset bench now uses update_overlay() to swap in a
  fresh file, which is the actual public reset path

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
pavitrabhalla added a commit that referenced this pull request Mar 20, 2026
dirty_bitmap.rs and overlay_io.rs were TDD stubs written before the
real implementation landed in feat/block-overlay-cow (PR #5). Now that
PR #5 has the authoritative implementation, the stubs would only cause
merge conflicts. The bench files are all that belongs here.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown

@pavitrabhalla pavitrabhalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review: Block-level COW overlay with dirty bitmap tracking

Overall: Strong implementation. The architecture is clean with well-separated modules (dirty_bitmap, overlay_io, delta). The COW pattern, delta format with CRC64 integrity, DISCARD support, and snapshot persistence are all well thought out. 40 unit tests provide good coverage of the core modules.

CRITICAL

1. write_block_deltas silently swallows per-device errors (src/vmm/src/device_manager/mod.rs:352-371)

The method logs errors with error!() but always returns Ok(()). If a delta write fails for any device, the snapshot completes successfully but the delta file is missing/incomplete. On clone-restore, apply_delta will fail with a confusing I/O error instead of failing at snapshot time where the problem actually occurred.

if let Err(e) = block.write_delta(&delta_path) {
    error!("Failed to write block delta for {}: {:?}", block.id(), e);
    // BUG: should return Err here, or at least collect errors
}

Suggest: collect the first error and return it, or return a Vec<(String, DeltaError)> so the caller can decide.

MAJOR

2. FileEngine::file() panics for overlay (src/vmm/src/devices/virtio/block/virtio/io/mod.rs:98)

FileEngine::Overlay(_) => unimplemented!("overlay engine has two files"),

Any code path that calls file() on an overlay device will panic. This should return a Result or at minimum provide a meaningful alternative (e.g., return the overlay file). Check all callers of file() to ensure none can reach this for overlay devices.

3. from_file for Overlay returns misleading error (src/vmm/src/devices/virtio/block/virtio/io/mod.rs:70-77)

Returns SizeMismatch { base_size: 0, overlay_size: 0 } which is semantically incorrect. Add a dedicated error variant like OverlayIoError::NotConstructibleFromFile to avoid confusion during debugging.

4. DISCARD length truncation (src/vmm/src/devices/virtio/block/virtio/request.rs:395-410)

let len = u64::from(num_sectors) << SECTOR_SHIFT;
if let Err(e) = disk.file_engine.discard(offset, len as u32) {

len is u64 but discard() takes u32. For disks >2GB with a single large TRIM, len as u32 silently truncates. Either:

  • Change discard() to accept u64, or
  • Validate/cap len before the cast and split large discards into multiple calls

5. update_overlay doesn't reset bitmap (src/vmm/src/devices/virtio/block/virtio/io/overlay_io.rs:421-423)

After update_overlay(new_file), the existing bitmap may not match the new overlay's contents. Since update_disk_image already rejects overlay devices (good!), document this method as internal-only or make it pub(crate).

MINOR

6. unsafe impl Send may be unnecessary (overlay_io.rs:394)

The safety comment ("File is send and ultimately a POD") is incorrect — File is already Send because it wraps a file descriptor. Check whether OverlayFileEngine auto-derives Send without the unsafe impl. If it does, remove the manual impl.

7. Unused variable start_block (overlay_io.rs:2584)

let start_block = offset / block_size;  // computed but never read

8. No sync_all() after delta write (delta.rs:1470)

write_delta calls writer.flush() but not sync_all(). If the system crashes between flush and the snapshot metadata write, the delta file may be partially written to the page cache but not persisted. The data CRC would catch corruption on apply, but the failure mode is "delta file exists but is truncated/corrupt" rather than "delta file doesn't exist" — the latter is easier to handle.

9. read_obj silently defaults to 0 on failure (request.rs:3222-3225)

let sector: u64 = mem.read_obj(seg_addr).unwrap_or(0);
let num_sectors: u32 = mem.read_obj(GuestAddress(seg_addr.0 + 8)).unwrap_or(0);

If guest memory read fails, this silently discards offset 0 with length 0 instead of returning an error. This could mask guest memory corruption.

OBSERVATIONS

  • Snapshot version bump 9.0.0 → 9.1.0 is correct for backward-compatible additions. The #[serde(default)] on overlay_state ensures old snapshots deserialize correctly. ✅
  • DISCARD code path has no test coverage (only the overlay_io unit test covers discard indirectly through the bitmap).
  • Benchmark scripts use hardcoded paths (~/firecracker/build/...) — fine for manual use but won't work in CI.
  • The iter_dirty filter uses *bit == true instead of idiomatic **bit — cosmetic only.

Summary

The core COW overlay, dirty bitmap, and delta format are solid and well-tested. The main concern is the silent error swallowing in write_block_deltas (CRITICAL) and the DISCARD u32 truncation (MAJOR). Fix those and this is ready to merge.

meAmitPatil added a commit that referenced this pull request Mar 23, 2026
Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
@meAmitPatil meAmitPatil marked this pull request as ready for review March 24, 2026 02:53
Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
@meAmitPatil
Copy link
Copy Markdown
Author

Tested end-to-end on a real host (GCE us-central1, ext4 persistent disk, kernel 6.17.0). Built this branch at commit 440a303 for x86_64-unknown-linux-musl, ran the full overlay flow against a 4 GB rootfs.

Cold-boot → snapshot → restore round-trip

  • Cold-boot with io_engine: "Overlay" + base_path + path_on_host succeeded; VM booted to userspace and our guest agent came up cleanly.
  • Snapshot save with block_delta_dir produced:
    • vmstate.snap: 29 KB
    • mem.snap: 256 MB (matches configured RAM)
    • deltas/rootfs.delta: 140 KB for a 4 GB virtual disk
  • Overlay file remained sparse: 8 KB actual on disk for a 4 GB virtual disk.
  • Snapshot load with block_delta_dir on a fresh Firecracker process resumed the VM cleanly. Internal load time: 'load snapshot' API request took 8050 us (~8 ms).

LGTM — good to merge.

meAmitPatil added a commit that referenced this pull request May 5, 2026
Signed-off-by: Amit Patil <iamitpatil2001@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants