Add docs on full local snapshots

amohoste · amohoste · commit c990d065e951 · 2022-06-12T17:52:43.000+01:00
Signed-off-by: Amory Hoste &lt;amory.hoste@gmail.com&gt;
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -3,6 +3,7 @@
 ## [Unreleased]
 
 ### Added
+- Add support for [fullLocal snapshots](docs/fulllocal_snapshots.md) mode
 
 ### Changed
 
diff --git a/configs/.wordlist.txt b/configs/.wordlist.txt
@@ -107,7 +107,9 @@ DCNs
 De
 debian
 deployer
+deterministically
 dev
+devicemapper
 devmapper
 df
 DialGRPCWithUnaryInterceptor
@@ -285,6 +287,7 @@ microarchitectural
 Microarchitecture
 microbenchmark
 microbenchmarks
+microVM
 microVMs
 minio
 MinIO
@@ -395,11 +398,13 @@ rebasing
 repo
 Repos
 roadmap
+rootfs
 RPC
 rperf
 RPerf
 RPERF
 rsquo
+rsync
 runc
 runtime
 runtimes
@@ -432,6 +437,7 @@ SinkBinding
 SinkBindings
 sms
 SMT
+snapshotted
 snapshotting
 SoC
 SOCACHE
@@ -461,6 +467,7 @@ TestProfileIncrementConfiguration
 TestProfileSingleConfiguration
 TextFormatter
 th
+thinpool
 Timeseries
 timeseriesdb
 TimeseriesDB
diff --git a/docs/developers_guide.md b/docs/developers_guide.md
@@ -108,14 +108,16 @@ We also offer self-hosted stock-Knative environments powered by KinD. To be able
 
 * vHive supports both the baseline Firecracker snapshots and our advanced
 Record-and-Prefetch (REAP) snapshots.
-
+  
 * vHive integrates with Kubernetes and Knative via its built-in CRI support.
 Currently, only Knative Serving is supported.
 
 * vHive supports arbitrary distributed setup of a serverless cluster.
 
 * vHive supports arbitrary functions deployed with OCI (Docker images).
 
+* Remote snapshot restore functionality can be integrated through the [full local snapshot functionality](./fulllocal_snapshots.md).
+
 * vHive has robust Continuous-Integration and our team is committed to deliver
 high-quality code.
 
diff --git a/docs/fulllocal_snapshots.md b/docs/fulllocal_snapshots.md
@@ -1,7 +1,58 @@
-# vHive fulllocal snapshots guide
+# vHive full local snapshots
 
-The default snapshots in vHive use an offloading based technique that leaves the shim and other resources running upon shutting down a VM such that it can be re-used in the future. This technique has the advantage that a shim does not have to be recreated and the block and network devices of the previously stopped VM can be reused. This approach does however limit the amount of VMs that can be booted from a snapshot to the amount of VMs that have been offloaded. An alternative approach is to allow loading an arbitrary amount of VMs from a single snapshot by creating a new shim, block and network devices upon loading a snapshot. This functionality can be enabled by running vHive using the `-snapshots -fulllocal` flags. Additionally, the following flags can be used to further configure the fullLocal snapshots
+When using Firecracker as the sandbox technology in vHive, two snapshotting modes are supported: a default mode and a 
+full local mode. The default snapshot mode use an offloading based technique which leaves the shim and other resources 
+running upon shutting down a microVM such that it can be re-used in the future. This technique has the advantage that 
+the shim does not have to be recreated and the block and network devices of the previously stopped microVM can be 
+reused, but limits the amount of microVMs that can be booted from a snapshot to the amount of microVMs that have been 
+offloaded. The full local snapshot mode instead allows loading an arbitrary amount of microVMs from a single snapshot. 
+This is done by creating a new shim and the required block and network devices upon loading a snapshot and creating an 
+extra patch file containing the filesystem differences written by the microVM upon snapshot creation. To enable the 
+full local snapshot functionality, vHive must be run with the `-snapshots` and `-fulllocal` flags. In addition, the 
+full local snapshot mode can be further configured using the following flags:
 
-* `-isSparseSnaps`: store the memory file as a sparse file to make the storage size closer to the actual memory utilized by the VM, rather than the memory allocated to the VM
-* `-snapsStorageSize [capacityGiB]`: specify the amount of capacity that can be used to store snapshots
-* `-netPoolSize [capacity]`: keep around a pool of [capacity] network devices that can be used by VMs to keep network creation off the cold start path
+- `isSparseSnaps`: store the memory file as a sparse file to make its storage size closer to the actual size of the memory utilized by the microVM, rather than the memory allocated to the microVM
+- `snapsStorageSize [capacityGiB]`: specify the amount of capacity that can be used to store snapshots
+- `netPoolSize [capacity]`: the amount of network devices in the network pool, which can be used by microVMs to keep the network initialization off the cold start path
+
+## Remote snapshots
+
+Rather than only using the snapshots available locally on a node, snapshots can also be transferred between nodes to 
+potentially accelerate cold start times and reduce memory utilization, given that proper mechanisms are in place to 
+minimize the snapshot network transfer latency. This could be done by storing snapshots in a global storage solution 
+such as S3, or directly distributing snapshots between compute nodes. The full local snapshot functionality in vHive 
+can be used to implement such functionality. To implement this, the container image used by the snapshotted microVM 
+must be available on the local node where the remote snapshot will be restored. This container image can be used in 
+combination with the filesystem changes stored in the snapshot patch file to create a device mapper snapshot that 
+contains the root filesystem needed by the restored microVM. After recreating the root filesystem block device, the 
+microVM can be created from the fetched memory file and microVM state similarly to how this is done for the full local 
+snapshots.
+
+## Incompatibilities and limitations
+
+### Snapshot filesystem changes capture and restoration
+
+Currently, the filesystem changes are captured in a “patch file”, which is created by mounting both the original 
+container image and the microVM block device and extracting the changes between both using rsync. Even though rsync 
+uses some optimisations such as using timestamps and file sizes to limit the amount of reads, this procedure is quite 
+inefficient and could be sped up by directly extracting the changed block offsets from the thinpool metadata device 
+and directly reading these blocks from the microVM rootfs block device. These extracted blocks could then be written 
+back at the correct offsets on top of the base image block device to create a root filesystem for the to be restored 
+microVM. Support for this alternative approach is provided through the `ForkContainerSnap` and `CreateDeviceSnapshot` 
+functions. However, for this approach to work across nodes for remote snapshots, support to [deterministically flatten a container image into a filesystem](https://www.youtube.com/watch?v=A-7j0QlGwFk) 
+would be required to ensure the block devices of identical images pulled to different nodes are bit identical. 
+In addition, further optimizations would be necessary to more efficiently extract filesystem changes from the thinpool 
+metadata device rather than current method, which relies on the devicemapper `reserve_metadata_snap` method to create
+a snapshot of the current metadata state in combination with `thin_delta` to extract changed blocks.
+
+### Performance limitations
+
+The full local snapshot mode requires a new block device and network device with the exact state of the snapshotted 
+microVM to be created before restoring the snapshot. The network namespace and devicemapper block device creation turn 
+out to be a bottleneck when concurrently restoring many snapshots. Approaches that reduce the impact of these operations 
+could further speedup the microVM snapshot restore latency at high load.
+
+### UPF snapshot compatibility
+
+The full local snapshot functionality is currently not integrated with the [Record-and-Prefetch (REAP)](papers/REAP_ASPLOS21.pdf) 
+accelerated snapshots and thus cannot be used in combination with the `-upf` flag.
diff --git a/docs/quickstart_guide.md b/docs/quickstart_guide.md
@@ -130,6 +130,8 @@ SSD-equipped nodes are highly recommended. Full list of CloudLab nodes can be fo
     > By default, the microVMs are booted, `-snapshots` enables snapshots after the 2nd invocation of each function.
     >
     > If `-snapshots` and `-upf` are specified, the snapshots are accelerated with the Record-and-Prefetch (REAP) technique that we described in our ASPLOS'21 paper ([extended abstract][ext-abstract], [full paper](papers/REAP_ASPLOS21.pdf)).
+    >
+    > If `-snapshots` and `-fulllocal` are specified, a single snapshot can be used to restore many microVMs ([full local snapshots](./fulllocal_snapshots.md)). Note that this mode is currently not compatible with the REAP technique.
 
 ### 3. Configure Master Node
 **On the master node**, execute the following instructions below **as a non-root user with sudo rights** using **bash**:

Original file line number	Diff line number	Diff line change
`@@ -130,6 +130,8 @@ SSD-equipped nodes are highly recommended. Full list of CloudLab nodes can be fo`
`130`	`130`	> By default, the microVMs are booted, `-snapshots` enables snapshots after the 2nd invocation of each function.
`131`	`131`	`>`
`132`	`132`	> If `-snapshots` and `-upf` are specified, the snapshots are accelerated with the Record-and-Prefetch (REAP) technique that we described in our ASPLOS'21 paper ([extended abstract][ext-abstract], [full paper](papers/REAP_ASPLOS21.pdf)).
	`133`	`+ >`
	`134`	+ > If `-snapshots` and `-fulllocal` are specified, a single snapshot can be used to restore many microVMs ([full local snapshots](./fulllocal_snapshots.md)). Note that this mode is currently not compatible with the REAP technique.
`133`	`135`
`134`	`136`	`### 3. Configure Master Node`
`135`	`137`	`On the master node, execute the following instructions below as a non-root user with sudo rights using bash:`