-
Notifications
You must be signed in to change notification settings - Fork 101
Description
Hi, I have several questions related to snapshotting. I am running one of the TestBenchParallelServe tests in vhive/Makefile on a single node setup.
sudo mkdir -m777 -p $(CTRDLOGDIR) && sudo env "PATH=$(PATH)" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>$(CTRDLOGDIR)/fccd_orch_noupf_log_bench.out 2>$(CTRDLOGDIR)/fccd_orch_noupf_log_bench.err &
sudo env "PATH=$(PATH)" go test $(EXTRAGOARGS) -run TestBenchParallelServe -args $(WITHSNAPSHOTS) $(WITHUPF) -benchDirTest configREAP -metricsTest -funcName helloworld
./scripts/clean_fcctr.sh
If I'm correct, this script will spawn parallelNum concurrent (and same) functions, in my case, helloworld, with both snapshots and REAP enabled. However, the maximum parallelNum it supports is only 10 on my machine. When getting larger, it fails with the following error:
cc@vhive-inst-01:~/vhive$ make bench
sudo mkdir -m777 -p /tmp/ctrd-logs && sudo env "PATH=/home/cc/.vscode-server/bin/3b889b090b5ad5793f524b5d1d39fda662b96a2a/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin" /usr/local/bin/firecracker-containerd --config /etc/firecracker-containerd/config.toml 1>/tmp/ctrd-logs/fccd_orch_noupf_log_bench.out 2>/tmp/ctrd-logs/fccd_orch_noupf_log_bench.err &
sudo env "PATH=/home/cc/.vscode-server/bin/3b889b090b5ad5793f524b5d1d39fda662b96a2a/bin/remote-cli:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/usr/local/go/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin:/home/cc/vhive/istio-1.12.5/bin" go test -v -race -cover -run TestBenchParallelServe -args -snapshotsTest -upfTest -benchDirTest configREAP -metricsTest -funcName helloworld
time="2022-07-27T15:56:12.352968106-04:00" level=info msg="Orchestrator snapshots enabled: true"
time="2022-07-27T15:56:12.353343756-04:00" level=info msg="Orchestrator UPF enabled: true"
time="2022-07-27T15:56:12.353431463-04:00" level=info msg="Orchestrator lazy serving mode enabled: false"
time="2022-07-27T15:56:12.353527886-04:00" level=info msg="Orchestrator UPF metrics enabled: true"
time="2022-07-27T15:56:12.353589272-04:00" level=info msg="Drop cache: true"
time="2022-07-27T15:56:12.353640659-04:00" level=info msg="Bench dir: configREAP"
time="2022-07-27T15:56:12.353695564-04:00" level=info msg="Registering bridges for tap manager"
time="2022-07-27T15:56:12.355698518-04:00" level=info msg="Creating containerd client"
time="2022-07-27T15:56:12.358589476-04:00" level=info msg="Created containerd client"
time="2022-07-27T15:56:12.358791972-04:00" level=info msg="Creating firecracker client"
time="2022-07-27T15:56:12.359535647-04:00" level=info msg="Created firecracker client"
=== RUN TestBenchParallelServe
time="2022-07-27T15:56:12.370809757-04:00" level=info msg="New function added" fID=plr-fnc image="ghcr.io/ease-lab/helloworld:var_workload" isPinned=true servedTh=0
... (omitted some logs here)
time="2022-07-27T15:56:50.439361536-04:00" level=info msg="Creating snapshot for 1, vmID is 1-0"
time="2022-07-27T15:56:50.441571578-04:00" level=info msg="Orchestrator received CreateSnapshot" vmID=1-0
time="2022-07-27T15:56:50.666040404-04:00" level=info msg="Creating snapshot for 0, vmID is 0-0"
time="2022-07-27T15:56:50.667999825-04:00" level=info msg="Orchestrator received CreateSnapshot" vmID=0-0
time="2022-07-27T15:56:55.440588667-04:00" level=error msg="failed to create snapshot of the VM" error="rpc error: code = DeadlineExceeded desc = context deadline exceeded" vmID=1-0
time="2022-07-27T15:56:55.440898692-04:00" level=panic msg="rpc error: code = DeadlineExceeded desc = context deadline exceeded"
panic: (*logrus.Entry) 0xc0002390a0
goroutine 186 [running]:
github.com/sirupsen/logrus.(*Entry).log(0xc000238bd0, 0x0, {0xc0002ae870, 0x43})
/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:259 +0x95b
github.com/sirupsen/logrus.(*Entry).Log(0xc000238bd0, 0x0, {0xc000d05630, 0x1, 0x1})
/root/go/pkg/mod/github.com/sirupsen/[email protected]/entry.go:285 +0x8c
github.com/sirupsen/logrus.(*Logger).Log(0xc00021fdc0, 0x0, {0xc000d05630, 0x1, 0x1})
/root/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:198 +0x85
github.com/sirupsen/logrus.(*Logger).Panic(...)
/root/go/pkg/mod/github.com/sirupsen/[email protected]/logger.go:247
github.com/sirupsen/logrus.Panic(...)
/root/go/pkg/mod/github.com/sirupsen/[email protected]/exported.go:129
github.com/ease-lab/vhive.(*Function).CreateInstanceSnapshot(0xc00041c370)
/home/cc/vhive/functions.go:463 +0x51b
github.com/ease-lab/vhive.(*Function).Serve.func2()
/home/cc/vhive/functions.go:308 +0x46
sync.(*Once).doSlow(0xc0001c26d4, 0xc000d05ae8)
/usr/local/go/src/sync/once.go:68 +0x102
sync.(*Once).Do(0xc0001c26d4, 0xc000536750?)
/usr/local/go/src/sync/once.go:59 +0x47
github.com/ease-lab/vhive.(*Function).Serve(0xc00041c370, {0x1ca7c15?, 0x1?}, {0x1ca7c15, 0x1}, {0x1cad79c, 0x28}, {0x1c81435, 0x6})
/home/cc/vhive/functions.go:306 +0xeee
github.com/ease-lab/vhive.(*FuncPool).Serve(0xc00021fdc0?, {0x1e9d988, 0xc0001c2008}, {0x1ca7c15, 0x1}, {0x1cad79c, 0x28}, {0x1c81435, 0x6})
/home/cc/vhive/functions.go:121 +0xea
github.com/ease-lab/vhive.createSnapshots.func1({0x1ca7c15, 0x1})
/home/cc/vhive/bench_test.go:297 +0x18a
created by github.com/ease-lab/vhive.createSnapshots
/home/cc/vhive/bench_test.go:292 +0xb6
exit status 2
FAIL github.com/ease-lab/vhive 43.155s
It fails in createSnapshots() in bench_test.go where you are performing parallel snapshots saving. It throws error in createSnapshot in which it's sending createSnapshot gRPC request. I'm using a single node on chameleon cloud that has 128GB RAM and 48 cores of Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz. I doubt it's hardware capacity issue because I was using htop that shows for the entire testing duration, RAM usage is below 4GB. However, I see from your ASPLOS paper, "We use the the helloworld function and consider up to 64 concurrent independent function arrivals". Are you testing on a cluster in which those functions run on different machines? Or do you think it's constrained by concurrent SSD IO bandwidth? df -h shows:
cc@vhive-inst-01:~$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 63G 0 63G 0% /dev
tmpfs 13G 7.3M 13G 1% /run
/dev/sda1 230G 73G 148G 33% /
tmpfs 63G 0 63G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
tmpfs 63G 0 63G 0% /sys/fs/cgroup
/dev/loop0 56M 56M 0 100% /snap/core18/1932
/dev/loop1 68M 68M 0 100% /snap/lxd/18150
/dev/loop3 56M 56M 0 100% /snap/core18/2538
/dev/loop4 47M 47M 0 100% /snap/snapd/16292
/dev/loop5 62M 62M 0 100% /snap/core20/1581
/dev/loop6 68M 68M 0 100% /snap/lxd/22753
shm 64M 0 64M 0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/56a07329fbcd10e05e218702b95ed3b5c42d75d55321af0bfdc884db4711cbbf/shm
shm 64M 0 64M 0% /run/containerd/io.containerd.grpc.v1.cri/sandboxes/64fd4707bc99b77b6da37ce5cd6f188b5bb001d3e1bbcf9ba35b4f6c734b7d66/shm
... (tons of shms)
My second question is, it seems you are storing snapshots/ws files in local, under /fccd/snapshots/. What if the next invocation of the same function is on another machine, where you don't have the snapshots there? In this case, S3, or a distributed storage solution should be considered, right? Or if you have better ideas, love to hear that.
Also, using lsblk shows tons of stuff like this, which I have no idea. Is it created by Firecracker or vhive CRI?
loop0 7:0 0 55.4M 1 loop /snap/core18/1932
loop1 7:1 0 67.8M 1 loop /snap/lxd/18150
loop2 7:2 0 100G 0 loop
loop3 7:3 0 55.6M 1 loop /snap/core18/2538
loop4 7:4 0 47M 1 loop /snap/snapd/16292
loop5 7:5 0 62M 1 loop /snap/core20/1581
loop6 7:6 0 67.8M 1 loop /snap/lxd/22753
loop7 7:7 0 2G 0 loop
loop8 7:8 0 100G 0 loop
loop9 7:9 0 2G 0 loop
loop10 7:10 0 100G 0 loop
loop11 7:11 0 2G 0 loop
loop12 7:12 0 100G 0 loop
loop13 7:13 0 2G 0 loop
loop14 7:14 0 100G 0 loop
loop15 7:15 0 2G 0 loop
...
I appreciate your time reviewing and answering these. Thank you!