Low level storage benchmark results #6575
Unanswered
abacabadabacaba
asked this question in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I did a benchmark of an early prototype of the new low level storage. This post provides the results.
The benchmarks were done on an n2-highcpu-16 GCP instance running Debian 11. It had 1500 GiB of storage. For the benchmark of the storage implementation, it used four local NVMe SSDs assembled into a RAID-0 array with the stripe size of 1 MiB. For measuring the performance of the block devices themselves, in addition to the local SSD setup, there was a setup with a single 1500 GiB persistent SSD.
First, here are the results of the benchmark measuring raw block device performance:
This benchmark measured the latency of random reads of various sizes. I believe that random read latency is the most important for us because it determines how quickly the data can be provided at the request of a smart contract. The latency was measured using fio tool with three different access methods: psync (regular system calls), uring (a faster, asynchronous I/O mechanism) and uring with direct I/O and premapped buffers (as before, but optimized to prevent some memory mappings and copies).
Block sizes were different powers of 2 from 1 byte to 1 GiB. The tests were done with local and persistent SSDs, and also once when they were empty and once after they were filled with random data.
As can be seen from the graphs, as long as the block size is below 4 KiB, the latency stays the same. This is expected, because an entire 4 KiB block needs to be read in those cases. The latency is as follows: 44 ms for an empty local SSD, 123 ms for a full local SSD, 55 ms for an empty persistent SSD, and 405 ms for a full persistent SSD. The empty SSDs are not very interesting for us, but the full ones are: for accessing single blocks, local SSDs are about 3.3 times faster.
Above 4 KiB, the latency begins to increase, as more time is needed for the data to be transferred. This can be used to compute the bandwidth: 2.73 GiB/s for a local SSD, and 0.70 GiB/s for a persistent SSD. So, in the terms of bandwidth, a local SSD turns out to be 3.9 times faster.
Also, it is strange that the latency of an empty SSD is less than that of a full one, but the bandwidth is the same. This means that the hardware can take some shortcuts when reading from an empty drive, which allows it to produce the result with a smaller delay, but it doesn't help it to transfer the data (which consists of zero-valued bytes) more quickly.
Then, I did a benchmark of the storage implementations. The implementations themselves are available in the branches storage-benchmark-old and storage-benchmark-new. The benchmark was only done with local SSDs.
For both implementations, the benchmark ran 1000 iterations, each of which consisted of four phases: append phase, which inserted new key-value pairs into the storage, update phase, which replaced the values of random existing keys, mixed phase, which randomly performed the operations from the other three phases, and read phase, which tried to read the values of random existing keys. For the mixed phase, 50% of the operations were reads and 25% each of appends and updates. In each of the phases, a total of 1000000 operations were performed, with a commit after every 1000 operations (for the old storage only, as the new storage doesn't implement commits).
After the first 20 iterations, the result looked as follows:
The dashed line marks double the read latency, which is the theoretical latency of the new storage in the absence of caching.
It can be seen that the new storage was significantly slower, while the old storage was very fast, especially the append and update operations. However, as the amount of data increased, reads became slower. In contrast, with the new storage, all operations were slow. After 100 iterations, the result looked as follows:
Now, the reads were actually taking more time from the old storage than from the new one. After the whole 1000 iterations, the result looked as follows:
Unfortunately, during the 600th iteration the new storage ran out of disk space, so the benchmark was terminated. But it is still visible that the latency of the new storage only increased very slightly, while the latency of the old storage growed much more, eventually becoming about twice that of the new storage. This is more clearly visible on a graph with a linear scale:
It seems that for large data sizes, the old storage requires at least four disk reads in sequence on average, while the new storage always requires two.
A curious artifact: the latency of the append and update operations for the old storage made a jump in the middle of the benchmark (specifically, starting from the 497th iteration). I don't know why this happened.
Raw data used for the graphs above
benchmark-local.txt
benchmark-pd.txt
benchmark-new.txt
benchmark-old.txt
Currently, the new storage performs two disk accesses for each read, and it already outperforms the old storage on large data sets. With some optimizations, this can be reduced to just one disk access. As a result, a further 2x improvement in read latency could be achieved. The latency of the other operations is not very important, because these operations can be done in the background, without delaying other code. In fact, this is what the old storage is already doing, and that's why its latency of the append and update operations is so low.
Beta Was this translation helpful? Give feedback.
All reactions