-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathREADME
248 lines (212 loc) · 14.4 KB
/
README
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
SLZ / StateLess Zip - v1.2.1 - 2022/10/23 - Willy Tarreau <[email protected]>
------------------------------------------------------------------------
SLZ is a fast and memory-less stream compressor which produces an output that
can be decompressed with zlib or gzip. It does not implement decompression at
all, zlib is perfectly fine for this.
The purpose is to use SLZ in situations where a zlib-compatible stream is
needed and zlib's resource usage would be too high while the compression ratio
is not critical. The typical use case is in HTTP servers and gateways which
have to compress many streams in parallel with little CPU resources to assign
to this task, and without having to limit the compression ratio due to the
memory usage. In such an environment, the server's memory usage can easily be
divided by 10 and the CPU usage by 3.
While zlib uses 256 kB of memory per stream in addition to a few tens of bytes
for the stream descriptor itself, SLZ only stores a stream descriptor made of
28 bytes. Thus it is particularly suited to environments having to deal with
tens to hundreds of thousands of concurrent streams.
The key difference between zlib and SLZ is that SLZ is stateless in that it
doesn't consider the data previously compressed as part of its dictionary. It
doesn't hurt compression performance when it is fed in large enough chunks (at
least a few kB at once), but it must not be used if it has to read less than
a few kB at a time, or the caller will have to implement the input buffer to
maximize the compression ratio. For example, an HTTP gateway may defragment all
the parts from a chunked-encoded response buffer into a linear buffer that is
then fed at once to SLZ. This temporary buffer would then be static and shared
between all streams running on the same thread.
A side effect from its stateless nature is that it can emit a zlib-compatible
stream without being vulnerable to CRIME-like attacks by processing the
sensitive and attacker-controlled data in distinct batches. Since no dictionary
persists between calls, there is no risk that attacker-controlled data is used
to guess sensitive data using dictionary lookups.
Another induced benefit which was not initially seeked is that SLZ is on
average 3 times as fast as zlib in its fastest mode. This is the result of the
simpler matching and encoding mechanisms involved which permit even more
optimisations.
Some performance tests were run on various types of data showing anything
between 227 MB/s and 753 MB/s per core on the test machine depending on the
difficulty to find matching strings and on the amount of data emitted as
literals (hence not compressed). In the end, comparative tests were performed
using some reference datasets that are already being used for other libraries.
This performance report was produced using the Silesia Corpus to get something
comparable to what is reported by the LZ4 utility here :
https://github.com/Cyan4973/lz4
The tests were run on a single core of a core i5-3320M at 3.3 GHz (turbo mode).
Product Input BW Compression ratio
lz4-r124 352 MB/s 2.09
lzo-2.09 347 MB/s 2.10
slz-1.2.1 -1 -D 293 MB/s 2.12 # deflate format
slz-1.2.1 -3 -D 279 MB/s 2.20 # deflate format
slz-1.2.1 -1 232 MB/s 2.12 # gzip format
slz-1.2.1 -3 223 MB/s 2.20 # gzip format
zlib-1.28 -1 -h 88 MB/s 1.64 # huffman only
zlib-1.28 -1 -r 81 MB/s 1.65 # RLE only
zlib-1.28 -1 -F 70 MB/s 2.32 # fixed huffman
zlib-1.28 -1 70 MB/s 2.74
Only zlib and SLZ can emit a stream that a regular zlib-compatible browser can
decompress (RFC1950/1951/1952 aka zlib, deflate, gzip). Here SLZ is 3 times
faster than zlib for a less compressed payload (29% larger than zlib). Tests
run on other file types (including HTML, CSS, JS) show that the ratio of
compression speed between SLZ and zlib varies from 2.5 to 3.5. The output
compressed stream is often 24 to 32% larger for SLZ than zlib.
Since SLZ focuses a lot on performance savings gained by keeping most of the
workload in CPU cache, it scales very well when multiple cores are used in
parallel (low memory bandwidth) as shown in this tests run on a quad-core ARM
Cortex A17 (RK3288) running at 1.8 GHz and the same input data :
#Cores Input size Time(s) BW/core(MB/s) Total BW(MB/s)
1 211938580 2.870 73.8 73.8
2 211938580 2.880 73.5 147.1
3 211938580 2.885 73.4 220.3
4 211938580 2.975 71.2 284.9
On an ARMv8 with CRC32 (88F8040) running at 2.0 GHz it's even better:
#Cores Input size Time(s) BW/core(MB/s) Total BW(MB/s)
1 211938580 2.016 105.1 105.1
2 211938580 2.025 104.6 209.3
3 211938580 2.085 101.6 304.9
4 211938580 2.041 103.8 415.3
Since some products automatically throttle the compression level based on the
CPU usage or the available memory, on high speed networks, zlib can end up
showing a much lower average compression ratio in order to fill an uplink. With
SLZ, no extra memory usage is needed for compression, and the CPU usage is
reduced by 2.5-3.5 times, meaning that the throttling will happen at much
higher data rates. The net effect is a significant improvement on the
compression ratio compared to zlib at a given resource usage.
HAProxy running on a single core of a core i5-3320M, with 500 concurrent users,
the default 16kB buffers, and accessing www.haproxy.org (76 kB) shows the
following measures with traffic leaving the machine via a gigabit Ethernet link
(hence the 976 Mbps of IP traffic) after being compressed with zlib and SLZ,
respectively :
At 100% CPU (no CPU usage limit) :
BW In BW Out BW Saved Ratio memory VSZ/RSS hits/s
zlib 480 Mbps 188 Mbps 292 Mbps 2.55 130M / 101M 744
slz 1700 Mbps 810 Mbps 890 Mbps 2.10 23.7M / 9.7M 2382
At 85% CPU (limited) :
BW In BW Out BW Saved Ratio memory VSZ/RSS hits/s
zlib 1240 Mbps 976 Mbps 264 Mbps 1.27 130M / 100M 1738
slz 1600 Mbps 976 Mbps 624 Mbps 1.64 23.7M / 9.7M 2210
In the CPU-limited test, once CPU usage limit was reached, blocks were sent
uncompressed. This explains how the gigabit link was always saturated, and it
is then the input bandwidth which varies according to the average compression
ratio. In all situations, SLZ resulted in bandwidth savings 3 times higher than
zlib.
The "zenc" utility, which was orginally aimed at running regression tests,
found its way into backup systems since it permits to reach network links
limits with compressed streams before saturating the CPU. This results in a
shorter total backup time, while relying on a well-known and standard output
compression format. Such use cases could benefit from a multithreaded version
that would work on different blocks and adapt the running CRC once blocks are
finished, but given that a not-so-fast x86 system easily saturates a gigabit
link using a single core, there's not much demand for this at the moment.
There are 6 key points having a large impact on compression speed in any LZ-
based compressor :
- dictionary lookup : looking up a sequence of bytes in a recent window is not
necessarily fast. SLZ uses a hash-based technique similar to LZO or LZ4,
focusing on keeping track of only the most recent occurrence of a given
sequence. The hash table contains the absolute position in the stream of the
last occurrence of the hashed entry. This table is initialized to impossible
values prior to compressing so that unused entries are not even evaluated, as
it was found to be faster than not initializing them. During tests, it was
found that hashing 4 bytes to one value among 4096-65536 resulted in the best
compression ratio, and that 8192 entries was the best tradeoff between speed
and compression ratio. Smaller hash tables cause more collisions hurting the
compression ratio, and larger ones experience a significant slowdown on
certain datasets due to the hashtable having more difficulties to fit
entirely in the CPU cache.
- indexing : if sequences of less than 3 bytes are considered for a lookup,
since we only keep the last occurrence, there's a high risk of not keeping
the most useful sequence which has large changes of matching a similar word.
For example in HTML a lot of occurrences of '><', '</', '/>' can be found so
the probability that the one from '</a>' is picked instead of the one from
'</td>' will make it hard to find long matches. Conversely, sequences larger
than 4 bytes will only match once in a while and will prevent shorter series
of characters from being matched. Experimentation has shown that hashing 3
bytes only provided the same compression ratio as hashing 4 bytes, with up to
10% faster input processing thanks to the reduced cost of hashing. Attempts
to index the repeated data were made, and caused a significant performance
drop (about 5%) for less than 1% compression ratio improvement, so they were
abandonned.
- longest match : the longest match measurement is very similar to an strcmp()
call except that we count similar characters. In order to avoid this slow
operation most of the time, we store in the hash table a copy of the 4 bytes
that were hashed. This makes it easy to perform a comparison and to skip
those which do not match, then to only start the longest match measurement
for parts which already match on the first 4 bytes. With a 13-bits hash on
32-bit character sequence, there's a theorical 1 chance over 524288 that a
matching hash implies a matching sequence. But in practice on selected text
this chance is significantly higher since not all 32 bits are used, it's more
like 24 bits (6 bits per char), so it's one chance over 2048 that an entry
will match. That's still 2047/2048 useless calls to the memmatch function
that are saved this way.
- encoding the output : deflate is often criticized for its bit-oriented stream
which is expensive to manipulate as it requires bit shifts. Moreover, on top
of this the deflate stream supports several encoding modes, the most common
one being dynamic huffman trees, which requires some significant maintenance
to keep a valid tree up to date and which SLZ doesn't use. Instead, SLZ uses
static huffman trees, and when too many binary entries are found to keep this
encoding efficient, it switches to the plain literal mode. Proceeding like
this comes with a significant benefit : all code sequences, all distances and
all lengths are stored as their respective huffman output bits. This avoids
an encoding phase, sequences of bits are immediately sent to the destination
buffer without any further operation, and remain reasonably cheap compared to
the more common usages.
- checksumming : 3 checksum methods exist depending on the encoding format.
RFC1950 (zlib format) uses an Adler-32 checksum which theorically involves
modulus (divide) but which can be optimized as it's done with a number very
close to a power of two. RFC1951 (raw deflate) doesn't involve any checksum.
RFC1952 (gzip) uses a CRC-32 which is somewhat expensive but can be optimized
in various ways. Since SLZ was designed around gzip mainly, most of the
optimizations were done to optimize CRC-32. Some well-known methods involve
pre-computed tables, which can be further optimized for multi-byte processing
and single-byte processing as well. It was found that updating the CRC-32 on
a per-character basis before looking up the hash came at almost no cost
because it would probably benefit make use of some pipeline stalls when the
hashed entry was not in the CPU's L1 cache. But in order to avoid multiplying
the functions, the checksum computation was moved out of the function with a
very low extra cost (less than 1%) as it can process all the input at once 4
bytes at a time, contributes to preloading the input data into the caches.
And on CPU architectures featuring a CRC32 instruction (e.g. ARMv8), this
instruction is used and allows the gzip format to be roughly as fast as the
deflate one.
- implementation-specific optimizations : modern CPUs can load unaligned words
from memory with minimal to no overhead. On architectures that support this
(ix86, x86_64, armv7), this is used to limit the number of operations and
memory accesses. Some CPUs have a smaller cache and the direct mapping
between distances and huffman sequences can make it thrash a lot. Some
experimentations were made using a direct mapping only for shortest distances
(the most common ones), but results were not encouraging for now as a cache
miss is not completely offset by the amount of extra operations.
These two factors have a significant impact on the compression ratio :
- collisions in the dictionary, caused by suboptimal hash distribution. Here
with 8k entries we have what looks like the best tradeoff between speed and
compression ratio for most workloads tested, and it can easily be changed
using a macro.
- output encoding : the dynamic huffman trees would definitely shorten the
output stream but at an important performance cost. But even with fixed trees
it is possible to waste a lot of space. First, input bytes 144 to 255 are
encoded on 9 bits per byte. SLZ constantly monitors the output stream to
decide when it's preferable to switch to plain literals which cost exactly
52 bits to switch back and forth and will not inflate the output stream any
further. Second, referencing a match can be longer than the data it replaces.
This is true for short sequences, but it can be true even for a long sequence
in the middle of a series of plain literals. By monitoring the output stream
and verifying the encoding cost before switching, it is possible to avoid any
overhead beyond the base 5 bytes per 32kB plus headers for non-compressible
data.
SLZ is provided as a library with a few extra tools (such as the zenc testing
utility mentioned above). It is distributed under the X11 license, meaning
that you can do a lot of things with it, such as merge it into GPL or BSD-
licensed programs, as well as use it in proprietary software. Contributions are
welcome and should be sent as Git patches (see "git format-patch") and will be
made under the same license exclusively.
Project's webpage : http://www.libslz.org/
Download sources : https://github.com/wtarreau/libslz/
Contact : Willy Tarreau <[email protected]>