Skip to content

Commit 5ee4014

Browse files
daverodgmantorvalds
authored andcommitted
lib/lzo: implement run-length encoding
Patch series "lib/lzo: run-length encoding support", v5. Following on from the previous lzo-rle patchset: https://lkml.org/lkml/2018/11/30/972 This patchset contains only the RLE patches, and should be applied on top of the non-RLE patches ( https://lkml.org/lkml/2019/2/5/366 ). Previously, some questions were raised around the RLE patches. I've done some additional benchmarking to answer these questions. In short: - RLE offers significant additional performance (data-dependent) - I didn't measure any regressions that were clearly outside the noise One concern with this patchset was around performance - specifically, measuring RLE impact separately from Matt Sealey's patches (CTZ & fast copy). I have done some additional benchmarking which I hope clarifies the benefits of each part of the patchset. Firstly, I've captured some memory via /dev/fmem from a Chromebook with many tabs open which is starting to swap, and then split this into 4178 4k pages. I've excluded the all-zero pages (as zram does), and also the no-zero pages (which won't tell us anything about RLE performance). This should give a realistic test dataset for zram. What I found was that the data is VERY bimodal: 44% of pages in this dataset contain 5% or fewer zeros, and 44% contain over 90% zeros (30% if you include the no-zero pages). This supports the idea of special-casing zeros in zram. Next, I've benchmarked four variants of lzo on these pages (on 64-bit Arm at max frequency): baseline LZO; baseline + Matt Sealey's patches (aka MS); baseline + RLE only; baseline + MS + RLE. Numbers are for weighted roundtrip throughput (the weighting reflects that zram does more compression than decompression). https://drive.google.com/file/d/1VLtLjRVxgUNuWFOxaGPwJYhl_hMQXpHe/view?usp=sharing Matt's patches help in all cases for Arm (and no effect on Intel), as expected. RLE also behaves as expected: with few zeros present, it makes no difference; above ~75%, it gives a good improvement (50 - 300 MB/s on top of the benefit from Matt's patches). Best performance is seen with both MS and RLE patches. Finally, I have benchmarked the same dataset on an x86-64 device. Here, the MS patches make no difference (as expected); RLE helps, similarly as on Arm. There were no definite regressions; allowing for observational error, 0.1% (3/4178) of cases had a regression > 1 standard deviation, of which the largest was 4.6% (1.2 standard deviations). I think this is probably within the noise. https://drive.google.com/file/d/1xCUVwmiGD0heEMx5gcVEmLBI4eLaageV/view?usp=sharing One point to note is that the graphs show RLE appears to help very slightly with no zeros present! This is because the extra code causes the clang optimiser to change code layout in a way that happens to have a significant benefit. Taking baseline LZO and adding a do-nothing line like "__builtin_prefetch(out_len);" immediately before the "goto next" has the same effect. So this is a real, but basically spurious effect - it's small enough not to upset the overall findings. This patch (of 3): When using zram, we frequently encounter long runs of zero bytes. This adds a special case which identifies runs of zeros and encodes them using run-length encoding. This is faster for both compression and decompresion. For high-entropy data which doesn't hit this case, impact is minimal. Compression ratio is within a few percent in all cases. This modifies the bitstream in a way which is backwards compatible (i.e., we can decompress old bitstreams, but old versions of lzo cannot decompress new bitstreams). Link: http://lkml.kernel.org/r/[email protected] Signed-off-by: Dave Rodgman <[email protected]> Cc: David S. Miller <[email protected]> Cc: Greg Kroah-Hartman <[email protected]> Cc: Herbert Xu <[email protected]> Cc: Markus F.X.J. Oberhumer <[email protected]> Cc: Matt Sealey <[email protected]> Cc: Minchan Kim <[email protected]> Cc: Nitin Gupta <[email protected]> Cc: Richard Purdie <[email protected]> Cc: Sergey Senozhatsky <[email protected]> Cc: Sonny Rao <[email protected]> Signed-off-by: Andrew Morton <[email protected]> Signed-off-by: Linus Torvalds <[email protected]>
1 parent 761b323 commit 5ee4014

File tree

5 files changed

+181
-43
lines changed

5 files changed

+181
-43
lines changed

Documentation/lzo.txt

+28-7
Original file line numberDiff line numberDiff line change
@@ -78,16 +78,30 @@ Description
7878
is an implementation design choice independent on the algorithm or
7979
encoding.
8080

81+
Versions
82+
83+
0: Original version
84+
1: LZO-RLE
85+
86+
Version 1 of LZO implements an extension to encode runs of zeros using run
87+
length encoding. This improves speed for data with many zeros, which is a
88+
common case for zram. This modifies the bitstream in a backwards compatible way
89+
(v1 can correctly decompress v0 compressed data, but v0 cannot read v1 data).
90+
8191
Byte sequences
8292
==============
8393

8494
First byte encoding::
8595

86-
0..17 : follow regular instruction encoding, see below. It is worth
87-
noting that codes 16 and 17 will represent a block copy from
88-
the dictionary which is empty, and that they will always be
96+
0..16 : follow regular instruction encoding, see below. It is worth
97+
noting that code 16 will represent a block copy from the
98+
dictionary which is empty, and that it will always be
8999
invalid at this place.
90100

101+
17 : bitstream version. If the first byte is 17, the next byte
102+
gives the bitstream version. If the first byte is not 17,
103+
the bitstream version is 0.
104+
91105
18..21 : copy 0..3 literals
92106
state = (byte - 17) = 0..3 [ copy <state> literals ]
93107
skip byte
@@ -140,6 +154,11 @@ Byte sequences
140154
state = S (copy S literals after this block)
141155
End of stream is reached if distance == 16384
142156

157+
In version 1, this instruction is also used to encode a run of zeros if
158+
distance = 0xbfff, i.e. H = 1 and the D bits are all 1.
159+
In this case, it is followed by a fourth byte, X.
160+
run length = ((X << 3) | (0 0 0 0 0 L L L)) + 4.
161+
143162
0 0 1 L L L L L (32..63)
144163
Copy of small block within 16kB distance (preferably less than 34B)
145164
length = 2 + (L ?: 31 + (zero_bytes * 255) + non_zero_byte)
@@ -165,7 +184,9 @@ Authors
165184
=======
166185

167186
This document was written by Willy Tarreau <[email protected]> on 2014/07/19 during an
168-
analysis of the decompression code available in Linux 3.16-rc5. The code is
169-
tricky, it is possible that this document contains mistakes or that a few
170-
corner cases were overlooked. In any case, please report any doubt, fix, or
171-
proposed updates to the author(s) so that the document can be updated.
187+
analysis of the decompression code available in Linux 3.16-rc5, and updated
188+
by Dave Rodgman <[email protected]> on 2018/10/30 to introduce run-length
189+
encoding. The code is tricky, it is possible that this document contains
190+
mistakes or that a few corner cases were overlooked. In any case, please
191+
report any doubt, fix, or proposed updates to the author(s) so that the
192+
document can be updated.

include/linux/lzo.h

+1-1
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@
1818
#define LZO1X_1_MEM_COMPRESS (8192 * sizeof(unsigned short))
1919
#define LZO1X_MEM_COMPRESS LZO1X_1_MEM_COMPRESS
2020

21-
#define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3)
21+
#define lzo1x_worst_compress(x) ((x) + ((x) / 16) + 64 + 3 + 2)
2222

2323
/* This requires 'wrkmem' of size LZO1X_1_MEM_COMPRESS */
2424
int lzo1x_1_compress(const unsigned char *src, size_t src_len,

lib/lzo/lzo1x_compress.c

+89-11
Original file line numberDiff line numberDiff line change
@@ -20,7 +20,7 @@
2020
static noinline size_t
2121
lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
2222
unsigned char *out, size_t *out_len,
23-
size_t ti, void *wrkmem)
23+
size_t ti, void *wrkmem, signed char *state_offset)
2424
{
2525
const unsigned char *ip;
2626
unsigned char *op;
@@ -35,27 +35,85 @@ lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
3535
ip += ti < 4 ? 4 - ti : 0;
3636

3737
for (;;) {
38-
const unsigned char *m_pos;
38+
const unsigned char *m_pos = NULL;
3939
size_t t, m_len, m_off;
4040
u32 dv;
41+
u32 run_length = 0;
4142
literal:
4243
ip += 1 + ((ip - ii) >> 5);
4344
next:
4445
if (unlikely(ip >= ip_end))
4546
break;
4647
dv = get_unaligned_le32(ip);
47-
t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK;
48-
m_pos = in + dict[t];
49-
dict[t] = (lzo_dict_t) (ip - in);
50-
if (unlikely(dv != get_unaligned_le32(m_pos)))
51-
goto literal;
48+
49+
if (dv == 0) {
50+
const unsigned char *ir = ip + 4;
51+
const unsigned char *limit = ip_end
52+
< (ip + MAX_ZERO_RUN_LENGTH + 1)
53+
? ip_end : ip + MAX_ZERO_RUN_LENGTH + 1;
54+
#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && \
55+
defined(LZO_FAST_64BIT_MEMORY_ACCESS)
56+
u64 dv64;
57+
58+
for (; (ir + 32) <= limit; ir += 32) {
59+
dv64 = get_unaligned((u64 *)ir);
60+
dv64 |= get_unaligned((u64 *)ir + 1);
61+
dv64 |= get_unaligned((u64 *)ir + 2);
62+
dv64 |= get_unaligned((u64 *)ir + 3);
63+
if (dv64)
64+
break;
65+
}
66+
for (; (ir + 8) <= limit; ir += 8) {
67+
dv64 = get_unaligned((u64 *)ir);
68+
if (dv64) {
69+
# if defined(__LITTLE_ENDIAN)
70+
ir += __builtin_ctzll(dv64) >> 3;
71+
# elif defined(__BIG_ENDIAN)
72+
ir += __builtin_clzll(dv64) >> 3;
73+
# else
74+
# error "missing endian definition"
75+
# endif
76+
break;
77+
}
78+
}
79+
#else
80+
while ((ir < (const unsigned char *)
81+
ALIGN((uintptr_t)ir, 4)) &&
82+
(ir < limit) && (*ir == 0))
83+
ir++;
84+
for (; (ir + 4) <= limit; ir += 4) {
85+
dv = *((u32 *)ir);
86+
if (dv) {
87+
# if defined(__LITTLE_ENDIAN)
88+
ir += __builtin_ctz(dv) >> 3;
89+
# elif defined(__BIG_ENDIAN)
90+
ir += __builtin_clz(dv) >> 3;
91+
# else
92+
# error "missing endian definition"
93+
# endif
94+
break;
95+
}
96+
}
97+
#endif
98+
while (likely(ir < limit) && unlikely(*ir == 0))
99+
ir++;
100+
run_length = ir - ip;
101+
if (run_length > MAX_ZERO_RUN_LENGTH)
102+
run_length = MAX_ZERO_RUN_LENGTH;
103+
} else {
104+
t = ((dv * 0x1824429d) >> (32 - D_BITS)) & D_MASK;
105+
m_pos = in + dict[t];
106+
dict[t] = (lzo_dict_t) (ip - in);
107+
if (unlikely(dv != get_unaligned_le32(m_pos)))
108+
goto literal;
109+
}
52110

53111
ii -= ti;
54112
ti = 0;
55113
t = ip - ii;
56114
if (t != 0) {
57115
if (t <= 3) {
58-
op[-2] |= t;
116+
op[*state_offset] |= t;
59117
COPY4(op, ii);
60118
op += t;
61119
} else if (t <= 16) {
@@ -88,6 +146,17 @@ lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
88146
}
89147
}
90148

149+
if (unlikely(run_length)) {
150+
ip += run_length;
151+
run_length -= MIN_ZERO_RUN_LENGTH;
152+
put_unaligned_le32((run_length << 21) | 0xfffc18
153+
| (run_length & 0x7), op);
154+
op += 4;
155+
run_length = 0;
156+
*state_offset = -3;
157+
goto finished_writing_instruction;
158+
}
159+
91160
m_len = 4;
92161
{
93162
#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS) && defined(LZO_USE_CTZ64)
@@ -170,7 +239,6 @@ lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
170239

171240
m_off = ip - m_pos;
172241
ip += m_len;
173-
ii = ip;
174242
if (m_len <= M2_MAX_LEN && m_off <= M2_MAX_OFFSET) {
175243
m_off -= 1;
176244
*op++ = (((m_len - 1) << 5) | ((m_off & 7) << 2));
@@ -207,6 +275,9 @@ lzo1x_1_do_compress(const unsigned char *in, size_t in_len,
207275
*op++ = (m_off << 2);
208276
*op++ = (m_off >> 6);
209277
}
278+
*state_offset = -2;
279+
finished_writing_instruction:
280+
ii = ip;
210281
goto next;
211282
}
212283
*out_len = op - out;
@@ -221,6 +292,12 @@ int lzo1x_1_compress(const unsigned char *in, size_t in_len,
221292
unsigned char *op = out;
222293
size_t l = in_len;
223294
size_t t = 0;
295+
signed char state_offset = -2;
296+
297+
// LZO v0 will never write 17 as first byte,
298+
// so this is used to version the bitstream
299+
*op++ = 17;
300+
*op++ = LZO_VERSION;
224301

225302
while (l > 20) {
226303
size_t ll = l <= (M4_MAX_OFFSET + 1) ? l : (M4_MAX_OFFSET + 1);
@@ -229,7 +306,8 @@ int lzo1x_1_compress(const unsigned char *in, size_t in_len,
229306
break;
230307
BUILD_BUG_ON(D_SIZE * sizeof(lzo_dict_t) > LZO1X_1_MEM_COMPRESS);
231308
memset(wrkmem, 0, D_SIZE * sizeof(lzo_dict_t));
232-
t = lzo1x_1_do_compress(ip, ll, op, out_len, t, wrkmem);
309+
t = lzo1x_1_do_compress(ip, ll, op, out_len,
310+
t, wrkmem, &state_offset);
233311
ip += ll;
234312
op += *out_len;
235313
l -= ll;
@@ -242,7 +320,7 @@ int lzo1x_1_compress(const unsigned char *in, size_t in_len,
242320
if (op == out && t <= 238) {
243321
*op++ = (17 + t);
244322
} else if (t <= 3) {
245-
op[-2] |= t;
323+
op[state_offset] |= t;
246324
} else if (t <= 18) {
247325
*op++ = (t - 3);
248326
} else {

lib/lzo/lzo1x_decompress_safe.c

+52-23
Original file line numberDiff line numberDiff line change
@@ -46,11 +46,23 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
4646
const unsigned char * const ip_end = in + in_len;
4747
unsigned char * const op_end = out + *out_len;
4848

49+
unsigned char bitstream_version;
50+
4951
op = out;
5052
ip = in;
5153

5254
if (unlikely(in_len < 3))
5355
goto input_overrun;
56+
57+
if (likely(*ip == 17)) {
58+
bitstream_version = ip[1];
59+
ip += 2;
60+
if (unlikely(in_len < 5))
61+
goto input_overrun;
62+
} else {
63+
bitstream_version = 0;
64+
}
65+
5466
if (*ip > 17) {
5567
t = *ip++ - 17;
5668
if (t < 4) {
@@ -154,32 +166,49 @@ int lzo1x_decompress_safe(const unsigned char *in, size_t in_len,
154166
m_pos -= next >> 2;
155167
next &= 3;
156168
} else {
157-
m_pos = op;
158-
m_pos -= (t & 8) << 11;
159-
t = (t & 7) + (3 - 1);
160-
if (unlikely(t == 2)) {
161-
size_t offset;
162-
const unsigned char *ip_last = ip;
169+
NEED_IP(2);
170+
next = get_unaligned_le16(ip);
171+
if (((next & 0xfffc) == 0xfffc) &&
172+
((t & 0xf8) == 0x18) &&
173+
likely(bitstream_version)) {
174+
NEED_IP(3);
175+
t &= 7;
176+
t |= ip[2] << 3;
177+
t += MIN_ZERO_RUN_LENGTH;
178+
NEED_OP(t);
179+
memset(op, 0, t);
180+
op += t;
181+
next &= 3;
182+
ip += 3;
183+
goto match_next;
184+
} else {
185+
m_pos = op;
186+
m_pos -= (t & 8) << 11;
187+
t = (t & 7) + (3 - 1);
188+
if (unlikely(t == 2)) {
189+
size_t offset;
190+
const unsigned char *ip_last = ip;
163191

164-
while (unlikely(*ip == 0)) {
165-
ip++;
166-
NEED_IP(1);
167-
}
168-
offset = ip - ip_last;
169-
if (unlikely(offset > MAX_255_COUNT))
170-
return LZO_E_ERROR;
192+
while (unlikely(*ip == 0)) {
193+
ip++;
194+
NEED_IP(1);
195+
}
196+
offset = ip - ip_last;
197+
if (unlikely(offset > MAX_255_COUNT))
198+
return LZO_E_ERROR;
171199

172-
offset = (offset << 8) - offset;
173-
t += offset + 7 + *ip++;
174-
NEED_IP(2);
200+
offset = (offset << 8) - offset;
201+
t += offset + 7 + *ip++;
202+
NEED_IP(2);
203+
next = get_unaligned_le16(ip);
204+
}
205+
ip += 2;
206+
m_pos -= next >> 2;
207+
next &= 3;
208+
if (m_pos == op)
209+
goto eof_found;
210+
m_pos -= 0x4000;
175211
}
176-
next = get_unaligned_le16(ip);
177-
ip += 2;
178-
m_pos -= next >> 2;
179-
next &= 3;
180-
if (m_pos == op)
181-
goto eof_found;
182-
m_pos -= 0x4000;
183212
}
184213
TEST_LB(m_pos);
185214
#if defined(CONFIG_HAVE_EFFICIENT_UNALIGNED_ACCESS)

lib/lzo/lzodefs.h

+11-1
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,12 @@
1313
*/
1414

1515

16+
/* Version
17+
* 0: original lzo version
18+
* 1: lzo with support for RLE
19+
*/
20+
#define LZO_VERSION 1
21+
1622
#define COPY4(dst, src) \
1723
put_unaligned(get_unaligned((const u32 *)(src)), (u32 *)(dst))
1824
#if defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
@@ -28,6 +34,7 @@
2834
#elif defined(CONFIG_X86_64) || defined(CONFIG_ARM64)
2935
#define LZO_USE_CTZ64 1
3036
#define LZO_USE_CTZ32 1
37+
#define LZO_FAST_64BIT_MEMORY_ACCESS
3138
#elif defined(CONFIG_X86) || defined(CONFIG_PPC)
3239
#define LZO_USE_CTZ32 1
3340
#elif defined(CONFIG_ARM) && (__LINUX_ARM_ARCH__ >= 5)
@@ -37,7 +44,7 @@
3744
#define M1_MAX_OFFSET 0x0400
3845
#define M2_MAX_OFFSET 0x0800
3946
#define M3_MAX_OFFSET 0x4000
40-
#define M4_MAX_OFFSET 0xbfff
47+
#define M4_MAX_OFFSET 0xbffe
4148

4249
#define M1_MIN_LEN 2
4350
#define M1_MAX_LEN 2
@@ -53,6 +60,9 @@
5360
#define M3_MARKER 32
5461
#define M4_MARKER 16
5562

63+
#define MIN_ZERO_RUN_LENGTH 4
64+
#define MAX_ZERO_RUN_LENGTH (2047 + MIN_ZERO_RUN_LENGTH)
65+
5666
#define lzo_dict_t unsigned short
5767
#define D_BITS 13
5868
#define D_SIZE (1u << D_BITS)

0 commit comments

Comments
 (0)