Skip to content

Commit 32db4c5

Browse files
committed
extmod/moddeflate: Change default window size.
The primary purpose of this commit is to make decompress default to wbits=15 when the format is gzip (or auto format with gzip detected). The idea is that someone decompressing a gzip stream should be able to use the default `deflate.DeflateIO(f)` and it will "just work" for any input stream, even though it uses a lot of memory. This is done by making uzlib report gzip files as having wbits set to 15 in their header (where it previously only set the wbits out parameter for zlib files), and then fixing up the logic in `deflateio_init_read`. Updates the documentation to match. This work was funded through GitHub Sponsors. Signed-off-by: Jim Mussared <[email protected]>
1 parent 81c19d9 commit 32db4c5

File tree

3 files changed

+74
-54
lines changed

3 files changed

+74
-54
lines changed

docs/library/deflate.rst

+49-44
Original file line numberDiff line numberDiff line change
@@ -41,9 +41,15 @@ Classes
4141
to 1024 bytes. Valid values are ``5`` to ``15`` inclusive (corresponding to
4242
window sizes of 32 to 32k bytes).
4343

44-
If *wbits* is set to ``0`` (the default), then a window size of 256 bytes
45-
will be used (corresponding to *wbits* set to ``8``), except when
46-
:ref:`decompressing a zlib stream <deflate_wbits_zlib>`.
44+
If *wbits* is set to ``0`` (the default), then for compression a window size
45+
of 256 bytes will be used (as if *wbits* was set to 8). For decompression, it
46+
depends on the format:
47+
48+
* ``RAW`` will use 256 bytes (corresponding to *wbits* set to 8).
49+
* ``ZLIB`` (or ``AUTO`` with zlib detected) will use the value from the zlib
50+
header.
51+
* ``GZIP`` (or ``AUTO`` with gzip detected) will use 32 kilobytes
52+
(corresponding to *wbits* set to 15).
4753

4854
See the :ref:`window size <deflate_wbits>` notes below for more information
4955
about the window size, zlib, and gzip streams.
@@ -134,44 +140,43 @@ Deflate window size
134140
-------------------
135141

136142
The window size limits how far back in the stream the (de)compressor can
137-
reference. Increasing the window size will improve compression, but will
138-
require more memory.
139-
140-
However, just because a given window size is used for compression, this does not
141-
mean that the stream will require the same size window for decompression, as
142-
the stream may not reference data as far back as the window allows (for example,
143-
if the length of the input is smaller than the window size).
144-
145-
If the decompressor uses a smaller window size than necessary for the input data
146-
stream, it will fail mid-way through decompression with :exc:`OSError`.
147-
148-
.. _deflate_wbits_zlib:
149-
150-
The zlib format includes a header which specifies the window size used to
151-
compress the data (which due to the above, may be larger than the size required
152-
for the decompressor).
153-
154-
If this header value is lower than the specified *wbits* value, then the header
155-
value will be used instead in order to reduce the memory allocation size. If
156-
the *wbits* parameter is zero (the default), then the header value will only be
157-
used if it is less than the maximum value of ``15`` (which is default value
158-
used by most compressors [#f1]_).
159-
160-
In other words, if the source zlib stream has been compressed with a custom window
161-
size (i.e. less than ``15``), then using the default *wbits* parameter of zero
162-
will decompress any such stream.
163-
164-
The gzip file format does not include the window size in the header.
165-
Additionally, most compressor libraries (including CPython's implementation
166-
of :class:`gzip.GzipFile`) will default to the maximum possible window size.
167-
This makes it difficult to decompress most gzip streams on MicroPython unless
168-
your board has a lot of free RAM.
169-
170-
If you control the source of the compressed data, then prefer to use the zlib
171-
format, with a window size that is suitable for your target device.
172-
173-
.. rubric:: Footnotes
174-
175-
.. [#f1] The assumption here is that if the header value is the default used by
176-
most compressors, then nothing is known about the likely required window
177-
size and we should ignore it.
143+
reference. Increasing the window size will improve compression, but will require
144+
more memory and make the compressor slower.
145+
146+
If an input stream was compressed a given window size, then `DeflateIO`
147+
using a smaller window size will fail mid-way during decompression with
148+
:exc:`OSError`, but only if a back-reference actually refers back further
149+
than the decompressor's window size. This means it may be possible to decompress
150+
with a smaller window size. For example, this would trivially be the case if the
151+
original uncompressed data is shorter than the window size.
152+
153+
Decompression
154+
~~~~~~~~~~~~~
155+
156+
The zlib format includes a header which specifies the window size that was used
157+
to compress the data. This indicates the maximum window size required to
158+
decompress this stream. If this header value is less than the specified *wbits*
159+
value (or if *wbits* is unset), then the header value will be used.
160+
161+
The gzip format does not include the window size in the header, and assumes that
162+
all gzip compressors (e.g. the ``gzip`` utility, or CPython's implementation of
163+
:class:`gzip.GzipFile`) use the maximum window size of 32kiB. For this reason,
164+
if the *wbits* parameter is not set, the decompressor will use a 32 kiB window
165+
size (corresponding to *wbits* set to 15). This means that to be able to
166+
decompress an arbitrary gzip stream, you must have at least this much RAM
167+
available. If you control the source data, consider instead using the zlib
168+
format with a smaller window size.
169+
170+
The raw format has no header and therefore does not include any information
171+
about the window size. If *wbits* is not set, then it will default to a window
172+
size of 256 bytes, which may not be large enough for a given stream. Therefore
173+
it is recommended that you should always explicitly set *wbits* if using the raw
174+
format.
175+
176+
Compression
177+
~~~~~~~~~~~
178+
179+
For compression, MicroPython will default to a window size of 256 bytes for all
180+
formats. This provides a reasonable amount of compression with minimal memory
181+
usage and fast compression time, and will generate output that will work with
182+
any decompressor.

extmod/moddeflate.c

+21-10
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,8 @@ typedef enum {
5454
DEFLATEIO_FORMAT_MAX = DEFLATEIO_FORMAT_GZIP,
5555
} deflateio_format_t;
5656

57+
// This is used when the wbits is unset in the DeflateIO constructor. Default
58+
// to the smallest window size (faster compression, less RAM usage, etc).
5759
const int DEFLATEIO_DEFAULT_WBITS = 8;
5860

5961
typedef struct {
@@ -114,24 +116,32 @@ STATIC bool deflateio_init_read(mp_obj_deflateio_t *self) {
114116
// Don't modify self->window_bits as it may also be used for write.
115117
int wbits = self->window_bits;
116118

117-
// Parse the header if we're in NONE/ZLIB/GZIP modes.
118-
if (self->format != DEFLATEIO_FORMAT_RAW) {
119-
int header_wbits = wbits;
119+
if (self->format == DEFLATEIO_FORMAT_RAW) {
120+
if (wbits == 0) {
121+
// The docs recommends always setting wbits explicitly when using
122+
// RAW, but we still allow a default.
123+
wbits = DEFLATEIO_DEFAULT_WBITS;
124+
}
125+
} else {
126+
// Parse the header if we're in NONE/ZLIB/GZIP modes.
127+
int header_wbits;
120128
int header_type = uzlib_parse_zlib_gzip_header(&self->read->decomp, &header_wbits);
129+
if (header_type < 0) {
130+
// Stream header was invalid.
131+
return false;
132+
}
121133
if ((self->format == DEFLATEIO_FORMAT_ZLIB && header_type != UZLIB_HEADER_ZLIB) || (self->format == DEFLATEIO_FORMAT_GZIP && header_type != UZLIB_HEADER_GZIP)) {
134+
// Not what we expected.
122135
return false;
123136
}
124-
if (wbits == 0 && header_wbits < 15) {
125-
// If the header specified something lower than the default, then
126-
// use that instead.
137+
// header_wbits will either be 15 (gzip) or 8-15 (zlib).
138+
if (wbits == 0 || header_wbits < wbits) {
139+
// If the header specified something lower, then use that instead.
140+
// No point doing a bigger allocation than we need to.
127141
wbits = header_wbits;
128142
}
129143
}
130144

131-
if (wbits == 0) {
132-
wbits = DEFLATEIO_DEFAULT_WBITS;
133-
}
134-
135145
size_t window_len = 1 << wbits;
136146
self->read->window = m_new(uint8_t, window_len);
137147

@@ -163,6 +173,7 @@ STATIC bool deflateio_init_write(mp_obj_deflateio_t *self) {
163173

164174
int wbits = self->window_bits;
165175
if (wbits == 0) {
176+
// Same default wbits for all formats.
166177
wbits = DEFLATEIO_DEFAULT_WBITS;
167178
}
168179
size_t window_len = 1 << wbits;

lib/uzlib/header.c

+4
Original file line numberDiff line numberDiff line change
@@ -108,6 +108,10 @@ int uzlib_parse_zlib_gzip_header(uzlib_uncomp_t *d, int *wbits)
108108
d->checksum_type = UZLIB_CHKSUM_CRC;
109109
d->checksum = ~0;
110110

111+
/* gzip does not include the window size in the header, as it is expected that a
112+
compressor will use wbits=15 (32kiB).*/
113+
*wbits = 15;
114+
111115
return UZLIB_HEADER_GZIP;
112116
} else {
113117
/* check checksum */

0 commit comments

Comments
 (0)