Is a malicious Zarr store possible? #2625
Replies: 2 comments
-
I think compared to the attack vector posed by pickle, ordinary numerical data type + compression combinations are incredibly tame.
Generally yes, the assumption is that a single chunk fits in memory. But compression ratio aside, the size of each chunk is static and known in advance so you can't really surprise a victim with that -- clients can simply refuse to handle chunks that are too big. |
Beta Was this translation helpful? Give feedback.
-
because zarr is just files and folders ("hierarchy"), the zipped storage representation is just ... zipped files and folders. as far as I know we don't take any special precautions against zip bombs -- we inherit all of the vulnerabilities present in the python standard library implementation of zip. I don't think there's anything particularly special about zip as an archive format from zarr's POV -- the zarr storage model would work with any archive that can store files + folders. If zip, or the python implementation, has some severe vulnerabilities, then we should of course code through those issues and recommend alternatives if they exist. But I've never heard of zip bombing in the context of zarr. |
Beta Was this translation helpful? Give feedback.
-
Reading about Hugging Face's
safetensor
format, they cite a main motivation asGiven that their format is otherwise likely mappable to Zarr via a virtual-zarr approach (see zarr-developers/VirtualiZarr#367), this has me wondering if it's possible to create a malicious Zarr (or Icechunk) store?
I don't know much about security, but it seems like it's mostly safe:
eval()
arbitrary data they find,pickle
does),The more interesting question is whether or not the decompression step could be used for a Denial-Of-Service (DOS) attack, i.e. like a Zip Bomb. I don't understand how the decompression codecs actually function super well, but apparently the maximum compression ratio of zlib is 1032. To go any higher you would have to compress multiple times, which would need to be recorded in the codec information otherwise the reader wouldn't do it. So if you were worried about "Zarr Bombs" you could perhaps just look at the compression codec information to see if there was anything funky-looking (e.g. 10 decompression steps in a row)?
But even if you did create a zarr chunk that had a very high compression ratio, would the reader actually allocate that much memory?
Also, does Zarr support ZIP in the way described here?
Beta Was this translation helpful? Give feedback.
All reactions