-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[v3] default compressor / codec pipeline #2267
Comments
thoughts on defaulting to Zstandard? |
Does |
the v3 example is creating a zarr v2 array, so the "compressor" concept still applies there. but it's also worth considering what the default codec pipeline should be when creating zarr v3 arrays. If it was easy to specify, then I think defaulting to a sharding codec with an inner chunk size equal to the outer chunk size (i.e., a minimal shard) would actually be a good default, but would have to see how this looks. |
I think we should make the defaults exactly the same as they are in v2. I do not think we should make sharding the default until we have spent more time optimizing and debugging it. |
The default compressor in v2 was I'd be fine with keeping cc @mkitti |
Before defaulting to Blosc in Zarr v3, we should really fix the issue in #2171 . That is there should probably be a ArrayBytesBloscCodec that can actually transmit dtype / typesize information correctly to Blosc. Perhaps one could follow the zarr-java implementation and default the typesize to the dtype. Also, continuing to encode new data with Blosc v1 is unwise given the current stage-of-life of that package. Blosc v1 is in what I will term "community maintenance mode". Unless you are actively thinking about it, I would not assume that anyone is actively maintaining the package. My recommendations from most favored to least are:
|
I would be reluctant to pick Blosc, given it isn't very actively maintained and the blosc maintainers would rather folks tranistioned to Blosc2. In this context I think we have a responsibility to move away from providing Blosc (version 1) as the default compressor, and zarr-python v3 seems like a good point to do that. I'd be 👍 to defaulting to no compression. This would then force users to learn about compression and choose a compressor that works well for them. |
@rabernat can you expand a bit on why you think we should keep the same defaults? |
Instead of learning how zarr works, people might just get frustrated that their data isn't compressed and conclude that the format isn't worth the trouble. I like the idea but I don't think a "pedagogical default" beats a "good default" here. My preference would be that we pick a solid compressor that offers good all-around performance, and I think Zstd (sans blosc) clears that bar. |
Just for the sake of maintaining consistency and not having to make a decision. However, if there are serious drawbacks (as it appears there are), I'm fine with zstd. Do we have a sense of the performance implications of this choice? |
Here are blosc's internal benchmarks: "Not the fastest, but a nicely balanced one" is a good summary. Basically, the default settings balance compression ratio and speed. Lower or negative levels provide more speed. Higher levels provide more compression ratio. |
One note here that's relevant for #2036 and pydata/xarray#9515, the default codec can depend on the dtype of the array: # zarr-python 2.18.3
>>> g = zarr.group(store={})
>>> g.create(name="b", shape=(3,), dtype=str).filters
[VLenUTF8()] |
good point @TomAugspurger. In that case we probably should default to a string literal like "auto" for compressor / filters, and then use functions like |
Big 👍 to that idea. |
We now have automatic detection of the ArrayBytes codec based on dtype: zarr-python/src/zarr/codecs/__init__.py Lines 37 to 46 in 395604d
Next step for this issue is to just add a default BytesBytes compressor. |
Based on the discussion today, we landed on:
Related to this is how to set the default. In 2.x, we told folks to set |
+1 for using Zstd or Zlib potentially as default compressors. It would be ideal to use a format that is standardized, has a strongly specified stream definition and is widely used and supported. |
Based on this, I started working on #2470. Feel free to leave a comment. |
@jhamman |
VLenUTF8 is explicitly for strings. VLenBytes is just any random bytes. They are mutually exclusive. |
Zarr version
3.0.0.alpha6
Numcodecs version
N/A
Python Version
N/A
Operating System
N/A
Installation
N/A
Description
In Zarr-Python 2.x, Zarr provided default compressors for most (all?) datatypes. As of now, in 3.0, we don't provide any defaults.
Steps to reproduce
In 2.18:
In 3.0.0.alpha6
Additional output
No response
The text was updated successfully, but these errors were encountered: