-
-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression in V3 #2710
Comments
ouch, sorry the performance is so much worse in 3.0. I am pretty sure we can do better, hopefully this is an easy fix. In case you or anyone else wants to dig into this, i would profile this function |
FWIW, I tried and failed to reproduce this regression locally. In fact, V3 was faster. 2.18.2arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
# -> 8.0s
zarr.save_array(store, arr, zarr_version=3, path="/")
# -> 5.3s 3.0.1.dev10+g45146ca0'arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)
# -> 1.2s
za[:] = arr
# -> 9.9s I wonder what could be the difference between environments. Perhaps the regression is hardware-dependent. I'm on a macbook with a fast SSD. |
Hmm, I don't know what your first 8s measurement was. It should not take that long to allocate some chunk buffers in memory. I also have an MacBook Pro M3 Max so will rerun these and report back. The initial set of measurements I took was on an AWS EC2 m6i instance. UPDATE: My latencies also look much better on MacOS, albeit Zarr V3 is still measuring slower. Zarr 2.18.4 ❯ ZARR_V3_EXPERIMENTAL_API=1 ipython
Python 3.13.1 (main, Jan 5 2025, 06:22:40) [Clang 19.1.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.31.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: import numpy as np
In [2]: import zarr
In [3]: from zarr._storage.v3 import DirectoryStoreV3
In [4]: %time arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
CPU times: user 3.29 s, sys: 525 ms, total: 3.82 s
Wall time: 1.25 s
In [5]: store = DirectoryStoreV3("/tmp/foo.zarr")
In [6]: %time zarr.save_array(store, arr, zarr_version=3, path="/")
CPU times: user 7.83 s, sys: 543 ms, total: 8.37 s
Wall time: 1.25 s Zarr 3.0.0 In [5]: store = LocalStore("/tmp/bar.zarr")
In [6]: compressors = zarr.codecs.BloscCodec(cname="lz4", shuffle=zarr.codecs.BloscShuffle.bitshuffle)
In [8]: za = zarr.create_array(store, shape=(1024 * 1024 * 1024,), chunks=(1024 * 1024 * 16,), dtype=np.float64, compressors=compressors)
In [9]: arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)
In [10]: %time za[:] = arr
CPU times: user 6.75 s, sys: 1.38 s, total: 8.13 s
Wall time: 3.69 s I'll keep digging... |
My current hypothesis is that the benchmarking that I've been running on the EC2 instance (r7i.2xlarge) is memory-bandwidth constrained (AWS is very hand wavy about memory bandwidth throttling). Even just allocating large arrays, e.g. However, because https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/codec_pipeline.py#L396 is run in a coroutine on the event loop (even though there is no I/O), it effectively crowds out other tasks from running. For the 128 MiB chunk, calling a1 = np.arange(1024 ** 2 * 16)
a2 = np.ones(1024 **2 * 16)
# this 128 MiB memcmp takes 150ms on my r7i.2xlarge EC2 instance and 50ms on my M3 Max
np.array_equal(a1, a2) If I short-circuit The former implementation return np.all(value == array) which underneath the hood does some efficient SIMD (takes only 15ms on the same machine) whereas the version 3 implementation looks/is much more expensive. |
Hi, do the maintainers have any thoughts on this? I am happy to take a look if there is a change that you are willing to accept, but I'd first need some context why the decision was made to change the implementation of |
@y4n9squared - thanks very much for your work on this. We are definitely interested in fixing this performance regression and appreciate your input. @d-v-b is the one who wrote that code, so let's get his input on it. From my perspective, I am very open to changing the way this is implemented back the the way it used to be. |
Hi @d-v-b , just so we have something concrete to discuss, I opened a draft PR with the changes that I made which showed improvement. |
@y4n9squared thanks for the report, we are definitely happy to fix performance regressions here. one question about the examples you shared upthread: In zarr-python 2.18, the default value for Because of this change in default behavior, benchmarking across zarr-python versions without explicitly setting For the broader question of benchmarking in the context of throttled or low memory bandwidth, can you recommend any tools for artificially reducing memory bandwidth? I'd like to be able to benchmark these things locally. |
Contrary to what git blame suggests, I didn't write that implementation of |
@d-v-b Thanks for the callout. I'll rerun the comparison so that it's apples-to-apples. Along these lines, has there been a change to the compressor behavior as well? UUIC, in Zarr 2.x, if one did not override the compressor, the default setting was Blosc/LZ4 w/ shuffle, where the shuffle behavior (bit vs. byte) depended on the dtype -- for float64, it would do bit. When I compare the compressed chunk sizes with this 3.x code: store = "/tmp/foo.zarr"
shape = (1024 * 1024 * 1024,)
chunks = (1024 * 1024 * 16,)
dtype = np.float64
fill_value = np.nan
# cname = "blosclz"
cname = "lz4"
compressors = zarr.codecs.BloscCodec(cname=cname, shuffle=zarr.codecs.BloscShuffle.bitshuffle)
za = zarr.create_array(
store,
shape=shape,
chunks=chunks,
dtype=dtype,
fill_value=fill_value,
compressors=compressors,
)
arr = np.arange(1024 * 1024 * 1024, dtype=dtype).reshape(shape)
za[:] = arr I see notable differences in the sizes. Curiously, if I switch the cname to What is the correct compressor config for an accurate 2.x comparison?
I'm not entirely sure. Aside from the general notion that EC2 instances are VMs and small instance types like the ones that I've been using (m6i.2xlarge, r7i.2xlarge) are more susceptible to "noisy neighbors", I haven't been able to pinpoint why I see the slowdowns that I do. Some of the normal tools that I'd use to inspect these things (e.g. GNU perf) are handicapped in VMs. I am going to look into this a bit more and revert back. |
Thanks for iterating on this, this is exactly what we need to dial in the performance of the library.
For zarr v2 data, all of the zarr-python v2 codecs are available in zarr-python 3.0. You should be able to pass exactly the same numcodecs instance as |
If I pass in compressors = [numcodecs.Blosc(cname="lz4")] I get this traceback: File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 3926, in create_array
array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
~~~~~~~~~~~~~~~~~~~~~~~~^
compressors=compressors,
^^^^^^^^^^^^^^^^^^^^^^^^
...<2 lines>...
dtype=dtype_parsed,
^^^^^^^^^^^^^^^^^^^
)
^
File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 4127, in _parse_chunk_encoding_v3
out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 4127, in <genexpr>
out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
~~~~~~~~~~~~~~~~~~~~~~~~^^^
File "/home/yang.yang/workspaces/zarr-python/src/zarr/registry.py", line 184, in _parse_bytes_bytes_codec
raise TypeError(f"Expected a BytesBytesCodec. Got {type(data)} instead.")
TypeError: Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead. |
ah, sorry for the confusion, I thought you were making zarr v2 data (i.e., you were calling |
Do you know what causes the discrepancy in the compression results then? I expected that, even though the Python syntax has changed, the actual compression behavior/results would have been the same. |
Zarr version
3.0.0
Numcodecs version
0.14.1
Python Version
3.13
Operating System
Linux
Installation
Using uv
Description
This simple workload, which writes out the numbers 1 through 1e9 in 64 separate chunks,
run in about 5s on my machine on version 2.18.
The equivalent workload on version 3 takes over a minute:
Steps to reproduce
See above
Additional output
No response
The text was updated successfully, but these errors were encountered: