Performance regression in V3 #2710

y4n9squared · 2025-01-14T19:36:02Z

Zarr version

3.0.0

Numcodecs version

0.14.1

Python Version

3.13

Operating System

Linux

Installation

Using uv

Description

This simple workload, which writes out the numbers 1 through 1e9 in 64 separate chunks,

# Using zarr==2.18.2
import numpy as np
import zarr
from zarr._storage.v3 import DirectoryStoreV3

store = DirectoryStoreV3("/tmp/foo.zarr")
arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
zarr.save_array(store, arr, zarr_version=3, path="/")

run in about 5s on my machine on version 2.18.

The equivalent workload on version 3 takes over a minute:

# Using zarr==3.0
import numpy as np
import zarr
import zarr.codecs
from zarr.storage import LocalStore

store = LocalStore("/tmp/bar.zarr")

compressors = zarr.codecs.BloscCodec(cname='lz4', shuffle=zarr.codecs.BloscShuffle.bitshuffle)

za = zarr.create_array(
    store,
    shape=(1024 * 1024 * 1024,),
    chunks=(1024 * 1024 * 16,),
    dtype=np.float64,
    compressors=compressors,
)

arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)
za[:] = arr

Steps to reproduce

See above

Additional output

No response

d-v-b · 2025-01-14T22:06:19Z

ouch, sorry the performance is so much worse in 3.0. I am pretty sure we can do better, hopefully this is an easy fix. In case you or anyone else wants to dig into this, i would profile this function

rabernat · 2025-01-14T23:42:20Z

FWIW, I tried and failed to reproduce this regression locally. In fact, V3 was faster.

2.18.2

arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
# -> 8.0s
zarr.save_array(store, arr, zarr_version=3, path="/")
# -> 5.3s

3.0.1.dev10+g45146ca0'

arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)
# -> 1.2s
za[:] = arr
# -> 9.9s

I wonder what could be the difference between environments. Perhaps the regression is hardware-dependent. I'm on a macbook with a fast SSD.

y4n9squared · 2025-01-14T23:59:02Z

Hmm, I don't know what your first 8s measurement was. It should not take that long to allocate some chunk buffers in memory. I also have an MacBook Pro M3 Max so will rerun these and report back. The initial set of measurements I took was on an AWS EC2 m6i instance.

UPDATE:

My latencies also look much better on MacOS, albeit Zarr V3 is still measuring slower.

Zarr 2.18.4

❯ ZARR_V3_EXPERIMENTAL_API=1 ipython
Python 3.13.1 (main, Jan  5 2025, 06:22:40) [Clang 19.1.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.31.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: import zarr

In [3]: from zarr._storage.v3 import DirectoryStoreV3

In [4]: %time arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
CPU times: user 3.29 s, sys: 525 ms, total: 3.82 s
Wall time: 1.25 s

In [5]: store = DirectoryStoreV3("/tmp/foo.zarr")

In [6]: %time zarr.save_array(store, arr, zarr_version=3, path="/")
CPU times: user 7.83 s, sys: 543 ms, total: 8.37 s
Wall time: 1.25 s

Zarr 3.0.0

In [5]: store = LocalStore("/tmp/bar.zarr")

In [6]: compressors = zarr.codecs.BloscCodec(cname="lz4", shuffle=zarr.codecs.BloscShuffle.bitshuffle)

In [8]: za = zarr.create_array(store, shape=(1024 * 1024 * 1024,), chunks=(1024 * 1024 * 16,), dtype=np.float64, compressors=compressors)

In [9]: arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)

In [10]: %time za[:] = arr
CPU times: user 6.75 s, sys: 1.38 s, total: 8.13 s
Wall time: 3.69 s

I'll keep digging...

y4n9squared · 2025-01-15T05:40:45Z

My current hypothesis is that the benchmarking that I've been running on the EC2 instance (r7i.2xlarge) is memory-bandwidth constrained (AWS is very hand wavy about memory bandwidth throttling). Even just allocating large arrays, e.g. np.ones(1024 ** 3) takes a few seconds (whereas on bare metal it would be in the hundreds of milliseconds).

However, because https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/codec_pipeline.py#L396 is run in a coroutine on the event loop (even though there is no I/O), it effectively crowds out other tasks from running. For the 128 MiB chunk, calling np.array_equal takes about 150ms (again probably due to lack of memory bandwidth). But for the 64 chunks, this constitutes a full 10s of serial work on the event loop.

a1 = np.arange(1024 ** 2 * 16)
a2 = np.ones(1024 **2 * 16)
# this 128 MiB memcmp takes 150ms on my r7i.2xlarge EC2 instance and 50ms on my M3 Max
np.array_equal(a1, a2)

If I short-circuit chunk_array.all_equal by setting it to False, then the rest of the pipeline finishes very quickly and I can write out the full array in ~3s, which brings me to parity with the performance I can get on Zarr 2.18.

The former implementation zarr.util.all_equal appears to compare the chunk to a scalar

return np.all(value == array)

which underneath the hood does some efficient SIMD (takes only 15ms on the same machine) whereas the version 3 implementation looks/is much more expensive.

y4n9squared · 2025-01-17T19:45:10Z

Hi, do the maintainers have any thoughts on this? I am happy to take a look if there is a change that you are willing to accept, but I'd first need some context why the decision was made to change the implementation of all_equal.

rabernat · 2025-01-17T19:50:39Z

@y4n9squared - thanks very much for your work on this. We are definitely interested in fixing this performance regression and appreciate your input.

@d-v-b is the one who wrote that code, so let's get his input on it. From my perspective, I am very open to changing the way this is implemented back the the way it used to be.

y4n9squared · 2025-01-18T20:40:41Z

Hi @d-v-b , just so we have something concrete to discuss, I opened a draft PR with the changes that I made which showed improvement.

d-v-b · 2025-01-19T10:53:43Z

@y4n9squared thanks for the report, we are definitely happy to fix performance regressions here.

one question about the examples you shared upthread: In zarr-python 2.18, the default value for write_empty_chunks is True, but in 3.x it's False.

Because of this change in default behavior, benchmarking across zarr-python versions without explicitly setting write_empty_chunks might give you misleading results.

For the broader question of benchmarking in the context of throttled or low memory bandwidth, can you recommend any tools for artificially reducing memory bandwidth? I'd like to be able to benchmark these things locally.

d-v-b · 2025-01-19T11:01:47Z

@d-v-b is the one who wrote that code, so let's get his input on it. From my perspective, I am very open to changing the way this is implemented back the the way it used to be.

Contrary to what git blame suggests, I didn't write that implementation of all_equal -- it appeared in #1826. In any case, we should definitely go with an implementation that has the best all-around performance.

y4n9squared · 2025-01-19T17:38:09Z

@y4n9squared thanks for the report, we are definitely happy to fix performance regressions here.

one question about the examples you shared upthread: In zarr-python 2.18, the default value for write_empty_chunks is True, but in 3.x it's False.

Because of this change in default behavior, benchmarking across zarr-python versions without explicitly setting write_empty_chunks might give you misleading results.

@d-v-b Thanks for the callout. I'll rerun the comparison so that it's apples-to-apples. Along these lines, has there been a change to the compressor behavior as well?

UUIC, in Zarr 2.x, if one did not override the compressor, the default setting was Blosc/LZ4 w/ shuffle, where the shuffle behavior (bit vs. byte) depended on the dtype -- for float64, it would do bit.

When I compare the compressed chunk sizes with this 3.x code:

store = "/tmp/foo.zarr"
shape = (1024 * 1024 * 1024,)
chunks = (1024 * 1024 * 16,)
dtype = np.float64
fill_value = np.nan

# cname = "blosclz"
cname = "lz4"
compressors = zarr.codecs.BloscCodec(cname=cname, shuffle=zarr.codecs.BloscShuffle.bitshuffle)

za = zarr.create_array(
    store,
    shape=shape,
    chunks=chunks,
    dtype=dtype,
    fill_value=fill_value,
    compressors=compressors,
)

arr = np.arange(1024 * 1024 * 1024, dtype=dtype).reshape(shape)
za[:] = arr

I see notable differences in the sizes. Curiously, if I switch the cname to blosclz instead, I get chunk sizes that are more similar to Zarr 2.x.

What is the correct compressor config for an accurate 2.x comparison?

For the broader question of benchmarking in the context of throttled or low memory bandwidth, can you recommend any tools for artificially reducing memory bandwidth? I'd like to be able to benchmark these things locally.

I'm not entirely sure. Aside from the general notion that EC2 instances are VMs and small instance types like the ones that I've been using (m6i.2xlarge, r7i.2xlarge) are more susceptible to "noisy neighbors", I haven't been able to pinpoint why I see the slowdowns that I do. Some of the normal tools that I'd use to inspect these things (e.g. GNU perf) are handicapped in VMs. I am going to look into this a bit more and revert back.

d-v-b · 2025-01-19T18:02:31Z

Thanks for iterating on this, this is exactly what we need to dial in the performance of the library.

I see notable differences in the sizes. Curiously, if I switch the cname to blosclz instead, I get chunk sizes that are more similar to Zarr 2.x.

What is the correct compressor config for an accurate 2.x comparison?

For zarr v2 data, all of the zarr-python v2 codecs are available in zarr-python 3.0. You should be able to pass exactly the same numcodecs instance as compressors to create_array, which should make for a fair comparison. But maybe to keep things simple for a benchmark, no compression at all might make sense.

y4n9squared · 2025-01-19T18:37:17Z

For zarr v2 data, all of the zarr-python v2 codecs are available in zarr-python 3.0. You should be able to pass exactly the same numcodecs instance as compressors to create_array, which should make for a fair comparison. But maybe to keep things simple for a benchmark, no compression at all might make sense.

If I pass in numcodec types directly to create_array,

compressors = [numcodecs.Blosc(cname="lz4")]

I get this traceback:

  File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 3926, in create_array
    array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
                                            ~~~~~~~~~~~~~~~~~~~~~~~~^
        compressors=compressors,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        dtype=dtype_parsed,
        ^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 4127, in _parse_chunk_encoding_v3
    out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
  File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 4127, in <genexpr>
    out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
                            ~~~~~~~~~~~~~~~~~~~~~~~~^^^
  File "/home/yang.yang/workspaces/zarr-python/src/zarr/registry.py", line 184, in _parse_bytes_bytes_codec
    raise TypeError(f"Expected a BytesBytesCodec. Got {type(data)} instead.")
TypeError: Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.

d-v-b · 2025-01-19T21:17:17Z

ah, sorry for the confusion, I thought you were making zarr v2 data (i.e., you were calling create_array with zarr_format=2). As you learned, create_array with zarr_format=3 won't take the v2 numcodecs codecs (and we should make a better error message for this).

y4n9squared · 2025-01-19T21:41:44Z

Do you know what causes the discrepancy in the compression results then? I expected that, even though the Python syntax has changed, the actual compression behavior/results would have been the same.

y4n9squared · 2025-01-25T16:05:18Z

Hi @d-v-b, could you comment on my compressor question above? In summary, my expectation was that using

compressor = numcodecs.Blosc(cname="lz4", shuffle=BITSHUFFLE)

in 2.x would yield identical behavior as

compressors = zarr.codecs.Blosc(cname="lz4", shuffle=zarr.codecs.BloscShuffle.bitshuffle)

in 3.x, but this appears not to be the case. The compression ratio is actually off by quite a bit.

d-v-b · 2025-01-25T16:46:06Z

that sounds like a bug, but maybe it should be spun out into its own issue with a reproducer?

y4n9squared · 2025-01-25T17:44:01Z

I created #2766 to track this issue separately.

y4n9squared added the bug Potential issues with the zarr-python library label Jan 14, 2025

dstansby added the performance Potential issues with Zarr performance (I/O, memory, etc.) label Jan 15, 2025

y4n9squared mentioned this issue Jan 18, 2025

Restore optimizations for NDBuffer.all_equal #2730

Draft

This was referenced Feb 1, 2025

Monthly issue metrics report #2787

Closed

Monthly issue metrics report sanketverma1704/zarr-python#7

Open

Monthly issue metrics report enthusiastdev121/zarr-python#16

Open

K-Meech mentioned this issue Apr 3, 2025

Make write_empty_chunks consistent across packages HEFTIEProject/zarr-benchmarks#27

Closed

TomNicholas mentioned this issue Apr 7, 2025

TypeError when passing old numcodecs to zarr v3 #2964

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance regression in V3 #2710

Performance regression in V3 #2710

y4n9squared commented Jan 14, 2025

d-v-b commented Jan 14, 2025

rabernat commented Jan 14, 2025 •

edited

Loading

y4n9squared commented Jan 14, 2025 •

edited

Loading

y4n9squared commented Jan 15, 2025 •

edited

Loading

y4n9squared commented Jan 17, 2025

rabernat commented Jan 17, 2025

y4n9squared commented Jan 18, 2025

d-v-b commented Jan 19, 2025

d-v-b commented Jan 19, 2025

y4n9squared commented Jan 19, 2025 •

edited

Loading

d-v-b commented Jan 19, 2025

y4n9squared commented Jan 19, 2025

d-v-b commented Jan 19, 2025 •

edited

Loading

y4n9squared commented Jan 19, 2025

y4n9squared commented Jan 25, 2025 •

edited

Loading

d-v-b commented Jan 25, 2025

y4n9squared commented Jan 25, 2025

Performance regression in V3 #2710

Performance regression in V3 #2710

Comments

y4n9squared commented Jan 14, 2025

Zarr version

Numcodecs version

Python Version

Operating System

Installation

Description

Steps to reproduce

Additional output

d-v-b commented Jan 14, 2025

rabernat commented Jan 14, 2025 • edited Loading

2.18.2

3.0.1.dev10+g45146ca0'

y4n9squared commented Jan 14, 2025 • edited Loading

y4n9squared commented Jan 15, 2025 • edited Loading

y4n9squared commented Jan 17, 2025

rabernat commented Jan 17, 2025

y4n9squared commented Jan 18, 2025

d-v-b commented Jan 19, 2025

d-v-b commented Jan 19, 2025

y4n9squared commented Jan 19, 2025 • edited Loading

d-v-b commented Jan 19, 2025

y4n9squared commented Jan 19, 2025

d-v-b commented Jan 19, 2025 • edited Loading

y4n9squared commented Jan 19, 2025

y4n9squared commented Jan 25, 2025 • edited Loading

d-v-b commented Jan 25, 2025

y4n9squared commented Jan 25, 2025

rabernat commented Jan 14, 2025 •

edited

Loading

y4n9squared commented Jan 14, 2025 •

edited

Loading

y4n9squared commented Jan 15, 2025 •

edited

Loading

y4n9squared commented Jan 19, 2025 •

edited

Loading

d-v-b commented Jan 19, 2025 •

edited

Loading

y4n9squared commented Jan 25, 2025 •

edited

Loading