Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance regression in V3 #2710

Open
y4n9squared opened this issue Jan 14, 2025 · 14 comments
Open

Performance regression in V3 #2710

y4n9squared opened this issue Jan 14, 2025 · 14 comments
Labels
bug Potential issues with the zarr-python library performance Potential issues with Zarr performance (I/O, memory, etc.)

Comments

@y4n9squared
Copy link

Zarr version

3.0.0

Numcodecs version

0.14.1

Python Version

3.13

Operating System

Linux

Installation

Using uv

Description

This simple workload, which writes out the numbers 1 through 1e9 in 64 separate chunks,

# Using zarr==2.18.2
import numpy as np
import zarr
from zarr._storage.v3 import DirectoryStoreV3

store = DirectoryStoreV3("/tmp/foo.zarr")
arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
zarr.save_array(store, arr, zarr_version=3, path="/")

run in about 5s on my machine on version 2.18.

The equivalent workload on version 3 takes over a minute:

# Using zarr==3.0
import numpy as np
import zarr
import zarr.codecs
from zarr.storage import LocalStore

store = LocalStore("/tmp/bar.zarr")

compressors = zarr.codecs.BloscCodec(cname='lz4', shuffle=zarr.codecs.BloscShuffle.bitshuffle)

za = zarr.create_array(
    store,
    shape=(1024 * 1024 * 1024,),
    chunks=(1024 * 1024 * 16,),
    dtype=np.float64,
    compressors=compressors,
)

arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)
za[:] = arr

Steps to reproduce

See above

Additional output

No response

@y4n9squared y4n9squared added the bug Potential issues with the zarr-python library label Jan 14, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Jan 14, 2025

ouch, sorry the performance is so much worse in 3.0. I am pretty sure we can do better, hopefully this is an easy fix. In case you or anyone else wants to dig into this, i would profile this function

@rabernat
Copy link
Contributor

rabernat commented Jan 14, 2025

FWIW, I tried and failed to reproduce this regression locally. In fact, V3 was faster.

2.18.2

arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
# -> 8.0s
zarr.save_array(store, arr, zarr_version=3, path="/")
# -> 5.3s

3.0.1.dev10+g45146ca0'

arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)
# -> 1.2s
za[:] = arr
# -> 9.9s

I wonder what could be the difference between environments. Perhaps the regression is hardware-dependent. I'm on a macbook with a fast SSD.

@y4n9squared
Copy link
Author

y4n9squared commented Jan 14, 2025

Hmm, I don't know what your first 8s measurement was. It should not take that long to allocate some chunk buffers in memory. I also have an MacBook Pro M3 Max so will rerun these and report back. The initial set of measurements I took was on an AWS EC2 m6i instance.

UPDATE:

My latencies also look much better on MacOS, albeit Zarr V3 is still measuring slower.

Zarr 2.18.4

ZARR_V3_EXPERIMENTAL_API=1 ipython
Python 3.13.1 (main, Jan  5 2025, 06:22:40) [Clang 19.1.6 ]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.31.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import numpy as np

In [2]: import zarr

In [3]: from zarr._storage.v3 import DirectoryStoreV3

In [4]: %time arr = zarr.array(np.arange(1024 * 1024 * 1024, dtype=np.float64), chunks=(1024 * 1024 * 16,))
CPU times: user 3.29 s, sys: 525 ms, total: 3.82 s
Wall time: 1.25 s

In [5]: store = DirectoryStoreV3("/tmp/foo.zarr")

In [6]: %time zarr.save_array(store, arr, zarr_version=3, path="/")
CPU times: user 7.83 s, sys: 543 ms, total: 8.37 s
Wall time: 1.25 s

Zarr 3.0.0

In [5]: store = LocalStore("/tmp/bar.zarr")

In [6]: compressors = zarr.codecs.BloscCodec(cname="lz4", shuffle=zarr.codecs.BloscShuffle.bitshuffle)

In [8]: za = zarr.create_array(store, shape=(1024 * 1024 * 1024,), chunks=(1024 * 1024 * 16,), dtype=np.float64, compressors=compressors)

In [9]: arr = np.arange(1024 * 1024 * 1024, dtype=np.float64)

In [10]: %time za[:] = arr
CPU times: user 6.75 s, sys: 1.38 s, total: 8.13 s
Wall time: 3.69 s

I'll keep digging...

@y4n9squared
Copy link
Author

y4n9squared commented Jan 15, 2025

My current hypothesis is that the benchmarking that I've been running on the EC2 instance (r7i.2xlarge) is memory-bandwidth constrained (AWS is very hand wavy about memory bandwidth throttling). Even just allocating large arrays, e.g. np.ones(1024 ** 3) takes a few seconds (whereas on bare metal it would be in the hundreds of milliseconds).

However, because https://github.com/zarr-developers/zarr-python/blob/main/src/zarr/core/codec_pipeline.py#L396 is run in a coroutine on the event loop (even though there is no I/O), it effectively crowds out other tasks from running. For the 128 MiB chunk, calling np.array_equal takes about 150ms (again probably due to lack of memory bandwidth). But for the 64 chunks, this constitutes a full 10s of serial work on the event loop.

a1 = np.arange(1024 ** 2 * 16)
a2 = np.ones(1024 **2 * 16)
# this 128 MiB memcmp takes 150ms on my r7i.2xlarge EC2 instance and 50ms on my M3 Max
np.array_equal(a1, a2)

If I short-circuit chunk_array.all_equal by setting it to False, then the rest of the pipeline finishes very quickly and I can write out the full array in ~3s, which brings me to parity with the performance I can get on Zarr 2.18.

The former implementation zarr.util.all_equal appears to compare the chunk to a scalar

return np.all(value == array)

which underneath the hood does some efficient SIMD (takes only 15ms on the same machine) whereas the version 3 implementation looks/is much more expensive.

@dstansby dstansby added the performance Potential issues with Zarr performance (I/O, memory, etc.) label Jan 15, 2025
@y4n9squared
Copy link
Author

Hi, do the maintainers have any thoughts on this? I am happy to take a look if there is a change that you are willing to accept, but I'd first need some context why the decision was made to change the implementation of all_equal.

@rabernat
Copy link
Contributor

@y4n9squared - thanks very much for your work on this. We are definitely interested in fixing this performance regression and appreciate your input.

@d-v-b is the one who wrote that code, so let's get his input on it. From my perspective, I am very open to changing the way this is implemented back the the way it used to be.

@y4n9squared
Copy link
Author

Hi @d-v-b , just so we have something concrete to discuss, I opened a draft PR with the changes that I made which showed improvement.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 19, 2025

@y4n9squared thanks for the report, we are definitely happy to fix performance regressions here.

one question about the examples you shared upthread: In zarr-python 2.18, the default value for write_empty_chunks is True, but in 3.x it's False.

Because of this change in default behavior, benchmarking across zarr-python versions without explicitly setting write_empty_chunks might give you misleading results.

For the broader question of benchmarking in the context of throttled or low memory bandwidth, can you recommend any tools for artificially reducing memory bandwidth? I'd like to be able to benchmark these things locally.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 19, 2025

@d-v-b is the one who wrote that code, so let's get his input on it. From my perspective, I am very open to changing the way this is implemented back the the way it used to be.

Contrary to what git blame suggests, I didn't write that implementation of all_equal -- it appeared in #1826. In any case, we should definitely go with an implementation that has the best all-around performance.

@y4n9squared
Copy link
Author

y4n9squared commented Jan 19, 2025

@y4n9squared thanks for the report, we are definitely happy to fix performance regressions here.

one question about the examples you shared upthread: In zarr-python 2.18, the default value for write_empty_chunks is True, but in 3.x it's False.

Because of this change in default behavior, benchmarking across zarr-python versions without explicitly setting write_empty_chunks might give you misleading results.

@d-v-b Thanks for the callout. I'll rerun the comparison so that it's apples-to-apples. Along these lines, has there been a change to the compressor behavior as well?

UUIC, in Zarr 2.x, if one did not override the compressor, the default setting was Blosc/LZ4 w/ shuffle, where the shuffle behavior (bit vs. byte) depended on the dtype -- for float64, it would do bit.

When I compare the compressed chunk sizes with this 3.x code:

store = "/tmp/foo.zarr"
shape = (1024 * 1024 * 1024,)
chunks = (1024 * 1024 * 16,)
dtype = np.float64
fill_value = np.nan

# cname = "blosclz"
cname = "lz4"
compressors = zarr.codecs.BloscCodec(cname=cname, shuffle=zarr.codecs.BloscShuffle.bitshuffle)

za = zarr.create_array(
    store,
    shape=shape,
    chunks=chunks,
    dtype=dtype,
    fill_value=fill_value,
    compressors=compressors,
)

arr = np.arange(1024 * 1024 * 1024, dtype=dtype).reshape(shape)
za[:] = arr

I see notable differences in the sizes. Curiously, if I switch the cname to blosclz instead, I get chunk sizes that are more similar to Zarr 2.x.

What is the correct compressor config for an accurate 2.x comparison?

For the broader question of benchmarking in the context of throttled or low memory bandwidth, can you recommend any tools for artificially reducing memory bandwidth? I'd like to be able to benchmark these things locally.

I'm not entirely sure. Aside from the general notion that EC2 instances are VMs and small instance types like the ones that I've been using (m6i.2xlarge, r7i.2xlarge) are more susceptible to "noisy neighbors", I haven't been able to pinpoint why I see the slowdowns that I do. Some of the normal tools that I'd use to inspect these things (e.g. GNU perf) are handicapped in VMs. I am going to look into this a bit more and revert back.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 19, 2025

Thanks for iterating on this, this is exactly what we need to dial in the performance of the library.

I see notable differences in the sizes. Curiously, if I switch the cname to blosclz instead, I get chunk sizes that are more similar to Zarr 2.x.

What is the correct compressor config for an accurate 2.x comparison?

For zarr v2 data, all of the zarr-python v2 codecs are available in zarr-python 3.0. You should be able to pass exactly the same numcodecs instance as compressors to create_array, which should make for a fair comparison. But maybe to keep things simple for a benchmark, no compression at all might make sense.

@y4n9squared
Copy link
Author

For zarr v2 data, all of the zarr-python v2 codecs are available in zarr-python 3.0. You should be able to pass exactly the same numcodecs instance as compressors to create_array, which should make for a fair comparison. But maybe to keep things simple for a benchmark, no compression at all might make sense.

If I pass in numcodec types directly to create_array,

compressors = [numcodecs.Blosc(cname="lz4")]

I get this traceback:

  File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 3926, in create_array
    array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
                                            ~~~~~~~~~~~~~~~~~~~~~~~~^
        compressors=compressors,
        ^^^^^^^^^^^^^^^^^^^^^^^^
    ...<2 lines>...
        dtype=dtype_parsed,
        ^^^^^^^^^^^^^^^^^^^
    )
    ^
  File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 4127, in _parse_chunk_encoding_v3
    out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
  File "/home/yang.yang/workspaces/zarr-python/src/zarr/core/array.py", line 4127, in <genexpr>
    out_bytes_bytes = tuple(_parse_bytes_bytes_codec(c) for c in maybe_bytes_bytes)
                            ~~~~~~~~~~~~~~~~~~~~~~~~^^^
  File "/home/yang.yang/workspaces/zarr-python/src/zarr/registry.py", line 184, in _parse_bytes_bytes_codec
    raise TypeError(f"Expected a BytesBytesCodec. Got {type(data)} instead.")
TypeError: Expected a BytesBytesCodec. Got <class 'numcodecs.blosc.Blosc'> instead.

@d-v-b
Copy link
Contributor

d-v-b commented Jan 19, 2025

ah, sorry for the confusion, I thought you were making zarr v2 data (i.e., you were calling create_array with zarr_format=2). As you learned, create_array with zarr_format=3 won't take the v2 numcodecs codecs (and we should make a better error message for this).

@y4n9squared
Copy link
Author

Do you know what causes the discrepancy in the compression results then? I expected that, even though the Python syntax has changed, the actual compression behavior/results would have been the same.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library performance Potential issues with Zarr performance (I/O, memory, etc.)
Projects
None yet
Development

No branches or pull requests

4 participants