Skip to content

Improve write performance of shards #2977

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

balbasty
Copy link

The poor write performance of sharded zarrs in the zarr-python implementation is currently a major limiting factor to its adoption by our group. We found that writing shard-by-shard in an empty sharded array is one magnitude slower than writing in unsharded zarrs. This is surprising, as writing full shards should only be marginally slower than writing unsharded chunks.

While this 2023 discussion suggests that the latency is caused by the re-generation of the index table, this does not seem to be the case in the latest implementation, which saves all chunks in memory and (properly) waits for all chunks to be available before generating the index table (see _encode_partial_single).

Instead I found that a major cause of slowdown comes from the implementation of the Buffer class, which calls np.concatenate every time bytes are added to the buffer. As a proof of concept, I have implemented an alternative DelayedBuffer class that keeps individual byte chunks in a list, and only concatenates them when needed. On a simple benchmark that uses 512**3 shards and 128**3 chunks and a local store, it reduces the time to write one shard from ~10 sec to ~1 sec, which is on par with the time taken to write the same 512**3 array in an unsharded zarr (~0.9 sec).

I am keeping this as a draft for now as it is a hacky proof-of-concept implementation, but I am happy to clean it up if this is found to be a good solution (with guidance on how to implement the delayed buffer in a way that is compatible with the buffer prototype logic, which I don't fully understand). All tests pass except one that checks whether a store receives a TestBuffer (as it instead receives a `DelayedBuffer).

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

Sorry, something went wrong.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Apr 11, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Apr 11, 2025

@balbasty thank you so much for this work. I think your detective work here will be very much appreciated.

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

@balbasty
Copy link
Author

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

I don't believe so. The index table has a fixed size, but the chunks have variable size (hence the index table). Otherwise compressed chunks would take more space than needed. The format is either index_table + stack([encoded_chunks]) or stack([encoded_chunks]) + index_table.

@d-v-b
Copy link
Contributor

d-v-b commented Apr 19, 2025

general question: why are we doing concatenation at all? is there a reason why we can't statically allocate all the memory we need in advance? I thought the sharding format gave explicit byte ranges for each chunk, and thus the size of any combination of shards can be known prior to fetching anything.

I don't believe so. The index table has a fixed size, but the chunks have variable size (hence the index table). Otherwise compressed chunks would take more space than needed. The format is either index_table + stack([encoded_chunks]) or stack([encoded_chunks]) + index_table.

What I mean is that, when we get the index table, we also get the size of each compressed chunk. And when we are fetching chunks from a shard, we always know in advance which chunks we need. So it seems like the combination of the shard index + the set of requested chunks is sufficient to specify the required memory for compressed chunks exactly. Does this check out?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants