feat/batch creation #2665

d-v-b · 2025-01-07T13:27:35Z

This PR adds a few routines for creating a collection of arrays and groups (i.e., a dict with path-like keys and ArrayMetadata / GroupMetadata values) in storage concurrently.

create_hierarchy takes a dict representation of a hierarchy, parses that dict to ensure that there are no implicit groups (creating group metadata documents as needed), then invokes create_nodes and yields the results
create_nodes concurrently writes metadata documents to storage, and yields the created AsyncArray / AsyncGroup instances.

I still need to wire up concurrency limits, and test them.

TODO:

Add unit tests and/or doctests in docstrings
Add docstrings and API docs for any new/modified user-facing classes and functions
New/modified features documented in docs/tutorial.rst
Changes documented in docs/release.rst
GitHub Actions have all passed
Test coverage is 100% (Codecov passes)

…/batch-creation

…into feat/batch-creation

…at/batch-creation

…to feat/batch-creation

…at/batch-creation

…into feat/batch-creation

…at/batch-creation

d-v-b · 2025-01-10T14:40:15Z

this is now working, so I would appreciate some feedback on the design.

The basic design is the same as what I outlined earlier in this PR: there are two new functions that take a dict[path, GroupMetadata | ArrayMetadata] like {'a': GroupMetadata(zarr_format=3), 'a/b': ArrayMetadata(...)} and concurrently persist those metadata documents to storage, resulting in a hierarchy on disk that looks like the dict.

approach

basically the same as concurrent group members listing, except we don't need any recursion. I'm scheduling writes and using as_completed to yield Arrays / Groups when they are available.

new functions

create_nodes is low-level and doesn't do any checking of its input, so it will happily create invalid hierarchies, e.g. nesting groups inside arrays, or mixing v2 and v3 metadata, and it won't create intermediate groups, either.
create_hierarchy is higher level, it parses the input, checking it for invalid hierarchies, and inserting implicit groups as needed.
Group.create_hierarchy is a new method on the Group / AsyncGroup classes that takes a hierarchy dict and creates the nodes specified in that dict at locations relative to the path of the group instance. the return value is dict[str, AsyncGroup | AsyncArray], but I guess it also doesn't have tor return anything, or it could be an async iterator, so that you can interact with the nodes as they are formed. This is flexible right now, but I think the iterator idea is nice.
_from_flat (names welcome) is a new function that creates a group entirely from a hierarchy dict + a store. that dict must specify a root group, otherwise an exception is raised. We could revise this to create a root group if one is not specified. Open to suggestions here.

Implicit groups

Partial hierarchies like {'a': GroupMetadata(), 'a/b/c': ArrayMetadata(...)} implicitly denote a group at a/b. When creating such a hierarchy, if we find an existing group at a/b, then we don't need to create a new one. So in the context of modeling a hierarchy, implicit groups are a little special -- by not specifying the properties of the group, the user / application is tolerant of any group being there. So I introduced a subclass of GroupMetadata called _ImplicitGroupMetadata, which can be inserted into a hierarchy dict to explicitly denote groups that don't need to be written if one already exists. _ImplicitGroupMetadata is just like GroupMetadata except it errors if you try to set any parameter except zarr_format.

streaming v2 vs v3 node creation

creating v3 arrays / groups requires writing 1 metadata document, but v2 requires 2. To get the most concurrency I await the write of each metadata document separately, which means that foo/.zattrs might resolve before foo/.zarray. So in the v2 case I only yield up an array / group when both documents were written.

Overlap with metadata consolidation logic

there's a lot of similarity between the stuff in this PR and routines used for consolidated metadata. it would be great to find ways to factor out some of the overlap areas

still to do:

write some more tests (checking that implicit groups don't get written if a group already exists)
handle overwriting. I think the plan here is, if overwrite is False, then we do a check before any writing to ensure that there are no conflicts between the proposed hierarchy and the stuff that actually exists in storage. this check will involve more IO.

…at/batch-creation

d-v-b · 2025-01-25T16:54:10Z

@TomNicholas you should have a look at some of these new functions / methods. I'd be happy to change things if you have some datatree conventions you'd like to suggest

dcherian · 2025-01-27T20:40:03Z

src/zarr/core/group.py

+        This method takes a dictionary where the keys are the names of the arrays or groups
+        to create and the values are the metadata or objects representing the arrays or groups.
+
+        The method returns an asynchronous iterator over the created nodes.


A note about when to use this would be great

good point, would something like "use this method to create an entire tree of sub-groups and / or sub-arrays efficiently." suffice?

dcherian · 2025-01-27T20:40:13Z

src/zarr/core/group.py

@@ -1407,6 +1395,42 @@ async def _members(
        ):
            yield member

+    # TODO: find a better name for this. create_tree could work.


hierarchy is fine to me.

dcherian · 2025-01-27T20:45:46Z

src/zarr/core/group.py

+    """
+    ctx: asyncio.Semaphore | contextlib.nullcontext[None]
+
+    if semaphore is None:


why do we need None as an option here? And why is it "no semaphore" instead of "default concurrency semaphore"?

create_nodes is low-level, kind of dangerous, and the expectation is that the user of this function knows what they are doing. So it doesn't default with any concurrency limit. It also doesn't check if nodes already exist, or if the user wants to nest arrays inside arrays. A higher level function (create_hierarchy) does all that checking, and that's where the concurrency limit defaults to the value in the config.

so why not require the semaphore if it's an advanced user?

the semaphore is only necessary if you want some concurrency limiting mechanism, but I think that's a special case. A default of None means that users of the function don't need to run from asyncio import Semaphore before creating some nodes. E.g., there are a lot of places in the tests where I would want to use create_nodes, and in basically none of those places would a semaphore be necessary.

dcherian · 2025-01-27T20:48:08Z

src/zarr/core/group.py

+    # We will iterate over the dict again, but a full pass here ensures that the error message
+    # is comprehensive, and I think the performance cost will be negligible.
+    for k, v in data.items():
+        observed_zarr_formats[v.zarr_format].append(k)


can we short-circuit inside the loop and raise on the first instance of zarr_format that is not equal to next(data.items())[1].zarr_format?

we definitely could, but we get a much nicer error message if we can traverse the entire proposed hierarchy and identify all the problematic nodes.

dcherian · 2025-01-27T20:48:53Z

src/zarr/core/group.py

+    document stored at store_path.path / (.zgroup | .zarray). If no such document is found,
+    raise a KeyError.
+    """
+    # TODO: consider first fetching array metadata, and only fetching group metadata when we don't


i think it's better to just grab all 3 at once honestly

dcherian · 2025-01-27T20:55:45Z

tests/conftest.py

+"""
+
+
+def create_array_metadata(


seems like this would duplicate logic form elsewhere? Is there nowhere else this function could be used?

this should be duplicated, but sadly these functions don't exist in the codebase yet! I add some functions like this in another PR: #2761

…at/batch-creation

d-v-b · 2025-01-28T16:25:33Z

question: should we export a sync version of create_hierarchy from the top-level zarr namespace?

dcherian · 2025-01-28T21:48:33Z

question: should we export a sync version of create_hierarchy from the top-level zarr namespace?

Yes, this would be used in Xarray.

d-v-b · 2025-01-28T22:15:10Z

this PR adds a few functions that have async implementations and sync wrappers, like create_hierarchy_a (async) and create_hierarchy. Instead of putting (async, sync) pairs in the same module with slightly different names, I wonder if we should split the group module into an async-only namespace and a sync namespace (that imports stuff from the async namespace)? Then we don't need to mangle function names.

…to feat/batch-creation

d-v-b added 8 commits December 11, 2024 15:38

sketch out batch creation routine

8faf994

scratch state of easy batch creation

8952911

Merge branch 'main' of https://github.com/d-v-b/zarr-python into feat…

de3c594

…/batch-creation

rename tupleize keys

c700e39

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

986d68b

…into feat/batch-creation

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

97b768f

…at/batch-creation

Merge branch 'feat/batch-creation' of github.com:d-v-b/zarr-python in…

b6bf2dd

…to feat/batch-creation

tests and proper implementation for create_nodes and create_hierarchy

57ceb64

d-v-b requested review from jhamman and dcherian January 7, 2025 13:27

d-v-b added 7 commits January 7, 2025 14:28

privatize

181d3d0

use Posixpath instead of Path in tests; avoid redundant cast

e8e6107

restore cast

4f2c954

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

dd4174c

…at/batch-creation

pureposixpath instead of posixpath

cf72834

group-level create_hierarchy

e2cff8c

docstring

0912ecb

normanrz added this to the After 3.0.0 milestone Jan 7, 2025

d-v-b added 3 commits January 8, 2025 19:12

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

04f7922

…into feat/batch-creation

sketch out from_flat for groups

089feef

better concurrency for v2

116ab87

dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025

Merge branch 'main' of https://github.com/zarr-developers/zarr-python …

246f862

…into feat/batch-creation

github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025

d-v-b added 2 commits January 9, 2025 21:42

revert change to default concurrency

e38c1ca

create root correctly

2fb9083

d-v-b mentioned this pull request Jan 10, 2025

creating groups from dicts #2685

Open

d-v-b added 2 commits January 10, 2025 15:03

working _from_flat

b099fba

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

64b54bf

…at/batch-creation

d-v-b added 2 commits January 22, 2025 23:18

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

9d2f642

…at/batch-creation

Merge branch 'main' into feat/batch-creation

ed0d52a

TomNicholas mentioned this pull request Jan 25, 2025

DataTree + Zarr-Python 3 pydata/xarray#9984

Open

2 tasks

dcherian reviewed Jan 27, 2025

View reviewed changes

d-v-b added 5 commits January 28, 2025 16:14

use store + path instead of StorePath for hierarchy api

661678f

docstrings

7a718d5

docstrings

23bfef5

Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…

619eeb5

…at/batch-creation

release notes

5282534

refactor sync / async functions, and make tests more compact accordingly

6507e43

d-v-b removed the needs release notes Automatically applied to PRs which haven't added release notes label Jan 28, 2025

keyerror -> filenotfounderror

6b56342

keyerror -> filenotfounderror, fixup

3be878d

github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 28, 2025

d-v-b added 5 commits January 28, 2025 23:15

Merge branch 'main' into feat/batch-creation

774eeda

add top-level exports

f3c506f

Merge branch 'feat/batch-creation' of github.com:d-v-b/zarr-python in…

60379a7

…to feat/batch-creation

mildly refactor node input validation

32e06fa

simplify path normalization

8bd0b57

This was referenced Jan 29, 2025

Initializing a group or array is not thread-safe, even with mode='w' #1435

Open

[do not merge] sketch of a refactor for core modules #2780

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat/batch creation #2665

feat/batch creation #2665

d-v-b commented Jan 7, 2025

d-v-b commented Jan 10, 2025 •

edited

Loading

d-v-b commented Jan 25, 2025

dcherian Jan 27, 2025

d-v-b Jan 27, 2025

dcherian Jan 27, 2025

dcherian Jan 27, 2025 •

edited

Loading

d-v-b Jan 27, 2025 •

edited

Loading

dcherian Jan 27, 2025

d-v-b Jan 27, 2025

dcherian Jan 27, 2025

d-v-b Jan 27, 2025

dcherian Jan 27, 2025

dcherian Jan 27, 2025

d-v-b Jan 27, 2025

d-v-b commented Jan 28, 2025

dcherian commented Jan 28, 2025

d-v-b commented Jan 28, 2025

feat/batch creation #2665

Are you sure you want to change the base?

feat/batch creation #2665

Conversation

d-v-b commented Jan 7, 2025

d-v-b commented Jan 10, 2025 • edited Loading

approach

new functions

Implicit groups

streaming v2 vs v3 node creation

Overlap with metadata consolidation logic

still to do:

d-v-b commented Jan 25, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dcherian Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

d-v-b Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

d-v-b commented Jan 28, 2025

dcherian commented Jan 28, 2025

d-v-b commented Jan 28, 2025

d-v-b commented Jan 10, 2025 •

edited

Loading

dcherian Jan 27, 2025 •

edited

Loading

d-v-b Jan 27, 2025 •

edited

Loading