-
-
Notifications
You must be signed in to change notification settings - Fork 304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zarr-v3 Consolidated Metadata #2113
Zarr-v3 Consolidated Metadata #2113
Conversation
This isn't handling nested groups correctly yet. My understanding (and I'll clarify this in the spec) is that given a structure like
we'd expect that the metadata from child-group, child-arr-1, and child-arr-2 should end up in the consolidated metadata. |
Ensures that nested children are listed properly.
This PR adds a recursive=True flag to Group.members, for recursively listing the members of some hierarhcy. This is useful for Consolidated Metadata, which needs to recursively inspect children. IMO, it's useful (and simple) enough to include in the public API.
Implements the optional Consolidated Metadata feature of zarr-v3.
4943be2
to
65a8bd4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start here Tom! I have a few questions about parts you haven't implemented yet that should help explore the design a bit more:
- How are you thinking about wiring this into read operations (e.g.
getitem
,listdir
, etc.)? - How should we think about consistency w.r.t. the consolidated metadata field? When should we invalidate the consolidated metadata and when is it safe to keep using it?
- How can this approach be used to support reading/writing v2 consolidated metadata (
.zmetadata
)?
…olidated-metadata
…olidated-metadata
…olidated-metadata
…olidated-metadata
…olidated-metadata
Quick update here:
|
As I start to use this a bit, I'm rethinking the in-memory representation of consolidated metadata for a nested hierarchy. Specifically the
The motivation for this rethink is from the current implementation having to be careful about just using Here's an example: Given a hierarchy like
We'll represent the consolidated metadata as a flat mapping of (store) keys to values.
But in memory, what is the Should it match the flat structure on disk, where the keys imply the structure?
Or should it have a nested / tree-like structure, where just immediate children appear in
Right now I've implemented option 1. I'll out option 2 today. |
There's a behavior change here that's causing some xarray tests to fail. It comes down to whether we interpret
Zarr 2.x implemented the first one. This branch currently implements the second version. The rough flow is
I'll see how hard it is to implement what 2.x was doing, but I want to confirm that's the behavior we want. A limitation of 2.x's way of doing things is that you can only have one consolidated metadata for the entire Store: at the root. |
…olidated-metadata
…olidated-metadata
…onsolidated-metadata
…olidated-metadata
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger -- this came together so nicely! Way to go 😄
I left a few comments of substance but I think this can go in very soon (tomorrow?)
docs/consolidated_metadata.rst
Outdated
Consolidated Metadata | ||
===================== | ||
|
||
zarr-python implements the `Consolidated Metadata_` extension to the Zarr Spec. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
zarr-python implements the `Consolidated Metadata_` extension to the Zarr Spec. | |
Zarr-Python implements the `Consolidated Metadata_` extension to the Zarr Spec. |
docs/consolidated_metadata.rst
Outdated
metadata reads get child Group or Array nodes will *not* require reads from the store. | ||
|
||
In Python, the consolidated metadata is available on the ``.consolidated_metadata`` | ||
attribute of the Group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attribute of the Group. | |
attribute of the Group metadata. |
of the metadata, at the time they read the root node with its consolidated | ||
metadata. | ||
|
||
.. _Consolidated Metadata: https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html#consolidated-metadata |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These docs are great @TomAugspurger! 👏
src/zarr/api/asynchronous.py
Outdated
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup: | ||
""" | ||
Alias for :func:`open_group` with ``use_consolidated=True``. | ||
""" | ||
return await open_group(*args, use_consolidated=use_consolidated, **kwargs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup: | |
""" | |
Alias for :func:`open_group` with ``use_consolidated=True``. | |
""" | |
return await open_group(*args, use_consolidated=use_consolidated, **kwargs) | |
async def open_consolidated(*args: Any, use_consolidated: bool = True, **kwargs: Any) -> AsyncGroup: | |
""" | |
Alias for :func:`open_group` with ``use_consolidated=True``. | |
""" | |
return await open_group(*args, use_consolidated=True, **kwargs) |
Seems to me that open_consolidated
should always use use_consolidated=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My goal here is to avoid a user accidentally passing use_consolidated
in **kwargs
and us silently overwriting it. I'll update this to raise if use_consolidated
isn't True.
src/zarr/core/array.py
Outdated
if metadata_dict.get("node_type") != "array": | ||
# This KeyError is load bearing for `open`. That currently tries | ||
# to open the node as an `array` and then falls back to opening | ||
# as a group. | ||
raise KeyError |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps raise NodeTypeValidationError
from #2310
# We already read zattrs and zgroup. Should we ignore these? | ||
v2_consolidated_metadata.pop(".zattrs") | ||
v2_consolidated_metadata.pop(".zgroup") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the future, we may choose to inform the caller that the consolidated metadata does not match the root group metadata but for now, this seems fine.
…olidated-metadata
…olidated-metadata
src/zarr/core/group.py
Outdated
@@ -82,42 +89,310 @@ def _parse_async_node(node: AsyncArray | AsyncGroup) -> Array | Group: | |||
raise TypeError(f"Unknown node type, got {type(node)}") | |||
|
|||
|
|||
def _json_convert(o: object) -> Any: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot to actually delete this block.
Huge @TomAugspurger 🎉 🎉 🎉 !!! |
awesome work! |
Implements the optional Consolidated Metadata feature of zarr-v3: zarr-developers/zarr-specs#309, along with reading zarr v2 metadata.
The implementation defines a new dataclass:
ConsoliatedMetadata
. It's an optional field on the existingGroupMetadata
object. Opening as a draft until the PR to the zarr-specs repo finishes up.closes #1161
TODO: