-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consolidated metadata storage for Zarr v3 #136
Comments
Is the V3 considering metadata compression? I'm working with a dataset where the consolidated metadata (v2) has 1.2GB in size, and opening it from S3 takes a very long time to open. |
Hi @ianliu, thanks this is very useful to know. Adding a capability for compression into a v3 extension for consolidated metadata sounds like a good idea to me. |
@ianliu I'm curious what makes the metadata so big in your case? Do you have lots of arrays? Or very large custom attrs? V3 allows for defining of custom metadata encoding, so a compressed format could be used. Also, a major question for V3 is whether user metadata is a separate document or not (see #149 (comment)). If you have 1GB of custom attrs in a single array, that would be a strong motivation not to put user metadata together with the core array metadata. |
@rabernat in my particular case, I have many small arrays. It could be argued that there is a better data representation that would consolidate multiple arrays into a single array, thus reducing metadata, but the way our program is setup makes this a little bit trickier (mainly because our program produces results in parallel). |
💯 for potential consolidated metadata features but
would this suggest stripping the current consolidated metadata support in V3 until there's an extension? |
I've drafted a version of this at https://github.com/TomAugspurger/zeps/blob/consolidated-metadata/draft/ZEP0006.md. A few questions before turning that into a PR: Is this appropriate for a ZEP? Or would it go in the zarr-specs repo? One major design decision is whether this consolidated metadata should be embedded in the I'm also a bit curious about how much async zarr-python reduces the pain of hierarchies with a large number of nodes. With dozens of HTTP requests in flight at the same time, maybe it's not so bad? If #284 (listing the child nodes in the root |
I like the draft ZEP @TomAugspurger. Thanks for putting it together. As far as I understand, the current spec does not provide a way to add arbitrary extensions to a metadata document (this could be changed though). Provided we can address this challenge, we likely will want to support both an inline and external consolidated metadata object. // inline
"consolidated_metadata": {
"kind": "inline":
"metadata": {
"air": {
"shape": [2920, 25, 53],
"fill_value": 0,
...
"node_type": "array",
},
"lat": {
"shape": [25],
"fill_value": NaN,
...
"node_type": "array",
}
}
}
// external
"consolidated_metadata": {
"kind": "external":
"metadata": "consolidated_metadata.json"
} An alternative direction we could take is to define a Consolidated Metadata Store which could enable this functionality without needing to touch the root metadata document. |
In the sense that that's just not spelled out anywhere as a thing that can be done? Or that adding additional keys might break things?
I like the idea of future-proofing things by including a "kind" enum to indicate where to find the consolidated metadata, maybe we pick only one for the spec? This is minor, but I worry about putting too much complexity on the implementors. |
I don't think we have defined the expected behavior for what will happen if an implementation encounters and unexpected key. So yes, I think thinks could break. edit: there is this section in the spec:
Now, I don't know how/where you would put "must_understand" in this case. See also #270
+1 one, pick one for now. We can always expand this later. |
I'm not sure I see the motivation for disallowing extra keys in the metadata objects. That seems to make it hard to evolve the spec in forwards-compatible ways. Anyway, maybe in this case it's fine to use because we're making a "future version of the spec"? And I guess we'll set |
In my mind, this is a great ZEP.
That should definitely be some form of metadata in zarr.json. I'd very much be for using this as a way to kick the tires of the extension process. i.e., if you/we can't do it, then that needs to be fixed at the spec level.
I think it's implicitly possible but not (well-)documented. So 👍 for doing what we need to do to make it possible.
I'd urge caution here. This led to many issues in v2. Minimally it should be a "wrapping store" so that they can be composed, because otherwise they lead to an explosion of classes that are needed. (e.g. "NestedRemoteDirectoryStore") Additionally, I'd suggestthe metadata MUST be in the zarr.json regardless of the implementation.
👍 for having |
Whoops, I just pushed this as a PR at #309, rather than a ZEP, sorry.
I'll open a separate issue to discuss this. In my experience, STAC's method of each object containing a |
In the current draft of the v3 spec in zarr-developers/zarr-python#898, consolidated metadata has been implemented similarly to v2. Currently this metadata gets stored in the group
meta/root/consolidated
, but this is not part of the spec. Likely we need to specify an extension for how consolidated metadata should be handled.The text was updated successfully, but these errors were encountered: