Proposal: group to list it's children #15

tam203 · 2019-04-11T15:17:25Z

If I read the 2.2 spec correctly then when opening a group there is no way of knowing the children of that group without doing a list. This seems sub-optimal to me. I'm usually working on S3 and very large Zarr files (1000s/100,000s objects) and a list operation in this setting is not very efficent.

I feel that it .zgroup listed its children then this would relieve this problem. Something like.

.zgroup -> contains: ["foo", "foo2"]
foo/.zgroup -> contains:["bar"]
foo/barr/.zarray
foo2/.zarray

The text was updated successfully, but these errors were encountered:

martindurant · 2019-04-11T15:22:06Z

Coming out of a similar need in xarray (which scans the whole structure when opened), you can "consolidate" all the metadata into a single key, if you wish, so that you only need one HTTP call and no listing https://zarr.readthedocs.io/en/stable/api/convenience.html?highlight=consolidate#zarr.convenience.consolidate_metadata

tam203 · 2019-04-11T15:41:34Z

@martindurant interesting, thank you. This feels useful but a workaround rather than a solution. I'd be curious if you agree. I'm particularly interested in "high momentum datasets" large but also rapidly changing datasets.

Background: I work at the UK Met Office where we have a lot of this sort of data, I've written a couple of posts about this that explain where I'm coming from (and your thought on would be appreciated)

The system we are building will be dealing with many concurrent writes to different parts of the datasets I feel if there was a central point that all had to update this would be a bottleneck that would cause us issues.

martindurant · 2019-04-11T15:53:43Z

Quite right, the consolidated case is meant for data that is not changing (often), at least in the metadata. I know there's been talk of "growing" zarrs, but I don't know what the best solution would be for your case, your suggestion feels a little like a "partial consolidation", in that you store in one place information that might is available, but expensive to obtain. There are probably a few shades between the standard files-only and the fully consolidated.

joshmoore · 2019-04-30T14:58:31Z

I can certainly see having "hints" which point to subgroups, but from the specs point-of-view, what does one do when the filesystem and the list is out-of-sync?

From the user's point-of-view, I can also imagine:

zarr fsck which checks for discrepancies
zarr seal which simply performs consolidate, listing all the groups, but somehow marks that changes should be avoided.

tam203 · 2019-05-13T13:46:27Z

Thanks for your input @joshmoore (sorry for slow reply been on holiday).

but from the specs point-of-view, what does one do when the filesystem and the list is out-of-sync?

I think this is the key point. To me and the paradime I'm working in at the moment file systems can not be trusted. I'm nearly always working with eventually consistent object storage so access the same thing twice you might get a different answer.

Because of this I want the spec to have a single entry point (.zgroup?) from which all truth cascades. I don't want to rely on listing keys or any other interaction of with the file system other than "give me this exact object" or "put this exact object".

In the moving (as in being updated, growing and or rolling) datasets I'm working with I might have:

/.zgroup
/v1/.zarray
/v2/.zarray
/v3/.zarray

I'd want /.zgroup to point to only one of the v1 v2 v3 child arrays, accepting that because of eventual consistency I might open it and it points to /v2/... but you might get /v3/... we both get valid datasets just different versions.

An example where I might use this is a rolling dataset where the data and metadata are stored in sibling arrays. As the data gets updated/advanced the metadata needs to do the same. If the filesystem serves you a slightly older version of the .zgroup then you get pointed to the old metadata and old data. (this relies on some other changes/ideas too, like .zatts being signposted from or not being a separate object to .zarray/.zgroup .

I feel I'm sharing this shamelessly everywhere but if you're interested I've written a blog about this: How to (and not to) handle metadata in high momentum datasets

Thanks

joshmoore · 2019-05-17T11:24:58Z

I want the spec to have a single entry point (.zgroup?) from which all truth cascades.

Interesting. Thanks, I can definitely see the reasoning.

tam203 · 2019-05-17T16:20:27Z

@joshmoore there is some relevance between this issue and my comment on this 3.0 spec pull request. I think this 'cascading truth' approach might resolve both.

I'm mulling the idea that in the hierarchy a node knows it's children but not its grandchildren or parents. This would allow you to enter a hierarchy at any point and work down but not up.

There would be some inefficiencies especially in deep trees but it would avoid a list operation (which is what I'm most scared of (working in S3 land a lot))

Thanks for the feedback.

jstriebel · 2022-11-23T14:35:11Z

Also referencing the consolidated metadata issue #136 here.

I assume that this could be a v3 extension, right? Probably as an extension in the entrypoint metadata, which would not be strictly required for reads. Tagging this as a possible extension for now, please let me know if this is not the case.

tam203 changed the title ~~.zgroup to list it's children~~ Proposal: .zgroup to list it's children Apr 11, 2019

manzt mentioned this issue Feb 17, 2020

Better support for hierachrical stores and nested urls gzuidhof/zarr.js#36

Closed

jstriebel added this to ZEP1 Nov 16, 2022

jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022

jstriebel removed this from ZEP1 Nov 23, 2022

jstriebel added protocol-extension Protocol extension related issue and removed core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec labels Nov 23, 2022

martindurant mentioned this issue Dec 8, 2022

Zarr V3 support fsspec/kerchunk#262

Closed

jstriebel changed the title ~~Proposal: .zgroup to list it's children~~ Proposal: group to list it's children Feb 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: group to list it's children #15

Proposal: group to list it's children #15

tam203 commented Apr 11, 2019

martindurant commented Apr 11, 2019

tam203 commented Apr 11, 2019

martindurant commented Apr 11, 2019

joshmoore commented Apr 30, 2019

tam203 commented May 13, 2019

joshmoore commented May 17, 2019

tam203 commented May 17, 2019

jstriebel commented Nov 23, 2022

Proposal: group to list it's children #15

Proposal: group to list it's children #15

Comments

tam203 commented Apr 11, 2019

martindurant commented Apr 11, 2019

tam203 commented Apr 11, 2019

martindurant commented Apr 11, 2019

joshmoore commented Apr 30, 2019

tam203 commented May 13, 2019

joshmoore commented May 17, 2019

tam203 commented May 17, 2019

jstriebel commented Nov 23, 2022