Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal: group to list it's children #15

Open
tam203 opened this issue Apr 11, 2019 · 8 comments
Open

Proposal: group to list it's children #15

tam203 opened this issue Apr 11, 2019 · 8 comments
Labels
protocol-extension Protocol extension related issue

Comments

@tam203
Copy link

tam203 commented Apr 11, 2019

If I read the 2.2 spec correctly then when opening a group there is no way of knowing the children of that group without doing a list. This seems sub-optimal to me. I'm usually working on S3 and very large Zarr files (1000s/100,000s objects) and a list operation in this setting is not very efficent.

I feel that it .zgroup listed its children then this would relieve this problem. Something like.

.zgroup -> contains: ["foo", "foo2"]
foo/.zgroup -> contains:["bar"]
foo/barr/.zarray
foo2/.zarray
@martindurant
Copy link
Member

Coming out of a similar need in xarray (which scans the whole structure when opened), you can "consolidate" all the metadata into a single key, if you wish, so that you only need one HTTP call and no listing https://zarr.readthedocs.io/en/stable/api/convenience.html?highlight=consolidate#zarr.convenience.consolidate_metadata

@tam203 tam203 changed the title .zgroup to list it's children Proposal: .zgroup to list it's children Apr 11, 2019
@tam203
Copy link
Author

tam203 commented Apr 11, 2019

@martindurant interesting, thank you. This feels useful but a workaround rather than a solution. I'd be curious if you agree. I'm particularly interested in "high momentum datasets" large but also rapidly changing datasets.

Background: I work at the UK Met Office where we have a lot of this sort of data, I've written a couple of posts about this that explain where I'm coming from (and your thought on would be appreciated)

The system we are building will be dealing with many concurrent writes to different parts of the datasets I feel if there was a central point that all had to update this would be a bottleneck that would cause us issues.

@martindurant
Copy link
Member

Quite right, the consolidated case is meant for data that is not changing (often), at least in the metadata. I know there's been talk of "growing" zarrs, but I don't know what the best solution would be for your case, your suggestion feels a little like a "partial consolidation", in that you store in one place information that might is available, but expensive to obtain. There are probably a few shades between the standard files-only and the fully consolidated.

@joshmoore
Copy link
Member

I can certainly see having "hints" which point to subgroups, but from the specs point-of-view, what does one do when the filesystem and the list is out-of-sync?

From the user's point-of-view, I can also imagine:

  • zarr fsck which checks for discrepancies
  • zarr seal which simply performs consolidate, listing all the groups, but somehow marks that changes should be avoided.

@tam203
Copy link
Author

tam203 commented May 13, 2019

Thanks for your input @joshmoore (sorry for slow reply been on holiday).

but from the specs point-of-view, what does one do when the filesystem and the list is out-of-sync?

I think this is the key point. To me and the paradime I'm working in at the moment file systems can not be trusted. I'm nearly always working with eventually consistent object storage so access the same thing twice you might get a different answer.

Because of this I want the spec to have a single entry point (.zgroup?) from which all truth cascades. I don't want to rely on listing keys or any other interaction of with the file system other than "give me this exact object" or "put this exact object".

In the moving (as in being updated, growing and or rolling) datasets I'm working with I might have:

/.zgroup
/v1/.zarray
/v2/.zarray
/v3/.zarray

I'd want /.zgroup to point to only one of the v1 v2 v3 child arrays, accepting that because of eventual consistency I might open it and it points to /v2/... but you might get /v3/... we both get valid datasets just different versions.

An example where I might use this is a rolling dataset where the data and metadata are stored in sibling arrays. As the data gets updated/advanced the metadata needs to do the same. If the filesystem serves you a slightly older version of the .zgroup then you get pointed to the old metadata and old data. (this relies on some other changes/ideas too, like .zatts being signposted from or not being a separate object to .zarray/.zgroup .

I feel I'm sharing this shamelessly everywhere but if you're interested I've written a blog about this: How to (and not to) handle metadata in high momentum datasets

Thanks

@joshmoore
Copy link
Member

I want the spec to have a single entry point (.zgroup?) from which all truth cascades.

Interesting. Thanks, I can definitely see the reasoning.

@tam203
Copy link
Author

tam203 commented May 17, 2019

@joshmoore there is some relevance between this issue and my comment on this 3.0 spec pull request. I think this 'cascading truth' approach might resolve both.

I'm mulling the idea that in the hierarchy a node knows it's children but not its grandchildren or parents. This would allow you to enter a hierarchy at any point and work down but not up.

There would be some inefficiencies especially in deep trees but it would avoid a list operation (which is what I'm most scared of (working in S3 land a lot))

Thanks for the feedback.

@jstriebel jstriebel added this to ZEP1 Nov 16, 2022
@jstriebel jstriebel added the core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec label Nov 16, 2022
@jstriebel
Copy link
Member

Also referencing the consolidated metadata issue #136 here.

I assume that this could be a v3 extension, right? Probably as an extension in the entrypoint metadata, which would not be strictly required for reads. Tagging this as a possible extension for now, please let me know if this is not the case.

@jstriebel jstriebel removed this from ZEP1 Nov 23, 2022
@jstriebel jstriebel added protocol-extension Protocol extension related issue and removed core-protocol-v3.0 Issue relates to the core protocol version 3.0 spec labels Nov 23, 2022
@jstriebel jstriebel changed the title Proposal: .zgroup to list it's children Proposal: group to list it's children Feb 22, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
protocol-extension Protocol extension related issue
Projects
None yet
Development

No branches or pull requests

4 participants