-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: group to list it's children #15
Comments
Coming out of a similar need in xarray (which scans the whole structure when opened), you can "consolidate" all the metadata into a single key, if you wish, so that you only need one HTTP call and no listing https://zarr.readthedocs.io/en/stable/api/convenience.html?highlight=consolidate#zarr.convenience.consolidate_metadata |
@martindurant interesting, thank you. This feels useful but a workaround rather than a solution. I'd be curious if you agree. I'm particularly interested in "high momentum datasets" large but also rapidly changing datasets. Background: I work at the UK Met Office where we have a lot of this sort of data, I've written a couple of posts about this that explain where I'm coming from (and your thought on would be appreciated) The system we are building will be dealing with many concurrent writes to different parts of the datasets I feel if there was a central point that all had to update this would be a bottleneck that would cause us issues. |
Quite right, the consolidated case is meant for data that is not changing (often), at least in the metadata. I know there's been talk of "growing" zarrs, but I don't know what the best solution would be for your case, your suggestion feels a little like a "partial consolidation", in that you store in one place information that might is available, but expensive to obtain. There are probably a few shades between the standard files-only and the fully consolidated. |
I can certainly see having "hints" which point to subgroups, but from the specs point-of-view, what does one do when the filesystem and the list is out-of-sync? From the user's point-of-view, I can also imagine:
|
Thanks for your input @joshmoore (sorry for slow reply been on holiday).
I think this is the key point. To me and the paradime I'm working in at the moment file systems can not be trusted. I'm nearly always working with eventually consistent object storage so access the same thing twice you might get a different answer. Because of this I want the spec to have a single entry point ( In the moving (as in being updated, growing and or rolling) datasets I'm working with I might have:
I'd want An example where I might use this is a rolling dataset where the data and metadata are stored in sibling arrays. As the data gets updated/advanced the metadata needs to do the same. If the filesystem serves you a slightly older version of the I feel I'm sharing this shamelessly everywhere but if you're interested I've written a blog about this: How to (and not to) handle metadata in high momentum datasets Thanks |
Interesting. Thanks, I can definitely see the reasoning. |
@joshmoore there is some relevance between this issue and my comment on this 3.0 spec pull request. I think this 'cascading truth' approach might resolve both. I'm mulling the idea that in the hierarchy a node knows it's children but not its grandchildren or parents. This would allow you to enter a hierarchy at any point and work down but not up. There would be some inefficiencies especially in deep trees but it would avoid a list operation (which is what I'm most scared of (working in S3 land a lot)) Thanks for the feedback. |
Also referencing the consolidated metadata issue #136 here. I assume that this could be a v3 extension, right? Probably as an extension in the entrypoint metadata, which would not be strictly required for reads. Tagging this as a possible extension for now, please let me know if this is not the case. |
If I read the 2.2 spec correctly then when opening a group there is no way of knowing the children of that group without doing a list. This seems sub-optimal to me. I'm usually working on S3 and very large Zarr files (1000s/100,000s objects) and a list operation in this setting is not very efficent.
I feel that it
.zgroup
listed its children then this would relieve this problem. Something like.The text was updated successfully, but these errors were encountered: