-
-
Notifications
You must be signed in to change notification settings - Fork 323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Confusion about the dimension_separator keyword #769
Comments
To prevent a breaking change.
Agreed. This surprises me as well. |
So, I think
An alternative would be to disallow "/" as a separator with the error message, "use NestedDirectoryStore". |
Thanks for looking into this, @joshmoore. In my opinion, the issue with
|
To add to this issue, the same problem exists when reading the data.
The data returned will be all zeros (or more generally all fill_value).
|
Yeah, that definitely counts as a bug. I almost wonder if it wouldn't be easiest to use FSStore to re-implement (Nested)DirectoryStore. cc: @martindurant |
Maybe, but also it can be tricky to have an implementation that does too many things, with a load of options. Is that better or worse than a set of specific implementations? When I initially trialled FSStore, it did have more options, that I removed for want of testing. fsspec is a light dependency, but I imagine people would still be surprised to need it for local files (unless zarr were to have a hard requirement). |
(Working on a failing test & a fix now) |
So the failing test is pretty straight-forward, but I'm beginning to fear that this use case shows that the Edit: pondering out loud, an alternative would be to have a |
Agreed, we need to sort out to what extent zarr metadata should describe not just how to turn bytes into arrays, but details of how it is laid out in the storage backend. I am ambivalent - maybe the user just "needs to know", so that the metadata of a copy on a different backed is the same. Actually, an over-layer describing how to open a dataset works and is arguably the right place to separate out this kind of option. I would propose that Intake does this well. |
Just thinking what that would mean for the likes of dask/dask#7673, @martindurant, would you pass an Intake URL then to |
Oh no! I mean, I'd love it, but Dask doesn't want that... I mean that the intake prescription could include The most you'd want to do directly on the dask side is to pass on arguments to zarr as faithfully as you can, and provide good documentation around it. If the user needs to instantiate their own store instances, the'll have to do some digging. |
Ok. It's seeming then like "encoded at a higher-level" ends up being a reversal of all the work that has gone in since #715. If there are others that are similarly worried about the addition of Otherwise, I'm going to push forward with trying to make it behave sanely as currently implemented with the goal of having it behave the same across language implementations and without the user needing to know that an array was saved with nested vs. flat. |
I am not particularly advocating the "higher level", sorry if it sounded like it; we can surely fix the code in dask and anywhere else to "detect" the situation too. The Intake thing is like a workaround for cases where we don't want to change the implementation. As extra arguments for storage go, this one is pretty low-level, almost at the C/fortran layout layer. The argument was, that since it affects the names of the contained files, you couldn't simply copy the whole dataset anyway. Other storage-oriented arguments are more divorced from that layer or specific to the storage, such as "consolidated" (which we support as an optional argument, but is not in the spec; it does survive copies) and "requester-pays" for an S3 dataset I happen to be looking at today. |
Interesting! Please ping me on related issues when the time comes.
Understood. Thanks for the clarification. |
@joshmoore I specifically objected to the Insofar as In terms of ergonomics, one pain point here is that if the storage layer is looking for chunk files with |
Thanks, @d-v-b. I was worried that I had remembered your concerns correctly. The issue we've run into here is that when passing the "a/b/0.0" style key from the array to the store, there's no way to pass along whether one is in a dimension_separator="/" or a "." context. If you have any ideas, I'm all ears. The options I've considered are:
The downsides of 3 are:
The downside of 2 is that all Stores would need to learn this new protocol. If I understand correct, your concern with 1 is that in the N5 case it won't be possible to split the key portion from the chunk portion? |
Does this actually need to be supported? In my usage, the storage backend is what makes "." or "/" the more desirable key separator. If this is generally the case, then maybe it is safe to require that the key separator must be consistent across all arrays in a given store?
Yes, n5 support requires that the store intercept the abstract chunk key (formatted as
Yes, I think this is the right approach. If an array has been already saved to disk with |
Agreed. For all the cases where a user passes a dimension_separator, I agree that it's the most straight-forward. But for the vast majority of cases, that's not what's happening, and I fear the usability of this is fairly lousy and will lead to:
as being the standard idiom in which case nothing has been won. |
The |
Agreed, in principle at least. In practice, it's per array since there's no top-level metadata in v2.
That's certainly what I'm trying to achieve, but it's not currently possible. An option perhaps (4?) is to allow Arrays to modify stores (!). I don't particularly like the idea (and it introduces a race condition), but it's the only way I've thought of so far to leave Stores in charge of this feature and still have the behavior your describe. I'm already passing the _dimension_separator into the Array creation by a property on the Store. If the Store was not given a dimension_separator, the first time an Array is opened the Array would set it on the given Store. |
There is the |
@constantinpape - probably not, since you should always be able to access the path of the array directly without knowing it is within a group. |
But that could be solved by having it both in the array and group metadata, no? |
True, but then we must worry about keeping the two in sync!
…On June 15, 2021 5:29:30 PM EDT, Constantin Pape ***@***.***> wrote:
> probably not, since you should always be able to access the path of
the array directly without knowing it is within a group.
But that could be solved by having it both in the array and group
metadata, no?
This duplicates the information, but seems preferable to the other
downsides listed above.
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#769 (comment)
--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
Yes, that's true. But that still sounds preferable to the other downsides discussed above. And I think that would be a good solution if we decide not to allow mixed separators per group, which I am in favor of. Let's see what @joshmoore and @d-v-b think about this. |
@joshmoore this appears to link back to the current issue. Was there another issue this was intending to refer to? (just trying to get some missing context here) |
Thanks for catching that, @jakirkham. #709 was what I meant. Fixed above. |
What about a length-2 tuple containing the array key and the chunk key, e.g. |
In terms of new APIs that would certainly work for me, but it would be a pretty significant change. |
WRT to the netcdf-c implementation of dimension separator.
|
@d-v-b: sorry, I've been failing to find the time to really engage here, however the @DennisHeimbigner : thanks for the info. If we're willing to break the API, then yet, I definitely see the value in clearly splitting the key into its individual parts. In another issue (which I can dig up) there was a pretty strong vote against @jbms (#715 (comment)) that we not try the heuristics (as I had tried with a PR with @SabineEmbacher) cc: @haileyajohnson |
Actually splitting the key occurs completely internally, so the user never sees it. |
@joshmoore I will put together a branch that you can test against. And I think I figured out a stupid hack that lets a store infer the key separator even if it changes on the array side. To summarize this issue as I understand it:
I would be interested in understanding the pros and cons of fixing this via storage metadata (i.e. some root metadata that contains information about storage for all elements of the container). If store-specific details (like dimension separator) were stored in |
The obvious one would be an attempt to move (or copy) the dataset/file hierarchy from one storage system to another. Such a move is not necessarily made by a thing that has any knowledge of zarr and its conventions. |
💯
At least at a specification level, this is neither the case in V2 nor V3 currently. It's an array concern.
|
Just chiming in to say I also hit the issue addressed by #773 !
This is certainly the current behaviour, but it's not the case in V3 is it?
Going into V3, we lose the ability to use such tools anyway, right (which is a shame)? Because all the metadata is stored in a directory hierarchy parallel to the groups/arrays. |
Another +1 and thanks to @joshmoore for #773, which addressed the behavior I observed when trying to use Are there plans to integrate #773? |
After a lengthy back and forth, I think this is a last call for opinions on the fix. I'd propose to get it out in |
I don't really understand how to use the new
dimension_separator
keyword, in particular:DirectoryStore(dimension_separator="/")
does not have the effect I would expect (see code and problem description below).NestedDirectoryStore
? Shouldn't it be the same asDirectoryStore(dimension_separator="/")
? Hence I would assume thatNestedDirectoryStore
could either be removed or (if to be kept for legacy purposes) should just map toDirectoryStore(dimension_seperator="/")
.Minimal, reproducible code sample, a copy-pastable example if possible
Problem description
Now, I would assume that the chunks are nested, but I get:
but to, to my confusion, also this:
If I use
NestedDirectoryStore
instead, the chunks are nested as expected.Version and installation information
Please provide the following:
zarr.__version__
: 2.8.3numcodecs.__version__
: 0.7.3The text was updated successfully, but these errors were encountered: