Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDF reader creating spurious dimension coordinates #401

Closed
TomNicholas opened this issue Jan 29, 2025 · 3 comments · Fixed by #410
Closed

HDF reader creating spurious dimension coordinates #401

TomNicholas opened this issue Jan 29, 2025 · 3 comments · Fixed by #410
Labels
bug Something isn't working HDF reader Non-kerchunk-based HDF reader backend

Comments

@TomNicholas
Copy link
Member

          This is what the output looks like for the kerchunk-based reader:
<xarray.Dataset> Size: 16B
Dimensions:  (dim_0: 2)
Dimensions without coordinates: dim_0
Data variables:
    bar      (dim_0) int64 16B ManifestArray<shape=(2,), dtype=int64, chunks=...

vs the custom reader:

<xarray.Dataset> Size: 32B
Dimensions:  (dim_0: 2)
Coordinates:
    dim_0    (dim_0) float64 16B 0.0 0.0
Data variables:
    bar      (dim_0) int64 16B ManifestArray<shape=(2,), dtype=int64, chunks=...

for reference here is what it looks like if I just naively open it as a regular dataset:

(Pdb) xr.open_dataset(netcdf4_file_with_data_in_multiple_groups, group="subgroup")
<xarray.Dataset> Size: 16B
Dimensions:  (dim_0: 2)
Dimensions without coordinates: dim_0
Data variables:
    bar      (dim_0) int64 16B ...

I think this is probably the source of the nbytes difference too

Originally posted by @jsignell in #395 (comment)

@sharkinsspatial
Copy link
Collaborator

I'll need to verify but I think this is the issue captured by this xfailed test https://github.com/zarr-developers/VirtualiZarr/blob/main/virtualizarr/tests/test_readers/test_hdf/test_hdf_integration.py#L30-L42

@TomNicholas
Copy link
Member Author

@sharkinsspatial to answer your question in the meeting just now the expected behaviour here should be to create whatever coordinate variable xarray create (though the creation of the index backing it is handled by a later layer, so it just needs to be in / not be in your returned list of Variable objects).

jsignell added a commit to jsignell/VirtualiZarr that referenced this issue Jan 29, 2025
TomNicholas added a commit that referenced this issue Jan 30, 2025
* Switch to custom netcdf4/hdf5 backend

* Switches autodetected backend selection
* updates tests to require kerchunk less often
* only test kerchunk hdf reader if kerchunk is available

* Allow for kerchunk-based backend

* Rename to parametrize_over_hdf_backends

* Run group tests

* Respect dimensions without coordinates

* Fix #402 so that nested groups are ignored

* Encode #401 behavior in tests

* Fix min deps tests

* Make mypy happy

* Add to release notes

* combine two tests into one

---------

Co-authored-by: TomNicholas <[email protected]>
@sharkinsspatial
Copy link
Collaborator

@TomNicholas I was struggling a bit to understand how I would be able to differentiate between the empty HDF5 datasets that are created for dimensions when serializing to netCDF vs "real" empty HDF5 datasets users would want represented as variables. As @keewis identified here #260, kerchunk simply ignores HDF5 datasets that have no chunks (and thus these empty dimension HDF5 datasets but if we mimic this behavior, then we would also lose "real" empty HDF5 datasets).

I'm sure there are likely other cases I am not covering, but I believe we can differentiate these empty HDF5 datasets for dimensions using logic similar to what I've outlined here https://nbviewer.org/gist/sharkinsspatial/a00afe480186bbd953d5265e560fa24e. I'm sure xarray has more comprehensive logic for this determination but I wasn't able to locate where in the stack this handled.

I'll implement this now and get our tests passing, but it would be a good idea if we can get someone who understands the xarray logic better to review our approach as well.

@TomNicholas TomNicholas added bug Something isn't working HDF reader Non-kerchunk-based HDF reader backend labels Jan 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HDF reader Non-kerchunk-based HDF reader backend
Projects
None yet
2 participants