HDF reader creating spurious dimension coordinates #401

TomNicholas · 2025-01-29T18:15:36Z

          This is what the output looks like for the kerchunk-based reader:

<xarray.Dataset> Size: 16B
Dimensions:  (dim_0: 2)
Dimensions without coordinates: dim_0
Data variables:
    bar      (dim_0) int64 16B ManifestArray<shape=(2,), dtype=int64, chunks=...

vs the custom reader:

<xarray.Dataset> Size: 32B
Dimensions:  (dim_0: 2)
Coordinates:
    dim_0    (dim_0) float64 16B 0.0 0.0
Data variables:
    bar      (dim_0) int64 16B ManifestArray<shape=(2,), dtype=int64, chunks=...

for reference here is what it looks like if I just naively open it as a regular dataset:

(Pdb) xr.open_dataset(netcdf4_file_with_data_in_multiple_groups, group="subgroup")
<xarray.Dataset> Size: 16B
Dimensions:  (dim_0: 2)
Dimensions without coordinates: dim_0
Data variables:
    bar      (dim_0) int64 16B ...

I think this is probably the source of the nbytes difference too

Originally posted by @jsignell in #395 (comment)

The text was updated successfully, but these errors were encountered:

sharkinsspatial · 2025-01-29T18:38:08Z

I'll need to verify but I think this is the issue captured by this xfailed test https://github.com/zarr-developers/VirtualiZarr/blob/main/virtualizarr/tests/test_readers/test_hdf/test_hdf_integration.py#L30-L42

TomNicholas · 2025-01-29T19:05:12Z

@sharkinsspatial to answer your question in the meeting just now the expected behaviour here should be to create whatever coordinate variable xarray create (though the creation of the index backing it is handled by a later layer, so it just needs to be in / not be in your returned list of Variable objects).

* Switch to custom netcdf4/hdf5 backend * Switches autodetected backend selection * updates tests to require kerchunk less often * only test kerchunk hdf reader if kerchunk is available * Allow for kerchunk-based backend * Rename to parametrize_over_hdf_backends * Run group tests * Respect dimensions without coordinates * Fix #402 so that nested groups are ignored * Encode #401 behavior in tests * Fix min deps tests * Make mypy happy * Add to release notes * combine two tests into one --------- Co-authored-by: TomNicholas <[email protected]>

sharkinsspatial · 2025-01-30T17:09:27Z

@TomNicholas I was struggling a bit to understand how I would be able to differentiate between the empty HDF5 datasets that are created for dimensions when serializing to netCDF vs "real" empty HDF5 datasets users would want represented as variables. As @keewis identified here #260, kerchunk simply ignores HDF5 datasets that have no chunks (and thus these empty dimension HDF5 datasets but if we mimic this behavior, then we would also lose "real" empty HDF5 datasets).

I'm sure there are likely other cases I am not covering, but I believe we can differentiate these empty HDF5 datasets for dimensions using logic similar to what I've outlined here https://nbviewer.org/gist/sharkinsspatial/a00afe480186bbd953d5265e560fa24e. I'm sure xarray has more comprehensive logic for this determination but I wasn't able to locate where in the stack this handled.

I'll implement this now and get our tests passing, but it would be a good idea if we can get someone who understands the xarray logic better to review our approach as well.

TomNicholas mentioned this issue Jan 29, 2025

Switch to custom netcdf4/hdf5 backend #395

Merged

4 tasks

jsignell added a commit to jsignell/VirtualiZarr that referenced this issue Jan 29, 2025

Encode zarr-developers#401 behavior in tests

e18e647

TomNicholas added bug Something isn't working HDF reader Non-kerchunk-based HDF reader backend labels Jan 30, 2025

sharkinsspatial mentioned this issue Jan 31, 2025

Fix HDFVirtualBackend handling of non coordinate dimension HDF datasets. #410

Merged

4 tasks

sharkinsspatial closed this as completed in #410 Jan 31, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDF reader creating spurious dimension coordinates #401

HDF reader creating spurious dimension coordinates #401

TomNicholas commented Jan 29, 2025

sharkinsspatial commented Jan 29, 2025

TomNicholas commented Jan 29, 2025

sharkinsspatial commented Jan 30, 2025

HDF reader creating spurious dimension coordinates #401

HDF reader creating spurious dimension coordinates #401

Comments

TomNicholas commented Jan 29, 2025

sharkinsspatial commented Jan 29, 2025

TomNicholas commented Jan 29, 2025

sharkinsspatial commented Jan 30, 2025