-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure attributes on coordinate variables are preserved during round-tripping #154
Conversation
…coordinate variables
@@ -322,7 +321,7 @@ def separate_coords( | |||
# use workaround to avoid creating IndexVariables described here https://github.com/pydata/xarray/pull/8107#discussion_r1311214263 | |||
if len(var.dims) == 1: | |||
dim1d, *_ = var.dims | |||
coord_vars[name] = (dim1d, var.data) | |||
coord_vars[name] = (dim1d, var.data, var.attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding this fixes the initial issue described in #155 (i.e. the lat
and lon
coordinate variables having their attrs
get dropped) and makes the new test in this PR pass.
However it seems to introduce a new failure involving decoding the time
variable:
# assert equal to original dataset
> xrt.assert_identical(roundtrip, ds)
E AssertionError: Left and right Dataset objects are not identical
E
E Differing coordinates:
E L * time (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
E R * time (time) float32 12kB 1.867e+06 1.867e+06 ... 1.885e+06 1.885e+06
E Differing variable attributes:
E units: hours since 1800-01-01
E calendar: standard
@jsignell wondering if you have any idea what's going on here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I set decode_times=True
then I get
> xrt.assert_identical(roundtrip, ds)
E AssertionError: Left and right Dataset objects are not identical
E
E Differing coordinates:
E L * time (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
E R * time (time) datetime64[ns] 23kB 2013-01-01T00:02:06.757437440 ... 201...
which seem extremely close, similar to in #122 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to make sure I am understanding this properly:
- if you don't set
decode_times
then the roundtripped version ends up as a datetime[ns] and the original version ends up with floats - if you do set
decode_times=True
then the both the roundtripped and the original end up as datetime[ns] but with a floating point error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok I just tested a little theory. I think this has to do with the default attrs that xarray adds to the ds on load. When you read the saved version back in using xarray.open_dataset
you end up with datetime[ns] dtype as well:
diff --git a/virtualizarr/tests/test_integration.py b/virtualizarr/tests/test_integration.py
index b758be6..0b370d5 100644
--- a/virtualizarr/tests/test_integration.py
+++ b/virtualizarr/tests/test_integration.py
@@ -52,6 +52,9 @@ class TestKerchunkRoundtrip:
# save it to disk as netCDF (in temporary directory)
ds.to_netcdf(f"{tmpdir}/air.nc")
+ # read it back in
+ original = xr.open_dataset(f"{tmpdir}/air.nc")
+
# use open_dataset_via_kerchunk to read it as references
vds = open_virtual_dataset(f"{tmpdir}/air.nc", indexes={})
@@ -62,12 +65,18 @@ class TestKerchunkRoundtrip:
roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk")
# assert equal to original dataset
- xrt.assert_identical(roundtrip, ds)
+ xrt.assert_identical(roundtrip, original)
def test_kerchunk_roundtrip_concat(self, tmpdir, format):
# set up example xarray dataset
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False)
+ # save it to disk as netCDF (in temporary directory)
+ ds.to_netcdf(f"{tmpdir}/air.nc")
+
+ # read it back in
+ original = xr.open_dataset(f"{tmpdir}/air.nc")
+
# split into two datasets
ds1, ds2 = ds.isel(time=slice(None, 1460)), ds.isel(time=slice(1460, None))
@@ -89,7 +98,7 @@ class TestKerchunkRoundtrip:
roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk")
# assert equal to original dataset
- xrt.assert_identical(roundtrip, ds)
+ xrt.assert_identical(roundtrip, original)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I just realised that if I make sure to set decode_times=False
in both the original
# set up example xarray dataset
ds = xr.tutorial.open_dataset("air_temperature", decode_times=False)
and the roundtrip
# use fsspec to read the dataset from disk via the zarr store
roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk", decode_times=False)
calls then all the tests pass on this PR, even with assert_identical
. So perhaps this is a red herring and what we should do is merge this PR with decode_times=False
then rebase #112 and work on decode_times=True
there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep that makes sense to me. It seems like xarray might determine whether or not to decode times based on the attrs that are defined and since some attrs get added on load it might bump the version that you store in netcdf over the boundary.
@@ -322,7 +321,7 @@ def separate_coords( | |||
# use workaround to avoid creating IndexVariables described here https://github.com/pydata/xarray/pull/8107#discussion_r1311214263 | |||
if len(var.dims) == 1: | |||
dim1d, *_ = var.dims | |||
coord_vars[name] = (dim1d, var.data) | |||
coord_vars[name] = (dim1d, var.data, var.attrs) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This gets passed into the constructor of xr.Coordinates
, which calls xr.as_variable
.
docs/releases.rst