-
Notifications
You must be signed in to change notification settings - Fork 40
Decoding cftime_variables
#122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
virtualizarr/kerchunk.py
Outdated
calendar = var.attrs.get( | ||
"calendar", var.encoding.get("calendar", "standard") | ||
) | ||
units = var.attrs.get("units", var.encoding["units"]) | ||
|
||
data = cftime.date2num(var, calendar=calendar, units=units).ravel() | ||
var = var.copy(data=data) | ||
np_arr = var.to_numpy() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking out loud: Should this be a Zarr Codec instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
naive question: do you mean a new CFTime codec or just a number codec rather than numpy?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would that just be a refactoring in this PR? Like just separate out the logic here into a Codec class that could later be moved upstream?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomNicholas I've looked at a tiny portion of upstreaming CF logic to a codec for CF scale_factor
and add_offset
but this case has the benefit of a pre-existing numcodec
FixedScaleAndOffset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is also a CFScaleOffsetCoder
in xarray too...
Ok! I think the issue was that I was encoding the var to a number but then I wasn't actually storing the encoding information in zarr 😬 . So now that that is resolved things seem to be working better but @jbusecke if you have a test you want to try go for it! |
If folks here have a minute, I just tested this PR over at jbusecke/esgf-virtual-zarr-data-access#10 with real world CMIP data. I think it generally works, but the attributes of the resulting time coordinate are different. Would love to get some input on that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to get this merged!
This is an important feature, so deserves a note in the usage documentation too.
Thanks for the review! I think I have addressed all the comments. |
roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk") | ||
roundtrip = xr.open_dataset( | ||
f"{tmpdir}/refs.{format}", engine="kerchunk", decode_times=decode_times | ||
) | ||
|
||
# assert equal to original dataset | ||
xrt.assert_equal(roundtrip, ds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test doesn't pass yet, but it's getting a lot closer! It's basically a floating point difference at this point. I tried switching to xrt.assert_allclose
but I'm not sure it understands times because it still raises. If I do (roundtrip.time - ds.time).sum()
I get zero so I do think the times are pretty darn close.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried using numpy directly to get to the bottom of things:
>>> from numpy.testing import assert_allclose
>>> assert_allclose(roundtrip.time.values, ds.time.values)
*** numpy.exceptions.DTypePromotionError: The DType <class 'numpy.dtypes.DateTime64DType'> could not be promoted by <class 'numpy.dtypes._PyFloatDType'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtypes.DateTime64DType'>, <class 'numpy.dtypes._PyFloatDType'>)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See 0ffce4f for a rather silly solution
virtualizarr/utils.py
Outdated
@@ -113,7 +113,7 @@ def recode_cftime(var: xr.Variable) -> NDArray[cftime.datetime]: | |||
for c in var.values: | |||
value = cftime.num2date( | |||
cftime.date2num( | |||
datetime.datetime.fromisoformat(str(c)), | |||
datetime.datetime.fromisoformat(str(c.astype("M8[us]"))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Python 3.10 microsecond seems to be the smallest supported unit.
Changed in version 3.11: Previously, this method only supported formats that could be emitted by date.isoformat() or datetime.isoformat().
ref https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat
Ok @TomNicholas I am pausing. Let me know how this is looking to you |
Thanks for going so deep into this @jsignell! To add more confusion - do you think what I'm seeing in #154 (comment) is related to your close-but-not-equal issue? |
@jsignell you might want to merge in #154 and #158. Also sorry to push the finish line farther away, but what do you think about the idea in #117 (comment)? |
✅
No worries, better to do this the right way. I modeled the approach in this PR on kerchunk, but I will take a look at what is going on in xarray. |
❤️ So I was looking at this in more detail last night, and I think we might just want to replace your Those encoders are called by |
Ok I think I've finally got it working. |
@@ -127,7 +148,9 @@ def open_virtual_dataset( | |||
filepath=filepath, reader_options=reader_options | |||
) | |||
|
|||
ds = xr.open_dataset(fpath, drop_variables=drop_variables) | |||
ds = xr.open_dataset( | |||
fpath, drop_variables=drop_variables, decode_times=False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a change that required changes to tests (see: cebb031) but I do think this is more correct. We don't want to be decoding things if it isn't explicitly requested right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomNicholas really want to draw attention to this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. We can also make this more auto-magic later if we want. I'm not sure any of the existing tests were covering this.
Let me know if you are looking for any more changes or feedback! Otherwise this is done from my perspective 😸 |
Great contribution! Thanks so much @jsignell ! |
docs/releases.rst
api.rst
This is what I have started working on for the cftime_variables. I am going to be on vacation for a week so I thought I should show what I've gotten up to in case there is some urgency around this, but it is pretty WIP 🥲