Skip to content

Decoding cftime_variables #122

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 17 commits into from
Jun 28, 2024
Merged

Decoding cftime_variables #122

merged 17 commits into from
Jun 28, 2024

Conversation

jsignell
Copy link
Contributor

@jsignell jsignell commented May 20, 2024

This is what I have started working on for the cftime_variables. I am going to be on vacation for a week so I thought I should show what I've gotten up to in case there is some urgency around this, but it is pretty WIP 🥲

Comment on lines 267 to 274
calendar = var.attrs.get(
"calendar", var.encoding.get("calendar", "standard")
)
units = var.attrs.get("units", var.encoding["units"])

data = cftime.date2num(var, calendar=calendar, units=units).ravel()
var = var.copy(data=data)
np_arr = var.to_numpy()

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking out loud: Should this be a Zarr Codec instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naive question: do you mean a new CFTime codec or just a number codec rather than numpy?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would that just be a refactoring in this PR? Like just separate out the logic here into a Codec class that could later be moved upstream?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas I've looked at a tiny portion of upstreaming CF logic to a codec for CF scale_factor and add_offset but this case has the benefit of a pre-existing numcodec FixedScaleAndOffset.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a CFScaleOffsetCoder in xarray too...

@jsignell jsignell marked this pull request as ready for review June 5, 2024 20:09
@jsignell jsignell changed the title [WIP] Decoding cftime_variables Decoding cftime_variables Jun 5, 2024
@jsignell
Copy link
Contributor Author

jsignell commented Jun 5, 2024

Ok! I think the issue was that I was encoding the var to a number but then I wasn't actually storing the encoding information in zarr 😬 . So now that that is resolved things seem to be working better but @jbusecke if you have a test you want to try go for it!

@jbusecke
Copy link
Contributor

If folks here have a minute, I just tested this PR over at jbusecke/esgf-virtual-zarr-data-access#10 with real world CMIP data. I think it generally works, but the attributes of the resulting time coordinate are different. Would love to get some input on that.

Copy link
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to get this merged!

This is an important feature, so deserves a note in the usage documentation too.

@jsignell
Copy link
Contributor Author

Thanks for the review! I think I have addressed all the comments.

roundtrip = xr.open_dataset(f"{tmpdir}/refs.{format}", engine="kerchunk")
roundtrip = xr.open_dataset(
f"{tmpdir}/refs.{format}", engine="kerchunk", decode_times=decode_times
)

# assert equal to original dataset
xrt.assert_equal(roundtrip, ds)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't pass yet, but it's getting a lot closer! It's basically a floating point difference at this point. I tried switching to xrt.assert_allclose but I'm not sure it understands times because it still raises. If I do (roundtrip.time - ds.time).sum() I get zero so I do think the times are pretty darn close.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried using numpy directly to get to the bottom of things:

>>> from numpy.testing import assert_allclose
>>> assert_allclose(roundtrip.time.values, ds.time.values)
*** numpy.exceptions.DTypePromotionError: The DType <class 'numpy.dtypes.DateTime64DType'> could not be promoted by <class 'numpy.dtypes._PyFloatDType'>. This means that no common DType exists for the given inputs. For example they cannot be stored in a single array unless the dtype is `object`. The full list of DTypes is: (<class 'numpy.dtypes.DateTime64DType'>, <class 'numpy.dtypes._PyFloatDType'>)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See 0ffce4f for a rather silly solution

@@ -113,7 +113,7 @@ def recode_cftime(var: xr.Variable) -> NDArray[cftime.datetime]:
for c in var.values:
value = cftime.num2date(
cftime.date2num(
datetime.datetime.fromisoformat(str(c)),
datetime.datetime.fromisoformat(str(c.astype("M8[us]"))),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Python 3.10 microsecond seems to be the smallest supported unit.

Changed in version 3.11: Previously, this method only supported formats that could be emitted by date.isoformat() or datetime.isoformat().

ref https://docs.python.org/3/library/datetime.html#datetime.datetime.fromisoformat

@jsignell jsignell requested a review from TomNicholas June 21, 2024 19:03
@jsignell
Copy link
Contributor Author

Ok @TomNicholas I am pausing. Let me know how this is looking to you

@TomNicholas
Copy link
Member

Thanks for going so deep into this @jsignell! To add more confusion - do you think what I'm seeing in #154 (comment) is related to your close-but-not-equal issue?

@TomNicholas
Copy link
Member

@jsignell you might want to merge in #154 and #158.

Also sorry to push the finish line farther away, but what do you think about the idea in #117 (comment)?

@jsignell
Copy link
Contributor Author

@jsignell you might want to merge in #154 and #158.

Also sorry to push the finish line farther away, but what do you think about the idea in #117 (comment)?

No worries, better to do this the right way. I modeled the approach in this PR on kerchunk, but I will take a look at what is going on in xarray.

@TomNicholas
Copy link
Member

No worries, better to do this the right way. I modeled the approach in this PR on kerchunk, but I will take a look at what is going on in xarray.

❤️

So I was looking at this in more detail last night, and I think we might just want to replace your encode_cftime and recode_cftime with importing xarray's CFDatetimeCoder (and possibly the CFTimeDeltaEncoder too). As far as I can tell (and please tell me if I'm wrong) those are basically more mature versions of the functions you've written here. Those classes are private but stable interfaces, and using them here could be an argument for factoring them out of xarray at some point (or exposing them more publicly).

Those encoders are called by xr.decode_cf here, but replacing everything with xr.decode_cf seems like a longer-term idea that is beyond the scope of this PR (but using xarray's encoders would build towards it).

@jsignell
Copy link
Contributor Author

Ok I think I've finally got it working.

@@ -127,7 +148,9 @@ def open_virtual_dataset(
filepath=filepath, reader_options=reader_options
)

ds = xr.open_dataset(fpath, drop_variables=drop_variables)
ds = xr.open_dataset(
fpath, drop_variables=drop_variables, decode_times=False
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a change that required changes to tests (see: cebb031) but I do think this is more correct. We don't want to be decoding things if it isn't explicitly requested right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomNicholas really want to draw attention to this

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. We can also make this more auto-magic later if we want. I'm not sure any of the existing tests were covering this.

@jsignell
Copy link
Contributor Author

Let me know if you are looking for any more changes or feedback! Otherwise this is done from my perspective 😸

@TomNicholas TomNicholas merged commit 10e7863 into zarr-developers:main Jun 28, 2024
7 checks passed
@TomNicholas
Copy link
Member

Great contribution! Thanks so much @jsignell !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Option to interpret variables using cftime
5 participants