Skip to content

Fill value lost with Icechunk #343

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
rabernat opened this issue Dec 12, 2024 · 3 comments
Closed

Fill value lost with Icechunk #343

rabernat opened this issue Dec 12, 2024 · 3 comments
Assignees

Comments

@rabernat
Copy link
Collaborator

This issue is mostly likely related to the different treatment of missing data in Xarray with Zarr V3 vs. V2 (see pydata/xarray#5475). I don't think it's specific to Icechunk, but rather is related to Zarr V3. However, it can only be reproduced with IC afaik because only IC uses V3 in VirtualiZarr.

Note: this requires pip install git+https://github.com/mpiannucci/kerchunk@v3

To reproduce, first create test data

import xarray as xr
import numpy as np
import virtualizarr
from virtualizarr import open_virtual_dataset
import icechunk

ds = xr.DataArray([np.nan, 2, 3], name="foo").to_dataset()
ds.foo.encoding.update({"_FillValue": np.float32(-9999.9), "dtype": "f4"})

ds.to_netcdf("test.nc")

ds_nc = xr.open_dataset("test.nc").load()
print(ds_nc)
print(ds_nc.foo.encoding)
<xarray.Dataset> Size: 12B
Dimensions:  (dim_0: 3)
Dimensions without coordinates: dim_0
Data variables:
    foo      (dim_0) float32 12B nan 2.0 3.0
{'dtype': dtype('float32'), 'zlib': False, 'szip': False, 'zstd': False, 'bzip2': False, 'blosc': False, 'shuffle': False, 'complevel': 0, 'fletcher32': False, 'contiguous': True, 'chunksizes': None, 'source': '/home/jovyan/GPM_IMERG/test.nc', 'original_shape': (3,), '_FillValue': np.float32(-9999.9)}

Next, create an Icechunk virtual dataset

dsv = open_virtual_dataset("test.nc", indexes={})
storage_config = icechunk.StorageConfig.filesystem("./icechunk-local")
store = icechunk.IcechunkStore.create(storage_config)
dsv.virtualize.to_icechunk(store)
store.commit("wrote store")

Now read it back

ds_ic = xr.open_dataset(store, engine="zarr", zarr_format=3, consolidated=False).load()
print(ds_ic)
print(ds_ic.foo.encoding)
<xarray.Dataset> Size: 12B
Dimensions:  (dim_0: 3)
Dimensions without coordinates: dim_0
Data variables:
    foo      (dim_0) float32 12B -1e+04 2.0 3.0
{'chunks': (3,), 'preferred_chunks': {'dim_0': 3}, 'codecs': [{'name': 'bytes', 'configuration': {'endian': 'little'}}], 'dtype': dtype('float32')}

As you can see, there is no NaN in the first value.

I'm not really sure how encoding works in VirtualiZarr, but I'd be happy to explain the relevant changes with Xarray's handling of fill value in Zarr V3 format (see e.g. https://github.com/pydata/xarray/blob/49502fcde4db6ea3da1f60ead589580cfdad5c98/xarray/backends/zarr.py#L787-L796).

cc @mpiannucci

@rabernat
Copy link
Collaborator Author

rabernat commented Dec 14, 2024

I was able to work around this in the following way:

from xarray.backends.zarr import FillValueCoder
import numpy as np
import zarr

coder = FillValueCoder()
group = zarr.open_group(store)
group['foo'].attrs['_FillValue'] = coder.encode(ds.foo.encoding['_FillValue'], np.dtype('f4'))

store.commit("fixed fill value")

@abarciauskas-bgse
Copy link
Collaborator

@rabernat I am trying to understand the expected behavior for what the VirtualiZarr readers like dmrpp should return for _FillValue in xr.Variables' encoding and attrs, see #369.

Looking at https://github.com/pydata/xarray/blob/49502fcde4db6ea3da1f60ead589580cfdad5c98/xarray/backends/zarr.py#L787-L796 it seems like _FillValue should be in attrs for the xarray zarr reader (unless I'm misreading), but in my testing _FillValue was not in attrs. Further, I also found that _FillValue looks like it could be improperly encoded in zarr v3? Not sure if these issues are a user error or a bug in xarray or zarr at this point.

Testing in this notebook: https://gist.github.com/abarciauskas-bgse/7b0cad2dc2808a5ee7fc0676b59cf1a5

cc @ayushnag @sharkinsspatial @TomNicholas in case you can see what I'm doing wrong.

@sharkinsspatial
Copy link
Collaborator

Closed by #420

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants