Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic rechunking example #539

Closed
norlandrhagen opened this issue Aug 5, 2024 · 9 comments · Fixed by #681
Closed

Basic rechunking example #539

norlandrhagen opened this issue Aug 5, 2024 · 9 comments · Fixed by #681
Labels
bug Something isn't working xarray-integration Uses or required for cubed-xarray integration

Comments

@norlandrhagen
Copy link
Contributor

Working my way through understanding cubed / cubed-xarray.

I'm trying to get an example working of modifying the chunking of an Xarray dataset and writing it to Zarr. When I roundtrip the Zarr to and from Xarray, it seems like the chunking structure hasn't changed. Is using the .chunk method on an Xarray dataset with cubed viable or should I be using rechunk primitive?

Roundtrip example using Xarray + dask chunks

import xarray as xr 
from zarr.storage import TempStore

ts = TempStore('air_temp_dask.zarr')

ds = xr.tutorial.open_dataset('air_temperature', chunks={})
rds = ds.chunk({'time':1})
rds.to_zarr(ts, consolidated=True)

rtds = xr.open_zarr(ts, chunks={})
rtds

assert rtds.chunks == rds.chunks

Roundtrip example using Xarray + cubed

from cubed import Spec
import xarray as xr 
from zarr.storage import TempStore

ts = TempStore('air_temp_cubed.zarr')

spec = Spec(work_dir='tmp', allowed_mem='2GB')
ds = xr.tutorial.open_dataset('air_temperature', chunked_array_type='cubed',
     from_array_kwargs={'spec': spec},chunks={})

rds = ds.chunk({'time':1}, chunked_array_type="cubed")

# does compute need to be called?
# rds.compute()

rds.to_zarr(ts, consolidated=True, chunkmanager_store_kwargs={'from_array_kwargs': {'spec': spec} })

rtds = xr.open_zarr(ts, chunked_array_type='cubed',
     from_array_kwargs={'spec': spec},chunks={})
     
# This fails
assert rtds.chunks == rds.chunks

chunked dataset (rds):
image

roundtripped dataset (rtds):
image

🤞 this is an end-of-day brain implementation issue on my end.

@TomNicholas
Copy link
Member

TomNicholas commented Aug 5, 2024

What you're doing is correct, you should be able to use all the Xarray syntax you normally would without called anything from Cubed directly.

Using the .chunk method on Xarray is supposed to be viable but there is at least one bug in cubed-xarray that was found earlier today (see PR on cubed-xarray). There may be another bug here!

Cubed-xarray is currently under-tested relative to Cubed alone.

@TomNicholas
Copy link
Member

Okay so

> rds = ds.chunk({'time':1}, chunked_array_type="cubed")

You should not need to add chunked_array_type="cubed" here, it's supposed to automatically see that you're using cubed and assume you want to keep using cubed. I don't know why that's broken on xarray main but I was in the process of refactoring xarray's .chunk method anyway in pydata/xarray#9286 and that seems to fix it 🤷‍♂️

I'm also able to reproduce your the bug with writing out the wrong chunks. However when I instead try writing out just one array using cubed.to_zarr I see the expected chunks, i.e.

cubed.to_zarr(rds['air'].data, ts)

Which at least means the bug is in xarray / cubed-xarray rather than in cubed.

@norlandrhagen
Copy link
Contributor Author

Ah good to know it's probably a cubed-xarray bug. Would it be helpful to repost/move the issue there or cross ref it?

@TomNicholas
Copy link
Member

Thanks for your patience! Cross-referencing it in a new issue on cubed-xarray could be helpful.

@tomwhite
Copy link
Member

tomwhite commented Aug 8, 2024

I tracked down the problem - this seems to fix it: pydata/xarray#9326

@tomwhite
Copy link
Member

The Xarray issue has been merged now so it might be worth seeing if it fixes your original issue @norlandrhagen.

@norlandrhagen
Copy link
Contributor Author

Just tried with the main Xarray branch and it worked! 🎈 Thanks for the fix. Is it worth pinning xarray-cubed to and above the next release of Xarray?

@tomwhite
Copy link
Member

Great!

Is it worth pinning xarray-cubed to and above the next release of Xarray?

That might be a good idea.

@TomNicholas
Copy link
Member

Is it worth pinning xarray-cubed to and above the next release of Xarray?

Yes definitely. I'm also about to suggest we rename the ChunkManager to ComputeManger to better reflect it's updated responsibilities in light of pydata/xarray#9286, which would be a breaking change for cubed-xarray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working xarray-integration Uses or required for cubed-xarray integration
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants