Experimental rasterio-based zarr storage #1230
Replies: 2 comments 1 reply
-
Hi @sgillies and welcome to Zarr! Thanks for starting this interesting discussion. It's very exciting to see this integration possible. I left some comments on your PR. As you can see, Zarr is designed to be very hackable, particularly at the store level. By creating this key-value shim, you can wrap nearly any array-based storage format or library in Zarr. The downside of this approach is that you can create some pretty opaque and complex layers of software. IMO, Kerchunk is especially nice because it tries to put this logic into a specification, rather than in code. I'd like to learn more about the use case you have in mind here. What's the advantage for users of being able to access this data via Zarr? What users love about rasterio is that it not only reads many file formats--it also "understands" geospatial rasters and supports GIS-style workflows. It seems like this approach of wrapping rasterio in zarr is throwing away lots of useful metadata and functionality. The main use case I can see is to expose rasterio data to processing tools that only understand Zarr. However, most Zarr users in the geospatial world are using it together with Xarray, and Xarray already supports rasterio quite well. Many users I talk to are interested in the opposite integration: how can we store data in Zarr and still have first-class support from the rasterio / gdal stack. The lack of an official specification for geospatial metadata encoding in Zarr is currently limiting interoperability, as reflected in the following issues:
Last year, @christophenoel started developing something called "geozarr" to try to resolve some of these issues. He has a repo at christophenoel/geozarr-spec#3, but I'm not sure where that effort stands today. |
Beta Was this translation helpful? Give feedback.
-
@rabernat Thanks for the response! Sometimes, as the maintainer of Rasterio, I wonder why anyone would want to a use a legacy API for accessing data stored in legacy formats when they could use a protocol like Zarr instead 😄 Especially for very large scale datasets. That's more or less the motivation for my experiment, trying to figure out if it's useful to expose small to medium sized legacy data as though it were a subset of a huge dataset. One thing that I've been slow to understand is that the Zarr community would like to standardize on xarray for data access. It's a good choice! Choices that clarify are a gift to everybody. A possible ramification is that Rasterio can say no to requests to implement GDAL's multidimensional API (HDF-like access to HDF5, netCDF4, and Zarr) and let that be picked up by xarray and supporting software (like rioxarray). |
Beta Was this translation helpful? Give feedback.
-
I got some time to do a little hacking at work this week. It seems like access to good ole flat rasters via a hierarchical data API could be handy if you want to zarr-ify some GeoTIFFs, JP2s, or GDAL VRTs. To this end I wrote a draft of a new module and class for rasterio: rasterio/rasterio#2623.
RasterioStore represents a flat (not-hierarchical) raster as a Zarr group with one array and translates flat raster blocks or strips into Zarr chunks.
It works! I admire how approachable Zarr is, and wish I could say the same about rasterio.
Clearly, this work has some things in common with kerchunk. But instead of using byte offsets into files and resources on the web, rasterio's store uses offsets into GDAL's raster data I/O API.
I'd love some feedback on this. Is this aligned with where Zarr is going?
Beta Was this translation helpful? Give feedback.
All reactions