-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Order field should be 'F' #78
Comments
This is a bit more tricky, because you have to define what exactly 'C' and 'F' mean here. Almost all datasets in the wild are stored in 'C' order. If we would read these datasets in the same order, we would have to transpose everything to mimic the same dimension order in Julia. In order to avoid this problem, we simply revert the order of all dimensions, which is the equivalent to what packages like NetCDF.jl HDF5.jl and others are doing. This means that when you save an array of size (2,3,4) it will write a size of (4,3,2) to the metadata and pretend to have written the data in C order. The same happens when reading a dataset saved by e.g. the python zarr implementation. Dimensions will always be reversed in comparison to numpy arrays, e.g. when I save a 100x5 matrix in python, the Julia Zarr implementation will return a 5x100 matrix. |
I'm very sympathetic to this challenge. However, I'm not sure we should break spec to overcome. As you can see below, python zarr handles reading both order='F' and order='C' into row major, while Zarr.jl is out of spec. I propose that the most consistent approach is for Zarr.jl to transpose order='C' when reading, read order='F' as is, and save order='F'. Python -> Julia>>> import zarr, numpy as np
>>> zarr.array(np.arange(6).reshape(3,2), order='F',store='f.zarr')
<zarr.core.Array (3, 2) int64>
>>> zarr.array(np.arange(6).reshape(3,2), order='C',store='c.zarr')
<zarr.core.Array (3, 2) int64>
>>> zarr.open('f.zarr')[:]
array([[0, 1],
[2, 3],
[4, 5]])
>>> zarr.open('c.zarr')[:]
array([[0, 1],
[2, 3],
[4, 5]]) julia> using Zarr
julia> zopen("c.zarr")[:,:]
2×3 reshape(::Matrix{Union{Missing, Int64}}, 2, 3) with eltype Union{Missing, Int64}:
missing 2 4
1 3 5
julia> zopen("f.zarr")[:,:]
2×3 reshape(::Matrix{Union{Missing, Int64}}, 2, 3) with eltype Union{Missing, Int64}:
missing 4 3
2 1 5 There are three errors: 1) value should be 0, not missing. 2) The shape should be (3,2) 3) Zarr.jl should read "c.zarr" and "f.zarr" as the same matrix. Julia -> Pythonjulia> a = zcreate(Int,3,2,path="julia.zarr")
ZArray{Int64} of size 3 x 2
julia> a .= reshape(collect(1:6),3,2)
ZArray{Int64} of size 3 x 2
julia> a[:,:]
3×2 Matrix{Int64}:
1 4
2 5
3 6 >>> zarr.open('julia.zarr')[:]
array([[1, 2, 3],
[4, 5, 6]]) There is one error: the matrix was saved with the wrong dimensions. Edit: updated for clarity. Also, HDF5.jl ought to operate differently since HDF5 forces storing data row-major, whereas Zarr supports column-major. |
Regarding 1) this is expected since python sets the fill value to zero, so zeros are supposed to be interpreted as missings. With respect to 3) I think that opening a "f.zarr" should currently throw an error instead of returning the wrong data (and I thought it did, this was an oversight and I will prepare a PR). So to me we should really discuss point 2) and I would very much appreciate if others (@Alexander-Barth @visr ) could voice their opinions as well and I think a lot of this is connected to personal taste. Maybe this is domain-dependent, but usually there are conventions about the storage order of certain datasets. For example in climate science when you have arrays of dimensions longitude, latitude and time you would almost always have time as the record dimension with their index changing slowest, because many operations would be done map-by-map. So opening such a dataset, I would expect something like lon-lat-time in Julia while in Python I would expect time-lat-lon arrays. The point is that when we start transposing the arrays by default, this would first add an extra copy of the data and in addition lead to inefficient cache access when accessing the data in map slices, i.e. slower computations. I know that one could avoid the copy by using A second argument would be that datasets saved by xarray have named dimensions, so axes are not defined by their position bu by name. When you open a dataset e.g. with YAXArrays.jl axis names and values are read correctly and you can do slices and subsetting by axis name. In the future this will hopefully be possible for DimensionalData.jl as well. |
Yes, would love to hear input from folks in other domains. I’m using Zarr to pass large matrices between Julia and Python in different pipeline steps. IMHO row major versus column major is an implementation detail, but two matrices with different shapes are not the same matrix. Two different Zarr implementations should not read a different matrix from the same file. Hope I’m not coming off as critical—been having a great experience with Zarr.jl and would highly recommend over HDF5! |
I would (by far) favour that the order of dimensions is lon, lat, time in Julia if the file is written as time, lat, lon in python (when written with the default C ordering). As I understand, this is the current behaviour in Zarr.jl.
https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/cf-conventions.html The longitude dimension should vary the fastest. For my experience, this part of the CF conventions is well followed for NetCDF files. |
The issue is that the current behavior is out of spec for Zarr. Quite simply, we are saving data as column-major, but metadata labels it as row-major. Right now, Zarr.jl does not support column-major data storage in Zarr — we instead use a hack to save it transposed in row-major. Totally appreciate that NetCDF may benefit from different behavior, specifically for latitude and longitude. it seems that switching to using order=F gets best of both worlds? This wasn’t possible in HDF5 backend. Edit: to clarify, we are in agreement on the reading side!
|
I'm not sure we are out of spec. I don't read anything in the spec about having to permute the data instead of the dimensions. I do understand that it doesn't make things easier however, I have raised a similar issue in ArchGDAL in the past. If a Zarr file has (shape = (2, 3), order = C), since Julia is order F, we can choose to either permute the dimensions or permute the data. Given that permuting the data is much more expansive, we permute the dimensions instead, so we get shape (3, 2) in Julia. When writing this data, it is equally correct to call it either (shape = (2, 3), order = C) or (shape = (3, 2), order = F). I believe Python does not permute the dimensions, but just tells NumPy what the storage order is (it supports both), and keeps the dimension order the same as what is written in the file, regardless of the storage order. Here is some more reading:
The HDF5.jl issue also links to the HDF5 user guide which explicitly says that for Fortran, permuting the dimensions (not the data) is the right thing to do (HDF5 is always stored in C order).
|
thx for the thoughtful reply!
Yes you are correct. >>> zarr.open('c.zarr')[:].flags
C_CONTIGUOUS : True
F_CONTIGUOUS : False
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
>>> zarr.open('f.zarr')[:].flags
C_CONTIGUOUS : False
F_CONTIGUOUS : True
OWNDATA : True
WRITEABLE : True
ALIGNED : True
WRITEBACKIFCOPY : False
UPDATEIFCOPY : False
I think this is reasonable for HDF5, since no support for F order.
I suppose the question comes down to: should Zarr.jl preserve only memory order, only shape data, or both? Right now, we only preserve memory order. But we could do both:
This way, if order=F we preserve both. If order=C, we preserve memory order. Another benefit is I don't think this would be breaking. |
Yeah that sounds reasonable. I'll leave it up to @meggart if he want to write F order by default. Since it is the Julia order it sounds logical. Though as he said, C order files are much more common in the wild. |
Ok, let's think about the default for order then, I still haven't made up my mind. However, I think the first step would be to make 'F' order possible at all, I will try to do this as soon as I have time. Probably this will be as simple as putting some of the "reverse" statements into a conditional that checks for the ordering, but this would need some thorough testing. |
@tbenst I have created this branch https://github.com/meggart/Zarr.jl/tree/fortran_storage_order which makes the permutation of dimensions conditional on the storage order so that 'F' order files are not messed up anymore. I do not really have the time to polish this up and add unit tests, so it would be great if you could volunteer. I think I would also be fine with making 'F' order the default when creating arrays, but it might be worth checking with the zarr community first to get an overview on how many zarr implementations support Fortran order (I think NCZarr will not). It would also be nice to add a deprecation cycle for this, i.e. throwing a warning in the next version when |
One option might also be to use https://github.com/JuliaPackaging/Preferences.jl so that users can overwrite their preferred default storage order. |
What advantage does Preferences.jl have over setting the order with a keyword argument? The only thing I can think of is if you have wrapping code that does not expose that keyword. The downside is that now writing behaviour becomes dependent on local user settings, which may make things harder to reproduce, if people forget to include these LocalPreferences.toml when sharing code. |
The only advantage would be in case we can not agree on what the default value for the keyword argument should be. Personally I will continue to write data in 'C' format and will have to remember adding the keyword argument every time I write a file when we make 'F' the default. But you are right, it is probably a bad idea in terms of reproducibility, so let's drop this. |
@meggart thank you so much for all your work! I am happy to put some time in polishing this and adding unit tests. I will also make a post to check in for advice with the Zarr community |
Since this package saves data as column-order, the order field should be
F
per https://zarr.readthedocs.io/en/stable/spec/v1.html?highlight=row%20major#metadata, but currently this is set toC
https://github.com/meggart/Zarr.jl/blob/686e41856808d71c7cd878f0c6d142a89c60be37/src/metadata.jl#L153
The text was updated successfully, but these errors were encountered: