-
Notifications
You must be signed in to change notification settings - Fork 36
Initial support for Zarr v3 #290
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
| } | ||
|
|
||
| void checkMultiscale(Map<String, Object> multiscale, String name) { | ||
| assertEquals(getNGFFVersion(), multiscale.get("version")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if this multiscale is zarr v3 but in OME-Zarr v0.5, the version is not under multiscale any more but under "attributes": {"ome": {"version": "0.5", "multiscales": {...}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, thanks @will-moore. I'll need to update that in both the conversion and the test then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should have been fixed with 8e12cbe.
|
With last couple of commits, I think this is now in a state where it would be good to have more eyes on code/testing. One concern I have when trying to add sharding options and associated tests is whether it makes sense to allow shard (or even tile) sizes to be specified per-resolution. It's pretty easy to get into a scenario where the specified shard size works for the largest 1 or 2 resolutions, but not for anything smaller. Right now, that should result in a warning and no sharding applied, but that might not be ideal, so other thoughts definitely welcome. |
| * @return UCAR array type | ||
| */ | ||
| public static ucar.ma2.DataType getDataType(int type) { | ||
| switch (type) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "BIT" type https://github.com/ome/bioformats/blob/develop/components/formats-api/src/loci/formats/FormatTools.java#L840
is not supported. Should it be added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about LONG and ULONG?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BIT added in 4868446. LONG and ULONG and not valid types in OME schema, so are intentionally omitted here as no input data would be of those types.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Started testing the Zarr v3 conversion with three public representative datasets, previously used for the testing of glencoesoftware/raw2ometiff#80
- Leica-1.scn an RGB brightfield whole slide image
- NIRHTa+001 a 384 well plate from the BBBC017 collection
- LuCa-7color_Scan1.qptiff a multi-channel fluorescent whole slide image
Initially, all three images were converted with bioformats2raw both with the default options and with the --v3 flag using the following script
PATH=./bioformats2raw-0.11.0-SNAPSHOT/bin:$PATH
mkdir -p default
rm -rf default/*
time bioformats2raw sources/Leica-1.scn default/Leica-1.ome.zarr
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB default/NIRHTa+001.ome.zarr
time bioformats2raw sources/LuCa-7color_Scan1.qptiff default/LuCa-7color_Scan1.ome.zarr
mkdir -p v3
rm -rf v3/*
time bioformats2raw sources/Leica-1.scn v3/Leica-1.ome.zarr --v3
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB v3/NIRHTa+001.ome.zarr --v3
time bioformats2raw sources/LuCa-7color_Scan1.qptiff v3/LuCa-7color_Scan1.ome.zarr --v3The execution times and the sizes of the files are as follows
omero@ngff:/mnt/data/seb/bioformats2raw_290$ ./convert.sh
real 1m3.209s
user 4m29.618s
sys 0m6.989s
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.esotericsoftware.reflectasm.AccessClassLoader (file:/mnt/data/seb/bioformats2raw_290/bioformats2raw-0.11.0-SNAPSHOT/lib/reflectasm-1.11.9.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.esotericsoftware.reflectasm.AccessClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
real 0m58.257s
user 3m11.635s
sys 0m11.610s
real 0m47.745s
user 3m16.102s
sys 0m6.499s
real 2m6.667s
user 9m0.141s
sys 0m5.642s
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.esotericsoftware.reflectasm.AccessClassLoader (file:/mnt/data/seb/bioformats2raw_290/bioformats2raw-0.11.0-SNAPSHOT/lib/reflectasm-1.11.9.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.esotericsoftware.reflectasm.AccessClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
real 2m38.243s
user 8m38.694s
sys 0m11.849s
real 1m54.085s
user 8m1.865s
sys 0m5.122s
omero@ngff:/mnt/data/seb/bioformats2raw_290$ du -csh default/*
4.3G default/Leica-1.ome.zarr
3.2G default/LuCa-7color_Scan1.ome.zarr
4.0G default/NIRHTa+001.ome.zarr
12G total
omero@ngff:/mnt/data/seb/bioformats2raw_290$ du -csh v3/*
3.2G v3/Leica-1.ome.zarr
1.9G v3/LuCa-7color_Scan1.ome.zarr
2.3G v3/NIRHTa+001.ome.zarr
7.3G total
and the generated Zarr datasets have been uploaded to a temporary prefix of the public gs-public-zarr-archive AWS S3 bucket under bioformats2raw_290 and can be inspected via the OME NGFF validator e.g.
https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/default/Leica-1.ome.zarr/1
https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1
A couple of immediate findings:
- the conversion time was found to be twice slower when generated Zarr v3
- the default dataset size was reduced with Zarr v3 and practically the individual chunks take less space
- does this mean the default compression options are different? Could that partly explain the conversion difference?
- the validator reports a few metadata validation issues:
- the
dimension_namesattribute under the array attributes isnull. - the
plateandwellattributes are not nested underome - the
omeroattribute is not underome
- the
- when loading the chunks using the validator, they appear as corrupted. I have not confirmed this with an independent viewer yet
Pretty sure that's a difference in the default blosc cname; the size difference seems to be consistent with what's noted in glencoesoftware/zarr2zarr#9. Definitely open to changing defaults for 0.5/v3, but we'd need to collectively agree on what they should be.
Should be fixed with last few commits on this PR.
Still looking into what's going wrong here. |
|
As suggested by @sbesson, a simple test with zarr-python: indicates that zarr-python sees the v2 and v3 arrays as having equivalent contents. Looking at https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1, when loading chunk 0 of That suggests to me that decompression is just not being performed. If I then generate a simple sharded array with zarr-python and default compression: then and If my understanding is not correct and the |
|
Thanks both, that is really useful and seems to confirm that
Yes, this had been reported in #290 (review) and the test files haven't been regenerated since. I will do that using @melissalinkert's latest commits unless additional work is planned and update the datasets on our public buckets so that we can reduce the number of moving targets.
I see the same and this is surprising as there are absolutely no CORS errors when accessing for instance https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/CMU-1.ome.zarr/0/ (OME-NGFF 0.4) which is hosted under the same bucket with the same policy.
Thanks for the detailed investigation. My reading of the specification is that both configuration are equivalently valid and actually communicate different storage states. In the first one, the chunks within the shards are individually compressed with |
|
348190c adds a Something like: and then comparing |
|
Thanks @melissalinkert, I regenerated the Zarr files using the latest build from this PR and the following scripts Execution times
File sizes
OME NGFF validator links
A few initial observations
|



Opening as a draft to show some initial progress, but there are still a few things to do here before this is ready for full review:
--shard-width,--shard-height,--shard-depthto match chunk options?)A simple test of
bin/bioformats2raw --v3 test.fake test-v3.zarror similar should work though, and v2 output should be unaffected as indicated by passing tests.