Skip to content

Conversation

@melissalinkert
Copy link
Member

Opening as a draft to show some initial progress, but there are still a few things to do here before this is ready for full review:

  • make compression options work with Zarr v3 (right now only default codec supported)
  • add sharding options (--shard-width, --shard-height, --shard-depth to match chunk options?)
  • add tests that compare v2 against v3 output with various combinations of options
  • write correct HCS metadata for v3

A simple test of bin/bioformats2raw --v3 test.fake test-v3.zarr or similar should work though, and v2 output should be unaffected as indicated by passing tests.

}

void checkMultiscale(Map<String, Object> multiscale, String name) {
assertEquals(getNGFFVersion(), multiscale.get("version"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this multiscale is zarr v3 but in OME-Zarr v0.5, the version is not under multiscale any more but under "attributes": {"ome": {"version": "0.5", "multiscales": {...}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, thanks @will-moore. I'll need to update that in both the conversion and the test then.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have been fixed with 8e12cbe.

@melissalinkert melissalinkert marked this pull request as ready for review October 1, 2025 17:10
@melissalinkert melissalinkert requested a review from sbesson October 1, 2025 17:10
@melissalinkert
Copy link
Member Author

With last couple of commits, I think this is now in a state where it would be good to have more eyes on code/testing.

One concern I have when trying to add sharding options and associated tests is whether it makes sense to allow shard (or even tile) sizes to be specified per-resolution. It's pretty easy to get into a scenario where the specified shard size works for the largest 1 or 2 resolutions, but not for anything smaller. Right now, that should result in a warning and no sharding applied, but that might not be ideal, so other thoughts definitely welcome.

* @return UCAR array type
*/
public static ucar.ma2.DataType getDataType(int type) {
switch (type) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about LONG and ULONG?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BIT added in 4868446. LONG and ULONG and not valid types in OME schema, so are intentionally omitted here as no input data would be of those types.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Started testing the Zarr v3 conversion with three public representative datasets, previously used for the testing of glencoesoftware/raw2ometiff#80

Initially, all three images were converted with bioformats2raw both with the default options and with the --v3 flag using the following script

PATH=./bioformats2raw-0.11.0-SNAPSHOT/bin:$PATH
mkdir -p default
rm -rf default/*
time bioformats2raw sources/Leica-1.scn default/Leica-1.ome.zarr
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB default/NIRHTa+001.ome.zarr
time bioformats2raw sources/LuCa-7color_Scan1.qptiff default/LuCa-7color_Scan1.ome.zarr

mkdir -p v3
rm -rf v3/*
time bioformats2raw sources/Leica-1.scn v3/Leica-1.ome.zarr --v3
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB v3/NIRHTa+001.ome.zarr --v3
time bioformats2raw sources/LuCa-7color_Scan1.qptiff v3/LuCa-7color_Scan1.ome.zarr --v3

The execution times and the sizes of the files are as follows

omero@ngff:/mnt/data/seb/bioformats2raw_290$ ./convert.sh 

real    1m3.209s
user    4m29.618s
sys     0m6.989s
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.esotericsoftware.reflectasm.AccessClassLoader (file:/mnt/data/seb/bioformats2raw_290/bioformats2raw-0.11.0-SNAPSHOT/lib/reflectasm-1.11.9.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.esotericsoftware.reflectasm.AccessClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

real    0m58.257s
user    3m11.635s
sys     0m11.610s

real    0m47.745s
user    3m16.102s
sys     0m6.499s

real    2m6.667s
user    9m0.141s
sys     0m5.642s
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.esotericsoftware.reflectasm.AccessClassLoader (file:/mnt/data/seb/bioformats2raw_290/bioformats2raw-0.11.0-SNAPSHOT/lib/reflectasm-1.11.9.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.esotericsoftware.reflectasm.AccessClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

real    2m38.243s
user    8m38.694s
sys     0m11.849s
                
real    1m54.085s
user    8m1.865s
sys     0m5.122s

omero@ngff:/mnt/data/seb/bioformats2raw_290$ du -csh default/*
4.3G    default/Leica-1.ome.zarr
3.2G    default/LuCa-7color_Scan1.ome.zarr
4.0G    default/NIRHTa+001.ome.zarr
12G     total
omero@ngff:/mnt/data/seb/bioformats2raw_290$ du -csh v3/*
3.2G    v3/Leica-1.ome.zarr
1.9G    v3/LuCa-7color_Scan1.ome.zarr
2.3G    v3/NIRHTa+001.ome.zarr
7.3G    total

and the generated Zarr datasets have been uploaded to a temporary prefix of the public gs-public-zarr-archive AWS S3 bucket under bioformats2raw_290 and can be inspected via the OME NGFF validator e.g.
https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/default/Leica-1.ome.zarr/1
https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1

A couple of immediate findings:

  • the conversion time was found to be twice slower when generated Zarr v3
  • the default dataset size was reduced with Zarr v3 and practically the individual chunks take less space
    • does this mean the default compression options are different? Could that partly explain the conversion difference?
  • the validator reports a few metadata validation issues:
    • the dimension_names attribute under the array attributes is null.
    • the plate and well attributes are not nested under ome
    • the omero attribute is not under ome
  • when loading the chunks using the validator, they appear as corrupted. I have not confirmed this with an independent viewer yet

@will-moore
Copy link
Contributor

Looking at the chunks loading in browser devtools, it's reporting CORS errors and 403 on the chunks:

Screenshot 2025-10-08 at 17 11 18

Vol-E and vizarr are showing corrupted-looking images:

Screenshot 2025-10-08 at 17 15 22

I see in the dev-tools that some chunks (e.g. dataset 8 are giving CORS errors) but others are OK e.g. dataset 5.
Checking in the validator I can confirm that loading a chunk for dataset 5 works "OK" (no CORS error) but shows corrupted appearence.

Neuroglancer doesn't show anything (no errors either)?

@melissalinkert
Copy link
Member Author

the default dataset size was reduced with Zarr v3 and practically the individual chunks take less space

  • does this mean the default compression options are different? Could that partly explain the conversion difference?

Pretty sure that's a difference in the default blosc cname; the size difference seems to be consistent with what's noted in glencoesoftware/zarr2zarr#9. Definitely open to changing defaults for 0.5/v3, but we'd need to collectively agree on what they should be.

the validator reports a few metadata validation issues:

Should be fixed with last few commits on this PR.

when loading the chunks using the validator, they appear as corrupted. I have not confirmed this with an independent viewer yet

Still looking into what's going wrong here.

@melissalinkert
Copy link
Member Author

As suggested by @sbesson, a simple test with zarr-python:

import zarr
import numpy as np

for res in range(0, 4):
    z_default = zarr.open(store="https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/default/Leica-1.ome.zarr/1/" + str(res))
    z_v3 = zarr.open(store="https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1/" + str(res))
    print(np.array_equal(z_default[0, 0, 0, :, :], z_v3[0, 0, 0, :, :])

indicates that zarr-python sees the v2 and v3 arrays as having equivalent contents.

Looking at https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1, when loading chunk 0 of 0/zarr.json, I see that approximately the top third of the chunk image is corrupt data, and the remainder is black. Looking at the size of the matching chunk file, it is approximately one third of the uncompressed chunk size:

$ aws s3 ls s3://gs-public-zarr-archive/bioformats2raw_290/v3/Leica-1.ome.zarr/1/0/c/0/0/0/0/0 --no-sign-request
2025-10-08 05:31:42     368707 0

That suggests to me that decompression is just not being performed.

If I then generate a simple sharded array with zarr-python and default compression:

import zarr
import numpy as np

z = zarr.create_array(
    store="v3-shard-test.zarr",
    shape=(256,256),
    chunks=(1024,1024),
    shards=(1024,1024),
    dtype="B"
)
z[:, :] = np.mgrid[0:256, 0:256][1]

then v3-shard-test.zarr/zarr.json shows the zstd codec nested within the configuration for the sharding_indexed codec. A similar simple test with the current state of this PR:

bin/bioformats2raw "test&sizeX=256&sizeY=256.fake" bf-v3-test.zarr --v3 --compact

and bf-v3-test.zarr/zarr.json shows that the blosc codec is a separate entry in the codecs array after the sharding_indexed codec. My understanding based on https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#id22 is that either representation should be readable, which suggests that a next step might be to confirm that zarrita is working as intended on both of these minimal examples.

If my understanding is not correct and the blosc/zstd codec must be nested within the sharding_indexed codec, then one more thing to try would be using https://github.com/zarr-developers/zarr-java/blob/main/src/main/java/dev/zarr/zarrjava/v3/codec/CodecBuilder.java#L126 here.

@will-moore
Copy link
Contributor

I tried that validator link but loading chunks failed with CORS error.

Viewing in napari with:

$ napari --plugin napari-ome-zarr https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1/

loaded the image but I see the omero metadata (channel colours etc) isn't being read because it's not nested within the ome block but is actually a sibling under attributes in zarr.json.

Screenshot 2025-10-23 at 08 59 44

@sbesson
Copy link
Member

sbesson commented Oct 23, 2025

Thanks both, that is really useful and seems to confirm that zarr-python can read the chunks generated by zarr-java.

loaded the image but I see the omero metadata (channel colours etc) isn't being read because it's not nested within the ome block but is actually a sibling under attributes in zarr.json.

Yes, this had been reported in #290 (review) and the test files haven't been regenerated since. I will do that using @melissalinkert's latest commits unless additional work is planned and update the datasets on our public buckets so that we can reduce the number of moving targets.

I tried that validator link but loading chunks failed with CORS error.

I see the same and this is surprising as there are absolutely no CORS errors when accessing for instance https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/CMU-1.ome.zarr/0/ (OME-NGFF 0.4) which is hosted under the same bucket with the same policy.
I propose to retest once the datasets have been regenerated and we have addressed other validation issues.

then v3-shard-test.zarr/zarr.json shows the zstd codec nested within the configuration for the sharding_indexed codec. A similar simple test with the current state of this PR and bf-v3-test.zarr/zarr.json shows that the blosc codec is a separate entry in the codecs array after the sharding_indexed code

Thanks for the detailed investigation. My reading of the specification is that both configuration are equivalently valid and actually communicate different storage states. In the first one, the chunks within the shards are individually compressed with zstd while in the second one, the entire shard is compressed with zstd. If this is the case, that's really useful to know as we will want to assess the pros/cons of each configuration and decide whether the converter should offer the granularity of defining shard-level vs chunk-level compression (and how) or hide this complexity from the end-user

@melissalinkert
Copy link
Member Author

348190c adds a --compress-inner-chunk option so that we can compare the two ways of compressing shards/chunks. I don't particularly love that option name, and as with the other sharding options there is definitely room for discussion - but this should give us something to proceed with testing.

Something like:

$ bin/bioformats2raw --v3 test.fake v3-default.zarr
$ bin/bioformats2raw --v3 --compress-inner-chunk test.fake v3-compress-inner-chunk.zarr

and then comparing v3-*.zarr/0/0/zarr.json should show the difference.

@sbesson
Copy link
Member

sbesson commented Oct 24, 2025

Thanks @melissalinkert, I regenerated the Zarr files using the latest build from this PR and the following scripts

PATH=./bioformats2raw-0.12.0-SNAPSHOT/bin:$PATH
mkdir -p default
rm -rf default/*
time bioformats2raw sources/Leica-1.scn default/Leica-1.ome.zarr
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB default/NIRHTa+001.ome.zarr
time bioformats2raw sources/LuCa-7color_Scan1.qptiff default/LuCa-7color_Scan1.ome.zarr

mkdir -p v3
rm -rf v3/*
time bioformats2raw sources/Leica-1.scn v3/Leica-1.ome.zarr --v3
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB v3/NIRHTa+001.ome.zarr --v3
time bioformats2raw sources/LuCa-7color_Scan1.qptiff v3/LuCa-7color_Scan1.ome.zarr --v3

mkdir -p v3_chunk_compressed
rm -rf v3_chunk_compressed/*
time bioformats2raw sources/Leica-1.scn v3_chunk_compressed/Leica-1.ome.zarr --v3 --compress-inner-chunk
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB v3_chunk_compressed/NIRHTa+001.ome.zarr --v3 --compress-inner-chunk
time bioformats2raw sources/LuCa-7color_Scan1.qptiff v3_chunk_compressed/LuCa-7color_Scan1.ome.zarr --v3 --compress-inner-chunk

Execution times

Leica-1.ome.zarr NIRHTa+001.ome.zarr LuCa-7color_Scan1.ome.zarr
(Default) 0m56.890s 0m52.013s 0m46.134s
--v3 1m57.536s 2m32.787s 1m47.665s
--v3 --compress-inner-chunk 1m56.552s 2m26.934s 1m33.445s

File sizes

Leica-1.ome.zarr NIRHTa+001.ome.zarr LuCa-7color_Scan1.ome.zarr
(Default) 4.3G 3.2G 4.0G
--v3 3.2G 1.9G 2.3G
--v3 --compress-inner-chunk 3.2G 1.8G 2.3G

OME NGFF validator links

Leica-1.ome.zarr NIRHTa+001.ome.zarr LuCa-7color_Scan1.ome.zarr
(Default) Leica-1.ome.zarr/1 NIRHTa+001.ome.zarr LuCa-7color_Scan1.ome.zarr/0
--v3 Leica-1.ome.zarr/1 NIRHTa+001.ome.zarr LuCa-7color_Scan1.ome.zarr/0
--v3 --compress-inner-chunk Leica-1.ome.zarr/1 NIRHTa+001.ome.zarr LuCa-7color_Scan1.ome.zarr/0

A few initial observations

  • for HCS, the plate level metadata still misses a version under the ome attribute

  • for all modalities, the bioformats2raw.layout key should be under the ome attribute - https://ngff.openmicroscopy.org/0.5/#bf2raw-attributes

  • the Leica-1 dataset with compression at the chunk level (--v3 --compress-inner-chunk) now loads in the validator while the one with compression at the shard level (--v3) still appears as corrupted

  • the conversion failed for LuCa-7color_Scan1.ome.zarr with --v3 --compress-inner-chunk with errors of type

    2025-10-23 21:42:25,666 [pool-1-thread-4] ERROR c.g.bioformats2raw.Converter - Failure processing chunk; resolution=1 plane=1 xx=12288 yy=0 zz=0 width=192 height=1024 depth=1
    java.lang.RuntimeException: dev.zarr.zarrjava.ZarrException: Could not read shard index.
     at dev.zarr.zarrjava.v3.Array.lambda$read$0(Array.java:185)
     at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
     at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658)
     at dev.zarr.zarrjava.v3.Array.read(Array.java:154)
     at com.glencoesoftware.bioformats2raw.Converter.readAsBytesV3(Converter.java:1993)
     at com.glencoesoftware.bioformats2raw.Converter.getTileDownsampled(Converter.java:2219)
     at com.glencoesoftware.bioformats2raw.Converter.getTile(Converter.java:2270)
     at com.glencoesoftware.bioformats2raw.Converter.processChunk(Converter.java:2514)
     at com.glencoesoftware.bioformats2raw.Converter.lambda$saveResolutions$8(Converter.java:2747)
     at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
     at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
     at java.base/java.lang.Thread.run(Thread.java:829)
    Caused by: dev.zarr.zarrjava.ZarrException: Could not read shard index.
     at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec.decodeInternal(ShardingIndexedCodec.java:198)
     at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec.decodePartial(ShardingIndexedCodec.java:250)
     at dev.zarr.zarrjava.v3.codec.CodecPipeline.decodePartial(CodecPipeline.java:93)
     at dev.zarr.zarrjava.v3.Array.lambda$read$0(Array.java:173)
     ... 11 common frames omitted
    

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants