Initial support for Zarr v3 #290

melissalinkert · 2025-09-17T01:36:23Z

Opening as a draft to show some initial progress, but there are still a few things to do here before this is ready for full review:

make compression options work with Zarr v3 (right now only default codec supported)
add sharding options (--shard-width, --shard-height, --shard-depth to match chunk options?)
add tests that compare v2 against v3 output with various combinations of options
write correct HCS metadata for v3

A simple test of bin/bioformats2raw --v3 test.fake test-v3.zarr or similar should work though, and v2 output should be unaffected as indicated by passing tests.

will-moore · 2025-09-23T14:43:52Z

src/test/java/com/glencoesoftware/bioformats2raw/test/AbstractZarrTest.java

+  }
+
+  void checkMultiscale(Map<String, Object> multiscale, String name) {
+    assertEquals(getNGFFVersion(), multiscale.get("version"));


I'm not sure if this multiscale is zarr v3 but in OME-Zarr v0.5, the version is not under multiscale any more but under "attributes": {"ome": {"version": "0.5", "multiscales": {...}

Good point, thanks @will-moore. I'll need to update that in both the conversion and the test then.

This should have been fixed with 8e12cbe.

melissalinkert · 2025-10-01T17:15:46Z

With last couple of commits, I think this is now in a state where it would be good to have more eyes on code/testing.

One concern I have when trying to add sharding options and associated tests is whether it makes sense to allow shard (or even tile) sizes to be specified per-resolution. It's pretty easy to get into a scenario where the specified shard size works for the largest 1 or 2 resolutions, but not for anything smaller. Right now, that should result in a warning and no sharding applied, but that might not be ideal, so other thoughts definitely welcome.

jburel · 2025-10-06T13:25:09Z

src/main/java/com/glencoesoftware/bioformats2raw/ZarrTypes.java

+   * @return UCAR array type
+   */
+  public static ucar.ma2.DataType getDataType(int type) {
+    switch (type) {


The "BIT" type https://github.com/ome/bioformats/blob/develop/components/formats-api/src/loci/formats/FormatTools.java#L840
is not supported. Should it be added?

What about LONG and ULONG?

BIT added in 4868446. LONG and ULONG and not valid types in OME schema, so are intentionally omitted here as no input data would be of those types.

sbesson

Started testing the Zarr v3 conversion with three public representative datasets, previously used for the testing of glencoesoftware/raw2ometiff#80

Leica-1.scn an RGB brightfield whole slide image
NIRHTa+001 a 384 well plate from the BBBC017 collection
LuCa-7color_Scan1.qptiff a multi-channel fluorescent whole slide image

Initially, all three images were converted with bioformats2raw both with the default options and with the --v3 flag using the following script

PATH=./bioformats2raw-0.11.0-SNAPSHOT/bin:$PATH
mkdir -p default
rm -rf default/*
time bioformats2raw sources/Leica-1.scn default/Leica-1.ome.zarr
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB default/NIRHTa+001.ome.zarr
time bioformats2raw sources/LuCa-7color_Scan1.qptiff default/LuCa-7color_Scan1.ome.zarr

mkdir -p v3
rm -rf v3/*
time bioformats2raw sources/Leica-1.scn v3/Leica-1.ome.zarr --v3
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB v3/NIRHTa+001.ome.zarr --v3
time bioformats2raw sources/LuCa-7color_Scan1.qptiff v3/LuCa-7color_Scan1.ome.zarr --v3

The execution times and the sizes of the files are as follows

omero@ngff:/mnt/data/seb/bioformats2raw_290$ ./convert.sh 

real    1m3.209s
user    4m29.618s
sys     0m6.989s
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.esotericsoftware.reflectasm.AccessClassLoader (file:/mnt/data/seb/bioformats2raw_290/bioformats2raw-0.11.0-SNAPSHOT/lib/reflectasm-1.11.9.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.esotericsoftware.reflectasm.AccessClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

real    0m58.257s
user    3m11.635s
sys     0m11.610s

real    0m47.745s
user    3m16.102s
sys     0m6.499s

real    2m6.667s
user    9m0.141s
sys     0m5.642s
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.esotericsoftware.reflectasm.AccessClassLoader (file:/mnt/data/seb/bioformats2raw_290/bioformats2raw-0.11.0-SNAPSHOT/lib/reflectasm-1.11.9.jar) to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int,java.security.ProtectionDomain)
WARNING: Please consider reporting this to the maintainers of com.esotericsoftware.reflectasm.AccessClassLoader
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release

real    2m38.243s
user    8m38.694s
sys     0m11.849s
                
real    1m54.085s
user    8m1.865s
sys     0m5.122s

omero@ngff:/mnt/data/seb/bioformats2raw_290$ du -csh default/*
4.3G    default/Leica-1.ome.zarr
3.2G    default/LuCa-7color_Scan1.ome.zarr
4.0G    default/NIRHTa+001.ome.zarr
12G     total
omero@ngff:/mnt/data/seb/bioformats2raw_290$ du -csh v3/*
3.2G    v3/Leica-1.ome.zarr
1.9G    v3/LuCa-7color_Scan1.ome.zarr
2.3G    v3/NIRHTa+001.ome.zarr
7.3G    total

and the generated Zarr datasets have been uploaded to a temporary prefix of the public gs-public-zarr-archive AWS S3 bucket under bioformats2raw_290 and can be inspected via the OME NGFF validator e.g.
https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/default/Leica-1.ome.zarr/1
https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1

A couple of immediate findings:

the conversion time was found to be twice slower when generated Zarr v3
the default dataset size was reduced with Zarr v3 and practically the individual chunks take less space
- does this mean the default compression options are different? Could that partly explain the conversion difference?
the validator reports a few metadata validation issues:
- the dimension_names attribute under the array attributes is null.
- the plate and well attributes are not nested under ome
- the omero attribute is not under ome
when loading the chunks using the validator, they appear as corrupted. I have not confirmed this with an independent viewer yet

will-moore · 2025-10-08T16:20:45Z

Looking at the chunks loading in browser devtools, it's reporting CORS errors and 403 on the chunks:

Vol-E and vizarr are showing corrupted-looking images:

I see in the dev-tools that some chunks (e.g. dataset 8 are giving CORS errors) but others are OK e.g. dataset 5.
Checking in the validator I can confirm that loading a chunk for dataset 5 works "OK" (no CORS error) but shows corrupted appearence.

Neuroglancer doesn't show anything (no errors either)?

melissalinkert · 2025-10-13T17:29:10Z

the default dataset size was reduced with Zarr v3 and practically the individual chunks take less space

does this mean the default compression options are different? Could that partly explain the conversion difference?

Pretty sure that's a difference in the default blosc cname; the size difference seems to be consistent with what's noted in glencoesoftware/zarr2zarr#9. Definitely open to changing defaults for 0.5/v3, but we'd need to collectively agree on what they should be.

the validator reports a few metadata validation issues:

Should be fixed with last few commits on this PR.

when loading the chunks using the validator, they appear as corrupted. I have not confirmed this with an independent viewer yet

Still looking into what's going wrong here.

melissalinkert · 2025-10-23T01:19:25Z

As suggested by @sbesson, a simple test with zarr-python:

import zarr
import numpy as np

for res in range(0, 4):
    z_default = zarr.open(store="https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/default/Leica-1.ome.zarr/1/" + str(res))
    z_v3 = zarr.open(store="https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1/" + str(res))
    print(np.array_equal(z_default[0, 0, 0, :, :], z_v3[0, 0, 0, :, :])

indicates that zarr-python sees the v2 and v3 arrays as having equivalent contents.

Looking at https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1, when loading chunk 0 of 0/zarr.json, I see that approximately the top third of the chunk image is corrupt data, and the remainder is black. Looking at the size of the matching chunk file, it is approximately one third of the uncompressed chunk size:

$ aws s3 ls s3://gs-public-zarr-archive/bioformats2raw_290/v3/Leica-1.ome.zarr/1/0/c/0/0/0/0/0 --no-sign-request
2025-10-08 05:31:42     368707 0

That suggests to me that decompression is just not being performed.

If I then generate a simple sharded array with zarr-python and default compression:

import zarr
import numpy as np

z = zarr.create_array(
    store="v3-shard-test.zarr",
    shape=(256,256),
    chunks=(1024,1024),
    shards=(1024,1024),
    dtype="B"
)
z[:, :] = np.mgrid[0:256, 0:256][1]

then v3-shard-test.zarr/zarr.json shows the zstd codec nested within the configuration for the sharding_indexed codec. A similar simple test with the current state of this PR:

bin/bioformats2raw "test&sizeX=256&sizeY=256.fake" bf-v3-test.zarr --v3 --compact

and bf-v3-test.zarr/zarr.json shows that the blosc codec is a separate entry in the codecs array after the sharding_indexed codec. My understanding based on https://zarr-specs.readthedocs.io/en/latest/v3/core/index.html#id22 is that either representation should be readable, which suggests that a next step might be to confirm that zarrita is working as intended on both of these minimal examples.

If my understanding is not correct and the blosc/zstd codec must be nested within the sharding_indexed codec, then one more thing to try would be using https://github.com/zarr-developers/zarr-java/blob/main/src/main/java/dev/zarr/zarrjava/v3/codec/CodecBuilder.java#L126 here.

will-moore · 2025-10-23T08:07:12Z

I tried that validator link but loading chunks failed with CORS error.

Viewing in napari with:

$ napari --plugin napari-ome-zarr https://gs-public-zarr-archive.s3.amazonaws.com/bioformats2raw_290/v3/Leica-1.ome.zarr/1/

loaded the image but I see the omero metadata (channel colours etc) isn't being read because it's not nested within the ome block but is actually a sibling under attributes in zarr.json.

sbesson · 2025-10-23T09:08:31Z

Thanks both, that is really useful and seems to confirm that zarr-python can read the chunks generated by zarr-java.

loaded the image but I see the omero metadata (channel colours etc) isn't being read because it's not nested within the ome block but is actually a sibling under attributes in zarr.json.

Yes, this had been reported in #290 (review) and the test files haven't been regenerated since. I will do that using @melissalinkert's latest commits unless additional work is planned and update the datasets on our public buckets so that we can reduce the number of moving targets.

I tried that validator link but loading chunks failed with CORS error.

I see the same and this is surprising as there are absolutely no CORS errors when accessing for instance https://ome.github.io/ome-ngff-validator/?source=https://gs-public-zarr-archive.s3.amazonaws.com/CMU-1.ome.zarr/0/ (OME-NGFF 0.4) which is hosted under the same bucket with the same policy.
I propose to retest once the datasets have been regenerated and we have addressed other validation issues.

then v3-shard-test.zarr/zarr.json shows the zstd codec nested within the configuration for the sharding_indexed codec. A similar simple test with the current state of this PR and bf-v3-test.zarr/zarr.json shows that the blosc codec is a separate entry in the codecs array after the sharding_indexed code

Thanks for the detailed investigation. My reading of the specification is that both configuration are equivalently valid and actually communicate different storage states. In the first one, the chunks within the shards are individually compressed with zstd while in the second one, the entire shard is compressed with zstd. If this is the case, that's really useful to know as we will want to assess the pros/cons of each configuration and decide whether the converter should offer the granularity of defining shard-level vs chunk-level compression (and how) or hide this complexity from the end-user

…plied

melissalinkert · 2025-10-23T16:53:39Z

348190c adds a --compress-inner-chunk option so that we can compare the two ways of compressing shards/chunks. I don't particularly love that option name, and as with the other sharding options there is definitely room for discussion - but this should give us something to proceed with testing.

Something like:

$ bin/bioformats2raw --v3 test.fake v3-default.zarr
$ bin/bioformats2raw --v3 --compress-inner-chunk test.fake v3-compress-inner-chunk.zarr

and then comparing v3-*.zarr/0/0/zarr.json should show the difference.

sbesson · 2025-10-24T11:01:31Z

Thanks @melissalinkert, I regenerated the Zarr files using the latest build from this PR and the following scripts

PATH=./bioformats2raw-0.12.0-SNAPSHOT/bin:$PATH
mkdir -p default
rm -rf default/*
time bioformats2raw sources/Leica-1.scn default/Leica-1.ome.zarr
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB default/NIRHTa+001.ome.zarr
time bioformats2raw sources/LuCa-7color_Scan1.qptiff default/LuCa-7color_Scan1.ome.zarr

mkdir -p v3
rm -rf v3/*
time bioformats2raw sources/Leica-1.scn v3/Leica-1.ome.zarr --v3
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB v3/NIRHTa+001.ome.zarr --v3
time bioformats2raw sources/LuCa-7color_Scan1.qptiff v3/LuCa-7color_Scan1.ome.zarr --v3

mkdir -p v3_chunk_compressed
rm -rf v3_chunk_compressed/*
time bioformats2raw sources/Leica-1.scn v3_chunk_compressed/Leica-1.ome.zarr --v3 --compress-inner-chunk
time bioformats2raw sources/NIRHTa+001/AS_09125_050116000001_A10f00d0.DIB v3_chunk_compressed/NIRHTa+001.ome.zarr --v3 --compress-inner-chunk
time bioformats2raw sources/LuCa-7color_Scan1.qptiff v3_chunk_compressed/LuCa-7color_Scan1.ome.zarr --v3 --compress-inner-chunk

Execution times

	Leica-1.ome.zarr	NIRHTa+001.ome.zarr	LuCa-7color_Scan1.ome.zarr
(Default)	0m56.890s	0m52.013s	0m46.134s
--v3	1m57.536s	2m32.787s	1m47.665s
--v3 --compress-inner-chunk	1m56.552s	2m26.934s	1m33.445s

File sizes

	Leica-1.ome.zarr	NIRHTa+001.ome.zarr	LuCa-7color_Scan1.ome.zarr
(Default)	4.3G	3.2G	4.0G
--v3	3.2G	1.9G	2.3G
--v3 --compress-inner-chunk	3.2G	1.8G	2.3G

OME NGFF validator links

	Leica-1.ome.zarr	NIRHTa+001.ome.zarr	LuCa-7color_Scan1.ome.zarr
(Default)	Leica-1.ome.zarr/1	NIRHTa+001.ome.zarr	LuCa-7color_Scan1.ome.zarr/0
--v3	Leica-1.ome.zarr/1	NIRHTa+001.ome.zarr	LuCa-7color_Scan1.ome.zarr/0
--v3 --compress-inner-chunk	Leica-1.ome.zarr/1	NIRHTa+001.ome.zarr	LuCa-7color_Scan1.ome.zarr/0

A few initial observations

for HCS, the plate level metadata still misses a version under the ome attribute
for all modalities, the bioformats2raw.layout key should be under the ome attribute - https://ngff.openmicroscopy.org/0.5/#bf2raw-attributes
the Leica-1 dataset with compression at the chunk level (--v3 --compress-inner-chunk) now loads in the validator while the one with compression at the shard level (--v3) still appears as corrupted

the conversion failed for LuCa-7color_Scan1.ome.zarr with --v3 --compress-inner-chunk with errors of type

2025-10-23 21:42:25,666 [pool-1-thread-4] ERROR c.g.bioformats2raw.Converter - Failure processing chunk; resolution=1 plane=1 xx=12288 yy=0 zz=0 width=192 height=1024 depth=1
java.lang.RuntimeException: dev.zarr.zarrjava.ZarrException: Could not read shard index.
 at dev.zarr.zarrjava.v3.Array.lambda$read$0(Array.java:185)
 at java.base/java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948)
 at java.base/java.util.stream.ReferencePipeline$Head.forEach(ReferencePipeline.java:658)
 at dev.zarr.zarrjava.v3.Array.read(Array.java:154)
 at com.glencoesoftware.bioformats2raw.Converter.readAsBytesV3(Converter.java:1993)
 at com.glencoesoftware.bioformats2raw.Converter.getTileDownsampled(Converter.java:2219)
 at com.glencoesoftware.bioformats2raw.Converter.getTile(Converter.java:2270)
 at com.glencoesoftware.bioformats2raw.Converter.processChunk(Converter.java:2514)
 at com.glencoesoftware.bioformats2raw.Converter.lambda$saveResolutions$8(Converter.java:2747)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
 at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: dev.zarr.zarrjava.ZarrException: Could not read shard index.
 at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec.decodeInternal(ShardingIndexedCodec.java:198)
 at dev.zarr.zarrjava.v3.codec.core.ShardingIndexedCodec.decodePartial(ShardingIndexedCodec.java:250)
 at dev.zarr.zarrjava.v3.codec.CodecPipeline.decodePartial(CodecPipeline.java:93)
 at dev.zarr.zarrjava.v3.Array.lambda$read$0(Array.java:173)
 ... 11 common frames omitted

melissalinkert added 5 commits September 16, 2025 20:30

Initial support for Zarr v3

f3a5383

Save HCs metadata when writing v3

9ac541b

Add basic sharding and codec support for v3

080734d

Fix specification version numbers

1faf72c

Refactor v2 tests, and add a few v3 tests (more are needed)

2e363ad

will-moore reviewed Sep 23, 2025

View reviewed changes

melissalinkert added 2 commits September 29, 2025 17:49

Fix difference in v2/v3 version and multiscales attributes

8e12cbe

Add a few more v3 tests, including sharding options

ad1b098

melissalinkert marked this pull request as ready for review October 1, 2025 17:10

melissalinkert requested a review from sbesson October 1, 2025 17:10

jburel reviewed Oct 6, 2025

View reviewed changes

sbesson reviewed Oct 8, 2025

View reviewed changes

melissalinkert added 4 commits October 13, 2025 09:56

Add BIT pixel type

4868446

Fix omero nesting for 0.5

373bb9d

Fix dimension_names for 0.5/v3

37804a7

Fix plate and well for 0.5/v3

f17f648

Add --compress-inner-chunk option to switch where compression is ap…

348190c

…plied

Initial support for Zarr v3 #290

Are you sure you want to change the base?

Initial support for Zarr v3 #290

Uh oh!

Conversation

melissalinkert commented Sep 17, 2025

Uh oh!

will-moore Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

melissalinkert Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

melissalinkert Oct 1, 2025

Choose a reason for hiding this comment

Uh oh!

melissalinkert commented Oct 1, 2025

Uh oh!

jburel Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

dominikl Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

melissalinkert Oct 13, 2025

Choose a reason for hiding this comment

Uh oh!

sbesson left a comment

Choose a reason for hiding this comment

Uh oh!

will-moore commented Oct 8, 2025

Uh oh!

melissalinkert commented Oct 13, 2025

Uh oh!

melissalinkert commented Oct 23, 2025

Uh oh!

will-moore commented Oct 23, 2025

Uh oh!

sbesson commented Oct 23, 2025

Uh oh!

melissalinkert commented Oct 23, 2025

Uh oh!

sbesson commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sbesson commented Oct 24, 2025 •

edited

Loading