Skip to content

Add CLI for converting v2 metadata to v3 #3257

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 51 commits into
base: main
Choose a base branch
from

Conversation

K-Meech
Copy link
Contributor

@K-Meech K-Meech commented Jul 16, 2025

For #1798

Adds a CLI using typer to convert v2 metadata (.zarray / .zattrs...) to v3 metadata zarr.json.

To test, you will need to install the new optional cli dependency e.g.
pip install -e ".[remote,cli]"

This should make the zarr-converter command available e.g. try:

zarr-converter --help
zarr-converter convert --help
zarr-converter clear --help

convert adds zarr.json files to every group / array, leaving the v2 metadata as-is. A zarr with both sets of metadata can still be opened with zarr.open, but will give a UserWarning: Both zarr.json (Zarr format 3) and .zarray (Zarr format 2) metadata objects exist... Zarr v3 will be used.. This can be avoided by passing zarr_format=3 to zarr.open, or by using the clear command to remove the v2 metadata.

clear can also remove v3 metadata. This is useful if the conversion fails part way through e.g. if one of the arrays uses a codec with no v3 equivalent.

All code for the cli is in src/zarr/core/metadata/converter/cli.py, with the actual conversion functions in src/zarr/core/metadata/converter/converter_v2_v3.py. These functions can be called directly, for those who don't want to use the CLI (although currently they are part of /core which is considered private API, so it may be best to move them elsewhere in the package).

Some points to consider:

  • I had to modify set_path from test_dtype_registry.py and test_codec_entrypoints.py, as they were causing the CLI tests to fail if they were run after. This seems to be due to the lazy_load_list of the numcodecs codecs registries being cleared, meaning they were no longer available in my code which finds the numcodecs.zarr3 equivalent of a numcodecs codec.
  • I tested this on local zarr images, so it would be great if someone with access to s3 / google cloud etc., could try it out on some small example images there.
  • I'm happy to add docs about how to use the CLI, but wanted to get feedback on the general structure first

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/user-guide/*.rst
  • Changes documented as a new file in changes/
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jul 16, 2025
@K-Meech
Copy link
Contributor Author

K-Meech commented Jul 28, 2025

I've updated the structure of the CLI - hopefully this addresses both @dstansby and @d-v-b 's comments! You should be able to test with:

zarr --help
zarr migrate --help
zarr remove-metadata --help

I haven't addressed this comment yet, but will do so in the next round of changes (I know @dstansby had some additional comments to make on the converter implementation).

One known issue:

@dstansby
Copy link
Contributor

Very nice! I had a play with it locally and it worked well. I'll do a fuller review of the code now.

Copy link
Contributor

@dstansby dstansby left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking great! I've reviewed the implementation, and left some comments; I'll move on to reviewing the tests next, but thought I'd post these comments first.

else:
lvl = logging.WARNING
fmt = "%(message)s"
logging.basicConfig(level=lvl, format=fmt)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logging.basicConfig(level=lvl, format=fmt)
logger.basicConfig(level=lvl, format=fmt)

I think you want to configure the logger instance, not global settings?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could do something like:

if verbose:
        logger.setLevel(logging.INFO)
    else:
        logger.setLevel(logging.WARNING)
logger.addHandler(logging.StreamHandler())

but the issue is this will only affect logs directly from the cli.py file. When using --dry-run, I also want to see the log of created / deleted files from migrate_to_v3.py, which won't be shown with this setting alone.

I could add a similar setup to the migrate_to_v3.py file, but this could be annoying for downstream code that uses these functions and wants to use a different logging level / setup e.g. to a file rather than the console.

It could work if I created the migrate_to_v3.py logger as an explicit child of the cli.py logger though i.e.

# in cli.py
logger = logging.getLogger("cli")

# in migrate_to_v3.py
logger = logging.getLogger("cli.migrate")

What do you think?



def _set_verbose_level() -> None:
logging.getLogger().setLevel(logging.INFO)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logging.getLogger().setLevel(logging.INFO)
logger.setLevel(logging.INFO)

Same reason as above

str | None,
typer.Argument(
help=(
"Output location to write generated metadata (no chunks will be copied). If not provided, "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"Output location to write generated metadata (no chunks will be copied). If not provided, "
"Output location to write generated metadata (no array data will be copied). If not provided, "



def migrate_v2_to_v3(
input_store: StoreLike,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
input_store: StoreLike,
*,
input_store: StoreLike,

Going keyword only here should avoid accidentally hading the input/ouput stores the wrong way round.

Comment on lines +38 to +40
input_store: StoreLike,
output_store: StoreLike | None = None,
storage_options: dict[str, Any] | None = None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To save having to deal with storage_options here, I would instead just accept an actual Store as input to this funciton. Doing this would also avoid having to deal with the case where one wants different storage_options for both input and output stores.

)


def _convert_filters(metadata_v2: ArrayV2Metadata) -> list[ArrayArrayCodec]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would make slightly more sense to pass filters, instead of the whole metadata here.

Suggested change
def _convert_filters(metadata_v2: ArrayV2Metadata) -> list[ArrayArrayCodec]:
def _convert_filters(filters) -> list[ArrayArrayCodec]:

(I probably got the typing wrong)

return cast(list[ArrayArrayCodec], filters_codecs)


def _convert_compressor(metadata_v2: ArrayV2Metadata) -> BytesBytesCodec | None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def _convert_compressor(metadata_v2: ArrayV2Metadata) -> BytesBytesCodec | None:
def _convert_compressor(compressor, dtype) -> BytesBytesCodec | None:

Again, I think makes for a clearer API if you explicitly just pass the compressor and type instead of the whole metadata.

Comment on lines +242 to +244
compressor_name = metadata_v2.compressor.codec_id

match compressor_name:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
compressor_name = metadata_v2.compressor.codec_id
match compressor_name:
match metadata_v2.compressor.codec_id:

Since you don't use the variable again

@TomNicholas
Copy link
Member

If you have the patience for it, please keep track of the stuff we could change in the core library to make this kind of tool easier to write, and feel free to open issues to track these features.

For example, is there currently a way to use zarr-python to convert v2 to v3 metadata without the CLI? If so then that would be independently useful (e.g. in VirtualiZarr). If not then in an ideal world that would exist before a CLI is added to wrap it.

@emmanuelmathot
Copy link

Just FYI, I made this converter for ESA EOPF datasets (Zarr v2) to EOPF GeoZarr (V3). I adopted a recursive approach to copy and keep the datatree as well as the consolidated metadata.

@K-Meech
Copy link
Contributor Author

K-Meech commented Jul 29, 2025

@TomNicholas - this PR will add both a CLI + functions to convert v2 to v3 metadata. Everything possible with the CLI is also available via direct function calls.

Although, at the moment these are added to src/zarr/core/metadata which I don't think is part of the public API - @d-v-b / @dstansby, any suggestions for where these could be moved to make them easy to use by downstream scripts / packages?

@TomNicholas
Copy link
Member

@K-Meech thanks. That's actually okay for my use case - I'm already importing other private metadata-related internals. But I agree a later PR to make these public would be great.

@dstansby
Copy link
Contributor

I'd suggest putting the metadata converter python API in a new zarr.metadata submodule (which can contain other bits of metadata migrated from zarr.core.metadata later), and hiding the the CLI wrapper in a new private zarr._cli submodule (or zarr._cli.py file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants