-
-
Notifications
You must be signed in to change notification settings - Fork 349
Add CLI for converting v2 metadata to v3 #3257
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…sting a zarr version greater than 3
Merge changes from review
I've updated the structure of the CLI - hopefully this addresses both @dstansby and @d-v-b 's comments! You should be able to test with:
I haven't addressed this comment yet, but will do so in the next round of changes (I know @dstansby had some additional comments to make on the converter implementation). One known issue:
|
Very nice! I had a play with it locally and it worked well. I'll do a fuller review of the code now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking great! I've reviewed the implementation, and left some comments; I'll move on to reviewing the tests next, but thought I'd post these comments first.
else: | ||
lvl = logging.WARNING | ||
fmt = "%(message)s" | ||
logging.basicConfig(level=lvl, format=fmt) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging.basicConfig(level=lvl, format=fmt) | |
logger.basicConfig(level=lvl, format=fmt) |
I think you want to configure the logger instance, not global settings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could do something like:
if verbose:
logger.setLevel(logging.INFO)
else:
logger.setLevel(logging.WARNING)
logger.addHandler(logging.StreamHandler())
but the issue is this will only affect logs directly from the cli.py
file. When using --dry-run
, I also want to see the log of created / deleted files from migrate_to_v3.py
, which won't be shown with this setting alone.
I could add a similar setup to the migrate_to_v3.py
file, but this could be annoying for downstream code that uses these functions and wants to use a different logging level / setup e.g. to a file rather than the console.
It could work if I created the migrate_to_v3.py
logger as an explicit child of the cli.py
logger though i.e.
# in cli.py
logger = logging.getLogger("cli")
# in migrate_to_v3.py
logger = logging.getLogger("cli.migrate")
What do you think?
|
||
|
||
def _set_verbose_level() -> None: | ||
logging.getLogger().setLevel(logging.INFO) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logging.getLogger().setLevel(logging.INFO) | |
logger.setLevel(logging.INFO) |
Same reason as above
str | None, | ||
typer.Argument( | ||
help=( | ||
"Output location to write generated metadata (no chunks will be copied). If not provided, " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Output location to write generated metadata (no chunks will be copied). If not provided, " | |
"Output location to write generated metadata (no array data will be copied). If not provided, " |
|
||
|
||
def migrate_v2_to_v3( | ||
input_store: StoreLike, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
input_store: StoreLike, | |
*, | |
input_store: StoreLike, |
Going keyword only here should avoid accidentally hading the input/ouput stores the wrong way round.
input_store: StoreLike, | ||
output_store: StoreLike | None = None, | ||
storage_options: dict[str, Any] | None = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To save having to deal with storage_options
here, I would instead just accept an actual Store
as input to this funciton. Doing this would also avoid having to deal with the case where one wants different storage_options
for both input and output stores.
) | ||
|
||
|
||
def _convert_filters(metadata_v2: ArrayV2Metadata) -> list[ArrayArrayCodec]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would make slightly more sense to pass filters, instead of the whole metadata here.
def _convert_filters(metadata_v2: ArrayV2Metadata) -> list[ArrayArrayCodec]: | |
def _convert_filters(filters) -> list[ArrayArrayCodec]: |
(I probably got the typing wrong)
return cast(list[ArrayArrayCodec], filters_codecs) | ||
|
||
|
||
def _convert_compressor(metadata_v2: ArrayV2Metadata) -> BytesBytesCodec | None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def _convert_compressor(metadata_v2: ArrayV2Metadata) -> BytesBytesCodec | None: | |
def _convert_compressor(compressor, dtype) -> BytesBytesCodec | None: |
Again, I think makes for a clearer API if you explicitly just pass the compressor and type instead of the whole metadata.
compressor_name = metadata_v2.compressor.codec_id | ||
|
||
match compressor_name: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
compressor_name = metadata_v2.compressor.codec_id | |
match compressor_name: | |
match metadata_v2.compressor.codec_id: |
Since you don't use the variable again
For example, is there currently a way to use zarr-python to convert v2 to v3 metadata without the CLI? If so then that would be independently useful (e.g. in VirtualiZarr). If not then in an ideal world that would exist before a CLI is added to wrap it. |
Just FYI, I made this converter for ESA EOPF datasets (Zarr v2) to EOPF GeoZarr (V3). I adopted a recursive approach to copy and keep the datatree as well as the consolidated metadata. |
@TomNicholas - this PR will add both a CLI + functions to convert v2 to v3 metadata. Everything possible with the CLI is also available via direct function calls. Although, at the moment these are added to |
@K-Meech thanks. That's actually okay for my use case - I'm already importing other private metadata-related internals. But I agree a later PR to make these public would be great. |
I'd suggest putting the metadata converter python API in a new |
For #1798
Adds a CLI using
typer
to convert v2 metadata (.zarray
/.zattrs
...) to v3 metadatazarr.json
.To test, you will need to install the new optional cli dependency e.g.
pip install -e ".[remote,cli]"
This should make the
zarr-converter
command available e.g. try:convert
addszarr.json
files to every group / array, leaving the v2 metadata as-is. A zarr with both sets of metadata can still be opened withzarr.open
, but will give a UserWarning:Both zarr.json (Zarr format 3) and .zarray (Zarr format 2) metadata objects exist... Zarr v3 will be used.
. This can be avoided by passingzarr_format=3
tozarr.open
, or by using theclear
command to remove the v2 metadata.clear
can also remove v3 metadata. This is useful if the conversion fails part way through e.g. if one of the arrays uses a codec with no v3 equivalent.All code for the cli is in
src/zarr/core/metadata/converter/cli.py
, with the actual conversion functions insrc/zarr/core/metadata/converter/converter_v2_v3.py
. These functions can be called directly, for those who don't want to use the CLI (although currently they are part of/core
which is considered private API, so it may be best to move them elsewhere in the package).Some points to consider:
set_path
fromtest_dtype_registry.py
andtest_codec_entrypoints.py
, as they were causing the CLI tests to fail if they were run after. This seems to be due to thelazy_load_list
of the numcodecs codecs registries being cleared, meaning they were no longer available in my code which finds thenumcodecs.zarr3
equivalent of a numcodecs codec.TODO:
docs/user-guide/*.rst
changes/