-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZEP0004 Review - Zarr Conventions #262
base: main
Are you sure you want to change the base?
Conversation
Co-authored-by: Ryan Abernathey <[email protected]>
Thanks so much @MSanKeys963 for getting this started. It's a perfect place to start. Here's what I will try to do over the next few days
In the meantime, we can use this PR to continue the discussion started in https://github.com/zarr-developers/zeps/pull/28/files where ZEP4 was first proposed. @ivirshup I know you have lots of ideas here, and you have been very patient as this ZEP has moved forward very slowly. 🙏 I'd love to hear more about your use cases for conventions anndata and the other projects you're involved with. |
Would it make sense to suggest a zarr convention providing a json-schema in addition to a document? From what I read in zarr-developers/zeps#28 the goal of ZEP4 is to have a common place where to find information about how to store some domain-specific metadata in a standard-ish way rather than something that should be strictly enforced. So a convention document is more important than a json-schema and the latter should really be optional. Having a json-schema may be nice to avoid making mistakes or misinterpretations while implementing a convention in a domain-specific library. Using existing tooling would make the process faster too I guess? I'm not familiar with json-schema, though, so I don't know if it is compatible with the modularity and flexibility of zarr conventions as proposed in ZEP4. Are json schemas easily composable? |
JSON schema makes sense to me, and I have implemented some in Pydantic for a different project. However, it gets a bit ugly when you start using hyphens for key names and symbols for namespaces as proposed in ZEP004. Stuff like this, i.e. no programming language allows hyphens in variable names and they need aliases. Luckily pydantic has this, but not sure what would happen in other languages. Parsing can be difficult. It also gets very nested and confusing too.
Any thoughts? The example above allows JSON specification like this: {"units-v1": {"distance": "ft"}}
// or
{"units-v1": {"angle": "rad"}} at the end of the day you end up with a schema like this, which is nice, but implementation makes me want to barf :) |
Just dropping in having seen the ZEP page https://zarr.dev/zeps/draft/ZEP0004.html - is there any advantage to the flexibility around keeping a convention's configuration inside or not inside its own object within the attributes? I think we could stand to be more opinionated here and require that the config is kept in its own sub-object: this avoids name collisions and keeps everything together. That would also become the obvious place to keep the convention version, rather than having to encode it in the name. It also makes the jsonschema marginally easier, as you only have to describe the convention config object rather than the whole attributes object containing the convention config. Also this way, the Is there a strong argument in favour of allowing free-floating convention configuration exploded through the attributes object? |
I'm 100% on board with this
I'm not aware of any, but I am curious if anyone knows differently. |
I think this sounds fine. I would welcome explicit suggestions on the PR. I know that I have been very slow to move this forward. The space of possibilities feels vast. Specifically, @clbarnes - would you like to turn your suggestions into text on the ZEP? I would gladly incorporate that. The same thing goes for folks who favor JSON schema. Please suggest language you would like to see in the ZEP. |
Co-authored-by: Yaroslav Halchenko <[email protected]>
"zarr_format": 3, | ||
"node_type": "group", | ||
"attributes": { | ||
"zarr_conventions": ["units-v1", "foo"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any chance to make it more "specific" but also descriptive to potentially "decentralize" such conventions, while still allowing for a generic validation of zarrs. E.g. it could become here a dict of conversions, with their versions and schema (jsonschema ? or may be linkml?) URLs . e.g.
"zarr_conventions": ["units-v1", "foo"], | |
"zarr_conventions": { | |
"units": { | |
"version": 1, | |
"homepage": " ... URL which has potential to describe what that is about ...", | |
"schema_url": "... hosted somewhere ..." | |
}, | |
"foo": {} | |
}, |
where in above units
is a well defined convention and foo
is not so good (just for an example).
Providing schema to go along would open opportunity for a generic zarr validator to validate embedded in a zarr attributes following the schema. It is reflective of an approach NWB standard took - it stores a copy of the schema for itself of each of the extensions within .nwb (hdf5) file so it becomes feasible to do generic validation and also open it up following those embedded schemas even if extension library is not installed.
Separation of version from the convention name also would make it cleaner and diff
upon upgrade from one version to another becoming "to the point" (instead of changing every attribute name) thus making it easier to review etc.
I am not that savvy in zarr and thus acknowledge that development of the schema formalization for conventions might be a larger effort than intended for this ZEP, so might better be postponed. But establishing record of zarr_conventions
as a collection of records instead of just a list, would at least open such possibility without in the future requiring breaking type changes. Or may be it is already "easy" to add basic "schema" support here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a great suggestion @yarikoptic!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic should we rename schema-url
to schemaUrl
to adhere to JSON common practices? Hyphens, when parsed in some languages, cause issues / require special handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What exactly is the use case for storing a schema (or url to a schema) for the attributes alongside the attributes? I don't see how validation is attractive in this situation, because presumably if the attributes don't pass validation from that schema, you wouldn't write them to disk in the first place. If i'm a client reading a Zarr group that implements some schema that I am aware of, then by definition I already have the schema, so including the schema in the Zarr attributes is useless here; whereas, if it's a schema I'm not aware of, then why should I care if validation of that schema succeeds or fails?
I can see why a data stores that support partial reads would expose schemas, because you don't want clients to read everything just to know what's in it, but Zarr attributes are just JSON documents, so partial reading isn't really part of the picture there.
It's very likely that I don't understand the use case, so a motivating example would really help here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how validation is attractive in this situation, because presumably if the attributes don't pass validation from that schema, you wouldn't write them to disk in the first place.
That is quite a big assumption which would be impossible to verify unless schema is stored/pointed to explicitly. "Explicit better than implicit" (Zen of Python #2
). There can be a number of buggy client implementations, etc. Absent formalization of schema on Zarr level would facilitate "schema-free" conventions down-stream, thus facilitate breeding unformalized conventions/extensions.
Besides validation, having a schema over the fields might open opportunities for automated metadata-visualization/editing UI constructions (e.g. using smth like https://github.com/koumoul-dev/vuetify-jsonschema-form/ for vue) etc.
FWIW, having machine readable schema is a great feature for a standard to have: e.g. a foundational design principle within https://www.nwb.org/ (https://github.com/NeurodataWithoutBorders/nwb-schema), and recently (well -- years back but still being formalized) established within https://bids-specification.readthedocs.io/ (src/schema), but already acknowledged to be of great importance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not exactly sure what you propose to reply constructively, but sounds like "dump everything into a dict" which would be counter-effective to the original intention of this ZEP to (citing from https://zarr.dev/zeps/draft/ZEP0004.html; emphasis is mine)
.. standardize conventions around metadata and layout of Zarr data using user-defined attributes ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is correct that I don't agree with the proposal of the ZEP, insofar as it proposes to embed schema / type information inside the thing being schematized / typed, but I'm not really advocating for "dumping everything in a dict" either.
What I advocate is very simple: Schemas and associated tooling should be used to generate and validate zarr hierarchies. E.g., defining Zarr hierarchies as typed data, and checking that instances of Zarr hierarchies pass type-checking. See pydantic-zarr for an example of this approach. This is a very simple idea: take some unstructured data, apply a type system to it, get structured data, move on. What's missing from this picture is the need to staple the type information to the data after you have type checked it, but that's essentially what this ZEP proposes, and what I disagree with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there seems to be a tension between making the format more robustly machine verifiable/machine parseable (embed schema URL) and making the format more readily human readable and human writable (use only short identifier).
If attributes have to be (in practice) identified by a URL, then it becomes challenging for humans to write the format except by copy and paste.
I can see the merits of both sides, though personally I am inclined towards human readable/writable, similar to HTML and CSS, where short identifiers are used.
I would be inclined to just say that the convention is identified by the name of the attribute itself, and there is no separate "zarr_connections" attribute, and it generally becomes easier to work with and edit, compared to having to keep both a dictionary of attributes and a list of conventions in sync. It also avoids the possibility that two conventions would assign different meanings to the same attribute, which would prevent using both conventions at the same time. If URLs are used to identify conventions, then the attribute name would itself be the URL, which would be awkward, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jbms, I like the idea of not having a separate zarr_conventions
. Can you elaborate on your short identifier idea? Do you mean like the original idea in the ZEP; i.e. units_v1
or something different? In this scenario, how would we avoid conflicts? If we make the key more elaborate, then the identifier is going to be not short.
Or are you thinking more of a hierarchical way to define the attributes?
i.e.
{
"units":
{
"length": "meter",
"_convention": "< some-identifier >"
}
}
{
"stats":
{
"std": 42,
"mean": 100,
"_convention": "< some-identifier >"
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I mean something like units_v1
--- a nested _contention
seems worse than just using the convention as the property identifier itself.
However, I think we certainly do need a way to distinguish arbitrary non-standard metadata (which will presumably be used very widely) from standardized metadata properties that should be listed in some registry to ensure the identifier is unique.
I'm not sure exactly what sort of syntax makes sense --- possibly something like "std:units" or "zarr:units" or "$unit" or "@Units".
However, I think the idea behind this zarr_conventions proposal is that you may already have a collection of datasets with various metadata, and software that consumes that metadata, and therefore do not want to change the representation of the metadata at all. Instead you just want to tack on an additional property to indicate what metadata conventions are in use without having to modify the existing data or software. Potentially these "legacy" conventions could still be handled as a separate property per-convention though --- e.g. you could set "std:cf-conventions": true
but then there would be additional non-prefixed properties as defined by the convention.
I have a units convention defined in another open source project. With the current state of things, what's the best way to share this? It has a json schema with namespaces for different unit types: (edited to be similar to the explicit convention suggestion by @yarikoptic). I really like the JSON schema idea because we can run validation against it. (Expand units dropdown if it doesn't show up via hyperlink). If you press show json schema it'll show there too. It's all pydantic and pint based. The way we can currently specify it is like this in the variable attributes. Within array "units": {"density": "g/cm**3"}, Within group (?) "zarr_conventions": {
"units": {
"version": 1,
"homepage": "< reorged new link to rtd for convention >",
"schema-url": "< maybe new repo with metadata conventions in json >"
}
} The ZEP is unclear on some aspects. Can we meet sometime to formalize the ZEP, freeze it, and start a concrete implementation? I have many use cases for this :) Some Qs;
... and more |
Why should conventional zarr hierarchies be responsible for expressing which conventions they adhere to? (This amounts to the question of why nominal, rather than structural, typing is the right solution here). Also, how can this effort express conventions w.r.t the layout of arrays and groups in a hierarchy? An alternative strategy is for Zarr hierarchy consumers to define the conventions they support, and they use the structure of Zarr hierarchies as the "signature" of those conventions. In this scenario, we would benefit from a common language for expressing a Zarr convention as a piece of data. Because the layout of a Zarr hierarchy is invariably part of the structure in-scope for a convention, we need a piece of data that can express the structure + attributes of a Zarr hierarchy. This is addressed by the zarr object models ZEP #zarr-developers/zeps#46. So, tl;dr, I don't see why we need to define a nominal type system for Zarr attributes, when we can do structural typing on the entire hierarchy (or parts of a hierarchy). |
"node_type": "group", | ||
"attributes": { | ||
"zarr_conventions": ["units-v1", "foo"], | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Choice of schema-language
wanted to create a separate thread since prior one is overloaded. If the idea of "reference/contain a schema for a convention" would generally be accepted, might be worth to look into defining it in https://linkml.io/ instead of jsonschema since 1. more human readable/friendly ; 2. can be converted to jsonschema (or pydantic or ... see https://linkml.io/linkml/intro/overview.html#feature-rich-modeling-language )
Might be easier to establish such schemas. Not yet sure if would be easier to use in some cases, so might be worthwhile accompanying with both linkml and jsonschema urls... sorry if I am adding another level of complexity right away - but wanted to establish the "target horizon" right away ;-)
Hi everyone! 👋🏻
I did some preliminary work for ZEP0004 review, as mentioned here.
@rabernat, please have a look and let us know your thoughts. Thanks!