Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[draft] zarr object models #46

Open
wants to merge 17 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
248 changes: 248 additions & 0 deletions draft/ZEP0006.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,248 @@
---
layout: default
title: ZEP0006
description: Defining a Zarr Object Model (ZOM)
parent: draft ZEPs
nav_order: 1
---

# ZEP 6 - A Zarr Object Model

Authors:

* Davis Bennett([@d-v-b](https://github.com/d-v-b)) HHMI / Janelia Research Campus

Status: Draft

Type: Specification

Created: 2023-07-20


## Abstract

This ZEP defines Zarr Object Models, or ZOMs. ZOMs are abstract representations of Zarr hierarchy. The core of a ZOM is a language-independent interface that describes an abstract hierarchy as a tree of nodes.

The base ZOM defines two types of nodes: arrays and groups. Both types of nodes have an `attrs` property, which is an object with string keys and arbitrary values. The base ZOM does not define the exact properties of arrays, as these properties vary with Zarr versions. Groups have a property called `members`, which is an object with string keys and values that are either arrays or groups. A ZOM can be used by applications as the basis for a declarative, type-safe approach to managing Zarr hierarchies.

## Definition of hierarchy structure

This document distinguishes the *structure* of a Zarr hierarchy from the data stored in the hierarchy. The structure of a Zarr hierarchy is the layout of the tree of arrays and groups, and the metadata of those arrays and groups. This definition omits the data stored in the arrays, and the particular storage backend used to store data and metadata. By these definitions, two distinct Zarr hierarchies can have the same structure even if their arrays contain different values, and / or the hierarchies are stored using different storage backends.

Because the structure of a zarr hierarchy is decoupled, by definition, from the data stored in the hierarchy, it should be possible to represent the structure of a Zarr hierarchy with a compact data structure or interface. Such a data structure or interface would facilitate operations like evaluating whether two Zarr hierarchies are identically structured, evaluating whether a given Zarr hierarchy has a specific structure, or creating a Zarr hierarchy with a desired structure. This document formalizes the Zarr Object Model, an abstract model of the structure of a Zarr hierarchy. The ZOM serves as a foundation for tools that create and manipulate Zarr hierarchies at a structural level.

## Specification of the base Zarr Object Model

A node is an object with a property called `attrs` (short for "attributes"), which is a key-value data structure that contains content described as "arbitrary user metadata" in zarr specifications. As of Zarr versions 2 and 3, `attrs` must be a JSON-serializable object.

The base ZOM defines exactly two types of node: groups and arrays. This definition will use the unqualified terms "array" and "group" to refer to the two nodes defined in the ZOM. Where necessary to avoid ambiguity, the objects *represented* by ZOM arrays and ZOM groups, i.e. Zarr arrays and Zarr groups, will be referred to as "Zarr arrays" and "Zarr groups".

ZOM arrays and ZOM groups represent Zarr arrays and Zarr groups in the simplest way possible that still conforms to the definition of "node" given above. Thus, a ZOM array is a node with properties identical to those defined in a particular specification of Zarr array metadata, unless one of those Zarr array properties contains user metadata, in which case a ZOM array does not include that property (since user metadata is already represented by the `attrs` property of the array). This definition is parametric with respect to a particular Zarr specification in order to accomodate future versions of Zarr that may add new properties to Zarr arrays.

Similarly, a ZOM group is a node with properties identical to those defined in a specification of Zarr group metadata, unless one of those properties contains user metadata, in which case a ZOM group does not contain that property, for the same reason given above for arrays. Beyond the properties of Zarr groups defined in a particular Zarr specification, a ZOM group has an additional property:

- `members`: a key-value data structure where the keys are strings and the values are arrays or groups. This property allows a ZOM group to represent the hierarchical relationship between Zarr groups and the Zarr arrays or Zarr groups contained within them.

If future versions of Zarr use a property called `members` for some element of Zarr group metadata, then there would be a naming collision between the `members` property of a Zarr group and the `members` property of a ZOM group. In this case, the ZOM group would rename the Zarr group's `members` property to `_members`, and any additional name collisions would be resolved by prepending additional underscore ("_") characters. E.g., in the unlikely case that `members` and `_members` are *both* listed in Zarr group metadata, then the schema group representation would map the `members` property of the Zarr group to a property called `__members`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to reiterate the idea of broadening this ZEP to include persisted consolidated metadata. Basically, why not allow to store the members property in a zarr.json?
We would need to define the semantics of consolidated metadata (e.g. do member nodes still needs json files, does the members hierarchy need to be exhaustive). I would be happy to contribute that if there is interest to move this ZEP in that direction.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's definitely interest in that, my apologies for not making this more clear earlier. I'm not a user of consolidated metadata so I don't have a lot of experience with it, but I think for this ZEP to encompass consolidated metadata functionality as it exists today (i.e, a flat list of string keys pointing to JSON objects) we would need to define a tree flattening operation, and possible make members nullable (because in a flattened representation a ZOM group shouldn't hold a reference to its contents). Alternatively, if the flattened representation of the hierarchy used in consolidated metadata isn't essential to its function, we could simply just put a ZOM in JSON and leave it to clients to do the flattening. I don't have strong feelings either way! You should absolutely feel free contribute something here.

Copy link
Contributor

@rabernat rabernat Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should keep consolidated metadata separate from this ZOM concept. Consolidated metadata is a serialization choice, while the object model describes the relationships between entities in a more abstract, serialization independent way.

We intend to propose a ZEP which uses a STAC style link-relation element to allow a client to traverse an entire hierarchy without being able to list a store. This is similar to consolidated metadata but more scalable because it does not require all the metadata to be in a single json file. For context, we have hierarchies with 100_000 nodes. Would be happy to collaborate and iterate with you on that.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consolidated metadata is a serialization choice, while the object model describes the relationships between entities in a more abstract, serialization independent way.

In this case, maybe it would be good to get a statement like this in this ZEP to clarify the relationship between the abstract ZOM and consolidated metadata.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now wondering if this members property needs to be tweaked to accommodate the case of very large hierarchies. As is, it seems like the entire hierarchy may have to be explicitly populated all at once.

In python terms, I'd like to allow members to be either a set of child objects or a generator that yields such objects lazily. Is this making it too complicated?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(i.e, a flat list of string keys pointing to JSON objects)

It wouldn't have to have this structure. It could be a nested json structure like this:

{
  "zarr_format": 3,
  "node_type": "group",
  "attributes": {},
  "members": {
    "some_group": {
      "zarr_format": 3,
      "node_type": "group",
      "attributes": {},
      "members": {
        "some_array": {
          "zarr_format": 3,
          "node_type": "array",
          ...
        }
      }
    }
  }
}

I think we should keep consolidated metadata separate from this ZOM concept. Consolidated metadata is a serialization choice, while the object model describes the relationships between entities in a more abstract, serialization independent way.

I think there is strong overlap between the ZOM and consolidated metadata. This ZEP introduces a JSON schema that describes the existing metadata of groups and arrays with a new addition of the members property. I think it would be very confusing, if consolidated metadata would end up with different terminology than the ZOM.

We intend to propose a ZEP which uses a STAC style link-relation element to allow a client to traverse an entire hierarchy without being able to list a store. This is similar to consolidated metadata but more scalable because it does not require all the metadata to be in a single json file. For context, we have hierarchies with 100_000 nodes. Would be happy to collaborate and iterate with you on that.

That sounds great. As I said, we can discuss the semantics and features of the consolidated metadata. That could include linking. I don't think we should limit ourselves by what the implementation in zarr-python currently has.

Copy link
Author

@d-v-b d-v-b Sep 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am now wondering if this members property needs to be tweaked to accommodate the case of very large hierarchies. As is, it seems like the entire hierarchy may have to be explicitly populated all at once.

This concern is real! See also zarr-developers/pydantic-zarr#2 . The proposal there was to make members nullable, where None would encode "The members have not been parsed", and to give a tree parser the option to limit the depth of traversal, which would result in "truncated" GroupSpec instances being valid. But maybe the python generator approach obviates the need to express this with nullability? I'm open to suggestions here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to be able to distinguish between "there are definitely no members" vs. "there may be members, but they have to be discovered explicitly"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in pydantic-zarr members is now nullable, and it's been extremely useful. That being said, this can be viewed as a well-defined transformation of the base type, so it's not clear if the ZEP actually needs to address it.


Thus, ZOM groups and ZOM arrays can represent the structure of a Zarr hierarchy, per the description given in [#definition-of-hierarchy-structure].

### ZOM in JSON

The ZOM representation of a Zarr hierarchy can be easily represented as a JSON object. Here is an example of a ZOM group representing a Zarr group that contains a single two-dimensional Zarr array using Zarr version 2. Both the Zarr group and the Zarr array contain user metadata.

```json
{
"zarr_format" : 2,
"attrs": {
"foo" : 10,
"bar" : "hello"
},
"members": {
"foo": {
"zarr_format" : 2,
"shape" : [10,10],
"chunks": [1,1],
"dtype": "|u1",
"compressor": null,
"fill_value": 0,
"order": "C",
"filters": null,
"attrs" : {
"name": "my cool array"
}
}
}
}
```

The ZOM itself can also be represented as a JSON schema. Here is a the ZOM for Zarr V2 expressed as a JSON schema:
```json
{
"$ref": "#/definitions/Group",
"definitions": {
"Array": {
"title": "Array",
"description": "Model of a Zarr Version 2 Array",
"type": "object",
"properties": {
"attrs": {
"title": "Attrs",
"type": "object"
},
"shape": {
"title": "Shape",
"type": "array",
"items": {
"type": "integer"
}
},
"chunks": {
"title": "Chunks",
"type": "array",
"items": {
"type": "integer"
}
},
"dtype": {
"title": "Dtype",
"anyOf": [
{
"type": "string"
},
{
"type": "array",
"items": {
"type": "string"
}
}
]
},
"compressor": {
"title": "Compressor",
"type": "object"
},
"fill_value": {
"title": "Fill Value"
},
"order": {
"title": "Order",
"enum": [
"C",
"F"
],
"type": "string"
},
"filters": {
"title": "Filters",
"type": "array",
"items": {
"type": "object"
}
},
"dimension_separator": {
"title": "Dimension Separator",
"enum": [
".",
"/"
],
"type": "string"
},
"zarr_version": {
"title": "Zarr Version",
"default": 2,
"type": "integer"
}
},
"required": [
"attrs",
"shape",
"chunks",
"dtype",
"compressor",
"order",
"filters"
],
"additionalProperties": false
},
"Group": {
"title": "Group",
"description": "Model of a Zarr Version 2 Group",
"type": "object",
"properties": {
"attrs": {
"title": "Attrs",
"type": "object"
},
"members": {
"title": "Members",
"type": "object",
"additionalProperties": {
"anyOf": [
{
"$ref": "#/definitions/Array"
},
{
"$ref": "#/definitions/Group"
}
]
}
},
"zarr_version": {
"title": "Zarr Version",
"default": 2,
"type": "integer"
}
},
"required": [
"attrs",
"members"
],
"additionalProperties": false
}
}
}
```

And Zarr V3:

```json
# insert schema for v3 here
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let me know if you could use some help generating this.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some help here would be great, thank you!

```


## Related Work



## Implementation

- pydantic zarr
- ?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • dataclass zarr

(I have an unpublished version that I can share soon)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

zarrita also has attrs classes that define the metadata (minus the new members properties) https://github.com/scalableminds/zarrita/blob/async/zarrita/metadata.py#L259


## Discussion

- todo: show that consolidated metadata can be achieved by applying a flattening transformation to a ZOM representation of a hierarchy.
- - The origins of consolidated metadata:
* <https://github.com/pangeo-data/pangeo/issues/309>
* <https://github.com/zarr-developers/zarr-python/pull/268>

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may also be worth summarizing some of the intended benefits to existing/internal applications. For example, the utilization of a standard data object internally within zarr-python may help improve workflow for creating large hierarchies by allowing users to create the ZOM metadata before passing it to a zarr.creation method.


## References and Footnotes


## License

<p xmlns:dct="http://purl.org/dc/terms/">
<a rel="license"
href="http://creativecommons.org/publicdomain/zero/1.0/">
<img src="https://licensebuttons.net/p/zero/1.0/80x15.png" style="border-style: none;" alt="CC0" />
</a>
<br />
To the extent possible under law,
<a rel="dct:publisher"
href="https://github.com/zarr-developers/zeps">
<span property="dct:title">the authors</span></a>
have waived all copyright and related or neighboring rights to
<span property="dct:title">ZEP 1</span>.
</p>