Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Refactor TIFF backend to use async_tiff and obstore #488

Closed
wants to merge 42 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
0ec7f19
Start work on VirtualObjectStore
maxrjones Mar 9, 2025
08aba9c
Start refactoring the TIFF backend to use async-tiff
maxrjones Mar 13, 2025
1820025
Update VirtualObjectStore
maxrjones Mar 13, 2025
d2e670b
Handle group zarr.json
maxrjones Mar 13, 2025
8308983
Fix dim names and chunkmanifest shape
maxrjones Mar 14, 2025
577b882
Load some chunks
maxrjones Mar 14, 2025
33a0399
Start generalizing listing
maxrjones Mar 14, 2025
644bc58
Handle running event loops
maxrjones Mar 15, 2025
9b1b63a
Handle other request types
maxrjones Mar 15, 2025
d53e426
Draft VirtualObjectStore implementation
maxrjones Mar 16, 2025
5ad7d37
Add test
maxrjones Mar 16, 2025
ac64c16
Handle dataset metadata
maxrjones Mar 16, 2025
e9f213f
Fix chunk length
maxrjones Mar 16, 2025
1079607
Pass kwargs to open_dataset
maxrjones Mar 16, 2025
b27dec9
Add variable attrs to Zarr metadata
maxrjones Mar 16, 2025
266fbed
Remove extra env
maxrjones Mar 16, 2025
480c175
Raise NotImplementErrors on get_partial_values and exists
maxrjones Mar 16, 2025
04a2e60
Add docstring for
maxrjones Mar 16, 2025
81b13c8
Make all VirtualObjectStore instances read_only
maxrjones Mar 16, 2025
540b0a8
Remove unused get_partial_values functionality
maxrjones Mar 16, 2025
29111c2
Add more docstrings
maxrjones Mar 16, 2025
147682b
Fix some typos
maxrjones Mar 16, 2025
2475e6d
Fix typing
maxrjones Mar 16, 2025
cc1b2f8
Add release notes
maxrjones Mar 16, 2025
5dee76d
Merge branch 'develop' into virtual-obstore-store
maxrjones Mar 16, 2025
5996c3f
Merge branch 'develop' into tiff_with_virtualobjectstore
maxrjones Mar 16, 2025
e3195fd
Add obstore to test and typing envs
maxrjones Mar 16, 2025
e4bfdbd
Separate out byte range transformation
maxrjones Mar 16, 2025
ac37f15
Revise based on code review
maxrjones Mar 17, 2025
58fc240
Simplify typing
maxrjones Mar 17, 2025
f15a52a
Move store selection outside try/except block
maxrjones Mar 17, 2025
52272d0
Remove accessor method
maxrjones Mar 17, 2025
756c3c4
Rename to ManifestStore
maxrjones Mar 17, 2025
c14e67a
Don't include ManifestStore in test_integration
maxrjones Mar 17, 2025
e8b995c
Merge branch 'develop' into virtual-obstore-store
maxrjones Mar 18, 2025
52287e9
Add basic test for ManifestStore
maxrjones Mar 18, 2025
59b45a0
Merge branch 'virtual-obstore-store' into tiff_with_virtualobjectstore
maxrjones Mar 19, 2025
bc52aa0
Refactor around Zarr model
maxrjones Mar 19, 2025
f33b614
Fix import of optional deps
maxrjones Mar 19, 2025
09dacc1
Fix typo
maxrjones Mar 19, 2025
83a5622
Fix import
maxrjones Mar 19, 2025
71543cd
Rename static method
maxrjones Mar 19, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions docs/releases.rst
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,8 @@ v1.3.3 (unreleased)
New Features
~~~~~~~~~~~~

- Added experimental VirtualObjectStore for loading data directly from virtual datasets.

Breaking changes
~~~~~~~~~~~~~~~~

Expand Down
20 changes: 15 additions & 5 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,9 @@ remote = [
"aiohttp",
"s3fs",
]

obstore = [
"obstore @ git+https://github.com/developmentseed/obstore@main#subdirectory=obstore",
]
# non-kerchunk-based readers
hdf = [
"virtualizarr[remote]",
Expand All @@ -64,11 +66,16 @@ fits = [
"kerchunk>=0.2.8",
"astropy",
]
tif = [
"obstore @ git+https://github.com/developmentseed/obstore@main#subdirectory=obstore",
"async-tiff @ git+https://github.com/developmentseed/async-tiff#subdirectory=python",
]
all_readers = [
"virtualizarr[hdf]",
"virtualizarr[hdf5]",
"virtualizarr[netcdf3]",
"virtualizarr[fits]",
"virtualizarr[tif]",
]

# writers
Expand Down Expand Up @@ -157,6 +164,9 @@ h5netcdf = ">=1.5.0,<2"
[tool.pixi.feature.icechunk-dev.dependencies]
rust = "*"

[tool.pixi.feature.rio.dependencies]
rioxarray = "*"

# Define commands to run within the test environments
[tool.pixi.feature.dev.tasks]
run-mypy = { cmd = "mypy virtualizarr" }
Expand All @@ -170,11 +180,11 @@ run-tests-html-cov = { cmd = "pytest --run-network-tests --verbose --cov=virtual
[tool.pixi.environments]
min-deps = ["dev", "hdf", "hdf5", "hdf5-lib"] # VirtualiZarr/conftest.py using h5py, so the minimum set of dependencies for testing still includes hdf libs
# Inherit from min-deps to get all the test commands, along with optional dependencies
test = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib"]
test-py311 = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "py311"] # test against python 3.11
test-py312 = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "py312"] # test against python 3.12
test = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "tif", "rio"]
test-py311 = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "tif", "rio", "py311"] # test against python 3.11
test-py312 = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "tif", "rio", "py312"] # test against python 3.12
upstream = ["dev", "hdf", "hdf5", "hdf5-lib", "netcdf3", "upstream", "icechunk-dev"]
all = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "all_readers", "all_writers"]
all = ["dev", "remote", "hdf", "hdf5", "netcdf3", "fits", "icechunk", "kerchunk", "hdf5-lib", "obstore", "tif", "all_readers", "all_writers"]
docs = ["docs"]


Expand Down
1 change: 1 addition & 0 deletions virtualizarr/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
from virtualizarr.manifests import ChunkManifest, ManifestArray # type: ignore # noqa
from virtualizarr.accessor import VirtualiZarrDatasetAccessor # type: ignore # noqa

from virtualizarr.backend import open_virtual_dataset # noqa: F401

from importlib.metadata import version as _version
Expand Down
7 changes: 6 additions & 1 deletion virtualizarr/accessor.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
from __future__ import annotations

from datetime import datetime
from pathlib import Path
from typing import TYPE_CHECKING, Callable, Literal, overload

from xarray import Dataset, register_dataset_accessor
from xarray import (
Dataset,
register_dataset_accessor,
)

from virtualizarr.manifests import ManifestArray
from virtualizarr.types.kerchunk import KerchunkStoreRefs
Expand Down
1 change: 1 addition & 0 deletions virtualizarr/manifests/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,5 @@
# This is just to avoid conflicting with some type of file called manifest that .gitignore recommends ignoring.

from .array import ManifestArray # type: ignore # noqa
from .group import ManifestGroup # type: ignore # noqa
from .manifest import ChunkEntry, ChunkManifest # type: ignore # noqa
36 changes: 36 additions & 0 deletions virtualizarr/manifests/group.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
from typing import TypeAlias

from zarr.core.group import GroupMetadata

from virtualizarr.manifests import ManifestArray

ManifestDict: TypeAlias = dict[str, ManifestArray]


class ManifestGroup:
"""
Virtualized representation of multiple ManifestArrays as a Zarr Group.
"""

_manifest_dict: ManifestDict
_metadata: GroupMetadata

def __init__(
self,
manifest_dict: ManifestDict,
attributes: dict,
) -> None:
"""
Create a ManifestGroup from the dictionary of ManifestArrays and the group / dataset level metadata

Parameters
----------
attributes : attributes to include in Group metadata
manifest_dict : ManifestDict
"""

self._metadata = GroupMetadata(attributes=attributes)
self._manifest_dict = manifest_dict

def __str__(self) -> str:
return f"ManifestGroup({self._manifest_dict}, {self._metadata})"

Check warning on line 36 in virtualizarr/manifests/group.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/manifests/group.py#L36

Added line #L36 was not covered by tests
5 changes: 4 additions & 1 deletion virtualizarr/manifests/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ def create_v3_array_metadata(
fill_value: Any = None,
codecs: Optional[list[Dict[str, Any]]] = None,
attributes: Optional[Dict[str, Any]] = None,
dimension_names: Optional[tuple[str, ...]] = None,
) -> ArrayV3Metadata:
"""
Create an ArrayV3Metadata instance with standard configuration.
Expand All @@ -36,6 +37,8 @@ def create_v3_array_metadata(
List of codec configurations
attributes : Dict[str, Any], optional
Additional attributes for the array
dimension_names : tuple[str], optional
Names of the dimensions

Returns
-------
Expand All @@ -56,7 +59,7 @@ def create_v3_array_metadata(
dtype=data_type,
),
attributes=attributes or {},
dimension_names=None,
dimension_names=dimension_names,
storage_transformers=None,
)

Expand Down
43 changes: 43 additions & 0 deletions virtualizarr/readers/common.py
Original file line number Diff line number Diff line change
@@ -1,14 +1,19 @@
import dataclasses
from abc import ABC
from collections.abc import Iterable, Mapping, MutableMapping
from typing import (
Any,
Hashable,
Optional,
TypedDict,
)

import numpy as np
import xarray # noqa
from numcodecs.abc import Codec
from xarray import (
Coordinates,
DataArray,
Dataset,
DataTree,
Index,
Expand All @@ -21,6 +26,26 @@
from virtualizarr.utils import _FsspecFSFromFilepath


@dataclasses.dataclass
class ZstdProperties:
level: int


@dataclasses.dataclass
class ShuffleProperties:
elementsize: int


@dataclasses.dataclass
class ZlibProperties:
level: int


class CFCodec(TypedDict):
target_dtype: np.dtype
codec: Codec


def maybe_open_loadable_vars_and_indexes(
filepath: str,
loadable_variables,
Expand Down Expand Up @@ -86,6 +111,24 @@
return loadable_vars, indexes


def construct_virtual_dataarray(
virtual_var,
coord_vars: Optional[Variable] = None,
name: Optional[str] = None,
dims: Optional[Hashable] = None,
attrs: Optional[dict] = None,
) -> DataArray:
"""Construct a virtual DataArray from consistuent parts."""
vda = DataArray(

Check warning on line 122 in virtualizarr/readers/common.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/readers/common.py#L122

Added line #L122 was not covered by tests
data=virtual_var,
coords=coord_vars,
attrs=attrs,
dims=dims,
name=name,
)
return vda

Check warning on line 129 in virtualizarr/readers/common.py

View check run for this annotation

Codecov / codecov/patch

virtualizarr/readers/common.py#L129

Added line #L129 was not covered by tests


def construct_virtual_dataset(
virtual_vars,
loadable_vars,
Expand Down
28 changes: 7 additions & 21 deletions virtualizarr/readers/hdf/filters.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,18 @@
import dataclasses
from typing import TYPE_CHECKING, List, Tuple, TypedDict, Union
from typing import TYPE_CHECKING, List, Tuple, Union

import numcodecs.registry as registry
import numpy as np
from numcodecs.abc import Codec
from numcodecs.fixedscaleoffset import FixedScaleOffset
from xarray.coding.variables import _choose_float_dtype

from virtualizarr.readers.common import (
CFCodec,
ShuffleProperties,
ZlibProperties,
ZstdProperties,
)
from virtualizarr.utils import soft_import

if TYPE_CHECKING:
Expand Down Expand Up @@ -52,26 +58,6 @@ def __post_init__(self):
self.cname = blosc_compressor_codes[self.cname]


@dataclasses.dataclass
class ZstdProperties:
level: int


@dataclasses.dataclass
class ShuffleProperties:
elementsize: int


@dataclasses.dataclass
class ZlibProperties:
level: int


class CFCodec(TypedDict):
target_dtype: np.dtype
codec: Codec


def _filter_to_codec(
filter_id: str, filter_properties: Union[int, None, Tuple] = None
) -> Codec:
Expand Down
Loading
Loading