Skip to content

Commit 18f35da

Browse files
jendrikjoeshoyer
authored andcommitted
Appending to zarr store (#2706)
* Initial version of appending to zarr store * Added docs * Resolve PEP8 incompliances * Added write and append test for mode 'a' * Merged repaired master * Resolved pep8 issue * Put target store encoding in appended variable * Rewrite test with appending along time dimension * Add chunk_size parameter for rechunking appended coordinate * Add chunk_dim test * Add type check and tests for it. In append mode storing any datatype apart from number subtypes and fixed size strings raises and error. * Add documentation * Add test for compute=False and commented it out * Remove python 3.7 string formatting * Fix PEP8 incompliance * Add missing whitespaces * allowed for compute=False when appending to a zarr store * Fixed empty array data error When using create_append_test_data we used np.arange(...), which was incidently also the default value of the zarr array when fill_value is set to None. So appending to the data with compute=False and then expecting an error when asserting the source and target to be the same failed the tests. Using random data passes the tests * flake8 fixes * removed chunk_dim argument to to_zarr function * implemented requested changes * Update xarray/backends/api.py Co-Authored-By: Stephan Hoyer <[email protected]> * added contributors and example of using append to zarr * fixed docs fail * fixed docs * removed unnecessary condition * attempt at clean string encoding and variable length strings * implemented suggestions * * append_dim does not need to be specified if creating a new array with Dataset.to_zarr(store, mode='a') * cleand up to_zarr append mode tests * raise ValueError when append_dim is not a valid dimension * flake8 fix * removed unused comment * * raise error when appending with encoding provided for existing variable * add test for encoding consistency when appending * implemented: #2706 (comment) * refactored tests
1 parent d30635c commit 18f35da

File tree

8 files changed

+394
-56
lines changed

8 files changed

+394
-56
lines changed

Diff for: doc/io.rst

+23
Original file line numberDiff line numberDiff line change
@@ -604,6 +604,29 @@ store is already present at that path, an error will be raised, preventing it
604604
from being overwritten. To override this behavior and overwrite an existing
605605
store, add ``mode='w'`` when invoking ``to_zarr``.
606606

607+
It is also possible to append to an existing store. For that, add ``mode='a'``
608+
and set ``append_dim`` to the name of the dimension along which to append.
609+
610+
.. ipython:: python
611+
:suppress:
612+
613+
! rm -rf path/to/directory.zarr
614+
615+
.. ipython:: python
616+
617+
ds1 = xr.Dataset({'foo': (('x', 'y', 't'), np.random.rand(4, 5, 2))},
618+
coords={'x': [10, 20, 30, 40],
619+
'y': [1,2,3,4,5],
620+
't': pd.date_range('2001-01-01', periods=2)})
621+
ds1.to_zarr('path/to/directory.zarr')
622+
ds2 = xr.Dataset({'foo': (('x', 'y', 't'), np.random.rand(4, 5, 2))},
623+
coords={'x': [10, 20, 30, 40],
624+
'y': [1,2,3,4,5],
625+
't': pd.date_range('2001-01-03', periods=2)})
626+
ds2.to_zarr('path/to/directory.zarr', mode='a', append_dim='t')
627+
628+
To store variable length strings use ``dtype=object``.
629+
607630
To read back a zarr dataset that has been created this way, we use the
608631
:py:func:`~xarray.open_zarr` method:
609632

Diff for: doc/whats-new.rst

+5
Original file line numberDiff line numberDiff line change
@@ -226,6 +226,11 @@ Other enhancements
226226
report showing what exactly differs between the two objects (dimensions /
227227
coordinates / variables / attributes) (:issue:`1507`).
228228
By `Benoit Bovy <https://github.com/benbovy>`_.
229+
- Added append capability to the zarr store.
230+
By `Jendrik Jördening <https://github.com/jendrikjoe>`_,
231+
`David Brochart <https://github.com/davidbrochart>`_,
232+
`Ryan Abernathey <https://github.com/rabernat>`_ and
233+
`Shikhar Goenka<https://github.com/shikharsg>`_.
229234
- Resampling of standard and non-standard calendars indexed by
230235
:py:class:`~xarray.CFTimeIndex` is now possible. (:issue:`2191`).
231236
By `Jwen Fai Low <https://github.com/jwenfai>`_ and

Diff for: xarray/backends/api.py

+54-3
Original file line numberDiff line numberDiff line change
@@ -4,17 +4,21 @@
44
from io import BytesIO
55
from numbers import Number
66
from pathlib import Path
7+
import re
78

89
import numpy as np
10+
import pandas as pd
911

10-
from .. import Dataset, DataArray, backends, conventions
12+
from .. import Dataset, DataArray, backends, conventions, coding
1113
from ..core import indexing
1214
from .. import auto_combine
1315
from ..core.combine import (combine_by_coords, _nested_combine,
1416
_infer_concat_order_from_positions)
1517
from ..core.utils import close_on_error, is_grib_path, is_remote_uri
18+
from ..core.variable import Variable
1619
from .common import ArrayWriter
1720
from .locks import _get_scheduler
21+
from ..coding.variables import safe_setitem, unpack_for_encoding
1822

1923
DATAARRAY_NAME = '__xarray_dataarray_name__'
2024
DATAARRAY_VARIABLE = '__xarray_dataarray_variable__'
@@ -1024,8 +1028,48 @@ def save_mfdataset(datasets, paths, mode='w', format=None, groups=None,
10241028
for w, s in zip(writes, stores)])
10251029

10261030

1031+
def _validate_datatypes_for_zarr_append(dataset):
1032+
"""DataArray.name and Dataset keys must be a string or None"""
1033+
def check_dtype(var):
1034+
if (not np.issubdtype(var.dtype, np.number)
1035+
and not coding.strings.is_unicode_dtype(var.dtype)
1036+
and not var.dtype == object):
1037+
# and not re.match('^bytes[1-9]+$', var.dtype.name)):
1038+
raise ValueError('Invalid dtype for data variable: {} '
1039+
'dtype must be a subtype of number, '
1040+
'a fixed sized string, a fixed size '
1041+
'unicode string or an object'.format(var))
1042+
for k in dataset.data_vars.values():
1043+
check_dtype(k)
1044+
1045+
1046+
def _validate_append_dim_and_encoding(ds_to_append, store, append_dim,
1047+
encoding, **open_kwargs):
1048+
try:
1049+
ds = backends.zarr.open_zarr(store, **open_kwargs)
1050+
except ValueError: # store empty
1051+
return
1052+
if append_dim:
1053+
if append_dim not in ds.dims:
1054+
raise ValueError(
1055+
"{} not a valid dimension in the Dataset".format(append_dim)
1056+
)
1057+
for data_var in ds_to_append:
1058+
if data_var in ds:
1059+
if append_dim is None:
1060+
raise ValueError(
1061+
"variable '{}' already exists, but append_dim "
1062+
"was not set".format(data_var)
1063+
)
1064+
if data_var in encoding.keys():
1065+
raise ValueError(
1066+
"variable '{}' already exists, but encoding was"
1067+
"provided".format(data_var)
1068+
)
1069+
1070+
10271071
def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
1028-
encoding=None, compute=True, consolidated=False):
1072+
encoding=None, compute=True, consolidated=False, append_dim=None):
10291073
"""This function creates an appropriate datastore for writing a dataset to
10301074
a zarr ztore
10311075
@@ -1040,11 +1084,18 @@ def to_zarr(dataset, store=None, mode='w-', synchronizer=None, group=None,
10401084
_validate_dataset_names(dataset)
10411085
_validate_attrs(dataset)
10421086

1087+
if mode == "a":
1088+
_validate_datatypes_for_zarr_append(dataset)
1089+
_validate_append_dim_and_encoding(dataset, store, append_dim,
1090+
group=group,
1091+
consolidated=consolidated,
1092+
encoding=encoding)
1093+
10431094
zstore = backends.ZarrStore.open_group(store=store, mode=mode,
10441095
synchronizer=synchronizer,
10451096
group=group,
10461097
consolidate_on_close=consolidated)
1047-
1098+
zstore.append_dim = append_dim
10481099
writer = ArrayWriter()
10491100
# TODO: figure out how to properly handle unlimited_dims
10501101
dump_to_store(dataset, zstore, writer, encoding=encoding)

Diff for: xarray/backends/common.py

+10-3
Original file line numberDiff line numberDiff line change
@@ -158,26 +158,33 @@ class ArrayWriter:
158158
def __init__(self, lock=None):
159159
self.sources = []
160160
self.targets = []
161+
self.regions = []
161162
self.lock = lock
162163

163-
def add(self, source, target):
164+
def add(self, source, target, region=None):
164165
if isinstance(source, dask_array_type):
165166
self.sources.append(source)
166167
self.targets.append(target)
168+
self.regions.append(region)
167169
else:
168-
target[...] = source
170+
if region:
171+
target[region] = source
172+
else:
173+
target[...] = source
169174

170175
def sync(self, compute=True):
171176
if self.sources:
172177
import dask.array as da
173178
# TODO: consider wrapping targets with dask.delayed, if this makes
174179
# for any discernable difference in perforance, e.g.,
175180
# targets = [dask.delayed(t) for t in self.targets]
181+
176182
delayed_store = da.store(self.sources, self.targets,
177183
lock=self.lock, compute=compute,
178-
flush=True)
184+
flush=True, regions=self.regions)
179185
self.sources = []
180186
self.targets = []
187+
self.regions = []
181188
return delayed_store
182189

183190

Diff for: xarray/backends/zarr.py

+116-32
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,8 @@
88
from ..core import indexing
99
from ..core.pycompat import integer_types
1010
from ..core.utils import FrozenOrderedDict, HiddenKeyDict
11-
from .common import AbstractWritableDataStore, BackendArray
11+
from .common import AbstractWritableDataStore, BackendArray, \
12+
_encode_variable_name
1213

1314
# need some special secret attributes to tell us the dimensions
1415
_DIMENSION_KEY = '_ARRAY_DIMENSIONS'
@@ -212,7 +213,7 @@ def encode_zarr_variable(var, needs_copy=True, name=None):
212213
# zarr allows unicode, but not variable-length strings, so it's both
213214
# simpler and more compact to always encode as UTF-8 explicitly.
214215
# TODO: allow toggling this explicitly via dtype in encoding.
215-
coder = coding.strings.EncodedStringCoder(allows_unicode=False)
216+
coder = coding.strings.EncodedStringCoder(allows_unicode=True)
216217
var = coder.encode(var, name=name)
217218
var = coding.strings.ensure_fixed_length_bytes(var)
218219

@@ -257,6 +258,7 @@ def __init__(self, zarr_group, consolidate_on_close=False):
257258
self._synchronizer = self.ds.synchronizer
258259
self._group = self.ds.path
259260
self._consolidate_on_close = consolidate_on_close
261+
self.append_dim = None
260262

261263
def open_store_variable(self, name, zarr_array):
262264
data = indexing.LazilyOuterIndexedArray(ZarrArrayWrapper(name, self))
@@ -313,40 +315,122 @@ def encode_variable(self, variable):
313315
def encode_attribute(self, a):
314316
return _encode_zarr_attr_value(a)
315317

316-
def prepare_variable(self, name, variable, check_encoding=False,
317-
unlimited_dims=None):
318-
319-
attrs = variable.attrs.copy()
320-
dims = variable.dims
321-
dtype = variable.dtype
322-
shape = variable.shape
323-
324-
fill_value = attrs.pop('_FillValue', None)
325-
if variable.encoding == {'_FillValue': None} and fill_value is None:
326-
variable.encoding = {}
327-
328-
encoding = _extract_zarr_variable_encoding(
329-
variable, raise_on_invalid=check_encoding)
330-
331-
encoded_attrs = OrderedDict()
332-
# the magic for storing the hidden dimension data
333-
encoded_attrs[_DIMENSION_KEY] = dims
334-
for k, v in attrs.items():
335-
encoded_attrs[k] = self.encode_attribute(v)
336-
337-
zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
338-
fill_value=fill_value, **encoding)
339-
zarr_array.attrs.put(encoded_attrs)
340-
341-
return zarr_array, variable.data
342-
343-
def store(self, variables, attributes, *args, **kwargs):
344-
AbstractWritableDataStore.store(self, variables, attributes,
345-
*args, **kwargs)
318+
def store(self, variables, attributes, check_encoding_set=frozenset(),
319+
writer=None, unlimited_dims=None):
320+
"""
321+
Top level method for putting data on this store, this method:
322+
- encodes variables/attributes
323+
- sets dimensions
324+
- sets variables
325+
326+
Parameters
327+
----------
328+
variables : dict-like
329+
Dictionary of key/value (variable name / xr.Variable) pairs
330+
attributes : dict-like
331+
Dictionary of key/value (attribute name / attribute) pairs
332+
check_encoding_set : list-like
333+
List of variables that should be checked for invalid encoding
334+
values
335+
writer : ArrayWriter
336+
unlimited_dims : list-like
337+
List of dimension names that should be treated as unlimited
338+
dimensions.
339+
dimension on which the zarray will be appended
340+
only needed in append mode
341+
"""
342+
343+
existing_variables = set([vn for vn in variables
344+
if _encode_variable_name(vn) in self.ds])
345+
new_variables = set(variables) - existing_variables
346+
variables_without_encoding = OrderedDict([(vn, variables[vn])
347+
for vn in new_variables])
348+
variables_encoded, attributes = self.encode(
349+
variables_without_encoding, attributes)
350+
351+
if len(existing_variables) > 0:
352+
# there are variables to append
353+
# their encoding must be the same as in the store
354+
ds = open_zarr(self.ds.store, chunks=None)
355+
variables_with_encoding = OrderedDict()
356+
for vn in existing_variables:
357+
variables_with_encoding[vn] = variables[vn].copy(deep=False)
358+
variables_with_encoding[vn].encoding = ds[vn].encoding
359+
variables_with_encoding, _ = self.encode(variables_with_encoding,
360+
{})
361+
variables_encoded.update(variables_with_encoding)
362+
363+
self.set_attributes(attributes)
364+
self.set_dimensions(variables_encoded, unlimited_dims=unlimited_dims)
365+
self.set_variables(variables_encoded, check_encoding_set, writer,
366+
unlimited_dims=unlimited_dims)
346367

347368
def sync(self):
348369
pass
349370

371+
def set_variables(self, variables, check_encoding_set, writer,
372+
unlimited_dims=None):
373+
"""
374+
This provides a centralized method to set the variables on the data
375+
store.
376+
377+
Parameters
378+
----------
379+
variables : dict-like
380+
Dictionary of key/value (variable name / xr.Variable) pairs
381+
check_encoding_set : list-like
382+
List of variables that should be checked for invalid encoding
383+
values
384+
writer :
385+
unlimited_dims : list-like
386+
List of dimension names that should be treated as unlimited
387+
dimensions.
388+
"""
389+
390+
for vn, v in variables.items():
391+
name = _encode_variable_name(vn)
392+
check = vn in check_encoding_set
393+
attrs = v.attrs.copy()
394+
dims = v.dims
395+
dtype = v.dtype
396+
shape = v.shape
397+
398+
fill_value = attrs.pop('_FillValue', None)
399+
if v.encoding == {'_FillValue': None} and fill_value is None:
400+
v.encoding = {}
401+
if name in self.ds:
402+
zarr_array = self.ds[name]
403+
if self.append_dim in dims:
404+
# this is the DataArray that has append_dim as a
405+
# dimension
406+
append_axis = dims.index(self.append_dim)
407+
new_shape = list(zarr_array.shape)
408+
new_shape[append_axis] += v.shape[append_axis]
409+
new_region = [slice(None)] * len(new_shape)
410+
new_region[append_axis] = slice(
411+
zarr_array.shape[append_axis],
412+
None
413+
)
414+
zarr_array.resize(new_shape)
415+
writer.add(v.data, zarr_array,
416+
region=tuple(new_region))
417+
else:
418+
# new variable
419+
encoding = _extract_zarr_variable_encoding(
420+
v, raise_on_invalid=check)
421+
encoded_attrs = OrderedDict()
422+
# the magic for storing the hidden dimension data
423+
encoded_attrs[_DIMENSION_KEY] = dims
424+
for k2, v2 in attrs.items():
425+
encoded_attrs[k2] = self.encode_attribute(v2)
426+
427+
if coding.strings.check_vlen_dtype(dtype) == str:
428+
dtype = str
429+
zarr_array = self.ds.create(name, shape=shape, dtype=dtype,
430+
fill_value=fill_value, **encoding)
431+
zarr_array.attrs.put(encoded_attrs)
432+
writer.add(v.data, zarr_array)
433+
350434
def close(self):
351435
if self._consolidate_on_close:
352436
import zarr

0 commit comments

Comments
 (0)