Skip to content

Commit ab6b355

Browse files
rabernatjoshmoore
andauthored
Create fsstore from filesystem (#911)
* refactor FSStore class to allow fs argument * add tests * fixes #993 * add fsspec mapper kwargs to FSMap constructor * avoid passing missing_exceptions if possible * fix line length * add tests for array creation with existing fs * add test for consolidated reading of unlistable store * flake8 * rename functions and skip coverage for workaround we expect to remove * update release notes and tutorial * fix sphinx ref typo * Fix use of store.update() * Flake8 corrections Co-authored-by: Josh Moore <[email protected]> Co-authored-by: jmoore <[email protected]>
1 parent d1f590d commit ab6b355

File tree

6 files changed

+230
-38
lines changed

6 files changed

+230
-38
lines changed

docs/release.rst

Lines changed: 24 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,27 @@ Unreleased
99
Bug fixes
1010
~~~~~~~~~
1111

12+
* Fix bug that made it impossible to create an ``FSStore`` on unlistable filesystems
13+
(e.g. some HTTP servers).
14+
By :user:`Ryan Abernathey <rabernat>`; :issue:`993`.
15+
16+
Enhancements
17+
~~~~~~~~~~~~
18+
19+
* **Create FSStore from an existing fsspec filesystem**. If you have created
20+
an fsspec filesystem outside of Zarr, you can now pass it as a keyword
21+
argument to ``FSStore``.
22+
By :user:`Ryan Abernathey <rabernat>`.
23+
24+
25+
.. _release_2.11.3:
26+
27+
2.11.3
28+
------
29+
30+
Bug fixes
31+
~~~~~~~~~
32+
1233
* Changes the default value of ``write_empty_chunks`` to ``True`` to prevent
1334
unanticipated data losses when the data types do not have a proper default
1435
value when empty chunks are read back in.
@@ -322,7 +343,7 @@ Bug fixes
322343

323344
* FSStore: default to normalize_keys=False
324345
By :user:`Josh Moore <joshmoore>`; :issue:`755`.
325-
* ABSStore: compatibility with ``azure.storage.python>=12``
346+
* ABSStore: compatibility with ``azure.storage.python>=12``
326347
By :user:`Tom Augspurger <tomaugspurger>`; :issue:`618`
327348

328349

@@ -487,7 +508,7 @@ This release will be the last to support Python 3.5, next version of Zarr will b
487508

488509
* `DirectoryStore` now uses `os.scandir`, which should make listing large store
489510
faster, :issue:`563`
490-
511+
491512
* Remove a few remaining Python 2-isms.
492513
By :user:`Poruri Sai Rahul <rahulporuri>`; :issue:`393`.
493514

@@ -507,7 +528,7 @@ This release will be the last to support Python 3.5, next version of Zarr will b
507528
``zarr.errors`` have been replaced by ``ValueError`` subclasses. The corresponding
508529
``err_*`` function have been removed. :issue:`590`, :issue:`614`)
509530

510-
* Improve consistency of terminology regarding arrays and datasets in the
531+
* Improve consistency of terminology regarding arrays and datasets in the
511532
documentation.
512533
By :user:`Josh Moore <joshmoore>`; :issue:`571`.
513534

docs/tutorial.rst

Lines changed: 24 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -758,7 +758,7 @@ databases. The :class:`zarr.storage.RedisStore` class interfaces `Redis <https:/
758758
(an in memory data structure store), and the :class:`zarr.storage.MongoDB` class interfaces
759759
with `MongoDB <https://www.mongodb.com/>`_ (an object oriented NoSQL database). These stores
760760
respectively require the `redis-py <https://redis-py.readthedocs.io>`_ and
761-
`pymongo <https://api.mongodb.com/python/current/>`_ packages to be installed.
761+
`pymongo <https://api.mongodb.com/python/current/>`_ packages to be installed.
762762

763763
For compatibility with the `N5 <https://github.com/saalfeldlab/n5>`_ data format, Zarr also provides
764764
an N5 backend (this is currently an experimental feature). Similar to the zip storage class, an
@@ -897,6 +897,18 @@ The second invocation here will be much faster. Note that the ``storage_options`
897897
have become more complex here, to account for the two parts of the supplied
898898
URL.
899899

900+
It is also possible to initialize the filesytem outside of Zarr and then pass
901+
it through. This requires creating an :class:`zarr.storage.FSStore` object
902+
explicitly. For example::
903+
904+
>>> import s3fs * doctest: +SKIP
905+
>>> fs = s3fs.S3FileSystem(anon=True) # doctest: +SKIP
906+
>>> store = zarr.storage.FSStore('/zarr-demo/store', fs=fs) # doctest: +SKIP
907+
>>> g = zarr.open_group(store) # doctest: +SKIP
908+
909+
This is useful in cases where you want to also use the same fsspec filesystem object
910+
separately from Zarr.
911+
900912
.. _fsspec: https://filesystem-spec.readthedocs.io/en/latest/
901913

902914
.. _supported by fsspec: https://filesystem-spec.readthedocs.io/en/latest/api.html#built-in-implementations
@@ -1306,18 +1318,18 @@ filters (e.g., byte-shuffle) have been applied.
13061318

13071319
Empty chunks
13081320
~~~~~~~~~~~~
1309-
1321+
13101322
As of version 2.11, it is possible to configure how Zarr handles the storage of
13111323
chunks that are "empty" (i.e., every element in the chunk is equal to the array's fill value).
1312-
When creating an array with ``write_empty_chunks=False``,
1324+
When creating an array with ``write_empty_chunks=False``,
13131325
Zarr will check whether a chunk is empty before compression and storage. If a chunk is empty,
1314-
then Zarr does not store it, and instead deletes the chunk from storage
1315-
if the chunk had been previously stored.
1326+
then Zarr does not store it, and instead deletes the chunk from storage
1327+
if the chunk had been previously stored.
13161328

1317-
This optimization prevents storing redundant objects and can speed up reads, but the cost is
1318-
added computation during array writes, since the contents of
1319-
each chunk must be compared to the fill value, and these advantages are contingent on the content of the array.
1320-
If you know that your data will form chunks that are almost always non-empty, then there is no advantage to the optimization described above.
1329+
This optimization prevents storing redundant objects and can speed up reads, but the cost is
1330+
added computation during array writes, since the contents of
1331+
each chunk must be compared to the fill value, and these advantages are contingent on the content of the array.
1332+
If you know that your data will form chunks that are almost always non-empty, then there is no advantage to the optimization described above.
13211333
In this case, creating an array with ``write_empty_chunks=True`` (the default) will instruct Zarr to write every chunk without checking for emptiness.
13221334

13231335
The following example illustrates the effect of the ``write_empty_chunks`` flag on
@@ -1329,7 +1341,7 @@ the time required to write an array with different values.::
13291341
>>> from tempfile import TemporaryDirectory
13301342
>>> def timed_write(write_empty_chunks):
13311343
... """
1332-
... Measure the time required and number of objects created when writing
1344+
... Measure the time required and number of objects created when writing
13331345
... to a Zarr array with random ints or fill value.
13341346
... """
13351347
... chunks = (8192,)
@@ -1368,8 +1380,8 @@ the time required to write an array with different values.::
13681380
Random Data: 0.1359s, 1024 objects stored
13691381
Empty Data: 0.0301s, 0 objects stored
13701382

1371-
In this example, writing random data is slightly slower with ``write_empty_chunks=True``,
1372-
but writing empty data is substantially faster and generates far fewer objects in storage.
1383+
In this example, writing random data is slightly slower with ``write_empty_chunks=True``,
1384+
but writing empty data is substantially faster and generates far fewer objects in storage.
13731385

13741386
.. _tutorial_rechunking:
13751387

zarr/storage.py

Lines changed: 43 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1262,8 +1262,9 @@ class FSStore(Store):
12621262
Parameters
12631263
----------
12641264
url : str
1265-
The destination to map. Should include protocol and path,
1266-
like "s3://bucket/root"
1265+
The destination to map. If no fs is provided, should include protocol
1266+
and path, like "s3://bucket/root". If an fs is provided, can be a path
1267+
within that filesystem, like "bucket/root"
12671268
normalize_keys : bool
12681269
key_separator : str
12691270
public API for accessing dimension_separator. Never `None`
@@ -1275,7 +1276,19 @@ class FSStore(Store):
12751276
as a missing key
12761277
dimension_separator : {'.', '/'}, optional
12771278
Separator placed between the dimensions of a chunk.
1278-
storage_options : passed to the fsspec implementation
1279+
fs : fsspec.spec.AbstractFileSystem, optional
1280+
An existing filesystem to use for the store.
1281+
check : bool, optional
1282+
If True, performs a touch at the root location, to check for write access.
1283+
Passed to `fsspec.mapping.FSMap` constructor.
1284+
create : bool, optional
1285+
If True, performs a mkdir at the rool location.
1286+
Passed to `fsspec.mapping.FSMap` constructor.
1287+
missing_exceptions : sequence of Exceptions, optional
1288+
Exceptions classes to associate with missing files.
1289+
Passed to `fsspec.mapping.FSMap` constructor.
1290+
storage_options : passed to the fsspec implementation. Cannot be used
1291+
together with fs.
12791292
"""
12801293
_array_meta_key = array_meta_key
12811294
_group_meta_key = group_meta_key
@@ -1285,18 +1298,37 @@ def __init__(self, url, normalize_keys=False, key_separator=None,
12851298
mode='w',
12861299
exceptions=(KeyError, PermissionError, IOError),
12871300
dimension_separator=None,
1301+
fs=None,
1302+
check=False,
1303+
create=False,
1304+
missing_exceptions=None,
12881305
**storage_options):
12891306
import fsspec
1290-
self.normalize_keys = normalize_keys
12911307

1292-
protocol, _ = fsspec.core.split_protocol(url)
1293-
# set auto_mkdir to True for local file system
1294-
if protocol in (None, "file") and not storage_options.get("auto_mkdir"):
1295-
storage_options["auto_mkdir"] = True
1308+
mapper_options = {"check": check, "create": create}
1309+
# https://github.com/zarr-developers/zarr-python/pull/911#discussion_r841926292
1310+
# Some fsspec implementations don't accept missing_exceptions.
1311+
# This is a workaround to avoid passing it in the most common scenarios.
1312+
# Remove this and add missing_exceptions to mapper_options when fsspec is released.
1313+
if missing_exceptions is not None:
1314+
mapper_options["missing_exceptions"] = missing_exceptions # pragma: no cover
1315+
1316+
if fs is None:
1317+
protocol, _ = fsspec.core.split_protocol(url)
1318+
# set auto_mkdir to True for local file system
1319+
if protocol in (None, "file") and not storage_options.get("auto_mkdir"):
1320+
storage_options["auto_mkdir"] = True
1321+
self.map = fsspec.get_mapper(url, **{**mapper_options, **storage_options})
1322+
self.fs = self.map.fs # for direct operations
1323+
self.path = self.fs._strip_protocol(url)
1324+
else:
1325+
if storage_options:
1326+
raise ValueError("Cannot specify both fs and storage_options")
1327+
self.fs = fs
1328+
self.path = self.fs._strip_protocol(url)
1329+
self.map = self.fs.get_mapper(self.path, **mapper_options)
12961330

1297-
self.map = fsspec.get_mapper(url, **storage_options)
1298-
self.fs = self.map.fs # for direct operations
1299-
self.path = self.fs._strip_protocol(url)
1331+
self.normalize_keys = normalize_keys
13001332
self.mode = mode
13011333
self.exceptions = exceptions
13021334
# For backwards compatibility. Guaranteed to be non-None
@@ -1308,8 +1340,6 @@ def __init__(self, url, normalize_keys=False, key_separator=None,
13081340

13091341
# Pass attributes to array creation
13101342
self._dimension_separator = dimension_separator
1311-
if self.fs.exists(self.path) and not self.fs.isdir(self.path):
1312-
raise FSPathExistNotDir(url)
13131343

13141344
def _default_key_separator(self):
13151345
if self.key_separator is None:

zarr/tests/test_convenience.py

Lines changed: 40 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -26,6 +26,7 @@
2626
from zarr.hierarchy import Group, group
2727
from zarr.storage import (
2828
ConsolidatedMetadataStore,
29+
FSStore,
2930
KVStore,
3031
MemoryStore,
3132
atexit_rmtree,
@@ -205,9 +206,18 @@ def test_tree(zarr_version):
205206

206207

207208
@pytest.mark.parametrize('zarr_version', [2, 3])
208-
@pytest.mark.parametrize('with_chunk_store', [False, True], ids=['default', 'with_chunk_store'])
209209
@pytest.mark.parametrize('stores_from_path', [False, True])
210-
def test_consolidate_metadata(with_chunk_store, zarr_version, stores_from_path):
210+
@pytest.mark.parametrize(
211+
'with_chunk_store,listable',
212+
[(False, True), (True, True), (False, False)],
213+
ids=['default-listable', 'with_chunk_store-listable', 'default-unlistable']
214+
)
215+
def test_consolidate_metadata(with_chunk_store,
216+
zarr_version,
217+
listable,
218+
monkeypatch,
219+
stores_from_path):
220+
211221
# setup initial data
212222
if stores_from_path:
213223
store = tempfile.mkdtemp()
@@ -228,6 +238,10 @@ def test_consolidate_metadata(with_chunk_store, zarr_version, stores_from_path):
228238
version_kwarg = {}
229239
path = 'dataset' if zarr_version == 3 else None
230240
z = group(store, chunk_store=chunk_store, path=path, **version_kwarg)
241+
242+
# Reload the actual store implementation in case str
243+
store_to_copy = z.store
244+
231245
z.create_group('g1')
232246
g2 = z.create_group('g2')
233247
g2.attrs['hello'] = 'world'
@@ -278,14 +292,36 @@ def test_consolidate_metadata(with_chunk_store, zarr_version, stores_from_path):
278292
for key in meta_keys:
279293
del store[key]
280294

295+
# https://github.com/zarr-developers/zarr-python/issues/993
296+
# Make sure we can still open consolidated on an unlistable store:
297+
if not listable:
298+
fs_memory = pytest.importorskip("fsspec.implementations.memory")
299+
monkeypatch.setattr(fs_memory.MemoryFileSystem, "isdir", lambda x, y: False)
300+
monkeypatch.delattr(fs_memory.MemoryFileSystem, "ls")
301+
fs = fs_memory.MemoryFileSystem()
302+
if zarr_version == 2:
303+
store_to_open = FSStore("", fs=fs)
304+
else:
305+
store_to_open = FSStoreV3("", fs=fs)
306+
307+
# copy original store to new unlistable store
308+
store_to_open.update(store_to_copy)
309+
310+
else:
311+
store_to_open = store
312+
281313
# open consolidated
282-
z2 = open_consolidated(store, chunk_store=chunk_store, path=path, **version_kwarg)
314+
z2 = open_consolidated(store_to_open, chunk_store=chunk_store, path=path, **version_kwarg)
283315
assert ['g1', 'g2'] == list(z2)
284316
assert 'world' == z2.g2.attrs['hello']
285317
assert 1 == z2.g2.arr.attrs['data']
286318
assert (z2.g2.arr[:] == 1.0).all()
287319
assert 16 == z2.g2.arr.nchunks
288-
assert 16 == z2.g2.arr.nchunks_initialized
320+
if listable:
321+
assert 16 == z2.g2.arr.nchunks_initialized
322+
else:
323+
with pytest.raises(NotImplementedError):
324+
_ = z2.g2.arr.nchunks_initialized
289325

290326
if stores_from_path:
291327
# path string is note a BaseStore subclass so cannot be used to

zarr/tests/test_core.py

Lines changed: 70 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -2495,14 +2495,49 @@ def test_store_has_bytes_values(self):
24952495
pass
24962496

24972497

2498+
fsspec_mapper_kwargs = {
2499+
"check": True,
2500+
"create": True,
2501+
"missing_exceptions": None
2502+
}
2503+
2504+
24982505
@pytest.mark.skipif(have_fsspec is False, reason="needs fsspec")
24992506
class TestArrayWithFSStore(TestArray):
25002507
@staticmethod
25012508
def create_array(read_only=False, **kwargs):
25022509
path = mkdtemp()
25032510
atexit.register(shutil.rmtree, path)
25042511
key_separator = kwargs.pop('key_separator', ".")
2505-
store = FSStore(path, key_separator=key_separator, auto_mkdir=True)
2512+
store = FSStore(path, key_separator=key_separator, auto_mkdir=True, **fsspec_mapper_kwargs)
2513+
cache_metadata = kwargs.pop('cache_metadata', True)
2514+
cache_attrs = kwargs.pop('cache_attrs', True)
2515+
write_empty_chunks = kwargs.pop('write_empty_chunks', True)
2516+
kwargs.setdefault('compressor', Blosc())
2517+
init_array(store, **kwargs)
2518+
return Array(store, read_only=read_only, cache_metadata=cache_metadata,
2519+
cache_attrs=cache_attrs, write_empty_chunks=write_empty_chunks)
2520+
2521+
def expected(self):
2522+
return [
2523+
"ab753fc81df0878589535ca9bad2816ba88d91bc",
2524+
"c16261446f9436b1e9f962e57ce3e8f6074abe8a",
2525+
"c2ef3b2fb2bc9dcace99cd6dad1a7b66cc1ea058",
2526+
"6e52f95ac15b164a8e96843a230fcee0e610729b",
2527+
"091fa99bc60706095c9ce30b56ce2503e0223f56",
2528+
]
2529+
2530+
2531+
@pytest.mark.skipif(have_fsspec is False, reason="needs fsspec")
2532+
class TestArrayWithFSStoreFromFilesystem(TestArray):
2533+
@staticmethod
2534+
def create_array(read_only=False, **kwargs):
2535+
from fsspec.implementations.local import LocalFileSystem
2536+
fs = LocalFileSystem(auto_mkdir=True)
2537+
path = mkdtemp()
2538+
atexit.register(shutil.rmtree, path)
2539+
key_separator = kwargs.pop('key_separator', ".")
2540+
store = FSStore(path, fs=fs, key_separator=key_separator, **fsspec_mapper_kwargs)
25062541
cache_metadata = kwargs.pop('cache_metadata', True)
25072542
cache_attrs = kwargs.pop('cache_attrs', True)
25082543
write_empty_chunks = kwargs.pop('write_empty_chunks', True)
@@ -3148,7 +3183,40 @@ def create_array(array_path='arr1', read_only=False, **kwargs):
31483183
path = mkdtemp()
31493184
atexit.register(shutil.rmtree, path)
31503185
key_separator = kwargs.pop('key_separator', ".")
3151-
store = FSStoreV3(path, key_separator=key_separator, auto_mkdir=True)
3186+
store = FSStoreV3(
3187+
path,
3188+
key_separator=key_separator,
3189+
auto_mkdir=True,
3190+
**fsspec_mapper_kwargs
3191+
)
3192+
cache_metadata = kwargs.pop('cache_metadata', True)
3193+
cache_attrs = kwargs.pop('cache_attrs', True)
3194+
write_empty_chunks = kwargs.pop('write_empty_chunks', True)
3195+
kwargs.setdefault('compressor', Blosc())
3196+
init_array(store, path=array_path, **kwargs)
3197+
return Array(store, path=array_path, read_only=read_only, cache_metadata=cache_metadata,
3198+
cache_attrs=cache_attrs, write_empty_chunks=write_empty_chunks)
3199+
3200+
def expected(self):
3201+
return [
3202+
"1509abec4285494b61cd3e8d21f44adc3cf8ddf6",
3203+
"7cfb82ec88f7ecb7ab20ae3cb169736bc76332b8",
3204+
"b663857bb89a8ab648390454954a9cdd453aa24b",
3205+
"21e90fa927d09cbaf0e3b773130e2dc05d18ff9b",
3206+
"e8c1fdd18b5c2ee050b59d0c8c95d07db642459c",
3207+
]
3208+
3209+
3210+
@pytest.mark.skipif(have_fsspec is False, reason="needs fsspec")
3211+
class TestArrayWithFSStoreV3FromFilesystem(TestArrayWithPathV3, TestArrayWithFSStore):
3212+
@staticmethod
3213+
def create_array(array_path='arr1', read_only=False, **kwargs):
3214+
from fsspec.implementations.local import LocalFileSystem
3215+
fs = LocalFileSystem(auto_mkdir=True)
3216+
path = mkdtemp()
3217+
atexit.register(shutil.rmtree, path)
3218+
key_separator = kwargs.pop('key_separator', ".")
3219+
store = FSStoreV3(path, fs=fs, key_separator=key_separator, **fsspec_mapper_kwargs)
31523220
cache_metadata = kwargs.pop('cache_metadata', True)
31533221
cache_attrs = kwargs.pop('cache_attrs', True)
31543222
write_empty_chunks = kwargs.pop('write_empty_chunks', True)

0 commit comments

Comments
 (0)