Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat: improves delete_dir for s3fs-backed FsspecStore #2661

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

carshadi
Copy link

@carshadi carshadi commented Jan 6, 2025

Improves performance of FsspecStore.delete_dir when underlying fs is s3fs, by passing a list of filepaths to be removed in bulk instead of one-by-one.

Resolves #2659

TODO:

  • Add unit tests and/or doctests in docstrings
  • Add docstrings and API docs for any new/modified user-facing classes and functions
  • New/modified features documented in docs/tutorial.rst
  • Changes documented in docs/release.rst
  • GitHub Actions have all passed
  • Test coverage is 100% (Codecov passes)

carshadi and others added 2 commits January 6, 2025 14:38
- override Store.delete_dir default method, which deletes keys one by one, to support bulk deletion for fsspec implementations that support a list of paths in the fs._rm method.
- This can greatly reduce the number of requests to S3, which reduces likelihood of running into throttling errors and improves delete performance.
- Currently, only s3fs is supported.
@jhamman jhamman requested a review from martindurant January 6, 2025 21:55
src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
Copy link
Member

@martindurant martindurant left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reservations for this code. It seems to me, that calling ._rm() should be all that's required, and let fsspec handle everything else.

src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
@carshadi carshadi requested review from martindurant and d-v-b January 8, 2025 18:00
@dstansby dstansby added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 9, 2025
@github-actions github-actions bot removed the needs release notes Automatically applied to PRs which haven't added release notes label Jan 10, 2025
Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @carshadi for continued efforts here. I'd like to see some fsspec specific tests here if possible.

src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Jan 24, 2025
@carshadi
Copy link
Author

Thanks @carshadi for continued efforts here. I'd like to see some fsspec specific tests here if possible.

@jhamman added a few test cases. Let me know if there are others that would be useful.

Copy link
Member

@jhamman jhamman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good here!

@carshadi - the only thing this needs is a release note.

@martindurant - care to do the final review and/or merge?

src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
src/zarr/storage/_fsspec.py Outdated Show resolved Hide resolved
"""The local fs is not async so we should expect it to be wrapped automatically"""
from fsspec.implementations.asyn_wrapper import AsyncFileSystemWrapper

store = FsspecStore.from_url(tmpdir / "test/path", storage_options={"auto_mkdir": True})
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tmpdir / "test/path"

This is not a URL. I don't know what from_url demands, but f"{tmpdir}/test/path" seems better to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used that method since it wraps the filesystem automatically, and figured that would be the entry point most people try first. FWIW, it seems to work with a local path. I could construct the filesystem manually if preferrable, e.g.,

sync_fs = LocalFileSystem(auto_mkdir=True)
async_fs = AsyncFileSystemWrapper(sync_fs)
store = FsspecStore(async_fs, read_only=False, path=f"{tmpdir}/test/path")

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it seems to work with a local path

on windows?? :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running Windows 11, so apparently :)

tests/test_store/test_fsspec.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs release notes Automatically applied to PRs which haven't added release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

FsspecStore directory deletion performance improvements
5 participants