You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using s3fs with zarr xarray and dask, I found that if the data subset being fetched was all within a single chunk then the download was single threaded (as far as I can tell) and as a result quite slow.
For larger subsets spanning multiple chunks this isn't an issue as it quickly becomes network bandwidth limited. With the exception of the end of a download where as threads finish downloading, you're stuck waiting for the last few threads to finish up with network bandwidth to spare.
The s3fs cat_file function does take range input parameters but as far as I could tell they weren't being used.
For my use case, I overwrote the s3fs function to always check the size of the file being downloaded and chunk the download with range requests if it was over a certain size. Even with this additional logic slowing down all other requests for metadata etc, it cut down the ~5 minute download to 20s.
This may just be a workaround for the real solution of rechunking the data into smaller pieces, but I'm not really able to do that. At the very least I hope this is useful to save someone else some time if they come across the same issue.
Happy to expand on this with more info, tests, and examples if it's useful?
The text was updated successfully, but these errors were encountered:
@JoshCu that's very interesting. in the main branch (and in the upcoming zarr-python 3.0 release) we made the entire IO layer async, with the intention of facilitating exactly these kinds of performance optimizations. I wonder what your improvements would look like the zarr-python 3.0 version of FSStore, which we are currently calling FSSpecStore. Would you be interested in opening a PR?
gladly! It might take me a few days to update everything to the new versions and get test cases for zarr standalone and also the dask+xarray+zarr stack working, but I'll definitely have a look into it.
When using s3fs with zarr xarray and dask, I found that if the data subset being fetched was all within a single chunk then the download was single threaded (as far as I can tell) and as a result quite slow.
For larger subsets spanning multiple chunks this isn't an issue as it quickly becomes network bandwidth limited. With the exception of the end of a download where as threads finish downloading, you're stuck waiting for the last few threads to finish up with network bandwidth to spare.
The s3fs cat_file function does take range input parameters but as far as I could tell they weren't being used.
For my use case, I overwrote the s3fs function to always check the size of the file being downloaded and chunk the download with range requests if it was over a certain size. Even with this additional logic slowing down all other requests for metadata etc, it cut down the ~5 minute download to 20s.
I initialized a mfdataset like this
link
and save it like this
This may just be a workaround for the real solution of rechunking the data into smaller pieces, but I'm not really able to do that. At the very least I hope this is useful to save someone else some time if they come across the same issue.
Happy to expand on this with more info, tests, and examples if it's useful?
The text was updated successfully, but these errors were encountered: