Slow single chunk download using s3fs storage option.

When using s3fs with zarr xarray and dask, I found that if the data subset being fetched was all within a single chunk then the download  was single threaded (as far as I can tell) and as a result quite slow. 
For larger subsets spanning multiple chunks this isn't an issue as it quickly becomes network bandwidth limited. With the exception of the end of a download where as threads finish downloading, you're stuck waiting for the last few threads to finish up with network bandwidth to spare. 

The [s3fs cat_file function](https://github.com/fsspec/s3fs/blob/51e3c80ef380a82081a171de652e2b699753be2b/s3fs/core.py#L1118) does take range input parameters but as far as I could tell they weren't being used. 
For my use case, I [overwrote the s3fs function](https://github.com/CIROH-UA/NGIAB_data_preprocess/blob/02f5b6065501b685b26c0111085b339b06a63a38/modules/data_processing/s3fs_utils.py#L13) to always check the size of the file being downloaded and chunk the download with range requests if it was over a certain size. Even with this additional logic slowing down all other requests for metadata etc, it cut down the ~5 minute download to 20s.

I initialized a mfdataset like this
[link](https://github.com/CIROH-UA/NGIAB_data_preprocess/blob/02f5b6065501b685b26c0111085b339b06a63a38/modules/data_processing/zarr_utils.py#L19C1-L39C19) 
```python
def load_zarr_datasets(forcing_vars: list[str] = None) -> xr.Dataset:
    if not forcing_vars:
        forcing_vars = ["lwdown", "precip", "psfc", "q2d", "swdown", "t2d", "u2d", "v2d"]
    # if a LocalCluster is not already running, start one
    try:
        client = Client.current()
    except ValueError:
        cluster = LocalCluster()
        client = Client(cluster)
    s3_urls = [
        f"s3://noaa-nwm-retrospective-3-0-pds/CONUS/zarr/forcing/{var}.zarr"
        for var in forcing_vars
    ]
    # default cache is readahead which is detrimental to performance in this case
    fs = S3ParallelFileSystem(anon=True, default_cache_type="none")  # default_block_size
    s3_stores = [s3fs.S3Map(url, s3=fs) for url in s3_urls]
    dataset = xr.open_mfdataset(s3_stores, parallel=True, engine="zarr", cache=True)
    return dataset
``` 

and save it like this
```python
client = Client.current()
future = client.compute(dataset.to_netcdf(temp_path, compute=False))
# Display progress bar
progress(future)
future.result()
```

This may just be a workaround for the real solution of rechunking the data into smaller pieces, but I'm not really able to do that. At the very least I hope this is useful to save someone else some time if they come across the same issue. 

Happy to expand on this with more info, tests, and examples if it's useful? 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Slow single chunk download using s3fs storage option. #2662

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Slow single chunk download using s3fs storage option. #2662

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions