Concurrent data downloading and appending to zarr store #2532

guidocioni · 2024-12-04T11:45:23Z

guidocioni
Dec 4, 2024

I'm trying to speed up the download and processing of many netcdf files by running concurrent processes.
The main operations are

Download and extract a gzipped netcdf file
Read it with xarray, subset over coordinates
Append to the time dimension in the zarr store

So I start with

dates = pd.date_range(DATE_START, DATE_END, freq="1D")
# Write the first file to make sure we start with something
download_file(dates[0])
# then proceed with the others
process_map(download_file, dates, max_workers=MAX_WORKERS, chunksize=CHUNKSIZE)

(process_map is from tqdm.contrib.concurrent)

The function download_file basically does (I'm keeping short not because I don't want to share it but because it may be easier to read with less details):

with fsspec.open(url, mode="rb") as f_remote:
    with gzip.GzipFile(fileobj=f_remote) as f_decompressed:
        with io.BytesIO(f_decompressed.read()) as f_buffer:
            ds = xr.open_dataset(f_buffer).squeeze()
            subset = ds.sel(x=slice(x_min, x_max), y=slice(y_min, y_max))
            subset = subset.expand_dims("time").drop_vars('projection')
            if not os.path.exists(OUTPUT_ZARR):
                subset.to_zarr(OUTPUT_ZARR, mode="w", consolidated=True)
            else:  # Append to the existing Zarr dataset
                subset.to_zarr(OUTPUT_ZARR, mode="a", append_dim="time", consolidated=True)

This works when I run the script, but when I try to open the dataset with xarray I get

ValueError: conflicting sizes for dimension 
'time': length 6899 on 'time' and length 6897 on 
{'time': 'IMS_Surface_Values', 'y': 'IMS_Surface_Values', 'x': 'IMS_Surface_Values'}

If I look into the folder I can indeed see that the variable IMS_Surface_Values has chunks going from 0.0.0 to 6896.1.1 while time goes from 0 to 6898.
Is this because I'm writing concurrently or does the problem lie somewhere else?

Surely there must be a smarter way of downloading and appending to a zarr in parallel.

Answered by guidocioni

Dec 4, 2024

Cannot believe it was THAT easy.
Thanks to the suggestions of @rabernat I was able to create the following example which does exactly what it's supposed to do by only writing the region specific to a certain time (which is automatically detected by xarray).
The result is that I'm able to download and process about 20 years of daily data in about 7 minutes, and this could be improved by using even more workers (but then the network will probably become the bottleneck).
I'm leaving this here in case someone has the same problem.

import xarray as xr
import fsspec
import gzip
import io
import pyproj
import pandas as pd
from tqdm.contrib.concurrent import process_map
import dask.array as da
im…

View full answer

rabernat · 2024-12-04T13:10:28Z

rabernat
Dec 4, 2024
Maintainer

Thanks for sharing!

It seems like there is a logical problem with what you're trying to do here.

Appending over the time dimension has to be done in the correct order. You want the first file to be first, then append the second file, then the third, etc. so that the Zarr dataset has the correct temporal order. Because it is sequential, this operation can't be parallelized! Your process_map function makes no guarantees about the order in which the files will be processed. If you want to use the append approach, you just have to loop over all of the files sequentially.

However, there is another approach you can use here that is parallelizable. Since you know the shape and coordinates of the final dataset in advance, you can pre-allocate the entire domain and then write to each time slice in parallel. In this mode of operation, you use region instead of append_dim, and the writes can happen in any order. (But you do have to guarantee that the writes are aligned with chunk boundaries, so choose your chunks explicity.) You can see a simple example of this pattern in the Arraylake Docs. We also used it in this blog post.

Finally, when doing complex distributed writes to Zarr, it can be really useful to have a store that supports transactional updates and rollbacks in case something goes wrong. Have a look at our new Icechunk project if you're curious about that.

1 reply

guidocioni Dec 4, 2024
Author

Yep, I had the impression this was the issue...I just didn't find any relevant example online.
I think the region approach looks promising. I do indeed know the dimensions beforehand (even though the x, y dimensions may change if I change the subset region and it would be good to have something dynamic instead than hard coded) so I can try to do that and then write on the region.

guidocioni · 2024-12-04T15:24:28Z

guidocioni
Dec 4, 2024
Author

Cannot believe it was THAT easy.
Thanks to the suggestions of @rabernat I was able to create the following example which does exactly what it's supposed to do by only writing the region specific to a certain time (which is automatically detected by xarray).
The result is that I'm able to download and process about 20 years of daily data in about 7 minutes, and this could be improved by using even more workers (but then the network will probably become the bottleneck).
I'm leaving this here in case someone has the same problem.

import xarray as xr
import fsspec
import gzip
import io
import pyproj
import pandas as pd
from tqdm.contrib.concurrent import process_map
import dask.array as da
import logging

# Define lat/lon box boundaries
lat_min, lat_max = 65, 25  # Replace with your values
lon_min, lon_max = -40, 40  # Replace with your values

# Create a transformer for polar stereographic projection
transformer = pyproj.Transformer.from_crs(
    "EPSG:4326",
    "+proj=stere +lat_0=90 +lat_ts=60 +lon_0=-80 +k=1 +x_0=0 +y_0=0 +a=6378137 +b=6356257 +units=m +no_defs",
    always_xy=True,
)

# Transform the lat/lon boundaries to x/y boundaries
x_min, y_min = transformer.transform(lon_min, lat_min)
x_max, y_max = transformer.transform(lon_max, lat_max)

RESOLUTION="4km"
OUTPUT_ZARR = f"/Users/guidocioni/Downloads/data_{RESOLUTION}.zarr"
DATE_START="2005-01-01"
DATE_END="2024-12-03"
MAX_WORKERS=10
CHUNKSIZE=10

# Helper function to generate the URL for a given date
def get_url_for_date(date):
    if RESOLUTION == "1km":
        version = "v1.3"
    else:
        if date < pd.to_datetime("2014-12-03"):
            version = "v1.2"
        else:
            version = "v1.3"
    return f"https://noaadata.apps.nsidc.org/NOAA/G02156/netcdf/{RESOLUTION}/{date.year}/ims{date.year}{date.day_of_year:03d}_{RESOLUTION}_{version}.nc.gz"


# Parallel version
def main():
    # Process each year
    dates = pd.date_range(DATE_START, DATE_END, freq="1D")

    # Initialize the zarr store    
    url = get_url_for_date(dates[0])
    with fsspec.open(url, mode="rb") as f_remote:
        with gzip.GzipFile(fileobj=f_remote) as f_decompressed:
            with io.BytesIO(f_decompressed.read()) as f_buffer:
                ds = xr.open_dataset(f_buffer).squeeze()
                subset = ds.sel(x=slice(x_min, x_max), y=slice(y_min, y_max)).drop_vars('projection').expand_dims("time")
                # Initialize the Zarr store with the correct dimensions
                shape = (len(dates), subset.sizes['y'], subset.sizes['x'])
                chunks = (1, subset.sizes['y'], subset.sizes['x'])  # Write 1 time step at a time
                template = xr.Dataset(
                    data_vars={
                        "IMS_Surface_Values": (
                            ('time', 'y', 'x'),
                            da.zeros(shape, chunks=chunks, dtype=subset.IMS_Surface_Values.dtype)
                        )
                    },
                    coords={
                        'time': dates,
                        'y': subset['y'],
                        'x': subset['x']
                    }
                )
                template.to_zarr(OUTPUT_ZARR, mode="w", compute=False)
                # Now append first time step that we already downloaded
                subset.to_zarr(OUTPUT_ZARR, region="auto")

    process_map(download_time_step, dates[1:], max_workers=MAX_WORKERS, chunksize=CHUNKSIZE)

def download_time_step(date):
    url = get_url_for_date(date)
    try:
        with fsspec.open(url, mode="rb") as f_remote:
            with gzip.GzipFile(fileobj=f_remote) as f_decompressed:
                with io.BytesIO(f_decompressed.read()) as f_buffer:
                    ds = xr.open_dataset(f_buffer).squeeze()
                    subset = ds.sel(x=slice(x_min, x_max), y=slice(y_min, y_max)).drop_vars('projection').expand_dims("time")
                    subset.to_zarr(OUTPUT_ZARR, region="auto")
    except Exception as e:
        logging.error(
            f"Could not process {url}: {type(e).__name__} at line {e.__traceback__.tb_lineno} of {__file__}: {e}")

if __name__ == "__main__":
    main()

0 replies

guidocioni · 2024-12-11T14:14:16Z

guidocioni
Dec 11, 2024
Author

Hey @rabernat just curiosity.
Would it be possible to add timesteps to this zarr store in the future by using a similar concurrent approach, or will this not be possible as the time array needs to be defined at the beginning and cannot be modified afterwards?

Of course the serial approach (appending one instant after another sequentially) would still be available but this could become slow in case there are many timesteps to append.

Thanks again

8 replies

guidocioni Feb 10, 2025
Author

Today I was trying to append some timesteps sequentially but I ran into a new "problem".
It doesn't matter if the new timesteps are AFTER or BEFORE the time axis range, they will always be appended to the end. Is this expected or is there a way to insert new values over the time dimension in the right order/place?

rabernat Feb 10, 2025
Maintainer

There is no way to insert in the middle of the array using the current model. Zarr itself has no knowledge of time coordinates, "right order", etc., just raw arrays.

Fortunately temporal data are usually ordered in time, so appending to the end is not a major limitation. Is there a reason your workflow requires processing of out-of-order timesteps?

guidocioni Feb 21, 2025
Author

Sorry for the late reply.

The use case for that would be e.g. a back extension of a dataset. Say the reanalysis was covering only the 1991-2020 period, but then it gets extended back to 1981. It is not ideal having to create the store again back from scratch for large datasets. There's always the possibility of reading the zarr store, order by time and then re-write it but, from what I saw, it is not a lightweight operation (I could be wrong).

In theory, insertion in the middle (in case of e.g. an unforeseen processing error which causes a certain timestep not to be processed) shouldn't be a problem because that timestep can be filled again using the region approach

d-v-b Feb 21, 2025
Maintainer

I think what you are asking for would be possible if we made some changes to part of the zarr specification. Right now, when you create an array, it is stored in N chunk files with names [0, ... N-1]. Appending to this array means adding a new chunk file named N, but if we allowed chunk files with names like -1, -2, etc, then we could add chunks to the front of the array, and it would be possible to prepend data.

guidocioni Feb 21, 2025
Author

I understand. However I feel like allowing negative chunk names is possibly a recipe for disaster 😄
I thought there was a cleaner way of doing this. I think then it would be better to keep the current behavior. On my side that's not a big issue because i can quickly recreate the dataset by using more concurrent workers, but that may become a problem for others hosting large stores.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent data downloading and appending to zarr store #2532

{{title}}

Replies: 3 comments 9 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Concurrent data downloading and appending to zarr store #2532

guidocioni Dec 4, 2024

Replies: 3 comments · 9 replies

rabernat Dec 4, 2024 Maintainer

guidocioni Dec 4, 2024 Author

guidocioni Dec 4, 2024 Author

guidocioni Dec 11, 2024 Author

guidocioni Feb 10, 2025 Author

rabernat Feb 10, 2025 Maintainer

guidocioni Feb 21, 2025 Author

d-v-b Feb 21, 2025 Maintainer

guidocioni Feb 21, 2025 Author

guidocioni
Dec 4, 2024

Replies: 3 comments 9 replies

rabernat
Dec 4, 2024
Maintainer

guidocioni Dec 4, 2024
Author

guidocioni
Dec 4, 2024
Author

guidocioni
Dec 11, 2024
Author

guidocioni Feb 10, 2025
Author

rabernat Feb 10, 2025
Maintainer

guidocioni Feb 21, 2025
Author

d-v-b Feb 21, 2025
Maintainer

guidocioni Feb 21, 2025
Author