Zarr store using many Datasets writing in parallel - Help! #1046

mronda · 2022-06-17T22:22:53Z

mronda
Jun 17, 2022

Hi, reaching out because I am running into performance issues trying to run my workflow on a HPC environment with thousands of tasks. I might be doing this wrong but also seeking help.

To explain my workflow:
I am trying to use a Zarr.DirectoryStore to store thousands of xarray.Datasets (>10_000) that are about 10-18MB each. I start by scraping all the directories and generating a list of futures with a Dataset containing data from that directory. I then want to use a Zarr.DirectoryStore store to sink all my data but with the caveat that I want to store it into a specific group name.

So workflow looks something like this:

def scrape_dir(dir_name):
    
    foo = GenerateDataset(dir_name)
    
    dataset = foo.run()
    
    return dataset

def write_to_store(dataset, file_name, mode_="w", append_dim=None):
    
    lock = fasteners.InterProcessLock(os.getcwd()+"/.lock")

    with lock:
        try:
            synchronizer = zarr.ProcessSynchronizer(os.getcwd()+ '/zarr.sync')

            path = os.path.join(os.getcwd(), file_name)

            store = zarr.DirectoryStore(path=path)

            dataset.to_zarr(
                store=store,
                mode=mode_,
                synchronizer=synchronizer,
                group="MY_GROUP",
                encoding=None,
                compute=True,
                consolidated=True,
                append_dim=append_dim
            )
            msg = "success"
        except Exception as e:
            msg = f"failed! reason: {e}"
    
    return msg

# generate datasets as futures
future_datasets = [client.submit(scrape_dir, dir_name) for dir_name in directories]

# get first dataset to start the Zarr store
dataset_1 = future_datasets[0].result()

# write initial 
write_to_store(dataset=dataset_1, file_name="myzarr_store.zarr", mode_="w", append_dim=None)

# write rest of datasets as futures 
msgs = []
for ds in future_datasets[1:]:
    msg = client.submit(write_to_store, dataset=ds, file_name="myzarr_store.zarr", mode_="a", append_dim="case")
    msgs.append(msg)

As you can see I had to add my own file locking because without it Zarr would just hang.

I've been experimenting with other ways such as using xr.concat(futures) while the Datasets are sitting on my Dask workers but that concat function takes forever and even kills my cluster.

I've also dumped my Datasets as netCDF files and later used xr.open_mfdataset to open them up and store as a Zarr store but that defeats the purpose of dumping into Zarr while I am still holding on to the Datasets on my workers.

I'd wish that there was an xr.open_mfdataset(path=futures) that allowed me to pass futures and then allow it to save to Zarr without having to save to Disk and then load them up again but I did not find anything like that.

So yeah, open to suggestions as how I should go about this workflow? Many thanks in advance !!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Zarr store using many Datasets writing in parallel - Help! #1046

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

Zarr store using many Datasets writing in parallel - Help! #1046

Uh oh!

mronda Jun 17, 2022

Replies: 0 comments

mronda
Jun 17, 2022