Multiprocess usage of zarr with contiguous partitions #1284

ferrine · 2022-12-06T11:25:31Z

ferrine
Dec 6, 2022

Hi, zarr is just amazing. I'm about to rewrite some of my ML processing pipelines to use zarr for its distributed nature. I would like to ask for an advice for my specific usecase. I've read the multiprocessing documentation which allows for using process syncronizer.

synchronizer = zarr.ProcessSynchronizer('data/example.sync')
z = zarr.open_array('data/example', mode='w', shape=(10000, 10000),
                    chunks=(1000, 1000), dtype='i4',
                    synchronizer=synchronizer)
z

But this is a superset of what I'm trying to achieve.

output layout is fixed for me:

I know groups, dtypes, shapes before creating the array on the disk

What I want is to open the file from multiple processes and work on each partition independently

there are WORLD_SIZE partitions in the file
RANK worker only sees (read/write) its own part
Partitions are contiguous

Is there a more efficient way to open/create a file?

Pseudocode API

synchronizer = zarr.DistributedSynchronizer(world_size=N, rank=k)
z = zarr.open_array('data/example', mode='w', shape=(10000, 10000),
                    chunks=(1000, 1000), dtype='i4',
                    synchronizer=synchronizer)
z

A more lowlevel workaround would also work for me

Answered by ferrine

Dec 6, 2022

I found it in the docs

To give a simple example, consider a 1-dimensional array of length 60, z, divided into three chunks of 20 elements each. If three workers are running and each attempts to write to a 20 element region (i.e., z[0:20], z[20:40] and z[40:60]) then each worker will be writing to a separate chunk and no synchronization is required.

View full answer

ferrine · 2022-12-06T12:15:49Z

ferrine
Dec 6, 2022
Author

I found it in the docs

To give a simple example, consider a 1-dimensional array of length 60, z, divided into three chunks of 20 elements each. If three workers are running and each attempts to write to a 20 element region (i.e., z[0:20], z[20:40] and z[40:60]) then each worker will be writing to a separate chunk and no synchronization is required.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Multiprocess usage of zarr with contiguous partitions #1284

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Multiprocess usage of zarr with contiguous partitions #1284

Uh oh!

Uh oh!

ferrine Dec 6, 2022

Replies: 1 comment

Uh oh!

ferrine Dec 6, 2022 Author

ferrine
Dec 6, 2022

ferrine
Dec 6, 2022
Author