Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using zarr v3 in a multiprocessing context fails with JSONDecodeError #2729

Open
MariusMeyerDraeger opened this issue Jan 18, 2025 · 1 comment
Labels
bug Potential issues with the zarr-python library

Comments

@MariusMeyerDraeger
Copy link

Zarr version

3.0.1

Numcodecs version

0.15.0

Python Version

3.12.2

Operating System

Windows 11 22H2

Installation

using pip into virtual environment

Description

Hi,

I discovered zarr a few days ago, just after v3 was published and I'm trying to use it in a multiprocessing context where one process writes numeric as well as variable length string data into a persistent file from which a reader process reads the newly arrived data.
The aim is to exchange data as well as store it persistently at the same time.
I tried to build a minimal working example (See steps to reproduce) but more often than not reading from the zarr files fails with the following exception:

Traceback (most recent call last):
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\multiprocessing\process.py", line 314, in _bootstrap
    self.run()
  File "...\scratch_3.py", line 26, in run
    text_dset = root['text_data']
                ~~~~^^^^^^^^^^^^^
  File "...\site-packages\zarr\core\group.py", line 1783, in __getitem__
    obj = self._sync(self._async_group.getitem(path))
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "...\site-packages\zarr\core\sync.py", line 187, in _sync
    return sync(
           ^^^^^
  File "...\site-packages\zarr\core\sync.py", line 142, in sync
    raise return_result
  File "...\site-packages\zarr\core\sync.py", line 98, in _runner
    return await coro
           ^^^^^^^^^^
  File "...\site-packages\zarr\core\group.py", line 681, in getitem
    zarr_json = json.loads(zarr_json_bytes.to_bytes())
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\__init__.py", line 346, in loads
    return _default_decoder.decode(s)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\tools\Python\3.12\3.12.2-win64\Lib\json\decoder.py", line 355, in raw_decode
    raise JSONDecodeError("Expecting value", s, err.value) from None
json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Is this a bug in v3, is it not ready yet for multiprocessing or am I making as mistake here?
Sadly, the v3 docs don't really describe how to use zarr in a multiprocessing context so it might be possible I'm missing something.

Steps to reproduce

import sys
import time

import zarr
import numpy as np
import logging
from multiprocessing import Process, Event

class ZarrReader(Process):
    def __init__(self, event, fname, dsetname, timeout = 2.0):
        super().__init__()
        self._event = event
        self._fname = fname
        self._dsetname = dsetname
        self._timeout = timeout

    def run(self):
        self.log = logging.getLogger('reader')
        print("Reader: Waiting for initial event")
        assert self._event.wait( self._timeout )
        self._event.clear()

        print(f"Reader: Opening file {self._fname}")
        root = zarr.open_group(self._fname, mode='r')
        dset = root[self._dsetname]
        text_dset = root['text_data']
        # monitor and read loop
        while self._event.wait( self._timeout ):
            self._event.clear()
            print("Reader: Event received")
            dset = root[self._dsetname]
            text_dset = root['text_data']
            shape = dset.shape
            print("Reader: Read dset shape: %s"%str(shape))
            print(f"Reader: Text dataset shape: {text_dset.shape}")
            for i in range(text_dset.shape[0]):
                print(text_dset[i])

class ZarrWriter(Process):
    def __init__(self, event, fname, dsetname):
        super().__init__()
        self._event = event
        self._fname = fname
        self._dsetname = dsetname

    def run(self):
        self.log = logging.getLogger('writer')
        self.log.info("Creating file %s", self._fname)
        root = zarr.group(self._fname, overwrite=True)
        arr = np.array([1,2,3,4])
        dset = root.create_array(self._dsetname, shape=(4,), chunks=(2,), dtype=np.float64, fill_value=np.nan)
        dset[:] = arr
        text_dset = root.create_array('text_data', shape=(1,), chunks=(3,), dtype=str)
        text_arr = np.array(["Sample text 0"])
        text_dset[:] = text_arr

        print("Writer: Sending initial event")
        self._event.set()
        print("Writer: Waiting for the reader-opened-file event")
        # time.sleep(1.0)
        # Write loop
        for i in range(1, 6):
            new_shape = (i * len(arr), )
            print("Writer: Resizing dset shape: %s"%str(new_shape))
            dset.resize( new_shape )
            print("Writer: Writing data")
            dset[i*len(arr):] = arr
            text_dset.resize((text_dset.shape[0] + 1,))
            new_text_arr = np.array([f"Sample text {i}" * i])
            text_dset[-1:] = new_text_arr
            #dset.write_direct( arr, np.s_[:], np.s_[i*len(arr):] )
            print("Writer: Sending event")
            self._event.set()


if __name__ == "__main__":
    logging.basicConfig(format='%(levelname)10s  %(asctime)s  %(name)10s  %(message)s',level=logging.INFO)
    fname = 'measurements.zarr'
    dsetname = 'data'
    if len(sys.argv) > 1:
        fname = sys.argv[1]
    if len(sys.argv) > 2:
        dsetname = sys.argv[2]

    event = Event()
    reader = ZarrReader(event, fname, dsetname)
    writer = ZarrWriter(event, fname, dsetname)

    logging.info("Starting reader")
    reader.start()
    logging.info("Starting writer")
    writer.start()

    logging.info("Waiting for writer to finish")
    writer.join()
    logging.info("Waiting for reader to finish")
    reader.join()

Additional output

No response

@MariusMeyerDraeger MariusMeyerDraeger added the bug Potential issues with the zarr-python library label Jan 18, 2025
@d-v-b
Copy link
Contributor

d-v-b commented Jan 19, 2025

We haven't explicitly tested zarr-python 3 with multiprocessing, but I don't see any reason why there should be any particular problems, because at least with the LocalStore zarr-python doesn't rely on holding any file handles open.

That being said, I don't really understand the architecture of your program. From the error, it looks like the reader is trying to access a zarr.json document that is empty. Since the run method on your ZarrWriter class opens root with overwrite=True, but the run method on your ZarrReader class opens the same group with mode = r, it's possible that you have a race condition here. You may need to poll the state of the zarr.json document before trying to open it.

To avoid these kinds of issues, I would create your zarr hierarchy in synchronous code as much as possible (because writing some JSON documents doesn't benefit from multiprocessing anyways).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Potential issues with the zarr-python library
Projects
None yet
Development

No branches or pull requests

2 participants