-
Notifications
You must be signed in to change notification settings - Fork 92
Description
The problem
uproot.recreate() with a simplecache:: path writes data locally but never uploads it to the remote server. The cause seems to be (from some digging) that uproot opens the sink in r+b mode, while simplecache only triggers its upload-on-close mechanism for wb mode (e.g. https://filesystem-spec.readthedocs.io/en/latest/features.html#remote-write-caching).
Reproducer
import uproot
import fsspec
path = "simplecache::root://server//store/user/test/output.root"
raw = path.replace("simplecache::", "")
# Write with simplecache — appears to succeed
with uproot.recreate(path) as f:
f["Events"] = {"x": [1, 2, 3]}
print("WRITE: no exception")
# But data never reaches the remote storage, although a file is created
try:
with uproot.open(f"{raw}:Events") as t:
print(f"READ: {t.num_entries} entries")
except Exception as e:
# Throws a zero-bytes read error
print(f"READ: failed — {e}")
How I tried to disentangle the problem
I tried to figure out why the files seem to get created on the remote storage but it is corrupted/empty. So I tried to use unittest.mock.patch to trace calls to methods that would create the file on the remote server without writing anything on it (namely: mkdir, touch). It is hacky, but was insightful:
import uproot
import fsspec
from unittest.mock import patch
from fsspec_xrootd.xrootd import XRootDFileSystem
path = "simplecache::root://server//store/user/test/trace_output.root"
# Trace XRootDFileSystem operations and fsspec.open
traced_methods = ["mkdirs", "touch", "mkdir", "makedirs"]
originals = {}
def make_traced(name, original):
def traced(self, *args, **kwargs):
path_arg = args[0] if args else ""
print(f" FS.{name}({path_arg!r})")
return original(self, *args, **kwargs)
return traced
for method_name in traced_methods:
original = getattr(XRootDFileSystem, method_name, None)
if original is not None:
originals[method_name] = original
setattr(XRootDFileSystem, method_name, make_traced(method_name, original))
original_fsspec_open = fsspec.open
def traced_fsspec_open(p, *args, **kwargs):
mode = args[0] if args else kwargs.get("mode", "?")
print(f" fsspec.open({p!r}, mode={mode!r})")
return original_fsspec_open(p, *args, **kwargs)
try:
with patch("fsspec.open", traced_fsspec_open):
with uproot.recreate(path) as f:
f["Events"] = {"x": [1, 2, 3]}
finally:
for method_name, original in originals.items():
setattr(XRootDFileSystem, method_name, original)
# Output:
# FS.mkdirs('/store/user/test') --> creates parent dir on remote
# FS.touch('/store/user/test/trace_output.root') --> creates 0-byte file on remote
# fsspec.open('simplecache::root://...', mode='r+b') --> opens in r+b, not wb
uproot opens the sink in r+b mode. With simplecache, this opens a local cached copy for read+write. Writes go to the local file, but I think simplecache does not upload on close because r+b is not a write mode from its perspective.
Expected behaviour
For simplecache:: paths, maybe recreate() should open in wb mode so that simplecache uploads on close?
Some other concerns
Writing files to remote storage via simplecache:: silently produces empty files. No exception is raised, which made it a pain to debug. The empty file is visible on the remote server because uproot calls mkdirs() + touch() on the underlying remote filesystem before opening. These create a 0-byte file that is never overwritten with actual data since the upload never happens.
Note: for simplecache::root:// specifically, there might be a missing piece in fsspec-xrootd that needs to be implemented. Writes through S3 seem to work with simplecache, but not xrd. I opened a separate issue there.