-
Notifications
You must be signed in to change notification settings - Fork 22
Closed
Description
Hello,
internally, we wrote an own library that wraps arrow-rs to make it useable from Python.
Such a thing also exists publicly available through arro3 which I used here for some minimal reproducible example:
import pyarrow.parquet
import arro3.io
data = [8388855, 8388924, 8388853, 8388880, 8388876, 8388879]
schema = pyarrow.schema([pyarrow.field("html", pyarrow.binary())])
data = [{"html": b"0" * d} for d in data]
t = pyarrow.Table.from_pylist(data, schema=schema)
path = "/tmp/foo.parquet"
with open(path, "wb") as file:
for b in t.to_batches():
arro3.io.write_parquet(b, file, max_row_group_size=len(data) - 3)
reader = pyarrow.parquet.ParquetFile(path)
for i in range(2):
print(len(reader.read_row_group(i)))
This code writes a bit of dummy binary data through arrow-rs. Reading that with pyarrow results in
File "pyarrow/_parquet.pyx", line 1655, in pyarrow._parquet.ParquetReader.read_row_group
File "pyarrow/_parquet.pyx", line 1691, in pyarrow._parquet.ParquetReader.read_row_groups
File "pyarrow/error.pxi", line 92, in pyarrow.lib.check_status
OSError: Couldn't deserialize thrift: No more data to read.
Deserializing page header failed.
Observation
- Reading in the same file through
arro3or own internal library wrappingarrow-rsworks just fine - Reading in the same file through
duckdbalso works just fine - Reducing the amount of binary data per row slightly leads to the error disappearing (8_388_855 per row or 25_166_565 per row group or more seems to be the problematic amount)
- Issue is reproducible with
pyarrowversion18.1.0,19.0.1and20.0.0
I'm highly convinced that the underlying issue is somewhere in either the arrow-rs or arrow codebase, but opened an issue here nevertheless since the only (easy) way to reproduce this issue with publicly available packages is arro3 right now.
If you know any way to find out how to more narrowly identify where this issue could be stemming from, please let me know! I opened other issues here and here.
Metadata
Metadata
Assignees
Labels
No labels