-
Notifications
You must be signed in to change notification settings - Fork 425
Maybe fix column selection #1979
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
I'm looking into it. All examples from #1973 work now! Let me try running our CI in different repos vs this branch |
|
Sorry, I think I messed up something. The test from the issue still fails: uv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrowimport random
import pyarrow as pa
import pyarrow.parquet as pq
from fsspec.parquet import open_parquet_file
def test(n, path):
flat = pa.array([random.random() for _ in range(n)])
nested = pa.array([{"a": random.random(), "b": random.random()} for _ in range(n)])
table = pa.table({"flat": flat, "nested": nested})
pq.write_table(table, path)
with open_parquet_file(path, columns=["nested.a"], engine="pyarrow") as fh:
_ = pq.read_table(fh)
# works for 10 rows
test(10, "/tmp/ten.parquet")
# fails for 100k rows
test(100_000, "100k.parquet") |
|
I added the test https://github.com/fsspec/filesystem_spec/pull/1979/files#diff-ff7fd767388891014a980915bb4fb6a84233cd96324d59e80ce9d2db18577791R208-R220 that is supposed to be identical. I wonder what the difference is. |
|
(sorry, this link: filesystem_spec/fsspec/tests/test_parquet.py Line 208 in 8ba3aae
|
|
Sorry, it was my mistake with the code: |
|
Could you please try with the files I shared in #1973? It still fails for me with double-nested columns (e.g. "spectrum.flux" is a list-array itself). |
|
The test you introduced is a little bit different from the code in my original issue: you reuse the same If I change you test with either 1) Reproducible codeuv venv venv
source venv/bin/activate
uv pip install git+https://github.com/martindurant/filesystem_spec@parquet-nested pyarrow pandasimport os
import random
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from fsspec.parquet import open_parquet_file
def test_nested(n, tmpdir, engine):
path = os.path.join(str(tmpdir), "test.parquet")
import pyarrow as pa
flat = pa.array([random.random() for _ in range(n)])
a = random.random()
b = random.random()
nested = pa.array([{"a": a, "b": b} for _ in range(n)])
table = pa.table({"flat": flat, "nested": nested})
pq.write_table(table, path, use_dictionary=False, compression=None)
with open_parquet_file(path, columns=["nested.a"], engine=engine) as fh:
col = pd.read_parquet(fh, engine=engine, columns=["nested.a"])
name = "a" if engine == "pyarrow" else "nested.a"
assert (col[name] == a).all()
test_nested(1_000_000, '/tmp', 'pyarrow') |
That's interesting... |
|
@hombit , are you using the new pandas 3.0? It has broken all kinds of stuff related to parquet. |
|
I will merge this when it passes to unblock other PRs that are waiting and blocked on pandas 3, and then try to make a new fix later. |
|
@martindurant sure! I did use pandas==3, but it fails the same way with |
@hombit , can you test, please? I fear this may be over reading bytes, but at least all the tests pass, including your specific case.