Expand Parquet API: async, filtering, projection pushdown #313

kylebarron · 2025-03-25T22:31:48Z

Change list

Add ParquetFile class that can wrap either a sync or an async source.
Add ParquetFile.read, which reads from a local or remote source, always synchronously.
- For local sources, it's easy to share the record batch stream via FFI. For remote sources that are async, I previously didn't know if this was possible, but I was able to wrap an async source in a BlockingAsyncParquetReader that appears to correctly compile, so we should be able to expose an async source as a synchronous Python API. This means that each record batch will be fetched synchronously (but there could be multiple concurrent reads in the process of reading that record batch).
Add ParquetFile.read_async, which reads from a local (tbd, but should be easy) or remote source, and exposes an async iterator. This async iterator can't be exchanged via FFI, but Python can await on these async functions and then the result of the future can be exchanged to Python via FFI.
Handle exposing async ParquetRecordBatchStream as a blocking Python iterator (and actually via C FFI!!) so that we can provide a clean synchronous API for remote files.
Add Python type hints for ParquetFile

TODO:

Example

from arro3.io import ParquetFile
from obstore.store import S3Store

store = S3Store("overturemaps-us-west-2", region="us-west-2", skip_signature=True)

path = "release/2024-03-12-alpha.0/theme=buildings/type=building/part-00217-4dfc75cd-2680-4d52-b5e0-f4cc9f36b267-c000.zstd.parquet"
file = await ParquetFile.open_async(path, store=store)
stream = file.read_async(batch_size=128000)
i = 0
async for record_batch in stream:
    i += 1
    print(f"iteration: {i}")
    print(record_batch)

Ref #258 (comment) for earlier example.

Closes #258, closes #195

kylebarron added 4 commits March 25, 2025 13:09

Split parquet mod

7b3e1dc

Parquet API expansion

8665ee3

Merge branch 'main' into kyle/parquet-api-updates

35cd271

Start type hints

c5f2c43

This was referenced Mar 25, 2025

WIP: async stream of Arrow record batches from Parquet file #258

Closed

Parquet Feature requests #195

Open

More user control writing GeoParquet geoarrow/geoarrow-rs#1001

Closed

kylebarron added 3 commits March 25, 2025 23:00

Move options

b7b23e7

Support for predicate pushdown

ba671ab

rename filter to filters

249861a

kylebarron mentioned this pull request Mar 26, 2025

Mirror/surface arro3 equivalent parquet options geoarrow/geoarrow-rs#1009

Open

kylebarron added 4 commits March 26, 2025 11:04

impl AsyncFileReader for SyncReader

e03ab04

Define constructors

e7004d7

Update docs

07b173c

Comment out PyObspecReader for now

d28b8e9

kylebarron mentioned this pull request Mar 28, 2025

feat(arro3): Freethreading - module markers + CI #275

Merged

martindurant mentioned this pull request Mar 29, 2025

Fix categorical data handling with global dictionaries dask/fastparquet#954

Open

kylebarron mentioned this pull request Mar 31, 2025

Expose Parquet statistics #315

Open

kylebarron added 4 commits March 31, 2025 18:16

Update to latest upstream main

92b7af4

Bump pyo3-file

8a3ef65

Merge branch 'main' into kyle/parquet-api-updates

dbc5990

update to latest upstream

8c83aaa

kylebarron mentioned this pull request Apr 16, 2025

[Design Document] First pass at async api options design earth-mover/icechunk#914

Draft

kylebarron mentioned this pull request Apr 25, 2025

support FFI query result streams that do not pre-collect apache/datafusion-python#1011

Open

jonded94 mentioned this pull request May 12, 2025

Files containing binary data with >=8_388_855 bytes per row written with arro3 can't be read with pyarrow #324

Closed

Merge branch 'main' into kyle/parquet-api-updates

a629cc4

torchss mentioned this pull request Jun 8, 2025

Document ability to pass obstore store into parquet functions #311

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expand Parquet API: async, filtering, projection pushdown #313

Expand Parquet API: async, filtering, projection pushdown #313

Uh oh!

kylebarron commented Mar 25, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Expand Parquet API: async, filtering, projection pushdown #313

Are you sure you want to change the base?

Expand Parquet API: async, filtering, projection pushdown #313

Uh oh!

Conversation

kylebarron commented Mar 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change list

TODO:

Example

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kylebarron commented Mar 25, 2025 •

edited

Loading