Add Arrow C streaming, DataFrame iteration, and OOM-safe streaming execution #1222

kosiew · 2025-09-02T11:16:00Z

Which issue does this PR close?

Closes export to arrow generate OOM #1206

does ArrowStreamExportable to load the full data into memory or it is a recordbatch reader as I am getting OOM when used in smaller VM

Rationale for this change

Exporting DataFrame results via Arrow previously could trigger eager collection of the entire result set which risks exhausting process memory for large datasets. The project needs a zero-copy, lazy streaming path into PyArrow that:

Produces record batches incrementally (no full materialization),
Preserves partition ordering, and
Respects a requested schema (when trivial projections are possible) without eagerly collecting all batches.

This PR implements streaming-friendly paths in both the Rust extension and Python bindings, fixes some async/spawn patterns (improving signal handling and runtime usage), and adds tests and documentation to exercise the new behavior.

What changes are included in this PR?

High level

Implement __arrow_c_stream__ using a partitioned streaming reader that drains partition streams sequentially and exposes an Arrow ArrowArrayStream PyCapsule.
Add DataFrame iteration and async iteration support (Python): __iter__ and __aiter__ returning RecordBatch instances.
Add RecordBatch.__arrow_c_array__ for zero-copy export of individual record batches as Arrow C Data Interface capsules.
Use a helper spawn_future to run DataFusion futures on the Tokio runtime while preserving Python signal handling instead of directly creating JoinHandle/blocking joins.
New tests covering iteration, streaming, schema selection/mismatch, interruption by KeyboardInterrupt, and memory behavior when streaming large datasets.
Add a small testing helper tests/utils.py::range_table used to construct large range tables without expanding the public API.
Document the streaming behavior in the user guide under a new PyArrow Streaming section.
Add cstr dependency and small Cargo.toml tidy / formatting changes.

Files changed (summary)

Rust

src/dataframe.rs
- Introduces PartitionedDataFrameStreamReader implementing RecordBatchReader that pulls batches from partitioned SendableRecordBatchStreams and applies per-batch projection if requested.
- Reworks __arrow_c_stream__ to use execute_stream_partitioned() and create an FFI_ArrowArrayStream from the RecordBatchReader without materializing all batches.
- Adds a stable C capsule name constant using the cstr crate.
- Use spawn_future to run async tasks on the Tokio runtime.
src/record_batch.rs
- Adds poll_next_batch helper and uses it to unify stream polling logic.
- Fixes error propagation for next_stream.
src/utils.rs
- Adds spawn_future utility that spawns a future on the shared Tokio runtime and waits for it while preserving Python signal behavior and converting errors appropriately.
src/context.rs
- Replace ad-hoc runtime spawn/wait with spawn_future for execute_stream_partitioned/execution paths.

Python

python/datafusion/dataframe.py
- Add __iter__, __aiter__ to iterate over RecordBatch objects produced by execute_stream().
- Update docstrings and deprecate to_record_batch_stream (alias to execute_stream).
- Fix imports to include RecordBatch.
python/datafusion/record_batch.py
- Add __arrow_c_array__ to export a RecordBatch via Arrow C Data Interface (two capsules).
- Clarify iterator/async iterator docstrings for RecordBatchStream.
python/tests/*
- Many tests added/updated: new fixtures (fail_collect), tests for iteration (test_iter_batches, test_iter_returns_datafusion_recordbatch), streaming (`test_execute_stream_b

…ing in DataFrame - Add `range` method to SessionContext and iterator support to DataFrame - Introduce `spawn_stream` utility and refactor async execution for better signal handling - Add tests for `KeyboardInterrupt` in `__arrow_c_stream__` and incremental DataFrame streaming - Improve memory usage tracking in tests with psutil - Update DataFrame docs with PyArrow streaming section and enhance `__arrow_c_stream__` documentation - Replace Tokio runtime creation with `spawn_stream` in PySessionContext - Bump datafusion packages to 49.0.1 and update dependencies - Remove unused imports and restore main Cargo.toml

…andling - Refactor record batch streaming to use `poll_next_batch` for clearer error handling - Improve `spawn_future`/`spawn_stream` functions for better Python exception integration and code reuse - Update `datafusion` and `datafusion-ffi` dependencies to 49.0.2 - Fix PyArrow `RecordBatchReader` import to use `_import_from_c_capsule` for safer memory handling - Refactor `ArrowArrayStream` handling to use `PyCapsule` with destructor for improved memory management - Refactor projection initialization in `PyDataFrame` for clarity - Move `range` functionality into `_testing.py` helper - Rename test column in `test_table_from_batches_stream` for accuracy - Add tests for `RecordBatchReader` and enhance DataFrame stream handling

…docs - Preserve partition order in DataFrame streaming and update related tests - Add tests for record batch ordering and DataFrame batch iteration - Improve `drop_stream` to correctly handle PyArrow ownership transfer and null pointers - Replace `assert` with `debug_assert` for safer ArrowArrayStream validation - Add documentation for `poll_next_batch` in PyRecordBatchStream - Refactor tests to use `fail_collect` fixture for DataFrame collect - Refactor `range_table` return type to `DataFrame` for clearer type hints - Minor cleanup in SessionContext (remove extra blank line)

…r consistency

…dBatchReader validation

kylebarron · 2025-09-02T15:56:34Z

I'm invested in this and plan to review this this afternoon!

kylebarron

I would strongly advocate for less direct integration with pyarrow, not more. Pyarrow is a massive dependency, while the Arrow PyCapsule Interface should allow for better decentralized sharing of Arrow data.

docs/source/user-guide/dataframe/index.rst

kylebarron · 2025-09-03T16:41:51Z

docs/source/user-guide/dataframe/index.rst

+DataFrames are also iterable, yielding :class:`pyarrow.RecordBatch` objects
+lazily so you can loop over results directly:
+
+.. code-block:: python
+
+    for batch in df:
+        ...  # process each batch as it is produced


Because the user can iterate over the stream accessed by the target library, I don't think we should define our own custom integration here, and if we do, then the yielded object should not be a pyarrow RecordBatch, but rather an opaque, minimal Python class that just exposes __arrow_c_array__ so that the user can choose what Arrow library they want to use to work with the batch.

We already have our own RecordBatch class: https://datafusion.apache.org/python/autoapi/datafusion/record_batch/index.html#datafusion.record_batch.RecordBatch

Also, we should ensure that the dunder methods are rendered in the docs. It doesn't look like they are currently. (Or maybe the dunder methods on that RecordBatch aren't documented?)

kylebarron · 2025-09-03T16:44:33Z

python/datafusion/dataframe.py

        return self.df.__arrow_c_stream__(requested_schema)

+    def __iter__(self) -> Iterator[pa.RecordBatch]:


I don't really think there's a good rationale for having this method, especially as it reuses the exact same mechanism as the PyCapsule Interface. If anything, we might want to have an __aiter__ method that has a custom async connection to the DataFusion context.

RecordBatchStream already has __iter__ and __aiter__ methods https://datafusion.apache.org/python/autoapi/datafusion/record_batch/index.html#datafusion.record_batch.RecordBatchStream

Can we just have a method that converts a DataFrame into a RecordBatchStream? Then an __iter__ on DataFrame would just convert to a RecordBatchStream under the hood.

kylebarron · 2025-09-03T16:49:55Z

src/dataframe.rs

+#[allow(clippy::manual_c_str_literals)]
+static ARROW_STREAM_NAME: &CStr =
+    unsafe { CStr::from_bytes_with_nul_unchecked(b"arrow_array_stream\0") };


As suggested by the linter, can we just use c"arrow_array_stream"?

kylebarron · 2025-09-03T16:51:24Z

src/dataframe.rs

+unsafe extern "C" fn drop_stream(capsule: *mut ffi::PyObject) {
+    if capsule.is_null() {
+        return;
+    }
+
+    // When PyArrow imports this capsule it steals the raw stream pointer and
+    // sets the capsule's internal pointer to NULL. In that case
+    // `PyCapsule_IsValid` returns 0 and this destructor must not drop the
+    // stream as ownership has been transferred to PyArrow. If the capsule was
+    // never imported, the pointer remains valid and we are responsible for
+    // freeing the stream here.
+    if ffi::PyCapsule_IsValid(capsule, ARROW_STREAM_NAME.as_ptr()) == 1 {
+        let stream_ptr = ffi::PyCapsule_GetPointer(capsule, ARROW_STREAM_NAME.as_ptr())
+            as *mut FFI_ArrowArrayStream;
+        if !stream_ptr.is_null() {
+            drop(Box::from_raw(stream_ptr));
+        }
+    }
+
+    // `PyCapsule_GetPointer` sets a Python error on failure. Clear it only
+    // after the stream has been released (or determined to be owned
+    // elsewhere).
+    ffi::PyErr_Clear();
+}


We shouldn't need to do any of this, according to upstream discussion apache/arrow-rs#5070 (comment)

kylebarron · 2025-09-03T16:57:13Z

src/dataframe.rs

+        self.schema.clone()
+    }
+}
+
 #[pymethods]
 impl PyDataFrame {


Essentially this changes the DataFrame construct to always be "lazy"? Previously a DataFrame was always materialized in memory, whereas now it's just a representation of future batches?

kylebarron · 2025-09-03T17:00:21Z

src/dataframe.rs

-        let ffi_stream = FFI_ArrowArrayStream::new(reader);
-        let stream_capsule_name = CString::new("arrow_array_stream").unwrap();
-        PyCapsule::new(py, ffi_stream, Some(stream_capsule_name)).map_err(PyDataFusionError::from)
+        let stream = Box::new(FFI_ArrowArrayStream::new(reader));


If you have an FFI_ArrowArrayStream you should be able to just pass that to PyCapsule::new without touching any unsafe: https://github.com/kylebarron/arro3/blob/cb2453bf022d0d8704e56e81a324ab5a772e0247/pyo3-arrow/src/ffi/to_python/utils.rs#L93-L94

kylebarron · 2025-09-03T17:16:56Z

In #1227 I explicitly suggested removing the pyarrow dependency altogether. I thought I had created an issue before, but apparently not.

kylebarron · 2025-09-03T17:17:30Z

This would also close #1011

Co-authored-by: Kyle Barron <[email protected]>

…port in DataFrame

…nd conversion

…tch type

…version

timsaucer · 2025-09-10T11:31:47Z

I'm sorry I've not been following this one closely, but I hope to look into this tomorrow.

kylebarron · 2025-09-10T16:58:50Z

Cargo.toml

Are most of the changes here just formatting? Did you just add cstr?

kylebarron · 2025-09-10T17:02:11Z

src/record_batch.rs

-        Some(Ok(batch)) => Ok(batch.into()),
-        Some(Err(e)) => Err(PyDataFusionError::from(e))?,
-        None => {
+    match poll_next_batch(&mut stream).await {


I'm not a tokio expert; how does this materially change?

Did you have to make a new function here? Could you have just used

match stream.next().await.transpose()

poll_next_batch was created to encapsulate the recurring asynchronous polling pattern used when consuming the stream. Centralizing this behavior improves readability, prevents duplicated error-mapping logic, facilitates unit testing, and enables future enhancements without modifying multiple call sites.

python/tests/test_io.py

kylebarron · 2025-09-10T17:04:45Z

python/tests/test_io.py

+def test_table_from_batches_stream(ctx, fail_collect):
+    df = range_table(ctx, 0, 10)
+
+    table = pa.Table.from_batches(batch.to_pyarrow() for batch in df)


Are you intentionally testing a Python iterator here instead of using __arrow_c_stream__?

If you call

pa.table(df)

that will use the C Stream and materialize the data on the pyarrow side, and it'll be more efficient because it keeps everything as a C pointer instead of having to go through Python.

Good point.
I also renamed the test to
test_table_from_arrow_c_stream

kylebarron · 2025-09-10T17:06:32Z

python/datafusion/record_batch.py

@@ -46,6 +46,26 @@ def to_pyarrow(self) -> pa.RecordBatch:
        """Convert to :py:class:`pa.RecordBatch`."""
        return self.record_batch.to_pyarrow()

+    def __arrow_c_array__(


kylebarron · 2025-09-10T17:08:59Z

python/datafusion/dataframe.py

+    @deprecated("Use execute_stream() instead")
+    def to_record_batch_stream(self) -> RecordBatchStream:


Is this a new method? Why are we creating a new method that's immediately deprecated?

🤦
Removed.

kylebarron · 2025-09-10T17:10:38Z

docs/source/user-guide/dataframe/index.rst

+.. code-block:: python
+
+    for stream in df.execute_stream_partitioned():
+        for batch in stream:


Interesting. Can these streams be polled concurrently? Can you do

streams = list(df.execute_stream_partitioned())

and then concurrently iterate over all the streams, yielding whatever batch comes in first? I suppose that would just do in Python what execute_stream is doing in Rust?

Good question!
I added a concurrent iteration example in the same document to clarify this.

To process partitions concurrently, first collect the streams into a list and then poll each one in a separate ``asyncio`` task: .. code-block:: python import asyncio async def consume(stream): async for batch in stream: ... streams = list(df.execute_stream_partitioned()) await asyncio.gather(*(consume(s) for s in streams))

kylebarron · 2025-09-10T17:11:41Z

docs/source/user-guide/dataframe/index.rst

+:py:meth:`~datafusion.DataFrame.execute_stream` to obtain a
+:py:class:`pyarrow.RecordBatchReader`:


execute_stream returns a pyarrow RecordBatchReader? I thought it returned our own DataFusion-specific stream wrapper? I.e. datafusion.RecordBatchStream

You're right!

Corrected.

kylebarron · 2025-09-10T17:13:33Z

docs/source/user-guide/dataframe/index.rst

+Each batch exposes ``to_pyarrow()``, allowing conversion to a PyArrow
+table without collecting everything eagerly:
+
+.. code-block:: python
+
+    import pyarrow as pa
+    table = pa.Table.from_batches(b.to_pyarrow() for b in df)


I think if we provide an example of collecting as a table, we should suggest pa.table(df) instead, which will be more efficient because it doesn't go through Python.

And

without collecting everything eagerly

This isn't a good example for that, because pa.Table necessarily materializes everything

kylebarron · 2025-09-10T17:15:05Z

docs/source/user-guide/dataframe/index.rst

    # Count rows
    count = df.count()

+PyArrow Streaming


The heading here is specific to PyArrow, but I think it would be good to provide a distinction here. Maybe "Zero-copy Streaming to other Arrow implementations"? Or something like that?

Then we can have a sub-section dedicated to pyarrow, but also explain that it works with any arrow-based python impl.

Good point!

Implemented your suggestion.

kylebarron · 2025-09-10T17:16:34Z

I would strongly advocate for less direct integration with pyarrow, not more. Pyarrow is a massive dependency, while the Arrow PyCapsule Interface should allow for better decentralized sharing of Arrow data.

I will work on this as a follow-up task.

Amazing. Happy to review as well. I think it's fine to have some methods that specifically convert to a pyarrow record batch/table. But we should call import pyarrow conditionally so that pyarrow doesn't need to be a required dependency all the time.

Co-authored-by: Kyle Barron <[email protected]>

…clarity

…nstead

…yncio

… and usage

…handling and limitations

…ntal batch processing and memory efficiency

…mations

Co-authored-by: Kyle Barron <[email protected]>

…eam` to `test_table_from_arrow_c_stream`

Co-authored-by: Kyle Barron <[email protected]>

…ame class

…braries, clarifying the protocol and adding implementation-agnostic notes.

… documentation

…tion about eager conversion, emphasizing on-demand batch processing to prevent memory exhaustion.

…ream

…ated _import_from_c_capsule method

… RecordBatchReader in test_arrow_c_stream_schema_selection

…treaming method and usage of public API

…dataset

…lity with Python < 3.10

timsaucer

This seems like a massive improvement! I took a skim through and at a high level looks like the correct way to approach it. I will try to take more time this week to review more carefully and to test a few things out myself.

timsaucer · 2025-09-14T12:56:15Z

python/datafusion/dataframe.py

+        directly to remain compatible with Python < 3.10 (this project
+        supports Python >= 3.6).


I think our minimum supported version is 3.9 right now

kosiew added 9 commits September 2, 2025 16:35

feat: add testing utilities for DataFrame range generation

31e8ed1

feat: ensure proper resource management in DataFrame streaming

0130a72

refactor: replace spawn_stream and spawn_streams with spawn_future fo…

03e530c

…r consistency

feat: add test for Arrow C stream schema selection in DataFrame

4a3f17d

test: rename and extend test_arrow_c_stream_to_table to include Recor…

f7a2407

…dBatchReader validation

test: add validation for schema mismatch in Arrow C stream

b1d18a8

kosiew force-pushed the oom-1206 branch from ca2a6aa to b1d18a8 Compare September 2, 2025 12:52

kosiew changed the title ~~DRAFT Expose Arrow C stream and DataFrame iterator (zero‑copy streaming to PyArrow)~~ Expose Arrow C stream and DataFrame iterator (zero‑copy streaming to PyArrow) Sep 2, 2025

fix Ruff errors

eeb2a37

kosiew marked this pull request as ready for review September 2, 2025 13:49

kylebarron reviewed Sep 3, 2025

View reviewed changes

kylebarron mentioned this pull request Sep 3, 2025

support FFI query result streams that do not pre-collect #1011

Open

kosiew and others added 12 commits September 7, 2025 18:43

Update docs/source/user-guide/dataframe/index.rst

748b7e2

Co-authored-by: Kyle Barron <[email protected]>

test: add batch iteration test for DataFrame

5e650aa

refactor: simplify stream capsule creation in PyDataFrame

ebd2191

refactor: enhance stream capsule management in PyDataFrame

6bae74b

refactor: enhance DataFrame and RecordBatchStream iteration support

f0cbe06

refactor: improve docstrings for DataFrame and RecordBatchStream methods

295d04a

refactor: add to_record_batch_stream method and improve iteration sup…

475c031

…port in DataFrame

test: update test_iter_batches_dataframe to assert RecordBatch type a…

06c9fc7

…nd conversion

fix: update table creation from batches to use to_pyarrow conversion

94432b5

test: add test_iter_returns_datafusion_recordbatch to verify RecordBa…

31ed8e7

…tch type

docs: clarify RecordBatch reference and add PyArrow conversion example

610aed3

test: improve test_iter_batches_dataframe to validate RecordBatch con…

1ebd3c1

…version

kylebarron reviewed Sep 10, 2025

View reviewed changes

kosiew and others added 22 commits September 13, 2025 16:37

Update python/tests/test_io.py

d3c68cc

Co-authored-by: Kyle Barron <[email protected]>

Update python/datafusion/dataframe.py

33f9024

Co-authored-by: Kyle Barron <[email protected]>

Refactor test_table_from_batches_stream to use pa.table for improved …

7553b32

…clarity

Remove deprecated to_record_batch_stream method; use execute_stream i…

b6909a5

…nstead

Add example for concurrent processing of partitioned streams using as…

f4e76ea

…yncio

Update documentation to reflect changes in execute_stream return type…

b66b441

… and usage

Update PyArrow streaming example to use pa.table for eager collection

2794c88

Enhance documentation for DataFrame streaming API, clarifying schema …

17c4c2c

…handling and limitations

Clarify behavior of __arrow_c_stream__ execution, emphasizing increme…

0ff4c0d

…ntal batch processing and memory efficiency

Add note on limitations of arrow::compute::cast for schema transfor…

f450e1d

…mations

Update python/tests/test_io.py

5dc5cfa

Co-authored-by: Kyle Barron <[email protected]>

Rename test function for clarity: update `test_table_from_batches_str…

fd08dc4

…eam` to `test_table_from_arrow_c_stream`

Update python/datafusion/dataframe.py

9baa49e

Co-authored-by: Kyle Barron <[email protected]>

Add documentation note for Arrow C Data Interface PyCapsule in DataFr…

78f6c8a

…ame class

Enhance documentation on zero-copy streaming to Arrow-based Python li…

5a53633

…braries, clarifying the protocol and adding implementation-agnostic notes.

Fix formatting of section header for zero-copy streaming in DataFrame…

ccc8633

… documentation

Refine zero-copy streaming documentation by removing outdated informa…

98ac3a1

…tion about eager conversion, emphasizing on-demand batch processing to prevent memory exhaustion.

Add alternative method for creating RecordBatchReader from Arrow C st…

759fb86

…ream

Refactor tests to use RecordBatchReader.from_stream instead of deprec…

57d4162

…ated _import_from_c_capsule method

Replace deprecated _import_from_c_capsule method with from_stream for…

d66d496

… RecordBatchReader in test_arrow_c_stream_schema_selection

Update test description for arrow_c_stream_large_dataset to clarify s…

d76a509

…treaming method and usage of public API

Add comments to clarify RSS measurement in test_arrow_c_stream_large_…

7433234

…dataset

kosiew requested a review from kylebarron September 13, 2025 13:58

Fix ruff errors

848665e

kosiew force-pushed the oom-1206 branch from 27a072d to 848665e Compare September 13, 2025 14:10

Update async iterator implementation in DataFrame to ensure compatibi…

13ebaf9

…lity with Python < 3.10

timsaucer reviewed Sep 14, 2025

View reviewed changes

		return self.df.__arrow_c_stream__(requested_schema)

		def __iter__(self) -> Iterator[pa.RecordBatch]:

		@deprecated("Use execute_stream() instead")
		def to_record_batch_stream(self) -> RecordBatchStream:

		:py:meth:`~datafusion.DataFrame.execute_stream` to obtain a
		:py:class:`pyarrow.RecordBatchReader`:

		directly to remain compatible with Python < 3.10 (this project
		supports Python >= 3.6).

Add Arrow C streaming, DataFrame iteration, and OOM-safe streaming execution #1222

Are you sure you want to change the base?

Add Arrow C streaming, DataFrame iteration, and OOM-safe streaming execution #1222

Uh oh!

Conversation

kosiew commented Sep 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

High level

Files changed (summary)

Uh oh!

kylebarron commented Sep 2, 2025

Uh oh!

kylebarron left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylebarron Sep 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylebarron commented Sep 3, 2025

Uh oh!

kylebarron commented Sep 3, 2025

Uh oh!

timsaucer commented Sep 10, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylebarron commented Sep 10, 2025

Uh oh!

timsaucer left a comment

Choose a reason for hiding this comment

kosiew commented Sep 2, 2025 •

edited

Loading

kylebarron Sep 3, 2025 •

edited

Loading