feat(data): support list[str] URI columns in download() expression by Aydin-ab · Pull Request #64128 · ray-project/ray

Aydin-ab · 2026-06-16T02:57:59Z

Today download() fetches one file per row. This adds support for a column where each cell is a list of file URLs — for example, one row holding all the frame paths of a video — and downloads every file in that row.

What it does

A download() column can now hold a list of URLs per row, not just a single URL. The single-URL case behaves exactly as before.
For a list cell, all the URLs across all rows are fetched together through the same shared connection pool that single-URL downloads already use, then grouped back per row. The output column holds one list of file contents per row, in the same order.
Edge cases: an empty list stays empty, a missing or failed download becomes None in its slot, and a null cell stays null.

Why

Until now, anyone with several files per row had to write their own per-row thread pool to fetch them. Sending all the URLs through the one shared pool was about 26x faster than per-row pools in a small S3 benchmark (80 rows of 8 files each).

Compatibility

Single-URL downloads are unchanged — the list handling only runs for list-typed columns, so nothing else in Ray Data is affected.

Testing

New test covering mixed list lengths, an empty list, a null cell, and a row mixing a valid and a missing URL.
All existing download tests pass.
Ran end-to-end on a multi-node cluster downloading from S3 across several nodes, with correct results.

gemini-code-assist

Code Review

This pull request introduces support for downloading columns where cells contain lists of URIs (e.g., list<string>), flattening them for concurrent downloading, and re-nesting the downloaded bytes back into a list<binary> column. The feedback focuses on correctness and performance optimizations when working with PyArrow: using column.flatten() in first_inner_uri to correctly respect slice offsets, leveraging vectorized PyArrow compute functions in flatten_uri_list and pa.ListArray.from_arrays in renest_downloaded_bytes to avoid inefficient Python list materialization, and using append_column instead of add_column for cleaner, more idiomatic PyArrow table manipulation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

The row-level download() expression only accepted a scalar str URI per row. Rows that carry multiple files (e.g. a video row with N S3 frame paths) had to hand-roll a per-row ThreadPoolExecutor. Accept a list<string> column (also large_list / fixed_size_list of (large_)string): flatten every row's URIs into one flat list, run them through the existing concurrent downloader in a single pool, then re-nest into a list<binary> column preserving per-row length and order (empty list -> [], null cell -> null, failed download -> None in place). Additive: the scalar str path is unchanged -- every list branch is gated behind is_uri_list_column, which is false for scalar columns. Both the obstore and PyArrow-threaded download paths and the partition actor are made list-aware, and the range-split hidden-size-column optimization is deferred for list columns. Signed-off-by: Aydin Abiar <aydin@anyscale.com>

cursor

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

^{Reviewed by Cursor Bugbot for commit 82c0828. Configure here.}

cursor · 2026-06-16T07:50:09Z

+                output_block = output_block.append_column(
+                    output_bytes_column_name,
+                    renest_downloaded_bytes([], row_lengths),
+                )


Empty block missing scalar columns

Medium Severity

When the first URI column is a list<string> type, a zero-row block no longer takes the early exit and list columns still get an empty list<binary> output appended. Later scalar URI columns hit len(uris) == 0 / not uris and continue without adding their bytes columns, so the table schema no longer matches blocks that have rows or a scalar-first empty block.

Additional Locations (1)

python/ray/data/_internal/planner/_obstore_download.py#L457-L461

^{Reviewed by Cursor Bugbot for commit 82c0828. Configure here.}

gemini-code-assist Bot reviewed Jun 16, 2026

View reviewed changes

Aydin-ab force-pushed the extend-row-download-list-paths branch from 7161e80 to 9b2e250 Compare June 16, 2026 03:11

Aydin-ab marked this pull request as ready for review June 16, 2026 03:42

Aydin-ab requested a review from a team as a code owner June 16, 2026 03:42

Aydin-ab force-pushed the extend-row-download-list-paths branch from 9b2e250 to ce4465d Compare June 16, 2026 04:51

Aydin-ab force-pushed the extend-row-download-list-paths branch from ce4465d to 82c0828 Compare June 16, 2026 07:44

cursor Bot reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(data): support list[str] URI columns in download() expression#64128

feat(data): support list[str] URI columns in download() expression#64128
Aydin-ab wants to merge 1 commit into
ray-project:masterfrom
Aydin-ab:extend-row-download-list-paths

Aydin-ab commented Jun 16, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Aydin-ab commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What it does

Why

Compatibility

Testing

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 16, 2026

Choose a reason for hiding this comment

Empty block missing scalar columns

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Aydin-ab commented Jun 16, 2026 •

edited

Loading