Skip to content

feat(data): support list[str] URI columns in download() expression#64128

Open
Aydin-ab wants to merge 1 commit into
ray-project:masterfrom
Aydin-ab:extend-row-download-list-paths
Open

feat(data): support list[str] URI columns in download() expression#64128
Aydin-ab wants to merge 1 commit into
ray-project:masterfrom
Aydin-ab:extend-row-download-list-paths

Conversation

@Aydin-ab

@Aydin-ab Aydin-ab commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Today download() fetches one file per row. This adds support for a column where each cell is a list of file URLs — for example, one row holding all the frame paths of a video — and downloads every file in that row.

What it does

  • A download() column can now hold a list of URLs per row, not just a single URL. The single-URL case behaves exactly as before.
  • For a list cell, all the URLs across all rows are fetched together through the same shared connection pool that single-URL downloads already use, then grouped back per row. The output column holds one list of file contents per row, in the same order.
  • Edge cases: an empty list stays empty, a missing or failed download becomes None in its slot, and a null cell stays null.

Why

Until now, anyone with several files per row had to write their own per-row thread pool to fetch them. Sending all the URLs through the one shared pool was about 26x faster than per-row pools in a small S3 benchmark (80 rows of 8 files each).

Compatibility

Single-URL downloads are unchanged — the list handling only runs for list-typed columns, so nothing else in Ray Data is affected.

Testing

  • New test covering mixed list lengths, an empty list, a null cell, and a row mixing a valid and a missing URL.
  • All existing download tests pass.
  • Ran end-to-end on a multi-node cluster downloading from S3 across several nodes, with correct results.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for downloading columns where cells contain lists of URIs (e.g., list<string>), flattening them for concurrent downloading, and re-nesting the downloaded bytes back into a list<binary> column. The feedback focuses on correctness and performance optimizations when working with PyArrow: using column.flatten() in first_inner_uri to correctly respect slice offsets, leveraging vectorized PyArrow compute functions in flatten_uri_list and pa.ListArray.from_arrays in renest_downloaded_bytes to avoid inefficient Python list materialization, and using append_column instead of add_column for cleaner, more idiomatic PyArrow table manipulation.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread python/ray/data/_internal/planner/_download_list_utils.py
Comment thread python/ray/data/_internal/planner/_download_list_utils.py
Comment thread python/ray/data/_internal/planner/_download_list_utils.py
Comment thread python/ray/data/_internal/planner/plan_download_op.py Outdated
Comment thread python/ray/data/_internal/planner/plan_download_op.py Outdated
@Aydin-ab Aydin-ab force-pushed the extend-row-download-list-paths branch from 7161e80 to 9b2e250 Compare June 16, 2026 03:11
@Aydin-ab Aydin-ab marked this pull request as ready for review June 16, 2026 03:42
@Aydin-ab Aydin-ab requested a review from a team as a code owner June 16, 2026 03:42
@Aydin-ab Aydin-ab force-pushed the extend-row-download-list-paths branch from 9b2e250 to ce4465d Compare June 16, 2026 04:51
The row-level download() expression only accepted a scalar str URI per row.
Rows that carry multiple files (e.g. a video row with N S3 frame paths) had
to hand-roll a per-row ThreadPoolExecutor. Accept a list<string> column
(also large_list / fixed_size_list of (large_)string): flatten every row's
URIs into one flat list, run them through the existing concurrent downloader
in a single pool, then re-nest into a list<binary> column preserving per-row
length and order (empty list -> [], null cell -> null, failed download ->
None in place).

Additive: the scalar str path is unchanged -- every list branch is gated
behind is_uri_list_column, which is false for scalar columns. Both the
obstore and PyArrow-threaded download paths and the partition actor are made
list-aware, and the range-split hidden-size-column optimization is deferred
for list columns.

Signed-off-by: Aydin Abiar <aydin@anyscale.com>
@Aydin-ab Aydin-ab force-pushed the extend-row-download-list-paths branch from ce4465d to 82c0828 Compare June 16, 2026 07:44

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

Fix All in Cursor

Reviewed by Cursor Bugbot for commit 82c0828. Configure here.

output_block = output_block.append_column(
output_bytes_column_name,
renest_downloaded_bytes([], row_lengths),
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty block missing scalar columns

Medium Severity

When the first URI column is a list<string> type, a zero-row block no longer takes the early exit and list columns still get an empty list<binary> output appended. Later scalar URI columns hit len(uris) == 0 / not uris and continue without adding their bytes columns, so the table schema no longer matches blocks that have rows or a scalar-first empty block.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 82c0828. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant