feat(data): support list[str] URI columns in download() expression#64128
feat(data): support list[str] URI columns in download() expression#64128Aydin-ab wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces support for downloading columns where cells contain lists of URIs (e.g., list<string>), flattening them for concurrent downloading, and re-nesting the downloaded bytes back into a list<binary> column. The feedback focuses on correctness and performance optimizations when working with PyArrow: using column.flatten() in first_inner_uri to correctly respect slice offsets, leveraging vectorized PyArrow compute functions in flatten_uri_list and pa.ListArray.from_arrays in renest_downloaded_bytes to avoid inefficient Python list materialization, and using append_column instead of add_column for cleaner, more idiomatic PyArrow table manipulation.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
7161e80 to
9b2e250
Compare
9b2e250 to
ce4465d
Compare
The row-level download() expression only accepted a scalar str URI per row. Rows that carry multiple files (e.g. a video row with N S3 frame paths) had to hand-roll a per-row ThreadPoolExecutor. Accept a list<string> column (also large_list / fixed_size_list of (large_)string): flatten every row's URIs into one flat list, run them through the existing concurrent downloader in a single pool, then re-nest into a list<binary> column preserving per-row length and order (empty list -> [], null cell -> null, failed download -> None in place). Additive: the scalar str path is unchanged -- every list branch is gated behind is_uri_list_column, which is false for scalar columns. Both the obstore and PyArrow-threaded download paths and the partition actor are made list-aware, and the range-split hidden-size-column optimization is deferred for list columns. Signed-off-by: Aydin Abiar <aydin@anyscale.com>
ce4465d to
82c0828
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
Reviewed by Cursor Bugbot for commit 82c0828. Configure here.
| output_block = output_block.append_column( | ||
| output_bytes_column_name, | ||
| renest_downloaded_bytes([], row_lengths), | ||
| ) |
There was a problem hiding this comment.
Empty block missing scalar columns
Medium Severity
When the first URI column is a list<string> type, a zero-row block no longer takes the early exit and list columns still get an empty list<binary> output appended. Later scalar URI columns hit len(uris) == 0 / not uris and continue without adding their bytes columns, so the table schema no longer matches blocks that have rows or a scalar-first empty block.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 82c0828. Configure here.


Today
download()fetches one file per row. This adds support for a column where each cell is a list of file URLs — for example, one row holding all the frame paths of a video — and downloads every file in that row.What it does
download()column can now hold a list of URLs per row, not just a single URL. The single-URL case behaves exactly as before.Nonein its slot, and a null cell stays null.Why
Until now, anyone with several files per row had to write their own per-row thread pool to fetch them. Sending all the URLs through the one shared pool was about 26x faster than per-row pools in a small S3 benchmark (80 rows of 8 files each).
Compatibility
Single-URL downloads are unchanged — the list handling only runs for list-typed columns, so nothing else in Ray Data is affected.
Testing