CLI binary download without include_binary; parallelize HTTP downloads #5587

n0nick · 2025-12-11T22:19:31Z

This PR addresses several inefficiencies in binary data download logic:

Remove IncludeBinary parameter usage: Always download binary data via HTTP using the URI from metadata, since IncludeBinary is considered harmful.
Eliminate redundant metadata fetching:
- Filter-based downloads previously fetched metadata twice (once in getMatchingBinaryIDs, discarded, then again in downloadBinary).
- Dataset downloads fetched metadata three times (once for IDs, once in downloadBinary, once in binaryDataToJSONLines).
- Now all paths collect metadata once and reuse it.
Parallelize downloads for ID-based path: Previously sequential downloads when multiple IDs were provided; now uses worker pool pattern similar to
filter-based downloads.
Unify download logic: Both filter-based and ID-based paths now use the same parallel download implementation via downloadBinaryFilesInParallel.

Key changes:

Created downloadSingleBinaryFile: Extracts single-file download logic (metadata JSON writing, HTTP download with retry, file writing, gzip handling).
Created downloadBinaryFilesInParallel: Reusable parallel download function using worker pool pattern for both download paths.
Replaced getMatchingBinaryIDs with getMatchingBinaryMetadata: Returns full BinaryData objects with metadata (including URIs) instead of just IDs.
Refactored binaryData (filter-based): Collects all metadata once, then downloads in parallel.
Refactored dataExportBinaryIDsAction (ID-based): Collects all metadata once, then downloads in parallel.
Refactored downloadDataset: Eliminates triple metadata fetching, downloads files and writes JSON lines in parallel.
Removed downloadBinary function: No longer needed after refactor.

Benefits:

Removed IncludeBinary usage to prefer the HTTP file server download.
Overall reduction in metadata API calls across download actions.
Significant performance improvement for ID-based downloads (now parallelized).
Consistent behavior across all download paths.

This PR addresses several inefficiencies in binary data download logic: 1. **Remove IncludeBinary parameter usage**: Always download binary data via HTTP using the URI from metadata, since [IncludeBinary is considered harmful](https://docs.google.com/document/d/1Ex2wnnNp5rKiwb6RJz_ruViFTmeRTi8F9X6CjcW0QKM/edit?tab=t.0). 2. **Eliminate redundant metadata fetching**: - Filter-based downloads previously fetched metadata twice (once in getMatchingBinaryIDs, discarded, then again in downloadBinary). - Dataset downloads fetched metadata three times (once for IDs, once in downloadBinary, once in binaryDataToJSONLines). - Now all paths collect metadata once and reuse it. 3. **Parallelize downloads for ID-based path**: Previously sequential downloads when multiple IDs were provided; now uses worker pool pattern similar to filter-based downloads. 4. **Unify download logic**: Both filter-based and ID-based paths now use the same parallel download implementation via downloadBinaryFilesInParallel. Key changes: - Created downloadSingleBinaryFile: Extracts single-file download logic (metadata JSON writing, HTTP download with retry, file writing, gzip handling). - Created downloadBinaryFilesInParallel: Reusable parallel download function using worker pool pattern for both download paths. - Replaced getMatchingBinaryIDs with getMatchingBinaryMetadata: Returns full BinaryData objects with metadata (including URIs) instead of just IDs. - Refactored binaryData (filter-based): Collects all metadata once, then downloads in parallel. - Refactored dataExportBinaryIDsAction (ID-based): Collects all metadata once, then downloads in parallel. - Refactored downloadDataset: Eliminates triple metadata fetching, downloads files and writes JSON lines in parallel. - Removed deprecated downloadBinary function: No longer needed after refactor. Benefits: - Removed IncludeBinary usage to prefer the HTTP file server download. - Overall reduction in metadata API calls across download actions. - Significant performance improvement for ID-based downloads (now parallelized). - Consistent behavior across all download paths.

viambot added the safe to test This pull request is marked safe to test from a trusted zone label Dec 11, 2025

n0nick requested review from gloriacai01 and tahiyasalam and removed request for gloriacai01 and tahiyasalam December 11, 2025 22:21

n0nick marked this pull request as draft December 17, 2025 17:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLI binary download without include_binary; parallelize HTTP downloads #5587

CLI binary download without include_binary; parallelize HTTP downloads #5587

Uh oh!

n0nick commented Dec 11, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

CLI binary download without include_binary; parallelize HTTP downloads #5587

Are you sure you want to change the base?

CLI binary download without include_binary; parallelize HTTP downloads #5587

Uh oh!

Conversation

n0nick commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

n0nick commented Dec 11, 2025 •

edited

Loading