Skip to content

Conversation

@n0nick
Copy link
Member

@n0nick n0nick commented Dec 11, 2025

This PR addresses several inefficiencies in binary data download logic:

  1. Remove IncludeBinary parameter usage: Always download binary data via HTTP using the URI from metadata, since IncludeBinary is considered harmful.
  2. Eliminate redundant metadata fetching:
    • Filter-based downloads previously fetched metadata twice (once in getMatchingBinaryIDs, discarded, then again in downloadBinary).
    • Dataset downloads fetched metadata three times (once for IDs, once in downloadBinary, once in binaryDataToJSONLines).
    • Now all paths collect metadata once and reuse it.
  3. Parallelize downloads for ID-based path: Previously sequential downloads when multiple IDs were provided; now uses worker pool pattern similar to
    filter-based downloads.
  4. Unify download logic: Both filter-based and ID-based paths now use the same parallel download implementation via downloadBinaryFilesInParallel.

Key changes:

  • Created downloadSingleBinaryFile: Extracts single-file download logic (metadata JSON writing, HTTP download with retry, file writing, gzip handling).
  • Created downloadBinaryFilesInParallel: Reusable parallel download function using worker pool pattern for both download paths.
  • Replaced getMatchingBinaryIDs with getMatchingBinaryMetadata: Returns full BinaryData objects with metadata (including URIs) instead of just IDs.
  • Refactored binaryData (filter-based): Collects all metadata once, then downloads in parallel.
  • Refactored dataExportBinaryIDsAction (ID-based): Collects all metadata once, then downloads in parallel.
  • Refactored downloadDataset: Eliminates triple metadata fetching, downloads files and writes JSON lines in parallel.
  • Removed downloadBinary function: No longer needed after refactor.

Benefits:

  • Removed IncludeBinary usage to prefer the HTTP file server download.
  • Overall reduction in metadata API calls across download actions.
  • Significant performance improvement for ID-based downloads (now parallelized).
  • Consistent behavior across all download paths.

This PR addresses several inefficiencies in binary data download logic:

1. **Remove IncludeBinary parameter usage**: Always download binary data via HTTP using the URI from metadata, since [IncludeBinary is considered harmful](https://docs.google.com/document/d/1Ex2wnnNp5rKiwb6RJz_ruViFTmeRTi8F9X6CjcW0QKM/edit?tab=t.0).

2. **Eliminate redundant metadata fetching**:
   - Filter-based downloads previously fetched metadata twice (once in getMatchingBinaryIDs, discarded, then again in downloadBinary).
   - Dataset downloads fetched metadata three times (once for IDs, once in downloadBinary, once in binaryDataToJSONLines).
   - Now all paths collect metadata once and reuse it.

3. **Parallelize downloads for ID-based path**: Previously sequential downloads when multiple IDs were provided; now uses worker pool pattern similar to
   filter-based downloads.

4. **Unify download logic**: Both filter-based and ID-based paths now use the same parallel download implementation via downloadBinaryFilesInParallel.

Key changes:

- Created downloadSingleBinaryFile: Extracts single-file download logic (metadata JSON writing, HTTP download with retry, file writing, gzip handling).

- Created downloadBinaryFilesInParallel: Reusable parallel download function using worker pool pattern for both download paths.

- Replaced getMatchingBinaryIDs with getMatchingBinaryMetadata: Returns full BinaryData objects with metadata (including URIs) instead of just IDs.

- Refactored binaryData (filter-based): Collects all metadata once, then downloads in parallel.

- Refactored dataExportBinaryIDsAction (ID-based): Collects all metadata once, then downloads in parallel.

- Refactored downloadDataset: Eliminates triple metadata fetching, downloads files and writes JSON lines in parallel.

- Removed deprecated downloadBinary function: No longer needed after refactor.

Benefits:

- Removed IncludeBinary usage to prefer the HTTP file server download.
- Overall reduction in metadata API calls across download actions.
- Significant performance improvement for ID-based downloads (now parallelized).
- Consistent behavior across all download paths.
@viambot viambot added the safe to test This pull request is marked safe to test from a trusted zone label Dec 11, 2025
@n0nick n0nick requested review from gloriacai01 and tahiyasalam and removed request for gloriacai01 and tahiyasalam December 11, 2025 22:21
@n0nick n0nick marked this pull request as draft December 17, 2025 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test This pull request is marked safe to test from a trusted zone

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants