Skip to content

Reduce peak RSS of cross-compendium DuckDB checks via two-pass queries#703

Open
gaurav wants to merge 5 commits intomasterfrom
fix-large-duckdb-queries
Open

Reduce peak RSS of cross-compendium DuckDB checks via two-pass queries#703
gaurav wants to merge 5 commits intomasterfrom
fix-large-duckdb-queries

Conversation

@gaurav
Copy link
Copy Markdown
Collaborator

@gaurav gaurav commented Mar 30, 2026

The three cross-compendium report rules (check_for_identically_labeled_ cliques, check_for_duplicate_curies, check_for_duplicate_clique_leaders) previously requested 1500G of SLURM RAM. The bottleneck was LIST() aggregations over all Edge/Clique Parquet files — DuckDB cannot spill LIST() state to disk, so every growing array stayed in RAM.

Replace each single-pass LIST() query with a two-pass approach:

  • Pass 1: pure COUNT(*) GROUP BY to find duplicate keys; fully spill-able
  • Pass 2: LIST() collected only for the confirmed duplicates (a small join against the pass-1 result), keeping list materialization trivial

Also restores the biolink_type and clique_identifier_count columns in check_for_duplicate_clique_leaders that were previously dropped with a comment saying they used too much memory.

Lower Snakemake mem= and DuckDB memory_limit conservatively; right-size further after observing max_rss from benchmark TSVs on a real run.

The three cross-compendium report rules (check_for_identically_labeled_
cliques, check_for_duplicate_curies, check_for_duplicate_clique_leaders)
previously requested 1500G of SLURM RAM. The bottleneck was LIST()
aggregations over all Edge/Clique Parquet files — DuckDB cannot spill
LIST() state to disk, so every growing array stayed in RAM.

Replace each single-pass LIST() query with a two-pass approach:
- Pass 1: pure COUNT(*) GROUP BY to find duplicate keys; fully spill-able
- Pass 2: LIST() collected only for the confirmed duplicates (a small
  join against the pass-1 result), keeping list materialization trivial

Also restores the biolink_type and clique_identifier_count columns in
check_for_duplicate_clique_leaders that were previously dropped with a
comment saying they used too much memory.

Lower Snakemake mem= and DuckDB memory_limit conservatively; right-size
further after observing max_rss from benchmark TSVs on a real run.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
@gaurav gaurav requested a review from Copilot March 30, 2026 21:29
@gaurav gaurav moved this from Backlog to In progress in Babel sprints Mar 30, 2026
@gaurav gaurav marked this pull request as ready for review March 30, 2026 21:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reduces peak RAM usage for cross-compendium DuckDB report rules by replacing single-pass LIST() aggregations (which cannot spill) with a two-pass approach: (1) spill-able COUNT(*) GROUP BY to find duplicates, then (2) LIST() only over the confirmed duplicate keys. It also lowers Snakemake/SLURM memory requests and DuckDB memory_limit accordingly.

Changes:

  • Rewrites three cross-compendium DuckDB reports to use a two-pass duplicate-detection strategy.
  • Restores additional output columns for duplicate clique leaders now that LIST() is scoped to duplicates.
  • Reduces SLURM mem and DuckDB memory_limit, and increases DuckDB thread count for these rules.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/snakefiles/duckdb.snakefile Lowers RAM requests / DuckDB limits for the three cross-compendium report rules and adjusts DuckDB thread settings.
src/reports/duckdb_reports.py Implements two-pass DuckDB queries to avoid large non-spillable LIST() state; restores previously dropped columns for one report.
slurm/README.md Updates documented “Known Resource Hotspots” RAM requirements for the affected rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

gaurav and others added 4 commits March 30, 2026 18:27
… report rules

- Replace CREATE TEMP TABLE with CREATE OR REPLACE TEMP TABLE in all three
  two-pass report functions so they are safe to call more than once in a process.
- Add cpus_per_task=4 to the resources block of all three cross-compendium
  DuckDB rules to make CPU allocation explicit and consistent with the DuckDB
  threads setting.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
In check_for_duplicate_curies, adding ORDER BY e.clique_leader to filenames
and conflations so they stay aligned with the already-sorted clique_leaders list.

In check_for_duplicate_clique_leaders, adding ORDER BY c.filename to all three
LIST() aggregates so biolink_types and clique_identifier_counts remain positionally
aligned with filenames and output is deterministic.

Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

2 participants