Reduce peak RSS of cross-compendium DuckDB checks via two-pass queries#703
Reduce peak RSS of cross-compendium DuckDB checks via two-pass queries#703
Conversation
The three cross-compendium report rules (check_for_identically_labeled_ cliques, check_for_duplicate_curies, check_for_duplicate_clique_leaders) previously requested 1500G of SLURM RAM. The bottleneck was LIST() aggregations over all Edge/Clique Parquet files — DuckDB cannot spill LIST() state to disk, so every growing array stayed in RAM. Replace each single-pass LIST() query with a two-pass approach: - Pass 1: pure COUNT(*) GROUP BY to find duplicate keys; fully spill-able - Pass 2: LIST() collected only for the confirmed duplicates (a small join against the pass-1 result), keeping list materialization trivial Also restores the biolink_type and clique_identifier_count columns in check_for_duplicate_clique_leaders that were previously dropped with a comment saying they used too much memory. Lower Snakemake mem= and DuckDB memory_limit conservatively; right-size further after observing max_rss from benchmark TSVs on a real run. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
This PR reduces peak RAM usage for cross-compendium DuckDB report rules by replacing single-pass LIST() aggregations (which cannot spill) with a two-pass approach: (1) spill-able COUNT(*) GROUP BY to find duplicates, then (2) LIST() only over the confirmed duplicate keys. It also lowers Snakemake/SLURM memory requests and DuckDB memory_limit accordingly.
Changes:
- Rewrites three cross-compendium DuckDB reports to use a two-pass duplicate-detection strategy.
- Restores additional output columns for duplicate clique leaders now that
LIST()is scoped to duplicates. - Reduces SLURM
memand DuckDBmemory_limit, and increases DuckDB thread count for these rules.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
src/snakefiles/duckdb.snakefile |
Lowers RAM requests / DuckDB limits for the three cross-compendium report rules and adjusts DuckDB thread settings. |
src/reports/duckdb_reports.py |
Implements two-pass DuckDB queries to avoid large non-spillable LIST() state; restores previously dropped columns for one report. |
slurm/README.md |
Updates documented “Known Resource Hotspots” RAM requirements for the affected rules. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
… report rules - Replace CREATE TEMP TABLE with CREATE OR REPLACE TEMP TABLE in all three two-pass report functions so they are safe to call more than once in a process. - Add cpus_per_task=4 to the resources block of all three cross-compendium DuckDB rules to make CPU allocation explicit and consistent with the DuckDB threads setting. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
In check_for_duplicate_curies, adding ORDER BY e.clique_leader to filenames and conflations so they stay aligned with the already-sorted clique_leaders list. In check_for_duplicate_clique_leaders, adding ORDER BY c.filename to all three LIST() aggregates so biolink_types and clique_identifier_counts remain positionally aligned with filenames and output is deterministic. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
The three cross-compendium report rules (check_for_identically_labeled_ cliques, check_for_duplicate_curies, check_for_duplicate_clique_leaders) previously requested 1500G of SLURM RAM. The bottleneck was LIST() aggregations over all Edge/Clique Parquet files — DuckDB cannot spill LIST() state to disk, so every growing array stayed in RAM.
Replace each single-pass LIST() query with a two-pass approach:
Also restores the biolink_type and clique_identifier_count columns in check_for_duplicate_clique_leaders that were previously dropped with a comment saying they used too much memory.
Lower Snakemake mem= and DuckDB memory_limit conservatively; right-size further after observing max_rss from benchmark TSVs on a real run.