Reduce peak RSS of cross-compendium DuckDB checks via two-pass queries by gaurav · Pull Request #703 · NCATSTranslator/Babel

gaurav · 2026-03-30T21:24:22Z

The three cross-compendium report rules (check_for_identically_labeled_ cliques, check_for_duplicate_curies, check_for_duplicate_clique_leaders) previously requested 1500G of SLURM RAM. The bottleneck was LIST() aggregations over all Edge/Clique Parquet files — DuckDB cannot spill LIST() state to disk, so every growing array stayed in RAM.

Replace each single-pass LIST() query with a two-pass approach:

Pass 1: pure COUNT(*) GROUP BY to find duplicate keys; fully spill-able
Pass 2: LIST() collected only for the confirmed duplicates (a small join against the pass-1 result), keeping list materialization trivial

Also restores the biolink_type and clique_identifier_count columns in check_for_duplicate_clique_leaders that were previously dropped with a comment saying they used too much memory.

Lower Snakemake mem= and DuckDB memory_limit conservatively; right-size further after observing max_rss from benchmark TSVs on a real run.

The three cross-compendium report rules (check_for_identically_labeled_ cliques, check_for_duplicate_curies, check_for_duplicate_clique_leaders) previously requested 1500G of SLURM RAM. The bottleneck was LIST() aggregations over all Edge/Clique Parquet files — DuckDB cannot spill LIST() state to disk, so every growing array stayed in RAM. Replace each single-pass LIST() query with a two-pass approach: - Pass 1: pure COUNT(*) GROUP BY to find duplicate keys; fully spill-able - Pass 2: LIST() collected only for the confirmed duplicates (a small join against the pass-1 result), keeping list materialization trivial Also restores the biolink_type and clique_identifier_count columns in check_for_duplicate_clique_leaders that were previously dropped with a comment saying they used too much memory. Lower Snakemake mem= and DuckDB memory_limit conservatively; right-size further after observing max_rss from benchmark TSVs on a real run. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Copilot

Pull request overview

This PR reduces peak RAM usage for cross-compendium DuckDB report rules by replacing single-pass LIST() aggregations (which cannot spill) with a two-pass approach: (1) spill-able COUNT(*) GROUP BY to find duplicates, then (2) LIST() only over the confirmed duplicate keys. It also lowers Snakemake/SLURM memory requests and DuckDB memory_limit accordingly.

Changes:

Rewrites three cross-compendium DuckDB reports to use a two-pass duplicate-detection strategy.
Restores additional output columns for duplicate clique leaders now that LIST() is scoped to duplicates.
Reduces SLURM mem and DuckDB memory_limit, and increases DuckDB thread count for these rules.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File	Description
`src/snakefiles/duckdb.snakefile`	Lowers RAM requests / DuckDB limits for the three cross-compendium report rules and adjusts DuckDB thread settings.
`src/reports/duckdb_reports.py`	Implements two-pass DuckDB queries to avoid large non-spillable `LIST()` state; restores previously dropped columns for one report.
`slurm/README.md`	Updates documented “Known Resource Hotspots” RAM requirements for the affected rules.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/reports/duckdb_reports.py

… report rules - Replace CREATE TEMP TABLE with CREATE OR REPLACE TEMP TABLE in all three two-pass report functions so they are safe to call more than once in a process. - Add cpus_per_task=4 to the resources block of all three cross-compendium DuckDB rules to make CPU allocation explicit and consistent with the DuckDB threads setting. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

In check_for_duplicate_curies, adding ORDER BY e.clique_leader to filenames and conflations so they stay aligned with the already-sorted clique_leaders list. In check_for_duplicate_clique_leaders, adding ORDER BY c.filename to all three LIST() aggregates so biolink_types and clique_identifier_counts remain positionally aligned with filenames and output is deterministic. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-project-automation bot added this to Babel sprints Mar 30, 2026

github-project-automation bot moved this to Backlog in Babel sprints Mar 30, 2026

gaurav requested a review from Copilot March 30, 2026 21:29

gaurav moved this from Backlog to In progress in Babel sprints Mar 30, 2026

gaurav marked this pull request as ready for review March 30, 2026 21:30

Copilot started reviewing on behalf of gaurav March 30, 2026 21:30 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

src/reports/duckdb_reports.py Outdated Show resolved Hide resolved

src/reports/duckdb_reports.py Outdated Show resolved Hide resolved

gaurav and others added 4 commits March 30, 2026 18:27

Merge branch 'master' into fix-large-duckdb-queries

ae184fb

Reformatted Snakemake files.

61885ac

gaurav requested a review from Copilot March 30, 2026 22:42

Copilot started reviewing on behalf of gaurav March 30, 2026 22:42 View session

Copilot AI reviewed Mar 30, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce peak RSS of cross-compendium DuckDB checks via two-pass queries#703

Reduce peak RSS of cross-compendium DuckDB checks via two-pass queries#703
gaurav wants to merge 5 commits intomasterfrom
fix-large-duckdb-queries

gaurav commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

gaurav commented Mar 30, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants