Optimize partition merging algorithm by robtaylor · Pull Request #2 · ChipFlow/Jacquard

robtaylor · 2026-02-15T16:44:33Z

Summary

Replace per-candidate topological DFS in merge scoring with bitset union + popcount (O(n/64) vs O(subgraph))
Add TopoTraverser with dense visited buffer and iterative stack-based DFS, replacing IndexSet-based recursive DFS
Add Partition::quick_reject() pre-check to skip obviously infeasible merges before expensive hierarchy construction
Add cancel-on-success AtomicBool so speculative parallel build_one() trials bail early when another succeeds

Test plan

cargo check -r --features metal compiles cleanly
NVDLA benchmark: 316 initial → 55 merged partitions in 11m51s (same partition count as baseline)
Rocket benchmark
Gemmini benchmark
Compare gemparts output against baseline for correctness

Replace the expensive per-candidate topological DFS in merge scoring with bitset union + popcount (O(n/64) vs O(subgraph)). Pre-compute node bitsets during initial partition analysis and update incrementally on merge. Key changes: - Add TopoTraverser with dense Vec<u32> visited buffer using generation counter pattern, replacing IndexSet-based DFS (aig.rs) - Convert recursive DFS to iterative stack-based DFS to avoid stack overflow on deep AIGs and improve cache locality - Add bitset_union_popcount/bitset_or_inplace helpers for merge scoring - Add Partition::quick_reject() pre-check to skip obviously infeasible merges before expensive hierarchy construction - Add cancel-on-success AtomicBool to speculative parallel trials so in-progress build_one() calls bail early when another trial succeeds - Add build_one_cancellable() that checks cancel flag between boomerang stages - Extract collect_comb_outputs() helper, hoist out of inner loop - Update CLAUDE.md to document Metal backend and benchmarks Tested on NVDLA benchmark (254MB netlist): 316 initial → 55 merged partitions in 11m51s wall clock. Co-developed-by: Claude Code v2.1.42 (claude-opus-4-6)

…ures, parallel flatten - Pass prebuilt Partition objects from cut_map_interactive to process_partitions, eliminating ~316 redundant build_one() calls for NVDLA - Replace IndexSet-based topo_traverse_generic with dense TopoTraverser at all hot call sites (pe.rs, repcut.rs, staging.rs) - Replace IndexMap id2order with Vec<usize> in build_one_boomerang_stage for direct O(1) lookups instead of hash-based access - Replace IndexMap hier_visited_nodes_count with Vec<usize> + active_nodes list for O(1) contains/increment instead of hash-based entry() - Add dense Vec<bool> shadows for realized_inputs and unrealized_comb_outputs in build_one_cancellable for fast contains() checks in inner loops - Parallelize init_afters_writeouts and build_script in flatten.rs with rayon Co-developed-by: Claude Code v2.1.42 (claude-opus-4-6)

robtaylor added 2 commits February 15, 2026 22:33

robtaylor force-pushed the GEM-optimize branch from af361f9 to 0385a7a Compare February 15, 2026 22:33

robtaylor merged this pull request into master Feb 15, 2026
5 checks passed

robtaylor deleted the GEM-optimize branch February 16, 2026 01:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize partition merging algorithm#2

Optimize partition merging algorithm#2
robtaylor merged 2 commits intomasterfrom
GEM-optimize

robtaylor commented Feb 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

robtaylor commented Feb 15, 2026

Summary

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant