Skip to content

Optimize partition merging algorithm#2

Merged
robtaylor merged 2 commits intomasterfrom
GEM-optimize
Feb 15, 2026
Merged

Optimize partition merging algorithm#2
robtaylor merged 2 commits intomasterfrom
GEM-optimize

Conversation

@robtaylor
Copy link
Contributor

Summary

  • Replace per-candidate topological DFS in merge scoring with bitset union + popcount (O(n/64) vs O(subgraph))
  • Add TopoTraverser with dense visited buffer and iterative stack-based DFS, replacing IndexSet-based recursive DFS
  • Add Partition::quick_reject() pre-check to skip obviously infeasible merges before expensive hierarchy construction
  • Add cancel-on-success AtomicBool so speculative parallel build_one() trials bail early when another succeeds

Test plan

  • cargo check -r --features metal compiles cleanly
  • NVDLA benchmark: 316 initial → 55 merged partitions in 11m51s (same partition count as baseline)
  • Rocket benchmark
  • Gemmini benchmark
  • Compare gemparts output against baseline for correctness

Replace the expensive per-candidate topological DFS in merge scoring with
bitset union + popcount (O(n/64) vs O(subgraph)). Pre-compute node bitsets
during initial partition analysis and update incrementally on merge.

Key changes:
- Add TopoTraverser with dense Vec<u32> visited buffer using generation
  counter pattern, replacing IndexSet-based DFS (aig.rs)
- Convert recursive DFS to iterative stack-based DFS to avoid stack
  overflow on deep AIGs and improve cache locality
- Add bitset_union_popcount/bitset_or_inplace helpers for merge scoring
- Add Partition::quick_reject() pre-check to skip obviously infeasible
  merges before expensive hierarchy construction
- Add cancel-on-success AtomicBool to speculative parallel trials so
  in-progress build_one() calls bail early when another trial succeeds
- Add build_one_cancellable() that checks cancel flag between boomerang
  stages
- Extract collect_comb_outputs() helper, hoist out of inner loop
- Update CLAUDE.md to document Metal backend and benchmarks

Tested on NVDLA benchmark (254MB netlist): 316 initial → 55 merged
partitions in 11m51s wall clock.

Co-developed-by: Claude Code v2.1.42 (claude-opus-4-6)
…ures, parallel flatten

- Pass prebuilt Partition objects from cut_map_interactive to process_partitions,
  eliminating ~316 redundant build_one() calls for NVDLA
- Replace IndexSet-based topo_traverse_generic with dense TopoTraverser at all
  hot call sites (pe.rs, repcut.rs, staging.rs)
- Replace IndexMap id2order with Vec<usize> in build_one_boomerang_stage for
  direct O(1) lookups instead of hash-based access
- Replace IndexMap hier_visited_nodes_count with Vec<usize> + active_nodes list
  for O(1) contains/increment instead of hash-based entry()
- Add dense Vec<bool> shadows for realized_inputs and unrealized_comb_outputs
  in build_one_cancellable for fast contains() checks in inner loops
- Parallelize init_afters_writeouts and build_script in flatten.rs with rayon

Co-developed-by: Claude Code v2.1.42 (claude-opus-4-6)
@robtaylor robtaylor merged this pull request into master Feb 15, 2026
5 checks passed
@robtaylor robtaylor deleted the GEM-optimize branch February 16, 2026 01:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant