perf(solver): speed up 48-bit benchmark by oritwoen · Pull Request #103 · oritwoen/kangaroo

oritwoen · 2026-04-02T14:26:03Z

Off-by-one in select_best_variant() kept 65k herds on Wg64 when they should hit Wg128. Dispatch target was too low at 50ms - bumped to 120ms so calibration doesn't overpay host round-trip on small solves. Added non-power-of-two candidates around the 24-32 step cliff where benchmark GPUs actually diverge.

Fixes #102

coderabbitai · 2026-04-02T14:27:20Z

📝 Walkthrough

Walkthrough

This change adjusts internal GPU kernel calibration and dispatch tuning in the solver module. The calibration timing target increased from 50 to 120 milliseconds, the workgroup variant selection threshold transitioned from > to >= at 65,536 kangaroos, and the calibration step candidate sequence expanded from six to eight values with new intermediate steps.

Changes

Cohort / File(s)	Summary
Calibration & Workgroup Tuning `src/solver.rs`	`TARGET_DISPATCH_MS` constant bumped from 50 to 120 (affects calibrate loop acceptance/early-exit logic). WorkgroupVariant::Wg128 selection threshold changed from `num_kangaroos > 65_536` to `>= 65_536`. Calibration step candidates expanded from `[16, 32, 64, 128, 256, 512]` to `[16, 17, 18, 24, 64, 128, 256, 512]`, widening the search space.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

perf(solver): cut startup overhead for small-range solves #99: Overlaps on select_best_variant and calibrate logic—also modifies workgroup-size thresholds and calibration behavior.
fix(solver): reset DP counter between calibration probes #83: Modifies the same calibrate function internals in src/solver.rs.
perf(gpu): tune WGSL shader hot path #61: Changes the same solver-level GPU tuning and dispatch logic, including variant selection and calibration flow.

Poem

⚙️ Tuning knobs turn steady—
Fifty milliseconds? Not ready.
Doubled dispatch time, step candidates spread wide,
Kangaroos hop with more stride. 🦘

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title 'perf(solver): speed up 48-bit benchmark' is specific and directly related to the changeset's solver performance optimizations including calibration timing and kernel variant selection.
Description check	✅ Passed	The description directly addresses the three code changes: off-by-one fix in variant selection, dispatch target timing adjustment, and calibration candidate expansion.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch autoresearch/48bit-solve/01-solver-calibration

✨ Simplify code

Create PR with simplified code
Commit simplified code in branch autoresearch/48bit-solve/01-solver-calibration

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/solver.rs`:
- Around line 24-29: The constant TARGET_DISPATCH_MS in solver.rs was raised to
120, violating the file-level TDR safety contract requiring calibration under
50ms; change TARGET_DISPATCH_MS back to 50 (or a value <= 50) so the
auto-calibration in the solver honors the 50ms TDR threshold and avoids long GPU
startup stalls.
- Around line 601-605: The early-exit uses a hard-coded doubling check that no
longer matches the non-power-of-two candidates array (candidates), causing
calibration to stop too early; change the stopping condition to predict the next
candidate's time by multiplying elapsed_ms by the ratio of
next_candidate/current_candidate (use candidates[i] and candidates[i+1]) and
compare that to target (instead of elapsed_ms * 2 > target), so the loop only
breaks when the predicted time for the next candidate would exceed target.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9e8a6860-6acc-4b1f-af95-b1fc88d0730a

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa234b and 2ae3f85.

📒 Files selected for processing (1)

src/solver.rs

📜 Review details

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: cubic · AI code reviewer

🧰 Additional context used

📓 Path-based instructions (1)

src/solver.rs

📄 CodeRabbit inference engine (AGENTS.md)

src/solver.rs: Implement solver state machine logic in src/solver.rs using the step() method and calibration
Auto-calibrate steps to stay under 50ms TDR (Timeout Detection and Recovery) threshold

Files:

src/solver.rs

🧠 Learnings (11)

📓 Common learnings

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/shaders/kangaroo_affine.wgsl : Implement main GPU compute logic in WGSL shaders located at src/shaders/kangaroo_affine.wgsl

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/shaders/*.wgsl : Use workgroup size of 64 threads in compute shaders (hardcoded)

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/solver.rs : Auto-calibrate steps to stay under 50ms TDR (Timeout Detection and Recovery) threshold

📚 Learning: 2026-02-20T07:57:41.039Z

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/solver.rs : Auto-calibrate steps to stay under 50ms TDR (Timeout Detection and Recovery) threshold

Applied to files:

src/solver.rs

📚 Learning: 2026-02-20T07:57:41.039Z

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/solver.rs : Implement solver state machine logic in src/solver.rs using the step() method and calibration

Applied to files:

src/solver.rs

📚 Learning: 2026-02-20T07:57:41.039Z

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/cpu/dp_table.rs : Never trust DP buffer count without clamping to MAX_DISTINGUISHED_POINTS

Applied to files:

src/solver.rs

📚 Learning: 2026-02-20T07:57:41.039Z

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/cpu/dp_table.rs : Implement collision detection using the Distinguished Points (DP) pattern in src/cpu/dp_table.rs

Applied to files:

src/solver.rs

📚 Learning: 2026-02-20T07:57:41.039Z

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/cpu/dp_table.rs : Never use Jacobian coordinates in final DP storage; must convert to affine coordinates

Applied to files:

src/solver.rs

📚 Learning: 2026-02-20T07:57:41.039Z

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/gpu/buffers.rs : Clamp DP buffer to 90% capacity per dispatch to prevent overflow

Applied to files:

src/solver.rs

📚 Learning: 2026-02-20T07:57:41.039Z

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/gpu/**/*.rs : Never add y_parity or symClass to GpuDistinguishedPoint; collision resolution must be done CPU-side via compute_candidate_keys()

Applied to files:

src/solver.rs

📚 Learning: 2026-02-20T07:57:41.039Z

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/cpu/init.rs : Never split jump table for negation map; use add/sub operations with the same 256-entry table

Applied to files:

src/solver.rs

📚 Learning: 2026-03-31T16:28:59.172Z

Learnt from: oritwoen
Repo: oritwoen/kangaroo PR: 100
File: src/gpu/pipeline.rs:35-37
Timestamp: 2026-03-31T16:28:59.172Z
Learning: The `kangaroo` project is a CLI solver, not a service. `GpuContext` is created once per process run, and the process exits after solving. There is no create/destroy cycle for GPU contexts, so a static `PIPELINE_CACHE` holding strong `Arc` references in `src/gpu/pipeline.rs` is intentional and safe for this use case.

Applied to files:

src/solver.rs

📚 Learning: 2026-02-20T07:57:41.039Z

Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Auto-calculate DP bits from range_bits and clamp to 8-40 range

Applied to files:

src/solver.rs

🔇 Additional comments (1)

src/solver.rs (1)

205-205: Boundary condition fix is correct.

Line 205 switching to num_kangaroos >= 65_536 removes the exact-threshold split and makes variant selection deterministic at the boundary.

cubic-dev-ai

No issues found across 1 file

Confidence score: 5/5

Automated review surfaced no issues in the provided summaries.
No files require special attention.

_{Auto-approved: Small, low-risk performance tuning and an off-by-one fix in GPU solver calibration. Isolated logic changes with low risk of breakage.}

Architecture diagram

sequenceDiagram
    participant S as KangarooSolver
    participant C as GPU Context
    participant G as GPU Backend

    Note over S,G: Initialization & Variant Selection
    S->>C: Query max_workgroup_size()
    S->>S: CHANGED: select_best_variant()
    Note right of S: Now selects Wg128 if kangaroos >= 65,536<br/>(Fixed off-by-one error)

    Note over S,G: Calibration Phase
    loop NEW: Candidates [16, 17, 18, 24, 64, 128, 256, 512]
        S->>G: Dispatch test workload
        G-->>S: Return measured duration
        S->>S: CHANGED: Evaluate vs TARGET_DISPATCH_MS
        Note right of S: Threshold increased to 120ms to<br/>mask host-to-device latency overhead
    end

    S->>S: Select optimal steps_per_call
    S-->>S: Finalize KangarooPipeline

perf(solver): tighten GPU calibration

2ae3f85

oritwoen added the performance Performance improvements label Apr 2, 2026

oritwoen self-assigned this Apr 2, 2026

coderabbitai Bot requested changes Apr 2, 2026

View reviewed changes

Comment thread src/solver.rs

Comment thread src/solver.rs

cubic-dev-ai Bot approved these changes Apr 2, 2026

View reviewed changes

coderabbitai Bot approved these changes Apr 2, 2026

View reviewed changes

oritwoen merged commit 6f6e4a6 into main Apr 2, 2026
4 checks passed

oritwoen deleted the autoresearch/48bit-solve/01-solver-calibration branch April 2, 2026 14:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf(solver): speed up 48-bit benchmark#103

perf(solver): speed up 48-bit benchmark#103
oritwoen merged 1 commit intomainfrom
autoresearch/48bit-solve/01-solver-calibration

oritwoen commented Apr 2, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

oritwoen commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai Bot commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Poem

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

oritwoen commented Apr 2, 2026 •

edited

Loading

coderabbitai Bot commented Apr 2, 2026 •

edited

Loading