Skip to content

perf(solver): speed up 48-bit benchmark#103

Merged
oritwoen merged 1 commit intomainfrom
autoresearch/48bit-solve/01-solver-calibration
Apr 2, 2026
Merged

perf(solver): speed up 48-bit benchmark#103
oritwoen merged 1 commit intomainfrom
autoresearch/48bit-solve/01-solver-calibration

Conversation

@oritwoen
Copy link
Copy Markdown
Owner

@oritwoen oritwoen commented Apr 2, 2026

Off-by-one in select_best_variant() kept 65k herds on Wg64 when they should hit Wg128. Dispatch target was too low at 50ms - bumped to 120ms so calibration doesn't overpay host round-trip on small solves. Added non-power-of-two candidates around the 24-32 step cliff where benchmark GPUs actually diverge.

Fixes #102

@oritwoen oritwoen added the performance Performance improvements label Apr 2, 2026
@oritwoen oritwoen self-assigned this Apr 2, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Apr 2, 2026

📝 Walkthrough

Walkthrough

This change adjusts internal GPU kernel calibration and dispatch tuning in the solver module. The calibration timing target increased from 50 to 120 milliseconds, the workgroup variant selection threshold transitioned from > to >= at 65,536 kangaroos, and the calibration step candidate sequence expanded from six to eight values with new intermediate steps.

Changes

Cohort / File(s) Summary
Calibration & Workgroup Tuning
src/solver.rs
TARGET_DISPATCH_MS constant bumped from 50 to 120 (affects calibrate loop acceptance/early-exit logic). WorkgroupVariant::Wg128 selection threshold changed from num_kangaroos > 65_536 to >= 65_536. Calibration step candidates expanded from [16, 32, 64, 128, 256, 512] to [16, 17, 18, 24, 64, 128, 256, 512], widening the search space.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Possibly related PRs

Poem

⚙️ Tuning knobs turn steady—
Fifty milliseconds? Not ready.
Doubled dispatch time, step candidates spread wide,
Kangaroos hop with more stride. 🦘

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title 'perf(solver): speed up 48-bit benchmark' is specific and directly related to the changeset's solver performance optimizations including calibration timing and kernel variant selection.
Description check ✅ Passed The description directly addresses the three code changes: off-by-one fix in variant selection, dispatch target timing adjustment, and calibration candidate expansion.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch autoresearch/48bit-solve/01-solver-calibration
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch autoresearch/48bit-solve/01-solver-calibration

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/solver.rs`:
- Around line 24-29: The constant TARGET_DISPATCH_MS in solver.rs was raised to
120, violating the file-level TDR safety contract requiring calibration under
50ms; change TARGET_DISPATCH_MS back to 50 (or a value <= 50) so the
auto-calibration in the solver honors the 50ms TDR threshold and avoids long GPU
startup stalls.
- Around line 601-605: The early-exit uses a hard-coded doubling check that no
longer matches the non-power-of-two candidates array (candidates), causing
calibration to stop too early; change the stopping condition to predict the next
candidate's time by multiplying elapsed_ms by the ratio of
next_candidate/current_candidate (use candidates[i] and candidates[i+1]) and
compare that to target (instead of elapsed_ms * 2 > target), so the loop only
breaks when the predicted time for the next candidate would exceed target.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 9e8a6860-6acc-4b1f-af95-b1fc88d0730a

📥 Commits

Reviewing files that changed from the base of the PR and between 1aa234b and 2ae3f85.

📒 Files selected for processing (1)
  • src/solver.rs
📜 Review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: cubic · AI code reviewer
🧰 Additional context used
📓 Path-based instructions (1)
src/solver.rs

📄 CodeRabbit inference engine (AGENTS.md)

src/solver.rs: Implement solver state machine logic in src/solver.rs using the step() method and calibration
Auto-calibrate steps to stay under 50ms TDR (Timeout Detection and Recovery) threshold

Files:

  • src/solver.rs
🧠 Learnings (11)
📓 Common learnings
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/shaders/kangaroo_affine.wgsl : Implement main GPU compute logic in WGSL shaders located at src/shaders/kangaroo_affine.wgsl
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/shaders/*.wgsl : Use workgroup size of 64 threads in compute shaders (hardcoded)
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/solver.rs : Auto-calibrate steps to stay under 50ms TDR (Timeout Detection and Recovery) threshold
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/solver.rs : Auto-calibrate steps to stay under 50ms TDR (Timeout Detection and Recovery) threshold

Applied to files:

  • src/solver.rs
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/solver.rs : Implement solver state machine logic in src/solver.rs using the step() method and calibration

Applied to files:

  • src/solver.rs
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/cpu/dp_table.rs : Never trust DP buffer count without clamping to MAX_DISTINGUISHED_POINTS

Applied to files:

  • src/solver.rs
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/cpu/dp_table.rs : Implement collision detection using the Distinguished Points (DP) pattern in src/cpu/dp_table.rs

Applied to files:

  • src/solver.rs
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/cpu/dp_table.rs : Never use Jacobian coordinates in final DP storage; must convert to affine coordinates

Applied to files:

  • src/solver.rs
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/gpu/buffers.rs : Clamp DP buffer to 90% capacity per dispatch to prevent overflow

Applied to files:

  • src/solver.rs
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/gpu/**/*.rs : Never add y_parity or symClass to GpuDistinguishedPoint; collision resolution must be done CPU-side via compute_candidate_keys()

Applied to files:

  • src/solver.rs
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/cpu/init.rs : Never split jump table for negation map; use add/sub operations with the same 256-entry table

Applied to files:

  • src/solver.rs
📚 Learning: 2026-03-31T16:28:59.172Z
Learnt from: oritwoen
Repo: oritwoen/kangaroo PR: 100
File: src/gpu/pipeline.rs:35-37
Timestamp: 2026-03-31T16:28:59.172Z
Learning: The `kangaroo` project is a CLI solver, not a service. `GpuContext` is created once per process run, and the process exits after solving. There is no create/destroy cycle for GPU contexts, so a static `PIPELINE_CACHE` holding strong `Arc` references in `src/gpu/pipeline.rs` is intentional and safe for this use case.

Applied to files:

  • src/solver.rs
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Auto-calculate DP bits from range_bits and clamp to 8-40 range

Applied to files:

  • src/solver.rs
🔇 Additional comments (1)
src/solver.rs (1)

205-205: Boundary condition fix is correct.

Line 205 switching to num_kangaroos >= 65_536 removes the exact-threshold split and makes variant selection deterministic at the boundary.

Comment thread src/solver.rs
Comment thread src/solver.rs
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

Auto-approved: Small, low-risk performance tuning and an off-by-one fix in GPU solver calibration. Isolated logic changes with low risk of breakage.

Architecture diagram
sequenceDiagram
    participant S as KangarooSolver
    participant C as GPU Context
    participant G as GPU Backend

    Note over S,G: Initialization & Variant Selection
    S->>C: Query max_workgroup_size()
    S->>S: CHANGED: select_best_variant()
    Note right of S: Now selects Wg128 if kangaroos >= 65,536<br/>(Fixed off-by-one error)

    Note over S,G: Calibration Phase
    loop NEW: Candidates [16, 17, 18, 24, 64, 128, 256, 512]
        S->>G: Dispatch test workload
        G-->>S: Return measured duration
        S->>S: CHANGED: Evaluate vs TARGET_DISPATCH_MS
        Note right of S: Threshold increased to 120ms to<br/>mask host-to-device latency overhead
    end

    S->>S: Select optimal steps_per_call
    S-->>S: Finalize KangarooPipeline
Loading

@oritwoen oritwoen merged commit 6f6e4a6 into main Apr 2, 2026
4 checks passed
@oritwoen oritwoen deleted the autoresearch/48bit-solve/01-solver-calibration branch April 2, 2026 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Benchmark-sized solves miss better GPU calibration

1 participant