Skip to content

perf(gpu): narrow pipeline cache lock scope during compilation#101

Merged
oritwoen merged 1 commit intomainfrom
perf/narrow-pipeline-cache-lock
Mar 31, 2026
Merged

perf(gpu): narrow pipeline cache lock scope during compilation#101
oritwoen merged 1 commit intomainfrom
perf/narrow-pipeline-cache-lock

Conversation

@oritwoen
Copy link
Copy Markdown
Owner

Summary

  • Drop the mutex guard immediately after cache lookup instead of holding it across the entire shader compilation
  • Re-acquire the lock only briefly for the final insert
  • Other threads can now read cached pipelines while a new variant is being compiled

Follow-up to #100.

Drop the mutex guard after cache lookup so shader compilation
does not block other threads from reading cached pipelines.
The lock is only re-acquired briefly for the final insert.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented Mar 31, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 644557f4-0287-4bb1-869f-4f37525f929c

📥 Commits

Reviewing files that changed from the base of the PR and between 3347803 and 6b2c711.

📒 Files selected for processing (1)
  • src/gpu/pipeline.rs
📜 Recent review details
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: cubic · AI code reviewer
🧰 Additional context used
📓 Path-based instructions (1)
src/gpu/**/*.rs

📄 CodeRabbit inference engine (AGENTS.md)

Never add y_parity or symClass to GpuDistinguishedPoint; collision resolution must be done CPU-side via compute_candidate_keys()

Files:

  • src/gpu/pipeline.rs
🧠 Learnings (3)
📓 Common learnings
Learnt from: oritwoen
Repo: oritwoen/kangaroo PR: 100
File: src/gpu/pipeline.rs:35-37
Timestamp: 2026-03-31T16:28:59.172Z
Learning: The `kangaroo` project is a CLI solver, not a service. `GpuContext` is created once per process run, and the process exits after solving. There is no create/destroy cycle for GPU contexts, so a static `PIPELINE_CACHE` holding strong `Arc` references in `src/gpu/pipeline.rs` is intentional and safe for this use case.
📚 Learning: 2026-03-31T16:28:59.172Z
Learnt from: oritwoen
Repo: oritwoen/kangaroo PR: 100
File: src/gpu/pipeline.rs:35-37
Timestamp: 2026-03-31T16:28:59.172Z
Learning: The `kangaroo` project is a CLI solver, not a service. `GpuContext` is created once per process run, and the process exits after solving. There is no create/destroy cycle for GPU contexts, so a static `PIPELINE_CACHE` holding strong `Arc` references in `src/gpu/pipeline.rs` is intentional and safe for this use case.

Applied to files:

  • src/gpu/pipeline.rs
📚 Learning: 2026-02-20T07:57:41.039Z
Learnt from: CR
Repo: oritwoen/kangaroo PR: 0
File: AGENTS.md:0-0
Timestamp: 2026-02-20T07:57:41.039Z
Learning: Applies to src/gpu_crypto/context.rs : Detect and warn about software renderer fallbacks (llvmpipe/SwiftShader)

Applied to files:

  • src/gpu/pipeline.rs
🔇 Additional comments (2)
src/gpu/pipeline.rs (2)

41-48: TOCTOU race: two threads can both miss cache and compile the same variant.

Between the lookup dropping the lock (line 45) and the insert acquiring it again (line 174), another thread can also miss the cache for the same key. Both threads then compile the same shader, and the second insert wins.

For a CLI solver (per the learnings) this probably only causes redundant work during init, which is fine. If that's intentional, a brief comment would help future readers understand the trade-off. Otherwise, if you want true single-compilation guarantee, you'd need something like an entry API with or_insert_with while holding the lock.

Just confirming this is the intended behavior - redundant compilation is acceptable for the perf win on the read path?

Also applies to: 173-176


41-48: Lock-and-release pattern looks good.

The inline cache.lock()...get()...cloned() drops the guard immediately after cloning, which is the right move for narrowing scope. Clean pattern.


📝 Walkthrough

Walkthrough

Refactored cache locking in KangarooPipeline::new to drop lock guards immediately after use rather than holding them across control flow. The read path now acquires and releases the lock inline; the write path does the same for insertion. Logic flow unchanged.

Changes

Cohort / File(s) Summary
Pipeline cache lock refactoring
src/gpu/pipeline.rs
Replaced persistent guard variable with chained temporary lock acquisitions. Read path (lock()get()cloned()) and write path (lock()insert(...)) now release locks immediately after operations instead of spanning conditional branches.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

The main question: does dropping the lock early between the read-miss and the write introduce a race where two threads could both miss cache and create duplicate pipelines? Need to verify the creation cost is acceptable or if atomic compare-and-swap on insert would be safer. Also confirm the clone cost on read doesn't outweigh the lock scope reduction.

Possibly related PRs

Suggested labels

performance, refactoring

Poem

🔒 Lock held tight? Nope, released quick,
Chains of ops, no guarding tricks,
Cache still works, contention less—
Just watch the race, mind the mess. 🏎️

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main change: narrowing the pipeline cache lock scope during compilation by releasing the guard immediately after lookup instead of holding it across shader compilation.
Description check ✅ Passed The description clearly relates to the changeset, explaining the specific optimization: dropping the mutex guard after cache lookup, re-acquiring briefly for insert, and enabling concurrent reads during compilation.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch perf/narrow-pipeline-cache-lock
✨ Simplify code
  • Create PR with simplified code
  • Commit simplified code in branch perf/narrow-pipeline-cache-lock

Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the performance Performance improvements label Mar 31, 2026
Copy link
Copy Markdown

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No issues found across 1 file

Confidence score: 5/5

  • Automated review surfaced no issues in the provided summaries.
  • No files require special attention.

Requires human review: Modifies concurrency logic and locking strategy for a global cache. While intended for performance, changes to shared state management require human review.

@oritwoen oritwoen merged commit a62ddb5 into main Mar 31, 2026
4 checks passed
@oritwoen oritwoen deleted the perf/narrow-pipeline-cache-lock branch March 31, 2026 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant