fix(blocksync,p2p,service): fix concurrency bugs behind chronic CI flakes#1340
fix(blocksync,p2p,service): fix concurrency bugs behind chronic CI flakes#1340lklimek wants to merge 7 commits into
Conversation
…r busy-spin
Reset() and Start() toggled the `stopped` atomic as separate, unsynchronized
operations, and workers are spawned only inside Start(). A Stop() interleaving
between the Reset->Start pair in Run() could make Start() early-return (the old
`if p.stopped.Swap(false) { return }`), leaving the pool with OPEN channels but
ZERO workers: jobs enqueue but never execute, every Receive() blocks forever, and
blocksync wedges.
- Add a single lifecycleMtx that serializes Reset/Start/Stop/Run so the
Reset->Start pair in Run is atomic with respect to Stop. The mutex (the
go-deadlock-aliased sync.Mutex already used in this file) is held only around
state transitions; Stop's blocking waits (worker shutdown, wg.Wait) cannot
self-deadlock because workers acquire no pool lock. Cold path (once per sync
session), so tracking overhead is negligible.
- Remove Start's early-return; the invariant is now: after Run/Start returns the
pool is either stopped OR has exactly initPoolSize live workers, never zero.
- synchronizer.runHandler now EXITS on ErrWorkerPoolStopped instead of logging
and re-looping, killing the busy-spin / log-flood once the pool is stopping.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…expiry The deadlock guards in PeerManager.Ready and Disconnected panicked when broadcast returned an error matching context.DeadlineExceeded. But broadcast's only DeadlineExceeded source is the caller's ctx.Err() — a benign expiry (e.g. a slow CI runner blowing a test's WithTimeout). The REAL deadlock path (a subscriber failing to drain within broadcastTimeout while the PeerManager mutex is held) returned a plain fmt.Errorf the guard never matched. So the guard both false-positived on benign expiry AND missed the case it was meant to catch. - Introduce the errors.Is-able sentinel ErrBroadcastDeadlock and wrap it into the capacity-exhaustion error. - Both guards now panic ONLY on ErrBroadcastDeadlock; caller ctx cancellation/expiry is logged at debug and skipped, no panic. - Add regression tests: benign canceled-ctx broadcast does not panic; the sentinel is matchable and distinct from context.DeadlineExceeded. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Stop() called bs.cancel() — which closes the quit channel that Wait() blocks on — BEFORE atomic.StoreUint32(&bs.running, 0). A goroutine in Wait() could therefore unblock while IsRunning() still reported true, violating the documented contract that "Wait blocks until the service is stopped". Swap the two statements so running is set to 0 before cancel(), guaranteeing IsRunning() is false by the time Wait() returns. OnStop() ordering is unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…-race wedge
OnStart spawns the producer/consumer goroutines (go runHandler ...) but
BaseService.Start only sets running=1 AFTER OnStart returns. A handler goroutine
scheduled in that window read IsRunning()==false and exited at birth, freezing
s.height at 1 forever — the dominant cause of TestReactor_SyncTime flakiness
("expected node to be partially synced" never satisfied). This is the start-side
twin of the BaseService.Stop ordering bug.
Loop on ctx.Err() == nil instead of s.IsRunning(). The context is already live
when OnStart runs, so there is no dependency on the running-flag timing. The two
exits remain correct and leak-free: full-reactor cancellation ends the loop via
ctx.Err(), and a manual synchronizer.Stop() ends it via the kept
ErrWorkerPoolStopped early-return (Send/Receive observe the stopped pool).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The assertion getLastSyncRate() > 0.001 raced the sync-rate filter: with the default monitorInterval=100 the rate is only computed once the syncing node reaches height ~101, so a slow run could blow the 10s window before the rate became non-zero — a separate, rarer flake from the start-race wedge. Add a WithSynchronizerOptions reactor option that forwards OptionFuncs to the synchronizer built in OnStart, and use it in the test to set WithMonitorInterval(10) so the rate computes after ~10 blocks. This makes the assertion deterministic rather than widening the Eventually window. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous fix looped runHandler on the EXTERNAL reactor context, but OnStop stops the worker pool without cancelling that context. On the common switch-to-consensus path, poolRoutine calls synchronizer.Stop() while the parent context is still live: consumeJobResult exits (Receive -> ErrWorkerPoolStopped), but a caught-up produceJob keeps sleeping 50ms and returning nil — it never calls Send, never sees ErrWorkerPoolStopped, and ctx.Err() stays nil. The producer goroutine and its timer leaked for the entire consensus phase. - OnStart derives s.ctx, s.cancel = context.WithCancel(ctx) and runs the worker pool + handlers under s.ctx. It is created live before the goroutines spawn, so the original goroutine start-race stays fixed without consulting IsRunning(). - OnStop calls s.cancel() before workerPool.Stop(), so cancellation releases both handlers even when the caller's context is still live. - produceJob's idle wait now selects on ctx.Done() so a caught-up producer observes cancellation promptly instead of blocking on a bare sleep. Add TestStopReleasesHandlers (leaktest) reproducing the exact leak: Stop() with a live parent context and a caught-up job generator must release the handlers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…riant Send and Receive read jobCh/resultCh/doneCh locklessly while Reset/Run/Stop reassign or close them under lifecycleMtx. This is safe today only because callers serialize the lifecycle against in-flight Send/Receive (service OnStart/OnStop). Document the invariant on WorkerPool so future callers do not introduce a data race by overlapping them. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Warning Review limit reached
More reviews will be available in 57 minutes and 26 seconds. Learn how PR review limits work. Your organization has used up its prepaid credits, and credit purchases are no longer available. Enable the review add-on in the billing tab to keep reviews running — you're only billed for reviews past your plan's rate limits ($0.25/file). ⌛ How to resolve this issue?After more reviews become available, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available. Please see our Fair Usage Limits Policy for further information. ℹ️ Review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (8)
✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Pull request overview
This PR addresses long-standing CI “flake” failures by fixing concurrency/lifecycle races across the workerpool, blocksync synchronizer/reactor, PeerManager broadcast deadlock detection, and BaseService shutdown ordering.
Changes:
- Workerpool: serialize lifecycle transitions (Start/Run/Reset/Stop) under a dedicated mutex and document concurrency invariants.
- Blocksync: drive handler goroutines from a synchronizer-owned cancelable context; make producer/consumer handlers exit cleanly on workerpool stop; add leak test and determinism hooks for monitor interval.
- P2P/service: add an errors.Is()-matchable broadcast-deadlock sentinel and adjust BaseService.Stop ordering so Wait()/IsRunning() become consistent.
Reviewed changes
Copilot reviewed 8 out of 8 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| libs/workerpool/worker_pool.go | Adds lifecycle mutex + helpers to prevent Stop/Run interleavings; documents invariants. |
| libs/service/service.go | Ensures running=0 is stored before cancel so Wait() and IsRunning() align. |
| internal/p2p/peermanager.go | Introduces ErrBroadcastDeadlock sentinel and wraps broadcast timeout errors. |
| internal/p2p/peermanager_test.go | Adds coverage for benign ctx expiry vs deadlock sentinel behavior. |
| internal/blocksync/synchronizer.go | Switches handlers to synchronizer-owned context; exits loops on pool stop; refactors handler signatures. |
| internal/blocksync/synchronizer_test.go | Adds leaktest to lock down Stop releasing handler goroutines. |
| internal/blocksync/reactor.go | Adds reactor option forwarding to synchronizer construction. |
| internal/blocksync/reactor_test.go | Makes sync-rate assertion deterministic via monitor interval option. |
Comments suppressed due to low confidence (1)
libs/workerpool/worker_pool.go:282
- resetLocked assumes p.workers is already allocated and indexed up to initPoolSize. If Stop() is called before the first Start()/Run() (so workers were never initialized), stopped becomes true and a subsequent Reset()/Run() will panic with an index-out-of-range/nil slice when writing p.workers[i]. Reset should be safe regardless of whether workers were previously started.
// resetLocked re-initializes channels and workers. The caller must hold lifecycleMtx.
func (p *WorkerPool) resetLocked() {
if !p.stopped.Swap(false) {
return
}
p.doneCh = make(chan struct{})
p.jobCh = make(chan *Job, p.initPoolSize)
p.resultCh = make(chan Result, p.initPoolSize*2)
for i := 0; i < p.initPoolSize; i++ {
p.workers[i] = newWorker(i, p.jobCh, p.resultCh, p.doneCh, p.logger)
}
}
| s.jobProgressCounter.Add(1) | ||
| job, err := s.jobGen.nextJob(ctx) | ||
| if err != nil { | ||
| s.logger.Error("cannot create a next job", "error", err) | ||
| return | ||
| return nil | ||
| } | ||
| err = s.workerPool.Send(ctx, job) | ||
| if err != nil { | ||
| if errors.Is(err, workerpool.ErrWorkerPoolStopped) { | ||
| return err | ||
| } | ||
| s.logger.Error("cannot add a job to worker-pool", "error", err) | ||
| } | ||
| return nil |
| // A subscriber that never drains its channel. | ||
| sub := peerManager.Subscribe(ctx, "p2p") | ||
| _ = sub | ||
|
|
||
| added, err := peerManager.Add(a) | ||
| require.NoError(t, err) | ||
| require.True(t, added) | ||
| require.NoError(t, peerManager.Accepted(a.NodeID)) | ||
|
|
||
| // Fill the subscription buffer exactly to capacity with non-blocking buffered | ||
| // sends. Capacity is the unexported broadcastSubscriptionChannelCapacity (3); | ||
| // Accepted does not broadcast, so three Ready calls fill the channel. | ||
| for i := 0; i < 3; i++ { | ||
| peerManager.Ready(ctx, a.NodeID, nil) | ||
| } |
| checkLeaks := leaktest.CheckTimeout(suite.T(), 5*time.Second) | ||
|
|
||
| suite.Require().NoError(sync.Start(ctx)) | ||
| suite.Require().Eventually(sync.IsRunning, time.Second, 5*time.Millisecond) | ||
| // Give the idling producer a few iterations before stopping. | ||
| time.Sleep(50 * time.Millisecond) | ||
|
|
||
| sync.Stop() | ||
|
|
||
| // With the parent ctx still live, both handler goroutines must have exited. | ||
| suite.Require().NoError(ctx.Err()) | ||
| checkLeaks() | ||
| } |
Issue being fixed or feature implemented
Why this PR exists
TestReactor_SyncTimewedges ~10% (permanent sync stall at height 1);internal/consensuspanicspossible deadlock when sending disconnected broadcaston slow runners;TestNodeNewSeedNodefails (Wait()returns beforeIsRunning()flips). Net: chronically red, untrustworthy CI.What was done?
7 commits:
Start/Stop/Reset/Rununder one lifecycle mutex — fixes a race where aStopinterleavingRun'sReset→Startleft the pool "running" with zero workers;synchronizer.runHandlerexits onErrWorkerPoolStopped.runHandleroff the synchronizer's OWN cancelable context — fixes the goroutine start-race (handler spawned inOnStartcould die at birth becauseBaseService.Startsetsrunning=1only afterOnStartreturns) that wedgedTestReactor_SyncTime, and the stop-time producer-goroutine leak (caught-upproduceJobnever observedErrWorkerPoolStopped); deterministic monitor-interval test hook for the rate assertion.PeerManagerbroadcast deadlock-guard on a newErrBroadcastDeadlocksentinel — stops panicking on benign caller-ctx expiry while actually catching the real capacity-exhaustion deadlock it previously missed.BaseService.Stopclearsrunning=0beforecancel()soWait()can't return whileIsRunning()is still true.How Has This Been Tested?
-race:TestReactor_SyncTime0/50 (was 5/50),TestNodeNewSeedNode0/200,TestReactor_NoBlockResponse15/15, newTestStopReleasesHandlers(leaktest, red-first/green-after), fullinternal/p2psuite green with 0 deadlock panics;go build ./...,go vet, gofmt, golangci-lint clean. Adversarially reviewed in two rounds — the first caught a goroutine-leak regression, now fixed and locked down by the leaktest.Breaking Changes
None — internal concurrency fixes, no API/behavior change for callers.
Notes / follow-ups (pre-existing, not introduced here)
Send/Receivemust not run concurrently withReset/Run/Stop.BaseServicecannot truly restart (never resetsquit); a secondStarton a stopped instance no-ops. Pre-existing.Checklist:
For repository code-owners and collaborators only
🤖 Co-authored by Claudius the Magnificent AI Agent