-
Notifications
You must be signed in to change notification settings - Fork 755
Description
Bug report
RaftCluster.Start() can leak startup goroutines when it returns an error before running is set to true.
The failure mode is generic and not tied to any one feature: Start() creates components that launch background loops using c.ctx, but if a later startup step fails, Start() returns without canceling that context. Since running is still false, a later RaftCluster.Stop() also returns early and never calls c.cancel().
Current startup path that reproduces it
On current master, this sequence is enough:
Start()createsRegionLabeler, which startsRegionLabeler.doGC().Start()createsaffinity.Manager, which starts its availability check loop.keyspaceGroupManager.Bootstrap(c.ctx)returns an error.Start()returns early whilec.ctxis still live.- The package-level
goleakcheck reports leaked goroutines.
Reproduction
I reproduced this locally with a regression test under server/cluster that makes keyspaceGroupManager.Bootstrap fail after the earlier startup components are initialized.
Run:
go test ./server/cluster -run TestStartCancelsContextOnBootstrapFailure -count=1Observed result: the test body itself passes the expected startup error, but the package fails in goleak with leaked goroutines including:
labeler.(*RegionLabeler).doGCaffinity.(*Manager).startAvailabilityCheckLoop.func2
I also saw related partially-initialized background goroutines such as hot cache workers still alive in the same run.
Expected result
If RaftCluster.Start() fails at any point after InitCluster() succeeds, all background work started off c.ctx should be torn down before returning.
Likely root cause
Start() currently relies on Stop() for normal shutdown, but Stop() exits early when running == false. That means any startup failure after background goroutines are created but before running = true has no unified rollback path.
Suggested fix
Add a startup-failure rollback path inside RaftCluster.Start() itself. For example, after InitCluster() succeeds, defer a cleanup that cancels c.ctx when err != nil.
That would make startup failures self-contained and avoid depending on Stop() for a state where running was never set.
- LLM model: gpt-5.4
- Thinking level: xhigh
- Client: Codex Desktop 26.305.950 (build 863); codex-cli 0.105.0