Skip to content

server: fix peer add/done race between peerHandler and syncManager#2480

Open
Aharonee wants to merge 4 commits intobtcsuite:masterfrom
Aharonee:bugfix/peer_race_condition
Open

server: fix peer add/done race between peerHandler and syncManager#2480
Aharonee wants to merge 4 commits intobtcsuite:masterfrom
Aharonee:bugfix/peer_race_condition

Conversation

@Aharonee
Copy link
Copy Markdown
Contributor

@Aharonee Aharonee commented Feb 12, 2026

Summary

Fix a race condition where the sync manager can permanently get stuck with a dead sync peer after rapid peer connect/disconnect cycles.

The Race Condition

peerDoneHandler ran as a separate goroutine per peer and independently notified two event loops about a disconnect:

  1. It sent to the donePeers channel (consumed by peerHandler).
  2. It called syncManager.DonePeer() directly (sends to sm.msgChan, consumed by blockHandler).

Meanwhile, peerHandler only called syncManager.NewPeer() when it processed the newPeers channel. Because these two paths were unsynchronized, blockHandler could observe DonePeer before NewPeer for the same peer.

A second vector existed even if DonePeer were moved into peerHandler: two separate buffered channels (newPeers/donePeers) let Go's select randomly pick the done case before the add case when both were ready simultaneously.

A third vector existed due to negotiateTimeout: if the 30s timeout in peer.Peer.start() fired between verAckReceived = true and the OnVerAck callback completing, peerDoneHandler could observe VerAckReceived() == true and send peerDone before the OnVerAck callback sent peerAdd.

Consequences: The sync manager receives DonePeer for an unknown peer (logged as a warning, no cleanup). Then NewPeer arrives for the already-dead peer -- the sync manager registers it as a candidate and potentially selects it as syncPeer. Since it is already disconnected, no subsequent DonePeer arrives to clear it. The node is stuck: it believes it has a sync peer, ignores new candidates, and never makes chain progress.

What Triggers It

Any scenario that produces rapid connect/disconnect cycles:

  • Attacker traffic (connections that complete the version/verack handshake then immediately drop)
  • Flaky network conditions with many short-lived peers
  • High peer churn under load (e.g., maxpeers limit causing immediate disconnects)

The Fix

Three structural changes eliminate all race vectors:

  1. Merge newPeers and donePeers into a single peerLifecycle channel. A single FIFO channel eliminates the select-ambiguity vector where Go's select could pick done before add.

  2. Move syncManager.DonePeer() and orphan eviction into handleDonePeerMsg. All sync manager notifications now flow through the peerHandler goroutine.

  3. Make peerLifecycleHandler (renamed from peerDoneHandler) the sole sender of both peerAdd and peerDone for each peer. OnVerAck no longer sends to the channel directly; it closes a signal channel (verAckCh). peerLifecycleHandler selects on verAckCh vs peer.Done() (new method exposing the peer's quit channel), sends peerAdd if verack was received, then waits for disconnect and sends peerDone. Because both sends originate from the same goroutine, peerAdd is never enqueued after peerDone. If both channels are ready simultaneously, Go's select is nondeterministic so peerAdd may be skipped -- this is harmless (the peer is already disconnected) and documented in the peerLifecycleEvent comment.

Additionally, OnVerAck is guarded against double-close: if called more than once on the same peer, it logs an error instead of panicking.

Reproducing on master (without the fix)

The included integration test can demonstrate the corruption on an unpatched master branch:

git checkout master
git checkout bugfix/peer_race_condition -- integration/sync_race_test.go
go test -tags=rpctest -v -run TestSyncManagerRaceCorruption ./integration/ -count=10 -timeout 900s

Test Plan

  • go build ./... compiles cleanly
  • go test -tags=rpctest -v -run TestSyncManagerRaceCorruption ./integration/ -count=10 -timeout 900s passes
  • TestPreVerackDisconnect passes (disconnect before verack)
  • Existing integration tests unaffected
  • TestOnVerAckDoubleCall -- unit test calling OnVerAck twice on the same serverPeer, asserting no panic and verAckCh remains closed
  • TestPeerLifecycleOrdering -- unit test asserting verack-then-disconnect emits peerAdd followed by peerDone
  • TestPeerLifecycleSimultaneousReady -- unit test asserting stable behavior when both verAckCh and Peer.Done() are ready simultaneously (100 iterations, verifying ordering invariant holds regardless of which select case wins)

@Aharonee Aharonee force-pushed the bugfix/peer_race_condition branch from ee422e0 to 50b62a3 Compare February 12, 2026 10:18
@coveralls
Copy link
Copy Markdown

coveralls commented Feb 12, 2026

Pull Request Test Coverage Report for Build 22067556256

Details

  • 0 of 41 (0.0%) changed or added relevant lines in 2 files are covered.
  • 74 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.1%) to 54.846%

Changes Missing Coverage Covered Lines Changed/Added Lines %
peer/peer.go 0 3 0.0%
server.go 0 38 0.0%
Files with Coverage Reduction New Missed Lines %
database/ffldb/blockio.go 4 88.81%
rpcclient/infrastructure.go 70 42.62%
Totals Coverage Status
Change from base Build 20942501138: -0.1%
Covered Lines: 31142
Relevant Lines: 56781

💛 - Coveralls

@naorye
Copy link
Copy Markdown

naorye commented Feb 12, 2026

I experience the same issue.. Wow, really need this.

@yyforyongyu yyforyongyu self-requested a review February 13, 2026 02:48
server.go Outdated
)

// peerLifecycleEvent represents a peer connection or disconnection event.
// Using a single channel for both event types guarantees FIFO ordering:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the "first-in" part? Can OnVerAck be delayed and send its part after "done" event is sent? E.g. if OnVerAck runs longer than negotiateTimeout.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, there seems to still be a potential race in that scenario.
I've pushed a commit which changes the peerDoneHandler into peerLifecycleHandler, and delegates responsibility for both add peer and done peer events to it.
That way a single goroutine will manage synchronization and correct ordering of the peer lifecycle events.

Does that make sense?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me!

peerDoneHandler ran as a separate goroutine per peer and independently
notified both peerHandler (via donePeers channel) and the sync manager
(via syncManager.DonePeer) about a peer disconnect. Because these two
sends were unsynchronized, the sync manager could observe DonePeer
before NewPeer when a peer connected and disconnected quickly. This
caused the sync manager to log "unknown peer", then later register the
already-dead peer as a sync candidate that was never cleaned up,
potentially leaving it stuck with a dead sync peer.

Two structural changes eliminate the race:

1. Merge the newPeers and donePeers channels into a single
   peerLifecycle channel. Since OnVerAck (add) always fires before
   WaitForDisconnect returns (done), a single FIFO channel guarantees
   peerHandler always processes add before done for a given peer,
   removing the select-ambiguity where Go could pick done first.

2. Move the syncManager.DonePeer call and orphan eviction from
   peerDoneHandler into handleDonePeerMsg, which runs inside
   peerHandler. All sync manager peer lifecycle notifications now
   originate from the single peerHandler goroutine and flow into
   sm.msgChan in guaranteed add-before-done order.
@Aharonee Aharonee force-pushed the bugfix/peer_race_condition branch from 50b62a3 to 091b790 Compare February 16, 2026 09:46
Address review feedback on the peer add/done race fix:

- Make peerLifecycleHandler (renamed from peerDoneHandler) the sole
  sender of both peerAdd and peerDone events for each peer. OnVerAck
  now closes a signal channel (verAckCh) instead of sending directly,
  and peerLifecycleHandler selects on verAckCh vs peer.Done() to
  decide whether to send peerAdd before peerDone. This guarantees
  ordering by construction: a single goroutine sends both events
  sequentially, eliminating the negotiateTimeout race window.

- Add Done() method to peer.Peer exposing the quit channel read-only,
  enabling select-based disconnect detection from server code.

- Remove the now-unused AddPeer method.

- Address style feedback: 80-char line limit, empty lines between
  switch cases, break long function calls, use require.GreaterOrEqualf
  instead of if+Fatalf, bump syncRaceConcurrency to 300 for
  backpressure testing, add TestPreVerackDisconnect for disconnect
  prior to verack.
@Aharonee Aharonee requested a review from starius February 16, 2026 15:04
Comment on lines +2315 to +2327
// peerAdd is always enqueued before peerDone.
func (s *server) peerLifecycleHandler(sp *serverPeer) {
// Wait for the handshake to complete or the peer to
// disconnect, whichever comes first.
select {
case <-sp.verAckCh:
s.peerLifecycle <- peerLifecycleEvent{
action: peerAdd, sp: sp,
}

case <-sp.Peer.Done():
// Disconnected before verack; no peerAdd needed.
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If both sp.verAckCh and sp.Peer.Done() have messages to receive, select chooses pseudorandomly among them. So peerAdd can be skipped even if VerAckReceived is true, and handleDonePeerMsg will call DonePeer for an unknown peer.

Does it make sense to prioritize receiving from sp.verAckCh or check VerAckReceived if sp.Peer.Done() fired?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine to skip add peer if done peer event has already occurred.
After all, the peer has disconnected so we can avoid notifying the server of a new peer just to notify it right away after to remove it.

My main concern was done peer being processed before add peer, but done peer processing for an unknown peer that has already disconnected seems harmless.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right. My proposal would only improve log message clarity (avoiding "unknown peer" being logged), not the correctness of the code itself. It is optional.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the code no longer matches the commit message. The commit says: "Prioritize verAckCh in peerLifecycleHandler select to avoid nondeterministic peerAdd skipping when both channels are ready." However, there are no changes in peerLifecycleHandler's select logic.

I suggest implementing prioritization exactly as described in the commit message: first attempt a non-blocking receive from sp.verAckCh, and only in the default case fall back to the two-channel select. That would bring the code in line with the commit message.


const (
peerAdd peerLifecycleAction = iota
peerAdd peerLifecycleAction = iota
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this formatting change should belong to the first commit

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can squash both commits and force push if you prefer, but wouldn't it be more convenient for you to review the diff each time and only squash merge at the end?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, let's keep them separate for now.

server.go Outdated
Comment on lines 560 to 561
close(sp.verAckCh)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current code version allows calling OnVerAck only once. Should we safeguard for the future using sync.Once?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think safeguarding this could potentially hide a bug, and if it is called twice we would prefer a loud panic.
This pattern is also a consistent pattern used in the codebase, for example: Peer.quit channel is not safeguarded and is closed by Peer.Disconnect().

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, maybe we can produce an error instead, if it is closed already?

select {
case <-sp.verAckCh:
 log Error

default:
 close(sp.verAckCh)
}

The error won't let it pass unnoticed, but at least it won't panic and crash. What do you think?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, makes sense

server.go Outdated
Comment on lines +164 to +165
// goroutine (peerLifecycleHandler), guaranteeing that peerAdd is
// always enqueued before peerDone.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to adjust the comment to reflect that peerAdd may be skipped.

server.go Outdated
knownAddresses lru.Cache
banScore connmgr.DynamicBanScore
quit chan struct{}
verAckCh chan struct{} // closed when OnVerAck fires
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Formatting:

// Closed when OnVerAck fires.
verAckCh chan struct{}

server.go Outdated
)

// peerLifecycleEvent represents a peer connection or disconnection event.
// Using a single channel for both event types guarantees FIFO ordering:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change looks good to me!

Prioritize verAckCh in peerLifecycleHandler select to avoid
nondeterministic peerAdd skipping when both channels are ready.

Guard OnVerAck against double-close by checking the channel before
closing, logging an error instead of panicking.

Adjust peerLifecycleEvent comment to reflect that peerAdd may be
skipped when the peer disconnects before or concurrently with verack.

Fix verAckCh field comment formatting.
@Aharonee Aharonee requested a review from starius February 18, 2026 12:46
@Roasbeef Roasbeef added this to the v0.25.1 milestone Mar 10, 2026
Copy link
Copy Markdown
Contributor

@starius starius left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! 💾

The fix is currently validated only by integration tests. They cover the main failure mode well, but there are still unit-level coverage gaps that should be addressed with direct regression tests:

  • Add an OnVerAck idempotency test for the double-call guard (case <-sp.verAckCh): call OnVerAck twice on the same serverPeer, assert no panic on the second call, and assert verAckCh remains closed.
  • Add a deterministic peerLifecycleHandler test for the simultaneous-ready edge case: make both verAckCh and Peer.Done() ready at the same time, then assert the intended behavior is stable and documented.
  • Add a direct per-peer lifecycle ordering test: in the verack-then-disconnect path, assert the emitted events for the same peer are exactly peerAdd followed by peerDone.

I suggest adding these as regression tests that pass with the fix and fail if any part of the fix is rolled back. That ensures the core invariants are verified directly.

Comment on lines +563 to +569
select {
case <-sp.verAckCh:
peerLog.Errorf("OnVerAck called more than once "+
"for peer %v", sp)
default:
close(sp.verAckCh)
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to cover this with a direct test (not rpctest) to make sure that it works as expected (logs, not panics by closing the channel again). To be clear, the code is correct, but it is better to cover such fragile things explicitly, IMHO. Specifically: call OnVerAck twice on the same serverPeer and assert no panic + verAckCh remains closed.

Comment on lines +2315 to +2327
// peerAdd is always enqueued before peerDone.
func (s *server) peerLifecycleHandler(sp *serverPeer) {
// Wait for the handshake to complete or the peer to
// disconnect, whichever comes first.
select {
case <-sp.verAckCh:
s.peerLifecycle <- peerLifecycleEvent{
action: peerAdd, sp: sp,
}

case <-sp.Peer.Done():
// Disconnected before verack; no peerAdd needed.
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that the code no longer matches the commit message. The commit says: "Prioritize verAckCh in peerLifecycleHandler select to avoid nondeterministic peerAdd skipping when both channels are ready." However, there are no changes in peerLifecycleHandler's select logic.

I suggest implementing prioritization exactly as described in the commit message: first attempt a non-blocking receive from sp.verAckCh, and only in the default case fall back to the two-channel select. That would bring the code in line with the commit message.

// was received, then waits for disconnect and sends peerDone.
// Because both sends originate from this single goroutine,
// peerAdd is always enqueued before peerDone.
func (s *server) peerLifecycleHandler(sp *serverPeer) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a deterministic unit test for peerLifecycleHandler where verack arrives before disconnect, and assert peerLifecycle emits peerAdd first and peerDone second.

We don't have internal observability in integration/sync_race_test.go since it runs against a separate btcd process, so we can add a direct regression test server to cover the fix.

The test has to build before the fix is applied, but fail because of that race that we're fixing.

for time.Now().Before(deadline) && iter < syncRaceIterations {
for i := 0; i < syncRaceConcurrency; i++ {
go func() {
_ = fakePeerConn(nodeAddr)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should check this error and fail the test if it fails.

fakePeerConn errors are ignored and the test increments done regardless. On constrained/slow environments, many connection attempts can fail and the test may not actually execute enough handshake+disconnect cycles to stress the race as intended.

for i := 0; i < 50; i++ {
conn, err := net.DialTimeout("tcp", nodeAddr, 5*time.Second)
if err != nil {
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error also should not be ignored. This may result in the test passing without meaningful stress.

Comment on lines +187 to +190
nodeTCP, err := net.ResolveTCPAddr("tcp", nodeAddr)
if err != nil {
conn.Close()
continue
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error also should not be ignored. This may result in the test passing without meaningful stress.

msgVersion := wire.NewMsgVersion(me, you, nonce, 0)
msgVersion.Services = wire.SFNodeNetwork | wire.SFNodeWitness

_ = wire.WriteMessage(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error also should not be ignored. This may result in the test passing without meaningful stress.

Comment on lines +177 to +178
// sending verack. This produces a peerDone without a preceding
// peerAdd in the lifecycle channel.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/produces/is expected to produce/

We do not validate the internal state of the running btcd process, so this is only our assumption, not an actual verification.

Address review feedback on the peer add/done race fix:

Add three direct unit tests in server_test.go that exercise the fix
without the full server or rpctest harness:

- TestOnVerAckDoubleCall: call OnVerAck twice on the same serverPeer,
  assert no panic and verAckCh remains closed.
- TestPeerLifecycleOrdering: verack before disconnect emits peerAdd
  then peerDone in order.
- TestPeerLifecycleSimultaneousReady: both verAckCh and Peer.Done()
  ready before the handler runs; assert peerDone always arrives and
  peerAdd, if emitted, precedes it (100 iterations).

Harden integration tests in sync_race_test.go:

- Check fakePeerConn errors via require.NoError instead of discarding.
- Extract dialAndSendVersion helper for TestPreVerackDisconnect;
  check all errors instead of silently continuing.
- Fix comment wording ("produces" -> "is expected to produce").
@allocz
Copy link
Copy Markdown

allocz commented Apr 9, 2026

Diff looks nice, ran the entire test suite, all green. Nice work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants