-
-
Notifications
You must be signed in to change notification settings - Fork 947
[relay-server] Fix race condition in relay peer reconnection handling #4791
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
When a peer reconnects with the same ID, other peers were not reliably notified that the old connection went offline. This caused "connection already exists" errors when attempting to establish new connections to the reconnected peer. The issue occurred because the old peer's cleanup notification raced with the new peer's online notification. If reconnection happened before cleanup, the offline notification was silently dropped. The fix sends an offline notification synchronously during reconnection (when AddPeer returns true), ensuring all subscribed peers receive events in the correct order (offline → online). Added TestBindReconnectRace to validate the fix with 1000 reconnection iterations.
WalkthroughRelay.Accept now invokes the notifier's PeerWentOffline when AddPeer reports a reconnection. A new race-detection test TestBindReconnectRace was added. The Relay client manager API now accepts per-instance ManagerOpts (MTU, cleanup interval, unused-server timeout). CI tests use Changes
Sequence Diagram(s)sequenceDiagram
participant Relay as Relay.Accept
participant Store as PeerStore
participant Notifier as Notifier
rect rgb(245,250,255)
Note over Relay,Store: AddPeer flow with reconnection handling
Relay->>Store: AddPeer(peer)
alt New peer
Store-->>Relay: false
Relay->>Relay: proceed (record metrics)
else Reconnection
Store-->>Relay: true
Relay->>Notifier: PeerWentOffline(peer.ID()) %% NEW call on reconnection
Relay->>Relay: record reconnection metric
end
end
sequenceDiagram
participant Caller as NewManager caller
participant Manager as Manager
rect rgb(250,245,255)
Note over Caller,Manager: New per-instance option handling
Caller->>Manager: NewManager(ctx, urls, peerID, opts *ManagerOpts)
alt opts == nil
Manager->>Manager: set defaults (MTU, cleanupInterval, unusedServerTime)
else opts provided
Manager->>Manager: validate MTU (clamp with warnings)
Manager->>Manager: set cleanupInterval, unusedServerTime from opts
end
Manager-->>Caller: *Manager instance
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: CodeRabbit UI Review profile: CHILL Plan: Pro 📒 Files selected for processing (4)
🧰 Additional context used🧬 Code graph analysis (4)shared/relay/client/manager_test.go (2)
shared/relay/client/manager.go (1)
client/internal/engine_test.go (2)
client/internal/connect.go (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
🔇 Additional comments (9)
Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (1)
shared/relay/client/client_test.go (1)
509-509: Minor inconsistency: Check error from Close.For consistency with line 487-490, consider checking the error from
clientAlice.Close().Apply this diff:
- clientAlice.Close() + err = clientAlice.Close() + if err != nil { + t.Errorf("iteration %d: failed to close alice after test: %s", i, err) + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
relay/server/relay.go(1 hunks)shared/relay/client/client_test.go(2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
relay/server/relay.go (1)
route/route.go (1)
ID(48-48)
shared/relay/client/client_test.go (3)
shared/relay/client/client.go (1)
NewClient(149-170)client/iface/iface.go (1)
DefaultMTU(28-28)shared/relay/messages/id.go (1)
HashID(25-31)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Relay / Unit (amd64, -race)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Client / Unit (386)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Client / Unit
- GitHub Check: Darwin
- GitHub Check: Windows
- GitHub Check: Linux
- GitHub Check: release_ui_darwin
- GitHub Check: release
- GitHub Check: release_ui
- GitHub Check: JS / Lint
- GitHub Check: Client / Unit
- GitHub Check: iOS / Build
- GitHub Check: Android / Build
- GitHub Check: Client / Unit
🔇 Additional comments (1)
relay/server/relay.go (1)
133-137: LGTM! Synchronous offline notification fixes the race condition.The fix correctly sends the offline notification before the online notification during reconnection, ensuring subscribers receive events in the proper order and preventing "connection already exists" errors.
| // Close Bob's connection to Alice to prepare for next iteration | ||
| clientBob.mu.Lock() | ||
| aliceID := messages.HashID("alice") | ||
| if container, ok := clientBob.conns[aliceID]; ok { | ||
| container.close() | ||
| delete(clientBob.conns, aliceID) | ||
| } | ||
| clientBob.mu.Unlock() | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Encapsulation violation: Test accesses private client internals.
The test directly manipulates clientBob.mu and clientBob.conns, breaking encapsulation. This makes the test fragile and tightly coupled to implementation details.
Consider one of these approaches:
- Preferred: Add a public method to the Client type to close a specific connection by peer ID:
// In client.go
func (c *Client) CloseConnToPeer(peerID string) error {
c.mu.Lock()
defer c.mu.Unlock()
hashedID := messages.HashID(peerID)
if container, ok := c.conns[hashedID]; ok {
container.close()
delete(c.conns, hashedID)
}
return nil
}Then use it in the test:
- // Close Bob's connection to Alice to prepare for next iteration
- clientBob.mu.Lock()
- aliceID := messages.HashID("alice")
- if container, ok := clientBob.conns[aliceID]; ok {
- container.close()
- delete(clientBob.conns, aliceID)
- }
- clientBob.mu.Unlock()
+ // Close Bob's connection to Alice to prepare for next iteration
+ err = clientBob.CloseConnToPeer("alice")
+ if err != nil {
+ t.Errorf("iteration %d: failed to close Bob's conn to Alice: %s", i, err)
+ }- Alternative: Create a fresh Bob client for each iteration instead of manually cleaning state.
🤖 Prompt for AI Agents
shared/relay/client/client_test.go around lines 511-519: the test directly
accesses clientBob.mu and clientBob.conns which breaks encapsulation; add a
public method on Client (e.g., CloseConnToPeer(peerID string) error) in
client.go that hashes the peer ID, locks the client, finds the connection
container, calls its close method and deletes the map entry, then call that new
method from the test to close Bob’s connection to "alice" (alternatively, create
a fresh Bob client per iteration).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
shared/relay/client/manager.go (1)
25-27: Unify “relay client not connected” error with ErrRelayClientNotConnectedYou now expose a package‑level
ErrRelayClientNotConnected, butisForeignServerstill returns a freshfmt.Errorf("relay client not connected"). That means callers (likeAddCloseListener) can’t reliably useerrors.Is/ equality to detect this condition.Consider reusing or wrapping the sentinel error instead of creating a new one:
func (m *Manager) isForeignServer(address string) (bool, error) { - rAddr, err := m.relayClient.ServerInstanceURL() - if err != nil { - return false, fmt.Errorf("relay client not connected") - } + rAddr, err := m.relayClient.ServerInstanceURL() + if err != nil { + return false, ErrRelayClientNotConnected + } return rAddr != address, nil }This keeps error semantics consistent across the package and makes it easier for callers/tests to branch on this condition.
Also applies to: 376-382
shared/relay/client/manager_test.go (1)
19-19: Tests correctly exercise new ManagerOpts API; consider a small DRY helperThe test changes all look consistent with the new
NewManagersignature:
- Using
&ManagerOpts{MTU: iface.DefaultMTU}keeps the behavior identical to prior defaults for most tests.TestForeignAutoClose’s customCleanupInterval/UnusedServerTimeandtimeout := testCleanupInterval + testUnusedServerTime + 2*time.Secondgive enough headroom for the cleanup loop to run at least once after the unused window, which should keep the test stable.If you want to reduce repetition, you could add a small helper like:
func newTestManager(ctx context.Context, urls []string, peerID string) *Manager { return NewManager(ctx, urls, peerID, &ManagerOpts{MTU: iface.DefaultMTU}) }and use it in the tests that only care about MTU, leaving the explicit
ManagerOptsconstruction for tests that tweak timing.Also applies to: 94-95, 99-101, 201-203, 206-209, 224-226, 286-290, 317-317, 361-365, 367-371, 436-439, 441-444
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
shared/relay/client/manager.go(5 hunks)shared/relay/client/manager_test.go(8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
shared/relay/client/manager_test.go (2)
shared/relay/client/manager.go (2)
NewManager(94-140)ManagerOpts(78-89)client/iface/iface.go (1)
DefaultMTU(28-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
- GitHub Check: Signal / Unit (386)
- GitHub Check: Management / Unit (amd64, sqlite)
- GitHub Check: Relay / Unit (386)
- GitHub Check: Management / Unit (amd64, mysql)
- GitHub Check: Relay / Unit (amd64, -race -v)
- GitHub Check: Client / Unit (amd64)
- GitHub Check: Client / Unit (386)
- GitHub Check: Management / Unit (amd64, postgres)
- GitHub Check: Client (Docker) / Unit
- GitHub Check: Client / Unit
- GitHub Check: Client / Unit
- GitHub Check: Android / Build
- GitHub Check: Client / Unit
- GitHub Check: release_ui_darwin
- GitHub Check: release
- GitHub Check: JS / Lint
- GitHub Check: Windows
- GitHub Check: Linux
- GitHub Check: Darwin
🔇 Additional comments (1)
shared/relay/client/manager.go (1)
17-23: Per‑instance ManagerOpts wiring and MTU bounds look solidThe introduction of
ManagerOptspluscleanupInterval/unusedServerTime/mtuonManageris cleanly wired:
- Defaults are initialized once, then selectively overridden when
opts != niland values are> 0, so you can safely passnilfor defaults and avoid accidentally setting zero/negative intervals.- MTU clamping into
[minMTU, maxMTU]with warning logs ensures callers can’t misconfigure the relay path, andManager/ServerPickeralways see a saneuint16.startCleanupLoopandcleanUpUnusedRelaysnow respect per‑instance timing, which is exactly whatTestForeignAutoCloseis exercising.No issues from a correctness standpoint; this is a nice, backwards‑compatible API evolution.
Also applies to: 72-75, 77-89, 93-94, 97-119, 121-136, 385-385, 410-410
e7977da to
ca9985d
Compare
|



When a peer reconnects with the same ID, other peers were not reliably notified that the old connection went offline. This caused "connection already exists" errors when attempting to establish new connections to the reconnected peer.
The issue occurred because the old peer's cleanup notification raced with the new peer's online notification. If reconnection happened before cleanup, the offline notification was silently dropped.
The fix sends an offline notification synchronously during reconnection (when AddPeer returns true), ensuring all subscribed peers receive events in the correct order (offline → online).
Added TestBindReconnectRace to validate the fix with 1000 reconnection iterations.
Describe your changes
Issue ticket number and link
Stack
Checklist
Documentation
Select exactly one:
Docs PR URL (required if "docs added" is checked)
Paste the PR link from https://github.com/netbirdio/docs here:
https://github.com/netbirdio/docs/pull/__
Summary by CodeRabbit
Bug Fixes
Tests
Chores
Refactor