[relay-server] Fix race condition in relay peer reconnection handling #4791

pappz · 2025-11-14T23:11:00Z

When a peer reconnects with the same ID, other peers were not reliably notified that the old connection went offline. This caused "connection already exists" errors when attempting to establish new connections to the reconnected peer.

The issue occurred because the old peer's cleanup notification raced with the new peer's online notification. If reconnection happened before cleanup, the offline notification was silently dropped.

The fix sends an offline notification synchronously during reconnection (when AddPeer returns true), ensuring all subscribed peers receive events in the correct order (offline → online).

Added TestBindReconnectRace to validate the fix with 1000 reconnection iterations.

Describe your changes

Issue ticket number and link

Stack

Checklist

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

I added/updated documentation for this change
Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

Bug Fixes
- Reconnection flow now emits offline/online notifications to keep peer state consistent.
Tests
- Added a repeatable race-detection test for reconnect scenarios and made critical test failures fail-fast; improved test cleanup and logging.
Chores
- CI test workflow updated to enable additional test tags and more verbose race-mode output.
Refactor
- Manager construction changed to use a per-instance options struct, exposing configurable MTU, cleanup interval, and idle timeout.

When a peer reconnects with the same ID, other peers were not reliably notified that the old connection went offline. This caused "connection already exists" errors when attempting to establish new connections to the reconnected peer. The issue occurred because the old peer's cleanup notification raced with the new peer's online notification. If reconnection happened before cleanup, the offline notification was silently dropped. The fix sends an offline notification synchronously during reconnection (when AddPeer returns true), ensuring all subscribed peers receive events in the correct order (offline → online). Added TestBindReconnectRace to validate the fix with 1000 reconnection iterations.

coderabbitai · 2025-11-14T23:11:10Z

Walkthrough

Relay.Accept now invokes the notifier's PeerWentOffline when AddPeer reports a reconnection. A new race-detection test TestBindReconnectRace was added. The Relay client manager API now accepts per-instance ManagerOpts (MTU, cleanup interval, unused-server timeout). CI tests use -tags devcert and -race -v.

Changes

Cohort / File(s)	Summary
Relay server reconnection notification `relay/server/relay.go`	When `r.store.AddPeer(peer)` reports a reconnection, `Accept` now calls `r.notifier.PeerWentOffline(peer.ID())` before recording reconnection metrics.
Client reconnection race test `shared/relay/client/client_test.go`	Add `TestBindReconnectRace` to repeatedly exercise reconnect/bind races; change some tests to `t.Fatalf`, update imports, and add per-iteration cleanup and logging.
Manager per-instance configuration `shared/relay/client/manager.go`	Introduce exported `ManagerOpts` (CleanupInterval, UnusedServerTime, MTU); change `NewManager` to accept `opts *ManagerOpts`; add per-instance cleanupInterval and unusedServerTime fields, defaults, MTU validation/clamping, and use instance cleanup interval.
Manager tests updated to options `shared/relay/client/manager_test.go`	Update all `NewManager` calls to pass `&ManagerOpts{...}` and replace previous global-duration usage with test-local durations.
Client callsites updated `client/internal/connect.go`, `client/internal/engine_test.go`	Replace raw MTU argument with `&relayClient.ManagerOpts{MTU: ...}` in `NewManager` calls.
CI workflow build/tag changes `.github/workflows/golang-test-linux.yml`	Use `-race -v` for amd64 in the Relay test matrix and add `-tags devcert` to `go test` invocations in the Linux workflow for Relay tests.

Sequence Diagram(s)

sequenceDiagram
    participant Relay as Relay.Accept
    participant Store as PeerStore
    participant Notifier as Notifier

    rect rgb(245,250,255)
    Note over Relay,Store: AddPeer flow with reconnection handling
    Relay->>Store: AddPeer(peer)
    alt New peer
        Store-->>Relay: false
        Relay->>Relay: proceed (record metrics)
    else Reconnection
        Store-->>Relay: true
        Relay->>Notifier: PeerWentOffline(peer.ID())    %% NEW call on reconnection
        Relay->>Relay: record reconnection metric
    end
    end

sequenceDiagram
    participant Caller as NewManager caller
    participant Manager as Manager

    rect rgb(250,245,255)
    Note over Caller,Manager: New per-instance option handling
    Caller->>Manager: NewManager(ctx, urls, peerID, opts *ManagerOpts)
    alt opts == nil
        Manager->>Manager: set defaults (MTU, cleanupInterval, unusedServerTime)
    else opts provided
        Manager->>Manager: validate MTU (clamp with warnings)
        Manager->>Manager: set cleanupInterval, unusedServerTime from opts
    end
    Manager-->>Caller: *Manager instance
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Areas to inspect closely:
- shared/relay/client/manager.go — option parsing, MTU bounds checks, defaults, and cleanup loop timing.
- shared/relay/client/manager_test.go & shared/relay/client/client_test.go — test timing, cleanup, and stability for the new high-iteration race test.
- relay/server/relay.go — side-effects and thread-safety when calling Notifier.PeerWentOffline on reconnection.
- .github/workflows/golang-test-linux.yml — ensure -tags devcert and -race -v are intended across environments.

Suggested reviewers

crn4

Poem

🐇 I hopped through code with a curious twitch,
Notifier bell rung when peers switch and stitch.
I raced reconnects in a thousand-fold night,
MTUs snug, cleanups set — everything just right.
I nibble logs, find races, then vanish out of sight.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: fixing a race condition in relay peer reconnection handling, which is the core purpose of the changeset.
Description check	✅ Passed	The description comprehensively explains the problem, root cause, fix, and test validation. It follows the template structure with problem context, solution details, test coverage, and completed checklist sections.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch fix/relay-reconnection-race

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7977da and ca9985d.

📒 Files selected for processing (4)

client/internal/connect.go (1 hunks)
client/internal/engine_test.go (7 hunks)
shared/relay/client/manager.go (5 hunks)
shared/relay/client/manager_test.go (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (4)

shared/relay/client/manager_test.go (2)

shared/relay/client/manager.go (2)

NewManager (94-140)

ManagerOpts (78-89)

client/iface/iface.go (1)

DefaultMTU (28-28)

shared/relay/client/manager.go (1)

shared/relay/auth/hmac/store.go (1)

TokenStore (13-16)

client/internal/engine_test.go (2)

shared/relay/client/manager.go (2)

NewManager (94-140)

ManagerOpts (78-89)

client/iface/iface.go (1)

DefaultMTU (28-28)

client/internal/connect.go (1)

shared/relay/client/manager.go (2)

NewManager (94-140)

ManagerOpts (78-89)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: Signal / Unit (amd64)
GitHub Check: Management / Unit (amd64, mysql)
GitHub Check: Management / Unit (amd64, sqlite)
GitHub Check: Management / Unit (amd64, postgres)
GitHub Check: Client (Docker) / Unit
GitHub Check: Client / Unit (amd64)
GitHub Check: Client / Unit (386)
GitHub Check: Relay / Unit (amd64, -race -v)
GitHub Check: Relay / Unit (386)
GitHub Check: Android / Build
GitHub Check: iOS / Build
GitHub Check: Client / Unit
GitHub Check: Windows
GitHub Check: Darwin
GitHub Check: Linux
GitHub Check: Client / Unit
GitHub Check: JS / Lint
GitHub Check: Client / Unit
GitHub Check: release_ui_darwin
GitHub Check: release

🔇 Additional comments (9)

client/internal/connect.go (1)

256-258: LGTM! Clean API migration.

The relay manager initialization correctly adopts the new ManagerOpts pattern, passing the MTU through the structured configuration.

client/internal/engine_test.go (1)

232-232: LGTM! Consistent test updates.

All test cases correctly migrated to the new ManagerOpts pattern with appropriate default MTU values.

Also applies to: 378-378, 605-605, 770-770, 972-972, 1504-1504

shared/relay/client/manager_test.go (2)

19-19: LGTM! Consistent test migrations.

All test cases correctly updated to use the ManagerOpts pattern with appropriate MTU values.

Also applies to: 94-94, 99-99, 201-201, 206-206, 361-361, 367-367, 436-436, 441-441

224-225: Good test timing control.

The test-local duration variables provide better control over cleanup behavior. The timeout calculation accounts for both the cleanup interval and unused server time, with a 2-second buffer that should handle edge cases where the cleanup ticker fires just before the unused server time expires.

Also applies to: 286-290, 317-317

shared/relay/client/manager.go (5)

17-23: LGTM! Well-defined constants.

The default values and MTU bounds are appropriate. The minimum MTU of 1280 aligns with IPv6 requirements, and the maximum of 65535 matches typical network constraints.

77-89: LGTM! Well-documented configuration struct.

The ManagerOpts struct provides a clean, extensible way to configure the Manager with clear documentation on default behaviors.

93-140: LGTM! Robust initialization with validation.

The function correctly handles nil opts, applies sensible defaults, and validates MTU bounds with appropriate warnings. The implementation safely clamps MTU values to the valid range (1280-65535).

72-75: LGTM! Per-instance configuration.

Adding these fields to the Manager struct enables per-instance configuration and improves testability, replacing the previous reliance on package-level constants.

385-385: LGTM! Correct field usage.

The cleanup loop and unused relay cleanup correctly reference the per-instance configuration fields.

Also applies to: 410-410

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

shared/relay/client/client_test.go (1)
509-509: Minor inconsistency: Check error from Close.

For consistency with line 487-490, consider checking the error from clientAlice.Close().

Apply this diff:
-		clientAlice.Close()
+		err = clientAlice.Close()
+		if err != nil {
+			t.Errorf("iteration %d: failed to close alice after test: %s", i, err)
+		}

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4b41d0 and 605d44c.

📒 Files selected for processing (2)

relay/server/relay.go (1 hunks)
shared/relay/client/client_test.go (2 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

relay/server/relay.go (1)

route/route.go (1)

ID (48-48)

shared/relay/client/client_test.go (3)

shared/relay/client/client.go (1)

NewClient (149-170)

client/iface/iface.go (1)

DefaultMTU (28-28)

shared/relay/messages/id.go (1)

HashID (25-31)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)

GitHub Check: Management / Unit (amd64, sqlite)
GitHub Check: Management / Unit (amd64, postgres)
GitHub Check: Management / Unit (amd64, mysql)
GitHub Check: Relay / Unit (386)
GitHub Check: Relay / Unit (amd64, -race)
GitHub Check: Client (Docker) / Unit
GitHub Check: Client / Unit (386)
GitHub Check: Client / Unit (amd64)
GitHub Check: Client / Unit
GitHub Check: Darwin
GitHub Check: Windows
GitHub Check: Linux
GitHub Check: release_ui_darwin
GitHub Check: release
GitHub Check: release_ui
GitHub Check: JS / Lint
GitHub Check: Client / Unit
GitHub Check: iOS / Build
GitHub Check: Android / Build
GitHub Check: Client / Unit

🔇 Additional comments (1)

relay/server/relay.go (1)

133-137: LGTM! Synchronous offline notification fixes the race condition.

The fix correctly sends the offline notification before the online notification during reconnection, ensuring subscribers receive events in the proper order and preventing "connection already exists" errors.

coderabbitai · 2025-11-14T23:14:10Z

shared/relay/client/client_test.go

+		// Close Bob's connection to Alice to prepare for next iteration
+		clientBob.mu.Lock()
+		aliceID := messages.HashID("alice")
+		if container, ok := clientBob.conns[aliceID]; ok {
+			container.close()
+			delete(clientBob.conns, aliceID)
+		}
+		clientBob.mu.Unlock()
+	}


🛠️ Refactor suggestion | 🟠 Major

Encapsulation violation: Test accesses private client internals.

The test directly manipulates clientBob.mu and clientBob.conns, breaking encapsulation. This makes the test fragile and tightly coupled to implementation details.

Consider one of these approaches:

Preferred: Add a public method to the Client type to close a specific connection by peer ID:

// In client.go func (c *Client) CloseConnToPeer(peerID string) error { c.mu.Lock() defer c.mu.Unlock() hashedID := messages.HashID(peerID) if container, ok := c.conns[hashedID]; ok { container.close() delete(c.conns, hashedID) } return nil }

Then use it in the test:

- // Close Bob's connection to Alice to prepare for next iteration - clientBob.mu.Lock() - aliceID := messages.HashID("alice") - if container, ok := clientBob.conns[aliceID]; ok { - container.close() - delete(clientBob.conns, aliceID) - } - clientBob.mu.Unlock() + // Close Bob's connection to Alice to prepare for next iteration + err = clientBob.CloseConnToPeer("alice") + if err != nil { + t.Errorf("iteration %d: failed to close Bob's conn to Alice: %s", i, err) + }

Alternative: Create a fresh Bob client for each iteration instead of manually cleaning state.

🤖 Prompt for AI Agents

shared/relay/client/client_test.go around lines 511-519: the test directly accesses clientBob.mu and clientBob.conns which breaks encapsulation; add a public method on Client (e.g., CloseConnToPeer(peerID string) error) in client.go that hashes the peer ID, locks the client, finds the connection container, calls its close method and deletes the map entry, then call that new method from the test to close Bob’s connection to "alice" (alternatively, create a fresh Bob client per iteration).

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

shared/relay/client/manager.go (1)
25-27: Unify “relay client not connected” error with ErrRelayClientNotConnected

You now expose a package‑level ErrRelayClientNotConnected, but isForeignServer still returns a fresh fmt.Errorf("relay client not connected"). That means callers (like AddCloseListener) can’t reliably use errors.Is / equality to detect this condition.

Consider reusing or wrapping the sentinel error instead of creating a new one:
 func (m *Manager) isForeignServer(address string) (bool, error) {
-    rAddr, err := m.relayClient.ServerInstanceURL()
-    if err != nil {
-        return false, fmt.Errorf("relay client not connected")
-    }
+    rAddr, err := m.relayClient.ServerInstanceURL()
+    if err != nil {
+        return false, ErrRelayClientNotConnected
+    }
     return rAddr != address, nil
 }
This keeps error semantics consistent across the package and makes it easier for callers/tests to branch on this condition.

Also applies to: 376-382
shared/relay/client/manager_test.go (1)
19-19: Tests correctly exercise new ManagerOpts API; consider a small DRY helper

The test changes all look consistent with the new NewManager signature:

Using &ManagerOpts{MTU: iface.DefaultMTU} keeps the behavior identical to prior defaults for most tests.

TestForeignAutoClose’s custom CleanupInterval/UnusedServerTime and timeout := testCleanupInterval + testUnusedServerTime + 2*time.Second give enough headroom for the cleanup loop to run at least once after the unused window, which should keep the test stable.

If you want to reduce repetition, you could add a small helper like:
func newTestManager(ctx context.Context, urls []string, peerID string) *Manager {
	return NewManager(ctx, urls, peerID, &ManagerOpts{MTU: iface.DefaultMTU})
}
and use it in the tests that only care about MTU, leaving the explicit ManagerOpts construction for tests that tweak timing.

Also applies to: 94-95, 99-101, 201-203, 206-209, 224-226, 286-290, 317-317, 361-365, 367-371, 436-439, 441-444

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0781908 and e7977da.

📒 Files selected for processing (2)

shared/relay/client/manager.go (5 hunks)
shared/relay/client/manager_test.go (8 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

shared/relay/client/manager_test.go (2)

shared/relay/client/manager.go (2)

NewManager (94-140)

ManagerOpts (78-89)

client/iface/iface.go (1)

DefaultMTU (28-28)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)

GitHub Check: Signal / Unit (386)
GitHub Check: Management / Unit (amd64, sqlite)
GitHub Check: Relay / Unit (386)
GitHub Check: Management / Unit (amd64, mysql)
GitHub Check: Relay / Unit (amd64, -race -v)
GitHub Check: Client / Unit (amd64)
GitHub Check: Client / Unit (386)
GitHub Check: Management / Unit (amd64, postgres)
GitHub Check: Client (Docker) / Unit
GitHub Check: Client / Unit
GitHub Check: Client / Unit
GitHub Check: Android / Build
GitHub Check: Client / Unit
GitHub Check: release_ui_darwin
GitHub Check: release
GitHub Check: JS / Lint
GitHub Check: Windows
GitHub Check: Linux
GitHub Check: Darwin

🔇 Additional comments (1)

shared/relay/client/manager.go (1)

17-23: Per‑instance ManagerOpts wiring and MTU bounds look solid

The introduction of ManagerOpts plus cleanupInterval/unusedServerTime/mtu on Manager is cleanly wired:

Defaults are initialized once, then selectively overridden when opts != nil and values are > 0, so you can safely pass nil for defaults and avoid accidentally setting zero/negative intervals.

MTU clamping into [minMTU, maxMTU] with warning logs ensures callers can’t misconfigure the relay path, and Manager/ServerPicker always see a sane uint16.

startCleanupLoop and cleanUpUnusedRelays now respect per‑instance timing, which is exactly what TestForeignAutoClose is exercising.

No issues from a correctness standpoint; this is a nice, backwards‑compatible API evolution.

Also applies to: 72-75, 77-89, 93-94, 97-119, 121-136, 385-385, 410-410

sonarqubecloud · 2025-11-15T01:49:59Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

coderabbitai bot reviewed Nov 14, 2025

View reviewed changes

pappz added 2 commits November 15, 2025 00:28

Add devcert tag for CI test to test the race connection

72f65c6

Add -v for race detector

0781908

coderabbitai bot reviewed Nov 15, 2025

View reviewed changes

Add options for manager to avoid race in tests

ca9985d

pappz force-pushed the fix/relay-reconnection-race branch from e7977da to ca9985d Compare November 15, 2025 01:49

lixmal approved these changes Nov 18, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[relay-server] Fix race condition in relay peer reconnection handling #4791

[relay-server] Fix race condition in relay peer reconnection handling #4791

Uh oh!

pappz commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Nov 14, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Nov 14, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

sonarqubecloud bot commented Nov 15, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

[relay-server] Fix race condition in relay peer reconnection handling #4791

Are you sure you want to change the base?

[relay-server] Fix race condition in relay peer reconnection handling #4791

Uh oh!

Conversation

pappz commented Nov 14, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes

Issue ticket number and link

Stack

Checklist

Documentation

Docs PR URL (required if "docs added" is checked)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Nov 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Nov 14, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sonarqubecloud bot commented Nov 15, 2025

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

pappz commented Nov 14, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 14, 2025 •

edited

Loading