Skip to content

Conversation

@pappz
Copy link
Contributor

@pappz pappz commented Nov 14, 2025

When a peer reconnects with the same ID, other peers were not reliably notified that the old connection went offline. This caused "connection already exists" errors when attempting to establish new connections to the reconnected peer.

The issue occurred because the old peer's cleanup notification raced with the new peer's online notification. If reconnection happened before cleanup, the offline notification was silently dropped.

The fix sends an offline notification synchronously during reconnection (when AddPeer returns true), ensuring all subscribed peers receive events in the correct order (offline → online).

Added TestBindReconnectRace to validate the fix with 1000 reconnection iterations.

Describe your changes

Issue ticket number and link

Stack

Checklist

  • Is it a bug fix
  • Is a typo/documentation fix
  • Is a feature enhancement
  • It is a refactor
  • Created tests that fail without the change (if possible)

By submitting this pull request, you confirm that you have read and agree to the terms of the Contributor License Agreement.

Documentation

Select exactly one:

  • I added/updated documentation for this change
  • Documentation is not needed for this change (explain why)

Docs PR URL (required if "docs added" is checked)

Paste the PR link from https://github.com/netbirdio/docs here:

https://github.com/netbirdio/docs/pull/__

Summary by CodeRabbit

  • Bug Fixes

    • Reconnection flow now emits offline/online notifications to keep peer state consistent.
  • Tests

    • Added a repeatable race-detection test for reconnect scenarios and made critical test failures fail-fast; improved test cleanup and logging.
  • Chores

    • CI test workflow updated to enable additional test tags and more verbose race-mode output.
  • Refactor

    • Manager construction changed to use a per-instance options struct, exposing configurable MTU, cleanup interval, and idle timeout.

When a peer reconnects with the same ID, other peers were not reliably
notified that the old connection went offline. This caused "connection
already exists" errors when attempting to establish new connections to the
reconnected peer.

The issue occurred because the old peer's cleanup notification raced with
the new peer's online notification. If reconnection happened before
cleanup, the offline notification was silently dropped.

The fix sends an offline notification synchronously during reconnection
(when AddPeer returns true), ensuring all subscribed peers receive events
in the correct order (offline → online).

Added TestBindReconnectRace to validate the fix with 1000 reconnection
iterations.
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 14, 2025

Walkthrough

Relay.Accept now invokes the notifier's PeerWentOffline when AddPeer reports a reconnection. A new race-detection test TestBindReconnectRace was added. The Relay client manager API now accepts per-instance ManagerOpts (MTU, cleanup interval, unused-server timeout). CI tests use -tags devcert and -race -v.

Changes

Cohort / File(s) Summary
Relay server reconnection notification
relay/server/relay.go
When r.store.AddPeer(peer) reports a reconnection, Accept now calls r.notifier.PeerWentOffline(peer.ID()) before recording reconnection metrics.
Client reconnection race test
shared/relay/client/client_test.go
Add TestBindReconnectRace to repeatedly exercise reconnect/bind races; change some tests to t.Fatalf, update imports, and add per-iteration cleanup and logging.
Manager per-instance configuration
shared/relay/client/manager.go
Introduce exported ManagerOpts (CleanupInterval, UnusedServerTime, MTU); change NewManager to accept opts *ManagerOpts; add per-instance cleanupInterval and unusedServerTime fields, defaults, MTU validation/clamping, and use instance cleanup interval.
Manager tests updated to options
shared/relay/client/manager_test.go
Update all NewManager calls to pass &ManagerOpts{...} and replace previous global-duration usage with test-local durations.
Client callsites updated
client/internal/connect.go, client/internal/engine_test.go
Replace raw MTU argument with &relayClient.ManagerOpts{MTU: ...} in NewManager calls.
CI workflow build/tag changes
.github/workflows/golang-test-linux.yml
Use -race -v for amd64 in the Relay test matrix and add -tags devcert to go test invocations in the Linux workflow for Relay tests.

Sequence Diagram(s)

sequenceDiagram
    participant Relay as Relay.Accept
    participant Store as PeerStore
    participant Notifier as Notifier

    rect rgb(245,250,255)
    Note over Relay,Store: AddPeer flow with reconnection handling
    Relay->>Store: AddPeer(peer)
    alt New peer
        Store-->>Relay: false
        Relay->>Relay: proceed (record metrics)
    else Reconnection
        Store-->>Relay: true
        Relay->>Notifier: PeerWentOffline(peer.ID())    %% NEW call on reconnection
        Relay->>Relay: record reconnection metric
    end
    end
Loading
sequenceDiagram
    participant Caller as NewManager caller
    participant Manager as Manager

    rect rgb(250,245,255)
    Note over Caller,Manager: New per-instance option handling
    Caller->>Manager: NewManager(ctx, urls, peerID, opts *ManagerOpts)
    alt opts == nil
        Manager->>Manager: set defaults (MTU, cleanupInterval, unusedServerTime)
    else opts provided
        Manager->>Manager: validate MTU (clamp with warnings)
        Manager->>Manager: set cleanupInterval, unusedServerTime from opts
    end
    Manager-->>Caller: *Manager instance
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas to inspect closely:
    • shared/relay/client/manager.go — option parsing, MTU bounds checks, defaults, and cleanup loop timing.
    • shared/relay/client/manager_test.go & shared/relay/client/client_test.go — test timing, cleanup, and stability for the new high-iteration race test.
    • relay/server/relay.go — side-effects and thread-safety when calling Notifier.PeerWentOffline on reconnection.
    • .github/workflows/golang-test-linux.yml — ensure -tags devcert and -race -v are intended across environments.

Suggested reviewers

  • crn4

Poem

🐇 I hopped through code with a curious twitch,
Notifier bell rung when peers switch and stitch.
I raced reconnects in a thousand-fold night,
MTUs snug, cleanups set — everything just right.
I nibble logs, find races, then vanish out of sight.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 5.56% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: fixing a race condition in relay peer reconnection handling, which is the core purpose of the changeset.
Description check ✅ Passed The description comprehensively explains the problem, root cause, fix, and test validation. It follows the template structure with problem context, solution details, test coverage, and completed checklist sections.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/relay-reconnection-race

📜 Recent review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e7977da and ca9985d.

📒 Files selected for processing (4)
  • client/internal/connect.go (1 hunks)
  • client/internal/engine_test.go (7 hunks)
  • shared/relay/client/manager.go (5 hunks)
  • shared/relay/client/manager_test.go (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (4)
shared/relay/client/manager_test.go (2)
shared/relay/client/manager.go (2)
  • NewManager (94-140)
  • ManagerOpts (78-89)
client/iface/iface.go (1)
  • DefaultMTU (28-28)
shared/relay/client/manager.go (1)
shared/relay/auth/hmac/store.go (1)
  • TokenStore (13-16)
client/internal/engine_test.go (2)
shared/relay/client/manager.go (2)
  • NewManager (94-140)
  • ManagerOpts (78-89)
client/iface/iface.go (1)
  • DefaultMTU (28-28)
client/internal/connect.go (1)
shared/relay/client/manager.go (2)
  • NewManager (94-140)
  • ManagerOpts (78-89)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: Signal / Unit (amd64)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Relay / Unit (amd64, -race -v)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Android / Build
  • GitHub Check: iOS / Build
  • GitHub Check: Client / Unit
  • GitHub Check: Windows
  • GitHub Check: Darwin
  • GitHub Check: Linux
  • GitHub Check: Client / Unit
  • GitHub Check: JS / Lint
  • GitHub Check: Client / Unit
  • GitHub Check: release_ui_darwin
  • GitHub Check: release
🔇 Additional comments (9)
client/internal/connect.go (1)

256-258: LGTM! Clean API migration.

The relay manager initialization correctly adopts the new ManagerOpts pattern, passing the MTU through the structured configuration.

client/internal/engine_test.go (1)

232-232: LGTM! Consistent test updates.

All test cases correctly migrated to the new ManagerOpts pattern with appropriate default MTU values.

Also applies to: 378-378, 605-605, 770-770, 972-972, 1504-1504

shared/relay/client/manager_test.go (2)

19-19: LGTM! Consistent test migrations.

All test cases correctly updated to use the ManagerOpts pattern with appropriate MTU values.

Also applies to: 94-94, 99-99, 201-201, 206-206, 361-361, 367-367, 436-436, 441-441


224-225: Good test timing control.

The test-local duration variables provide better control over cleanup behavior. The timeout calculation accounts for both the cleanup interval and unused server time, with a 2-second buffer that should handle edge cases where the cleanup ticker fires just before the unused server time expires.

Also applies to: 286-290, 317-317

shared/relay/client/manager.go (5)

17-23: LGTM! Well-defined constants.

The default values and MTU bounds are appropriate. The minimum MTU of 1280 aligns with IPv6 requirements, and the maximum of 65535 matches typical network constraints.


77-89: LGTM! Well-documented configuration struct.

The ManagerOpts struct provides a clean, extensible way to configure the Manager with clear documentation on default behaviors.


93-140: LGTM! Robust initialization with validation.

The function correctly handles nil opts, applies sensible defaults, and validates MTU bounds with appropriate warnings. The implementation safely clamps MTU values to the valid range (1280-65535).


72-75: LGTM! Per-instance configuration.

Adding these fields to the Manager struct enables per-instance configuration and improves testability, replacing the previous reliance on package-level constants.


385-385: LGTM! Correct field usage.

The cleanup loop and unused relay cleanup correctly reference the per-instance configuration fields.

Also applies to: 410-410


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
shared/relay/client/client_test.go (1)

509-509: Minor inconsistency: Check error from Close.

For consistency with line 487-490, consider checking the error from clientAlice.Close().

Apply this diff:

-		clientAlice.Close()
+		err = clientAlice.Close()
+		if err != nil {
+			t.Errorf("iteration %d: failed to close alice after test: %s", i, err)
+		}
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e4b41d0 and 605d44c.

📒 Files selected for processing (2)
  • relay/server/relay.go (1 hunks)
  • shared/relay/client/client_test.go (2 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
relay/server/relay.go (1)
route/route.go (1)
  • ID (48-48)
shared/relay/client/client_test.go (3)
shared/relay/client/client.go (1)
  • NewClient (149-170)
client/iface/iface.go (1)
  • DefaultMTU (28-28)
shared/relay/messages/id.go (1)
  • HashID (25-31)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (20)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Relay / Unit (amd64, -race)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Client / Unit
  • GitHub Check: Darwin
  • GitHub Check: Windows
  • GitHub Check: Linux
  • GitHub Check: release_ui_darwin
  • GitHub Check: release
  • GitHub Check: release_ui
  • GitHub Check: JS / Lint
  • GitHub Check: Client / Unit
  • GitHub Check: iOS / Build
  • GitHub Check: Android / Build
  • GitHub Check: Client / Unit
🔇 Additional comments (1)
relay/server/relay.go (1)

133-137: LGTM! Synchronous offline notification fixes the race condition.

The fix correctly sends the offline notification before the online notification during reconnection, ensuring subscribers receive events in the proper order and preventing "connection already exists" errors.

Comment on lines +511 to +519
// Close Bob's connection to Alice to prepare for next iteration
clientBob.mu.Lock()
aliceID := messages.HashID("alice")
if container, ok := clientBob.conns[aliceID]; ok {
container.close()
delete(clientBob.conns, aliceID)
}
clientBob.mu.Unlock()
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Encapsulation violation: Test accesses private client internals.

The test directly manipulates clientBob.mu and clientBob.conns, breaking encapsulation. This makes the test fragile and tightly coupled to implementation details.

Consider one of these approaches:

  1. Preferred: Add a public method to the Client type to close a specific connection by peer ID:
// In client.go
func (c *Client) CloseConnToPeer(peerID string) error {
    c.mu.Lock()
    defer c.mu.Unlock()
    hashedID := messages.HashID(peerID)
    if container, ok := c.conns[hashedID]; ok {
        container.close()
        delete(c.conns, hashedID)
    }
    return nil
}

Then use it in the test:

-		// Close Bob's connection to Alice to prepare for next iteration
-		clientBob.mu.Lock()
-		aliceID := messages.HashID("alice")
-		if container, ok := clientBob.conns[aliceID]; ok {
-			container.close()
-			delete(clientBob.conns, aliceID)
-		}
-		clientBob.mu.Unlock()
+		// Close Bob's connection to Alice to prepare for next iteration
+		err = clientBob.CloseConnToPeer("alice")
+		if err != nil {
+			t.Errorf("iteration %d: failed to close Bob's conn to Alice: %s", i, err)
+		}
  1. Alternative: Create a fresh Bob client for each iteration instead of manually cleaning state.
🤖 Prompt for AI Agents
shared/relay/client/client_test.go around lines 511-519: the test directly
accesses clientBob.mu and clientBob.conns which breaks encapsulation; add a
public method on Client (e.g., CloseConnToPeer(peerID string) error) in
client.go that hashes the peer ID, locks the client, finds the connection
container, calls its close method and deletes the map entry, then call that new
method from the test to close Bob’s connection to "alice" (alternatively, create
a fresh Bob client per iteration).

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
shared/relay/client/manager.go (1)

25-27: Unify “relay client not connected” error with ErrRelayClientNotConnected

You now expose a package‑level ErrRelayClientNotConnected, but isForeignServer still returns a fresh fmt.Errorf("relay client not connected"). That means callers (like AddCloseListener) can’t reliably use errors.Is / equality to detect this condition.

Consider reusing or wrapping the sentinel error instead of creating a new one:

 func (m *Manager) isForeignServer(address string) (bool, error) {
-    rAddr, err := m.relayClient.ServerInstanceURL()
-    if err != nil {
-        return false, fmt.Errorf("relay client not connected")
-    }
+    rAddr, err := m.relayClient.ServerInstanceURL()
+    if err != nil {
+        return false, ErrRelayClientNotConnected
+    }
     return rAddr != address, nil
 }

This keeps error semantics consistent across the package and makes it easier for callers/tests to branch on this condition.

Also applies to: 376-382

shared/relay/client/manager_test.go (1)

19-19: Tests correctly exercise new ManagerOpts API; consider a small DRY helper

The test changes all look consistent with the new NewManager signature:

  • Using &ManagerOpts{MTU: iface.DefaultMTU} keeps the behavior identical to prior defaults for most tests.
  • TestForeignAutoClose’s custom CleanupInterval/UnusedServerTime and timeout := testCleanupInterval + testUnusedServerTime + 2*time.Second give enough headroom for the cleanup loop to run at least once after the unused window, which should keep the test stable.

If you want to reduce repetition, you could add a small helper like:

func newTestManager(ctx context.Context, urls []string, peerID string) *Manager {
	return NewManager(ctx, urls, peerID, &ManagerOpts{MTU: iface.DefaultMTU})
}

and use it in the tests that only care about MTU, leaving the explicit ManagerOpts construction for tests that tweak timing.

Also applies to: 94-95, 99-101, 201-203, 206-209, 224-226, 286-290, 317-317, 361-365, 367-371, 436-439, 441-444

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0781908 and e7977da.

📒 Files selected for processing (2)
  • shared/relay/client/manager.go (5 hunks)
  • shared/relay/client/manager_test.go (8 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
shared/relay/client/manager_test.go (2)
shared/relay/client/manager.go (2)
  • NewManager (94-140)
  • ManagerOpts (78-89)
client/iface/iface.go (1)
  • DefaultMTU (28-28)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (19)
  • GitHub Check: Signal / Unit (386)
  • GitHub Check: Management / Unit (amd64, sqlite)
  • GitHub Check: Relay / Unit (386)
  • GitHub Check: Management / Unit (amd64, mysql)
  • GitHub Check: Relay / Unit (amd64, -race -v)
  • GitHub Check: Client / Unit (amd64)
  • GitHub Check: Client / Unit (386)
  • GitHub Check: Management / Unit (amd64, postgres)
  • GitHub Check: Client (Docker) / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: Client / Unit
  • GitHub Check: Android / Build
  • GitHub Check: Client / Unit
  • GitHub Check: release_ui_darwin
  • GitHub Check: release
  • GitHub Check: JS / Lint
  • GitHub Check: Windows
  • GitHub Check: Linux
  • GitHub Check: Darwin
🔇 Additional comments (1)
shared/relay/client/manager.go (1)

17-23: Per‑instance ManagerOpts wiring and MTU bounds look solid

The introduction of ManagerOpts plus cleanupInterval/unusedServerTime/mtu on Manager is cleanly wired:

  • Defaults are initialized once, then selectively overridden when opts != nil and values are > 0, so you can safely pass nil for defaults and avoid accidentally setting zero/negative intervals.
  • MTU clamping into [minMTU, maxMTU] with warning logs ensures callers can’t misconfigure the relay path, and Manager/ServerPicker always see a sane uint16.
  • startCleanupLoop and cleanUpUnusedRelays now respect per‑instance timing, which is exactly what TestForeignAutoClose is exercising.

No issues from a correctness standpoint; this is a nice, backwards‑compatible API evolution.

Also applies to: 72-75, 77-89, 93-94, 97-119, 121-136, 385-385, 410-410

@pappz pappz force-pushed the fix/relay-reconnection-race branch from e7977da to ca9985d Compare November 15, 2025 01:49
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants