Skip to content

fix: peer backoff reconnect to prevent restart TLC stuck#1166

Open
jjyr wants to merge 7 commits intonervosnetwork:developfrom
jjyr:pr/ring-restart-clean
Open

fix: peer backoff reconnect to prevent restart TLC stuck#1166
jjyr wants to merge 7 commits intonervosnetwork:developfrom
jjyr:pr/ring-restart-clean

Conversation

@jjyr
Copy link
Collaborator

@jjyr jjyr commented Mar 3, 2026

Blocked by #1111

Scope

This PR is based on repro-restart-upstream-develop and only includes commits after that base branch.

Root Cause of TLC Stuck

The primary issue is a reconnect gap after peer disconnect/restart:

  • Channel actors are stopped on disconnect.
  • Reestablishment can only start after peer reconnect + Init exchange.
  • Previously, reconnect was not deterministic in key paths (especially disconnect and startup dial failure windows).
  • During this missing-actor window, TLC replay/remove flows can be delayed or blocked, and waiting_ack can remain gated for a long time, which manifests as stuck TLCs.

In short: the first failure point is reconnect not becoming ready in time, not the core TLC state machine itself.

Fix Implemented

  1. Added deterministic peer reconnect with exponential backoff in NetworkActor.
  2. Backoff is seeded when:
  • peer disconnect is observed, or
  • DialerError happens.
  1. Backoff is only enabled when the peer still has direct active channels.
  2. Guardrails:
  • skip reconnect for user-requested disconnect,
  • skip reconnect when there is no direct active channel.
  1. Backoff state is cleared after successful peer reconnect.
  2. Kept minimal debug events for reconnect lifecycle observability.

Test Improvements

  • Added targeted reconnect behavior tests:
  • test_peer_disconnect_with_active_channel_enters_backoff_reconnect
  • test_startup_dial_error_with_active_channel_enters_backoff_reconnect
  • test_peer_disconnect_without_active_channel_skips_backoff_reconnect
  • Updated the ring restart repro test to replace fixed sleep with state-based wait:
  • proceed immediately once conditions are met,
  • keep a 120s timeout ceiling for diagnostics.

Validation

Verified with:

  • cargo check -p fnn --tests --quiet
  • the 3 reconnect tests above
  • test_ring_self_payments_then_restart_two_nodes (--run-ignored ignored-only)

GitHub Diff Link (base vs PR branch)
jjyr/fiber@repro-restart-upstream-develop...pr/ring-restart-clean

@jjyr jjyr force-pushed the pr/ring-restart-clean branch from a4a25e2 to db3c067 Compare March 4, 2026 07:08
@quake quake requested a review from Copilot March 4, 2026 07:14
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to prevent “TLC stuck” during peer disconnect/restart by making reconnect deterministic (exponential backoff) and by improving reestablish replay determinism via persisted CommitDiff.

Changes:

  • Add peer reconnect backoff scheduling/guardrails in NetworkActor (seeded on disconnect and dial errors).
  • Persist and replay pending commitment state (CommitDiff) during channel reestablish, including deterministic replay ordering and deferred peer TLC update handling.
  • Expand test coverage with targeted reconnect and commit-diff tests, plus additional restart/reestablish scenarios.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
crates/fiber-lib/src/fiber/network.rs Implements reconnect-backoff state machine and triggers on disconnect/dial errors.
crates/fiber-lib/src/fiber/channel.rs Persists CommitDiff, replays it on reestablish, and adds deferred peer TLC update handling.
crates/fiber-lib/src/store/store_impl/mod.rs Adds KV persistence for pending CommitDiff.
crates/fiber-lib/src/store/schema.rs Introduces DB prefix for PendingCommitDiff.
crates/fiber-lib/src/fiber/tests/channel_commit_diff.rs New unit tests for CommitDiff validation/ordering helpers.
crates/fiber-lib/src/fiber/tests/channel.rs Adds reconnect-backoff tests and multiple restart/reestablish scenarios (incl. ignored ring repro).
crates/fiber-lib/src/fiber/tests/network.rs Adds test ensuring reconnect-backoff is skipped without direct active channels.
crates/fiber-lib/src/store/tests/store.rs Updates fixtures for newly added ChannelActorState fields.
crates/fiber-lib/src/store/sample/sample_channel.rs Updates sample ChannelActorState builders for new fields.
crates/fiber-lib/src/fiber/tests/settle_tlc_set_command_tests.rs Updates mock store + state builders for new CommitDiff/state fields.
crates/fiber-lib/src/fiber/tests/mod.rs Registers new channel_commit_diff test module.

@jjyr jjyr force-pushed the pr/ring-restart-clean branch from db3c067 to d6ae16c Compare March 6, 2026 09:18
@jjyr jjyr force-pushed the pr/ring-restart-clean branch from d6ae16c to 2b9579c Compare March 6, 2026 09:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants