fix: peer backoff reconnect to prevent restart TLC stuck#1166
Open
jjyr wants to merge 7 commits intonervosnetwork:developfrom
Open
fix: peer backoff reconnect to prevent restart TLC stuck#1166jjyr wants to merge 7 commits intonervosnetwork:developfrom
jjyr wants to merge 7 commits intonervosnetwork:developfrom
Conversation
a4a25e2 to
db3c067
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR aims to prevent “TLC stuck” during peer disconnect/restart by making reconnect deterministic (exponential backoff) and by improving reestablish replay determinism via persisted CommitDiff.
Changes:
- Add peer reconnect backoff scheduling/guardrails in
NetworkActor(seeded on disconnect and dial errors). - Persist and replay pending commitment state (
CommitDiff) during channel reestablish, including deterministic replay ordering and deferred peer TLC update handling. - Expand test coverage with targeted reconnect and commit-diff tests, plus additional restart/reestablish scenarios.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| crates/fiber-lib/src/fiber/network.rs | Implements reconnect-backoff state machine and triggers on disconnect/dial errors. |
| crates/fiber-lib/src/fiber/channel.rs | Persists CommitDiff, replays it on reestablish, and adds deferred peer TLC update handling. |
| crates/fiber-lib/src/store/store_impl/mod.rs | Adds KV persistence for pending CommitDiff. |
| crates/fiber-lib/src/store/schema.rs | Introduces DB prefix for PendingCommitDiff. |
| crates/fiber-lib/src/fiber/tests/channel_commit_diff.rs | New unit tests for CommitDiff validation/ordering helpers. |
| crates/fiber-lib/src/fiber/tests/channel.rs | Adds reconnect-backoff tests and multiple restart/reestablish scenarios (incl. ignored ring repro). |
| crates/fiber-lib/src/fiber/tests/network.rs | Adds test ensuring reconnect-backoff is skipped without direct active channels. |
| crates/fiber-lib/src/store/tests/store.rs | Updates fixtures for newly added ChannelActorState fields. |
| crates/fiber-lib/src/store/sample/sample_channel.rs | Updates sample ChannelActorState builders for new fields. |
| crates/fiber-lib/src/fiber/tests/settle_tlc_set_command_tests.rs | Updates mock store + state builders for new CommitDiff/state fields. |
| crates/fiber-lib/src/fiber/tests/mod.rs | Registers new channel_commit_diff test module. |
db3c067 to
d6ae16c
Compare
d6ae16c to
2b9579c
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Blocked by #1111
Scope
This PR is based on
repro-restart-upstream-developand only includes commits after that base branch.Root Cause of TLC Stuck
The primary issue is a reconnect gap after peer disconnect/restart:
Initexchange.waiting_ackcan remain gated for a long time, which manifests as stuck TLCs.In short: the first failure point is reconnect not becoming ready in time, not the core TLC state machine itself.
Fix Implemented
NetworkActor.DialerErrorhappens.Test Improvements
test_peer_disconnect_with_active_channel_enters_backoff_reconnecttest_startup_dial_error_with_active_channel_enters_backoff_reconnecttest_peer_disconnect_without_active_channel_skips_backoff_reconnectValidation
Verified with:
cargo check -p fnn --tests --quiettest_ring_self_payments_then_restart_two_nodes(--run-ignored ignored-only)GitHub Diff Link (base vs PR branch)
jjyr/fiber@repro-restart-upstream-develop...pr/ring-restart-clean