Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Do not track HTLC IDs as separate MPP parts which need claiming #3680

Conversation

TheBlueMatt
Copy link
Collaborator

When we claim an MPP payment, we need to track which channels have
had the preimage durably added to their `ChannelMonitor` to ensure
we don't remove the preimage from any `ChannelMonitor`s until all
`ChannelMonitor`s have the preimage.

Previously, we tracked each MPP part, down to the HTLC ID, as a
part which we needed to get the preimage on disk for. However, this
is not necessary - once a `ChannelMonitor` has a preimage, it
applies it to all inbound HTLCs with the same payment hash.

Further, this can cause a channel to wait on itself in cases of
high-latency synchronous persistence -
 * If we have receive an MPP payment for which multiple parts came
   to us over the same channel,
 * and claim the MPP payment, creating a `ChannelMonitorUpdate` for
   the first part but enqueueing the remaining HTLC claim(s) in the
   channel's holding cell,
 * and we receive a `revoke_and_ack` for the same channel before
   the `ChannelManager::claim_payment` method completes (as each
   claim waits for the `ChannelMonitorUpdate` persistence),
 * we will cause the `ChannelMonitorUpdate` for that
   `revoke_and_ack` to go into the blocked set, waiting on the MPP
   parts to be fully claimed,
 * but when `claim_payment` goes to add the next
   `ChannelMonitorUpdate` for the MPP claim, it will be placed in
   the blocked set, since the blocked set is non-empty.

Thus, we'll end up with a `ChannelMonitorUpdate` in the blocked set
which is needed to unblock the channel since it is a part of the
MPP set which blocked the channel.

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Mar 22, 2025

👋 Thanks for assigning @wpaulino as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

Copy link
Member

@shaavan shaavan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great on first pass! Will take another look at the test to make sure I understand everything correctly.

@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @wpaulino! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Copy link
Contributor

@wpaulino wpaulino left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find!

Would an alternative way of fixing this be by only queueing a single monitor update for all the HTLC parts belonging to the same channel?

I'd also like to, if possible, prevent future cases like this where the channel suddenly becomes blocked resulting in a force close. Can/should we trigger a panic if a queued monitor update that we need to process to unblock a channel becomes blocked?

Copy link
Contributor

@G8XSU G8XSU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Diff looks good to me overall.

@TheBlueMatt
Copy link
Collaborator Author

Would an alternative way of fixing this be by only queueing a single monitor update for all the HTLC parts belonging to the same channel?

Yes, that would also fix this. I thought about trying to do that first, but its not as nice for a backport in that its a fairly nontrivial difference. The advantage that would have, of course, is that it'd be nice performance win on top (MPP claims in one commitment update and one ChannelMonitorUpdate rather than several). Of course the real way to do this is to finally fix the old //TODO: Delay the claimed_funds relaying just like we do outbound relay!, which would lean on the holding cell to queue up claims until a timer cycle which also gives us privacy, but, again, much larger patch for this fix and not something I want to backport.

I'd also like to, if possible, prevent future cases like this where the channel suddenly becomes blocked resulting in a force close. Can/should we trigger a panic if a queued monitor update that we need to process to unblock a channel becomes blocked?

Hmm, we definitely should, but its not entirely clear to me how. We don't track explicit dependencies in the form of which specific monitor updates we're waiting on to release a monitor update, so I don't really see a clean assertion path here. I guess we could assert for this specific bug that we don't want a channel with no pending monitor updates to be blocked on an MPP claim from itself (with a monitor update in the pending queue), but its pretty tailored to exactly this problem which doesn't seem all that useful.

@TheBlueMatt TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 4 times, most recently from 3938d04 to 1d8f34a Compare March 26, 2025 21:47
@wpaulino
Copy link
Contributor

CI is still failing

@TheBlueMatt TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 2 times, most recently from c3340a8 to 90acd18 Compare March 27, 2025 14:47
@TheBlueMatt
Copy link
Collaborator Author

Yea, it hard to repro locally so I'm kinda fighting CI directly :/

@TheBlueMatt TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 4 times, most recently from acad2b4 to 9c0be2e Compare March 28, 2025 17:04
@TheBlueMatt TheBlueMatt requested a review from wpaulino March 28, 2025 17:04
@TheBlueMatt
Copy link
Collaborator Author

Okay, it passed the last time modulo rustfmt, so it should again this time.

@TheBlueMatt TheBlueMatt added the weekly goal Someone wants to land this this week label Mar 28, 2025
@TheBlueMatt TheBlueMatt removed the request for review from wpaulino March 28, 2025 18:40
@TheBlueMatt
Copy link
Collaborator Author

Nope, still hanging :(

@TheBlueMatt TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 4 times, most recently from 6b5e1f6 to 26e7e96 Compare March 29, 2025 11:58
Copy link

codecov bot commented Mar 29, 2025

Codecov Report

Attention: Patch coverage is 91.39785% with 24 lines in your changes missing coverage. Please review.

Project coverage is 90.43%. Comparing base (1fc2726) to head (26e7e96).
Report is 45 commits behind head on main.

Files with missing lines Patch % Lines
lightning/src/ln/chanmon_update_fail_tests.rs 94.18% 6 Missing and 4 partials ⚠️
lightning/src/sync/debug_sync.rs 77.50% 9 Missing ⚠️
lightning/src/util/test_utils.rs 82.35% 1 Missing and 2 partials ⚠️
lightning/src/ln/functional_test_utils.rs 86.66% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3680      +/-   ##
==========================================
+ Coverage   89.25%   90.43%   +1.18%     
==========================================
  Files         155      155              
  Lines      119708   129654    +9946     
  Branches   119708   129654    +9946     
==========================================
+ Hits       106845   117255   +10410     
+ Misses      10250     9864     -386     
+ Partials     2613     2535      -78     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@TheBlueMatt TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 2 times, most recently from 8a370bd to 1367c88 Compare March 31, 2025 00:21
@TheBlueMatt TheBlueMatt requested a review from wpaulino March 31, 2025 02:38
@TheBlueMatt
Copy link
Collaborator Author

Okay, this is finally ready for review. The test is quite stable (tho doesn't work at all on windows).

@dunxen
Copy link
Contributor

dunxen commented Mar 31, 2025

Okay, this is finally ready for review.

Nice. LGTM.

(tho doesn't work at all on windows)

Curious if it's some specific windows primitive that just won't play nice?

dunxen
dunxen previously approved these changes Mar 31, 2025
@TheBlueMatt
Copy link
Collaborator Author

Yea, windows is not a real operating system - its still, somehow, doing cooperative multitasking in 2025 (only within the same process, but still)

In a coming commit we'll need to hold references to
`TestChannelManager` in threads, requiring that it be `Sync`.
In a comming commit we'll add a test that relies heavily on lock
fairness, which is not provided by the default Rust `Mutex`.
Luckily, `parking_lot` provided an `unlock_fair`, which we use
here, though it implies we have to manually implement lock
poisoning.
@TheBlueMatt
Copy link
Collaborator Author

Squashed.

@TheBlueMatt TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch from 1367c88 to c83280a Compare March 31, 2025 16:30
@TheBlueMatt TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch from c83280a to d5e4829 Compare March 31, 2025 17:30
@TheBlueMatt
Copy link
Collaborator Author

TheBlueMatt commented Mar 31, 2025

Hmm, think the patch fails on rebase,

$ git diff-tree -U1 c83280ab1 93b4479e4
diff --git a/lightning/src/ln/chanmon_update_fail_tests.rs b/lightning/src/ln/chanmon_update_fail_tests.rs
index b386f6fee..e1c95def4 100644
--- a/lightning/src/ln/chanmon_update_fail_tests.rs
+++ b/lightning/src/ln/chanmon_update_fail_tests.rs
@@ -3877,3 +3877,4 @@ fn test_single_channel_multiple_mpp() {
        let node_cfgs = create_node_cfgs(9, &chanmon_cfgs);
-       let node_chanmgrs = create_node_chanmgrs(9, &node_cfgs, &[None; 9]);
+       let configs = [None, None, None, None, None, None, None, None, None];
+       let node_chanmgrs = create_node_chanmgrs(9, &node_cfgs, &configs);
        let mut nodes = create_network(9, &node_cfgs, &node_chanmgrs);

When we claim an MPP payment, we need to track which channels have
had the preimage durably added to their `ChannelMonitor` to ensure
we don't remove the preimage from any `ChannelMonitor`s until all
`ChannelMonitor`s have the preimage.

Previously, we tracked each MPP part, down to the HTLC ID, as a
part which we needed to get the preimage on disk for. However, this
is not necessary - once a `ChannelMonitor` has a preimage, it
applies it to all inbound HTLCs with the same payment hash.

Further, this can cause a channel to wait on itself in cases of
high-latency synchronous persistence -
 * If we have receive an MPP payment for which multiple parts came
   to us over the same channel,
 * and claim the MPP payment, creating a `ChannelMonitorUpdate` for
   the first part but enqueueing the remaining HTLC claim(s) in the
   channel's holding cell,
 * and we receive a `revoke_and_ack` for the same channel before
   the `ChannelManager::claim_payment` method completes (as each
   claim waits for the `ChannelMonitorUpdate` persistence),
 * we will cause the `ChannelMonitorUpdate` for that
   `revoke_and_ack` to go into the blocked set, waiting on the MPP
   parts to be fully claimed,
 * but when `claim_payment` goes to add the next
   `ChannelMonitorUpdate` for the MPP claim, it will be placed in
   the blocked set, since the blocked set is non-empty.

Thus, we'll end up with a `ChannelMonitorUpdate` in the blocked set
which is needed to unblock the channel since it is a part of the
MPP set which blocked the channel.
@TheBlueMatt TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch from d5e4829 to 93b4479 Compare March 31, 2025 19:34
Copy link
Contributor

@G8XSU G8XSU left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm!

@TheBlueMatt TheBlueMatt merged commit dd4b580 into lightningdevkit:main Apr 1, 2025
25 of 27 checks passed
@TheBlueMatt
Copy link
Collaborator Author

Backported in #3697

TheBlueMatt added a commit to TheBlueMatt/rust-lightning that referenced this pull request Apr 3, 2025
v0.1.2 - Apr 02, 2025 - "Foolishly Edgy Cases"

API Updates
===========

 * `lightning-invoice` is now re-exported as `lightning::bolt11_invoice`
   (lightningdevkit#3671).

Performance Improvements
========================

 * `rapid-gossip-sync` graph parsing is substantially faster, resolving a
   regression in 0.1 (lightningdevkit#3581).
 * `NetworkGraph` loading is now substantially faster and does fewer
   allocations, resulting in a 20% further improvement in `rapid-gossip-sync`
   loading when initializing from scratch (lightningdevkit#3581).
 * `ChannelMonitor`s for closed channels are no longer always re-persisted
   immediately after startup, reducing on-startup I/O burden (lightningdevkit#3619).

Bug Fixes
=========

 * BOLT 11 invoices longer than 1023 bytes long (and up to 7089 bytes) now
   properly parse (lightningdevkit#3665).
 * In some cases, when using synchronous persistence with higher latency than
   the latency to communicate with peers, when receiving an MPP payment with
   multiple parts received over the same channel, a channel could hang and not
   make progress, eventually leading to a force-closure due to timed-out HTLCs.
   This has now been fixed (lightningdevkit#3680).
 * Some rare cases with multi-hop BOLT 11 route hints or multiple redundant
   blinded paths could have led to the router creating invalid `Route`s were
   fixed (lightningdevkit#3586).
 * Corrected the decay logic in `ProbabilisticScorer`'s historical buckets
   model. Note that by default historical buckets are only decayed if no new
   datapoints have been added for a channel for two weeks (lightningdevkit#3562).
 * `{Channel,Onion}MessageHandler::peer_disconnected` will now be called if a
   different message handler refused connection by returning an `Err` from its
   `peer_connected` method (lightningdevkit#3580).
 * If the counterparty broadcasts a revoked state with pending HTLCs, those
   will now be claimed with other outputs which we consider to not be
   vulnerable to pinning attacks if they are not yet claimable by our
   counterparty, potentially reducing our exposure to pinning attacks (lightningdevkit#3564).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
weekly goal Someone wants to land this this week
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants