Do not track HTLC IDs as separate MPP parts which need claiming #3680

TheBlueMatt · 2025-03-22T00:45:51Z

When we claim an MPP payment, we need to track which channels have
had the preimage durably added to their `ChannelMonitor` to ensure
we don't remove the preimage from any `ChannelMonitor`s until all
`ChannelMonitor`s have the preimage.

Previously, we tracked each MPP part, down to the HTLC ID, as a
part which we needed to get the preimage on disk for. However, this
is not necessary - once a `ChannelMonitor` has a preimage, it
applies it to all inbound HTLCs with the same payment hash.

Further, this can cause a channel to wait on itself in cases of
high-latency synchronous persistence -
 * If we have receive an MPP payment for which multiple parts came
   to us over the same channel,
 * and claim the MPP payment, creating a `ChannelMonitorUpdate` for
   the first part but enqueueing the remaining HTLC claim(s) in the
   channel's holding cell,
 * and we receive a `revoke_and_ack` for the same channel before
   the `ChannelManager::claim_payment` method completes (as each
   claim waits for the `ChannelMonitorUpdate` persistence),
 * we will cause the `ChannelMonitorUpdate` for that
   `revoke_and_ack` to go into the blocked set, waiting on the MPP
   parts to be fully claimed,
 * but when `claim_payment` goes to add the next
   `ChannelMonitorUpdate` for the MPP claim, it will be placed in
   the blocked set, since the blocked set is non-empty.

Thus, we'll end up with a `ChannelMonitorUpdate` in the blocked set
which is needed to unblock the channel since it is a part of the
MPP set which blocked the channel.

ldk-reviews-bot · 2025-03-22T00:45:53Z

👋 Thanks for assigning @wpaulino as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

shaavan

Looks great on first pass! Will take another look at the test to make sure I understand everything correctly.

lightning/src/sync/debug_sync.rs

lightning/src/ln/functional_test_utils.rs

ldk-reviews-bot · 2025-03-24T00:57:16Z

🔔 1st Reminder

Hey @wpaulino! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

lightning/src/sync/debug_sync.rs

lightning/src/ln/chanmon_update_fail_tests.rs

wpaulino

Great find!

Would an alternative way of fixing this be by only queueing a single monitor update for all the HTLC parts belonging to the same channel?

I'd also like to, if possible, prevent future cases like this where the channel suddenly becomes blocked resulting in a force close. Can/should we trigger a panic if a queued monitor update that we need to process to unblock a channel becomes blocked?

lightning/src/ln/channelmanager.rs

lightning/src/ln/chanmon_update_fail_tests.rs

G8XSU

Diff looks good to me overall.

lightning/src/ln/channelmanager.rs

TheBlueMatt · 2025-03-26T18:18:53Z

Would an alternative way of fixing this be by only queueing a single monitor update for all the HTLC parts belonging to the same channel?

Yes, that would also fix this. I thought about trying to do that first, but its not as nice for a backport in that its a fairly nontrivial difference. The advantage that would have, of course, is that it'd be nice performance win on top (MPP claims in one commitment update and one ChannelMonitorUpdate rather than several). Of course the real way to do this is to finally fix the old //TODO: Delay the claimed_funds relaying just like we do outbound relay!, which would lean on the holding cell to queue up claims until a timer cycle which also gives us privacy, but, again, much larger patch for this fix and not something I want to backport.

I'd also like to, if possible, prevent future cases like this where the channel suddenly becomes blocked resulting in a force close. Can/should we trigger a panic if a queued monitor update that we need to process to unblock a channel becomes blocked?

Hmm, we definitely should, but its not entirely clear to me how. We don't track explicit dependencies in the form of which specific monitor updates we're waiting on to release a monitor update, so I don't really see a clean assertion path here. I guess we could assert for this specific bug that we don't want a channel with no pending monitor updates to be blocked on an MPP claim from itself (with a monitor update in the pending queue), but its pretty tailored to exactly this problem which doesn't seem all that useful.

wpaulino · 2025-03-26T23:01:45Z

CI is still failing

TheBlueMatt · 2025-03-27T16:23:08Z

Yea, it hard to repro locally so I'm kinda fighting CI directly :/

TheBlueMatt · 2025-03-28T17:05:08Z

Okay, it passed the last time modulo rustfmt, so it should again this time.

TheBlueMatt · 2025-03-28T18:40:11Z

Nope, still hanging :(

codecov · 2025-03-29T12:07:10Z

Codecov Report

Attention: Patch coverage is 91.39785% with 24 lines in your changes missing coverage. Please review.

Project coverage is 90.43%. Comparing base (1fc2726) to head (26e7e96).
Report is 45 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/ln/chanmon_update_fail_tests.rs	94.18%	6 Missing and 4 partials ⚠️
lightning/src/sync/debug_sync.rs	77.50%	9 Missing ⚠️
lightning/src/util/test_utils.rs	82.35%	1 Missing and 2 partials ⚠️
lightning/src/ln/functional_test_utils.rs	86.66%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3680      +/-   ##
==========================================
+ Coverage   89.25%   90.43%   +1.18%     
==========================================
  Files         155      155              
  Lines      119708   129654    +9946     
  Branches   119708   129654    +9946     
==========================================
+ Hits       106845   117255   +10410     
+ Misses      10250     9864     -386     
+ Partials     2613     2535      -78

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

TheBlueMatt · 2025-03-31T02:39:09Z

Okay, this is finally ready for review. The test is quite stable (tho doesn't work at all on windows).

dunxen · 2025-03-31T08:15:52Z

Okay, this is finally ready for review.

Nice. LGTM.

(tho doesn't work at all on windows)

Curious if it's some specific windows primitive that just won't play nice?

TheBlueMatt · 2025-03-31T12:38:24Z

Yea, windows is not a real operating system - its still, somehow, doing cooperative multitasking in 2025 (only within the same process, but still)

In a coming commit we'll need to hold references to `TestChannelManager` in threads, requiring that it be `Sync`.

In a comming commit we'll add a test that relies heavily on lock fairness, which is not provided by the default Rust `Mutex`. Luckily, `parking_lot` provided an `unlock_fair`, which we use here, though it implies we have to manually implement lock poisoning.

TheBlueMatt · 2025-03-31T16:30:10Z

Squashed.

TheBlueMatt · 2025-03-31T17:31:50Z

Hmm, think the patch fails on rebase,

$ git diff-tree -U1 c83280ab1 93b4479e4
diff --git a/lightning/src/ln/chanmon_update_fail_tests.rs b/lightning/src/ln/chanmon_update_fail_tests.rs
index b386f6fee..e1c95def4 100644
--- a/lightning/src/ln/chanmon_update_fail_tests.rs
+++ b/lightning/src/ln/chanmon_update_fail_tests.rs
@@ -3877,3 +3877,4 @@ fn test_single_channel_multiple_mpp() {
        let node_cfgs = create_node_cfgs(9, &chanmon_cfgs);
-       let node_chanmgrs = create_node_chanmgrs(9, &node_cfgs, &[None; 9]);
+       let configs = [None, None, None, None, None, None, None, None, None];
+       let node_chanmgrs = create_node_chanmgrs(9, &node_cfgs, &configs);
        let mut nodes = create_network(9, &node_cfgs, &node_chanmgrs);

When we claim an MPP payment, we need to track which channels have had the preimage durably added to their `ChannelMonitor` to ensure we don't remove the preimage from any `ChannelMonitor`s until all `ChannelMonitor`s have the preimage. Previously, we tracked each MPP part, down to the HTLC ID, as a part which we needed to get the preimage on disk for. However, this is not necessary - once a `ChannelMonitor` has a preimage, it applies it to all inbound HTLCs with the same payment hash. Further, this can cause a channel to wait on itself in cases of high-latency synchronous persistence - * If we have receive an MPP payment for which multiple parts came to us over the same channel, * and claim the MPP payment, creating a `ChannelMonitorUpdate` for the first part but enqueueing the remaining HTLC claim(s) in the channel's holding cell, * and we receive a `revoke_and_ack` for the same channel before the `ChannelManager::claim_payment` method completes (as each claim waits for the `ChannelMonitorUpdate` persistence), * we will cause the `ChannelMonitorUpdate` for that `revoke_and_ack` to go into the blocked set, waiting on the MPP parts to be fully claimed, * but when `claim_payment` goes to add the next `ChannelMonitorUpdate` for the MPP claim, it will be placed in the blocked set, since the blocked set is non-empty. Thus, we'll end up with a `ChannelMonitorUpdate` in the blocked set which is needed to unblock the channel since it is a part of the MPP set which blocked the channel.

G8XSU

Lgtm!

TheBlueMatt · 2025-04-01T18:54:18Z

Backported in #3697

v0.1.2 - Apr 02, 2025 - "Foolishly Edgy Cases" API Updates =========== * `lightning-invoice` is now re-exported as `lightning::bolt11_invoice` (lightningdevkit#3671). Performance Improvements ======================== * `rapid-gossip-sync` graph parsing is substantially faster, resolving a regression in 0.1 (lightningdevkit#3581). * `NetworkGraph` loading is now substantially faster and does fewer allocations, resulting in a 20% further improvement in `rapid-gossip-sync` loading when initializing from scratch (lightningdevkit#3581). * `ChannelMonitor`s for closed channels are no longer always re-persisted immediately after startup, reducing on-startup I/O burden (lightningdevkit#3619). Bug Fixes ========= * BOLT 11 invoices longer than 1023 bytes long (and up to 7089 bytes) now properly parse (lightningdevkit#3665). * In some cases, when using synchronous persistence with higher latency than the latency to communicate with peers, when receiving an MPP payment with multiple parts received over the same channel, a channel could hang and not make progress, eventually leading to a force-closure due to timed-out HTLCs. This has now been fixed (lightningdevkit#3680). * Some rare cases with multi-hop BOLT 11 route hints or multiple redundant blinded paths could have led to the router creating invalid `Route`s were fixed (lightningdevkit#3586). * Corrected the decay logic in `ProbabilisticScorer`'s historical buckets model. Note that by default historical buckets are only decayed if no new datapoints have been added for a channel for two weeks (lightningdevkit#3562). * `{Channel,Onion}MessageHandler::peer_disconnected` will now be called if a different message handler refused connection by returning an `Err` from its `peer_connected` method (lightningdevkit#3580). * If the counterparty broadcasts a revoked state with pending HTLCs, those will now be claimed with other outputs which we consider to not be vulnerable to pinning attacks if they are not yet claimable by our counterparty, potentially reducing our exposure to pinning attacks (lightningdevkit#3564).

TheBlueMatt added the backport 0.1 label Mar 22, 2025

ldk-reviews-bot requested a review from wpaulino March 22, 2025 00:56

shaavan reviewed Mar 22, 2025

View reviewed changes

lightning/src/sync/debug_sync.rs Show resolved Hide resolved

lightning/src/ln/functional_test_utils.rs Outdated Show resolved Hide resolved

dunxen reviewed Mar 24, 2025

View reviewed changes

lightning/src/sync/debug_sync.rs Show resolved Hide resolved

lightning/src/ln/chanmon_update_fail_tests.rs Outdated Show resolved Hide resolved

wpaulino reviewed Mar 24, 2025

View reviewed changes

lightning/src/ln/channelmanager.rs Show resolved Hide resolved

lightning/src/ln/chanmon_update_fail_tests.rs Outdated Show resolved Hide resolved

G8XSU reviewed Mar 25, 2025

View reviewed changes

lightning/src/ln/channelmanager.rs Outdated Show resolved Hide resolved

lightning/src/ln/channelmanager.rs Show resolved Hide resolved

TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 4 times, most recently from 3938d04 to 1d8f34a Compare March 26, 2025 21:47

TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 2 times, most recently from c3340a8 to 90acd18 Compare March 27, 2025 14:47

TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 4 times, most recently from acad2b4 to 9c0be2e Compare March 28, 2025 17:04

TheBlueMatt requested a review from wpaulino March 28, 2025 17:04

TheBlueMatt added the weekly goal Someone wants to land this this week label Mar 28, 2025

TheBlueMatt removed the request for review from wpaulino March 28, 2025 18:40

TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 4 times, most recently from 6b5e1f6 to 26e7e96 Compare March 29, 2025 11:58

TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch 2 times, most recently from 8a370bd to 1367c88 Compare March 31, 2025 00:21

TheBlueMatt requested a review from wpaulino March 31, 2025 02:38

dunxen previously approved these changes Mar 31, 2025

View reviewed changes

TheBlueMatt added 2 commits March 31, 2025 16:30

Make TestChannelManager Sync

7bc1187

In a coming commit we'll need to hold references to `TestChannelManager` in threads, requiring that it be `Sync`.

TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch from 1367c88 to c83280a Compare March 31, 2025 16:30

TheBlueMatt dismissed dunxen’s stale review via d5e4829 March 31, 2025 17:30

TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch from c83280a to d5e4829 Compare March 31, 2025 17:30

TheBlueMatt force-pushed the 2025-03-high-latency-sync-persist-hang branch from d5e4829 to 93b4479 Compare March 31, 2025 19:34

wpaulino approved these changes Mar 31, 2025

View reviewed changes

G8XSU approved these changes Apr 1, 2025

View reviewed changes

TheBlueMatt mentioned this pull request Apr 1, 2025

[0.1] Backports for 0.1.2 #3697

Merged

TheBlueMatt merged commit dd4b580 into lightningdevkit:main Apr 1, 2025
25 of 27 checks passed

TheBlueMatt removed the backport 0.1 label Apr 1, 2025

Do not track HTLC IDs as separate MPP parts which need claiming #3680

Do not track HTLC IDs as separate MPP parts which need claiming #3680

Uh oh!

Conversation

TheBlueMatt commented Mar 22, 2025

Uh oh!

ldk-reviews-bot commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

shaavan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ldk-reviews-bot commented Mar 24, 2025

Uh oh!

Uh oh!

Uh oh!

wpaulino left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

G8XSU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

TheBlueMatt commented Mar 26, 2025

Uh oh!

wpaulino commented Mar 26, 2025

Uh oh!

TheBlueMatt commented Mar 27, 2025

Uh oh!

TheBlueMatt commented Mar 28, 2025

Uh oh!

TheBlueMatt commented Mar 28, 2025

Uh oh!

codecov bot commented Mar 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

TheBlueMatt commented Mar 31, 2025

Uh oh!

dunxen commented Mar 31, 2025

Uh oh!

TheBlueMatt commented Mar 31, 2025

Uh oh!

TheBlueMatt commented Mar 31, 2025

Uh oh!

TheBlueMatt commented Mar 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

G8XSU left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

TheBlueMatt commented Apr 1, 2025

Uh oh!

Uh oh!

ldk-reviews-bot commented Mar 22, 2025 •

edited

Loading

codecov bot commented Mar 29, 2025 •

edited

Loading

TheBlueMatt commented Mar 31, 2025 •

edited

Loading