Correctly handle lost `MonitorEvent`s #3984

TheBlueMatt · 2025-08-02T20:31:55Z

`MonitorEvent`s aren't delivered to the `ChannelManager` in a
durable fasion - if the `ChannelManager` fetches the pending
`MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due
to a block update) then the node crashes, prior to persisting the
`ChannelManager` again, the `MonitorEvent` and its effects on the
`ChannelManger` will be lost. This isn't likely in a sync persist
environment, but in an async one this could be an issue.

Note that this is only an issue for closed channels -
`MonitorEvent`s only inform the `ChannelManager` that a channel is
closed (which the `ChannelManager` will learn on startup or when it
next tries to advance the channel state), that
`ChannelMonitorUpdate` writes completed (which the `ChannelManager`
will detect on startup), or that HTLCs resolved on-chain post
closure. Of the three, only the last is problematic to lose prior
to a reload.

This PR fixes lost MonitorEvents first in two commits which handle effectively replaying them on startup (first the preimage case then the timeout case) and then...

In previous commits we ensured that HTLC resolutions which came to
`ChannelManager` via a `MonitorEvent` were replayed on startup if
the `MonitorEvent` was lost. However, in cases where the
`ChannelManager` was so stale that it didn't have the payment state
for an HTLC at all, we only re-add it in cases where
`ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes
it.

Because constantly re-adding a payment state and then failing it
would generate lots of noise for users on startup (not to mention
risk of confusing stale payment events for the latest state of a
payment when the `PaymentId` has been reused to retry a payment).
Thus, `get_pending_or_resolved_outbound_htlcs` does not include
state for HTLCs which were resolved on chain with a preimage or
HTLCs which were resolved on chain with a timeout after
`ANTI_REORG_DELAY` confirmations.

This critera matches the critera for generating a `MonitorEvent`,
and works great under the assumption that `MonitorEvent`s are
reliably delivered. However, if they are not, and our
`ChannelManager` is lost or substantially old (or, in a future
where we do not persist `ChannelManager` at all), we will not end
up seeing payment resolution events for an HTLC.

Instead, we really want to tell our `ChannelMonitor`s when the
resolution of an HTLC is complete. Note that we don't particularly
care about non-payment HTLCs, as there is no re-hydration of state
to do there - `ChannelManager` load ignores forwarded HTLCs coming
back from `get_pending_or_resolved_outbound_htlcs` as there's
nothing to do - we always attempt to replay the success/failure and
figure out if it mattered based on whether there was still an HTLC
to claim/fail.

doing so via a new ChannelMonitorUpdate.

ldk-reviews-bot · 2025-08-02T20:31:57Z

👋 Thanks for assigning @valentinewallace as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

codecov · 2025-08-02T20:41:10Z

Codecov Report

❌ Patch coverage is 88.96321% with 33 lines in your changes missing coverage. Please review.
✅ Project coverage is 88.34%. Comparing base (d16e65d) to head (c0ab9e3).
⚠️ Report is 129 commits behind head on main.

Files with missing lines	Patch %	Lines
lightning/src/ln/channelmanager.rs	80.88%	20 Missing and 6 partials ⚠️
lightning/src/ln/payment_tests.rs	93.93%	2 Missing and 2 partials ⚠️
lightning/src/chain/channelmonitor.rs	94.73%	1 Missing ⚠️
lightning/src/ln/functional_test_utils.rs	85.71%	1 Missing ⚠️
lightning/src/ln/monitor_tests.rs	97.05%	0 Missing and 1 partial ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3984      +/-   ##
==========================================
- Coverage   88.76%   88.34%   -0.43%     
==========================================
  Files         176      177       +1     
  Lines      128521   131671    +3150     
  Branches   128521   131671    +3150     
==========================================
+ Hits       114088   116321    +2233     
- Misses      11844    12691     +847     
- Partials     2589     2659      +70

Flag	Coverage Δ
fuzzing	`21.61% <20.23%> (-0.26%)`	⬇️
tests	`88.17% <88.96%> (-0.43%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

ldk-reviews-bot · 2025-08-04T20:43:24Z

🔔 1st Reminder

Hey @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

TheBlueMatt · 2025-08-05T18:14:08Z

I think this is safe to backport, with the one exception of making ChannelMonitor::PartialEq test-only.

ldk-reviews-bot · 2025-08-06T20:43:54Z

🔔 2nd Reminder

Hey @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-08-09T00:00:56Z

🔔 3rd Reminder

Hey @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-08-11T00:01:32Z

🔔 4th Reminder

Hey @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

ldk-reviews-bot · 2025-08-13T00:02:06Z

🔔 5th Reminder

Hey @valentinewallace! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

TheBlueMatt · 2025-08-13T14:27:20Z

Broke part out in #4004, so drafting this until that lands.

`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fasion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In previous commits we ensured that HTLC resolutions which came to `ChannelManager` via a `MonitorEvent` were replayed on startup if the `MonitorEvent` was lost. However, in cases where the `ChannelManager` was so stale that it didn't have the payment state for an HTLC at all, we only re-add it in cases where `ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes it. Because constantly re-adding a payment state and then failing it would generate lots of noise for users on startup (not to mention risk of confusing stale payment events for the latest state of a payment when the `PaymentId` has been reused to retry a payment). Thus, `get_pending_or_resolved_outbound_htlcs` does not include state for HTLCs which were resolved on chain with a preimage or HTLCs which were resolved on chain with a timeout after `ANTI_REORG_DELAY` confirmations. This critera matches the critera for generating a `MonitorEvent`, and works great under the assumption that `MonitorEvent`s are reliably delivered. However, if they are not, and our `ChannelManager` is lost or substantially old (or, in a future where we do not persist `ChannelManager` at all), we will not end up seeing payment resolution events for an HTLC. Instead, we really want to tell our `ChannelMonitor`s when the resolution of an HTLC is complete. Note that we don't particularly care about non-payment HTLCs, as there is no re-hydration of state to do there - `ChannelManager` load ignores forwarded HTLCs coming back from `get_pending_or_resolved_outbound_htlcs` as there's nothing to do - we always attempt to replay the success/failure and figure out if it mattered based on whether there was still an HTLC to claim/fail. Here we take the first step towards that notification, adding a new `ChannelMonitorUpdateStep` for the completion notification, and tracking HTLCs which make it to the `ChannelMonitor` in such updates in a new map.

TheBlueMatt · 2025-08-26T20:22:41Z

Tagging 0.2 since this now resolves a regression (deliberately) introduced in #4004 where we generate tons of duplicate paymentsent/failed events on restart.

ldk-reviews-bot · 2025-08-28T21:47:44Z

🔔 1st Reminder