-
Notifications
You must be signed in to change notification settings - Fork 417
Correctly handle lost MonitorEvent
s
#3984
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Correctly handle lost MonitorEvent
s
#3984
Conversation
👋 Thanks for assigning @valentinewallace as a reviewer! |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3984 +/- ##
==========================================
- Coverage 88.76% 88.34% -0.43%
==========================================
Files 176 177 +1
Lines 128521 131671 +3150
Branches 128521 131671 +3150
==========================================
+ Hits 114088 116321 +2233
- Misses 11844 12691 +847
- Partials 2589 2659 +70
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
🔔 1st Reminder Hey @valentinewallace! This PR has been waiting for your review. |
I think this is safe to backport, with the one exception of making |
🔔 2nd Reminder Hey @valentinewallace! This PR has been waiting for your review. |
🔔 3rd Reminder Hey @valentinewallace! This PR has been waiting for your review. |
🔔 4th Reminder Hey @valentinewallace! This PR has been waiting for your review. |
🔔 5th Reminder Hey @valentinewallace! This PR has been waiting for your review. |
Broke part out in #4004, so drafting this until that lands. |
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fasion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In previous commits we ensured that HTLC resolutions which came to `ChannelManager` via a `MonitorEvent` were replayed on startup if the `MonitorEvent` was lost. However, in cases where the `ChannelManager` was so stale that it didn't have the payment state for an HTLC at all, we only re-add it in cases where `ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes it. Because constantly re-adding a payment state and then failing it would generate lots of noise for users on startup (not to mention risk of confusing stale payment events for the latest state of a payment when the `PaymentId` has been reused to retry a payment). Thus, `get_pending_or_resolved_outbound_htlcs` does not include state for HTLCs which were resolved on chain with a preimage or HTLCs which were resolved on chain with a timeout after `ANTI_REORG_DELAY` confirmations. This critera matches the critera for generating a `MonitorEvent`, and works great under the assumption that `MonitorEvent`s are reliably delivered. However, if they are not, and our `ChannelManager` is lost or substantially old (or, in a future where we do not persist `ChannelManager` at all), we will not end up seeing payment resolution events for an HTLC. Instead, we really want to tell our `ChannelMonitor`s when the resolution of an HTLC is complete. Note that we don't particularly care about non-payment HTLCs, as there is no re-hydration of state to do there - `ChannelManager` load ignores forwarded HTLCs coming back from `get_pending_or_resolved_outbound_htlcs` as there's nothing to do - we always attempt to replay the success/failure and figure out if it mattered based on whether there was still an HTLC to claim/fail. Here we take the first step towards that notification, adding a new `ChannelMonitorUpdateStep` for the completion notification, and tracking HTLCs which make it to the `ChannelMonitor` in such updates in a new map.
Tagging 0.2 since this now resolves a regression (deliberately) introduced in #4004 where we generate tons of duplicate paymentsent/failed events on restart. |
037a1d7
to
a791a4d
Compare
🔔 1st Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 2nd Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 1st Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 3rd Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 2nd Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 4th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 3rd Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 5th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 4th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 6th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
🔔 5th Reminder Hey @wpaulino @valentinewallace! This PR has been waiting for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally seems to make sense, will take a closer look shortly
/// [`Self::CommitmentSecret`] update which clears the payment information from all un-revoked | ||
/// counterparty commitment transactions. | ||
ReleasePaymentComplete { | ||
htlc: SentHTLCId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/htlc/htlc_id
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean isn't it clear from the type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as clear as it could be at the callsites, IMO. Since this field isn't actually an HTLC. Not a big deal though.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fasion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In previous commits we ensured that HTLC resolutions which came to `ChannelManager` via a `MonitorEvent` were replayed on startup if the `MonitorEvent` was lost. However, in cases where the `ChannelManager` was so stale that it didn't have the payment state for an HTLC at all, we only re-add it in cases where `ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes it. Because constantly re-adding a payment state and then failing it would generate lots of noise for users on startup (not to mention risk of confusing stale payment events for the latest state of a payment when the `PaymentId` has been reused to retry a payment). Thus, `get_pending_or_resolved_outbound_htlcs` does not include state for HTLCs which were resolved on chain with a preimage or HTLCs which were resolved on chain with a timeout after `ANTI_REORG_DELAY` confirmations. This critera matches the critera for generating a `MonitorEvent`, and works great under the assumption that `MonitorEvent`s are reliably delivered. However, if they are not, and our `ChannelManager` is lost or substantially old (or, in a future where we do not persist `ChannelManager` at all), we will not end up seeing payment resolution events for an HTLC. Instead, we really want to tell our `ChannelMonitor`s when the resolution of an HTLC is complete. Note that we don't particularly care about non-payment HTLCs, as there is no re-hydration of state to do there - `ChannelManager` load ignores forwarded HTLCs coming back from `get_pending_or_resolved_outbound_htlcs` as there's nothing to do - we always attempt to replay the success/failure and figure out if it mattered based on whether there was still an HTLC to claim/fail. Here we prepare to generate the new `ChannelMonitorUpdateStep::ReleasePaymentComplete` updates, adding a new `PaymentCompleteUpdate` struct to track the new update before we generate the `ChannelMonitorUpdate` and passing through to the right places in `ChannelManager`. The only cases where we want to generate the new update is after a `PaymentSent` or `PaymentFailed` event when the event was the result of a `MonitorEvent` or the equivalent read during startup.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fasion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In previous commits we ensured that HTLC resolutions which came to `ChannelManager` via a `MonitorEvent` were replayed on startup if the `MonitorEvent` was lost. However, in cases where the `ChannelManager` was so stale that it didn't have the payment state for an HTLC at all, we only re-add it in cases where `ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes it. Because constantly re-adding a payment state and then failing it would generate lots of noise for users on startup (not to mention risk of confusing stale payment events for the latest state of a payment when the `PaymentId` has been reused to retry a payment). Thus, `get_pending_or_resolved_outbound_htlcs` does not include state for HTLCs which were resolved on chain with a preimage or HTLCs which were resolved on chain with a timeout after `ANTI_REORG_DELAY` confirmations. This critera matches the critera for generating a `MonitorEvent`, and works great under the assumption that `MonitorEvent`s are reliably delivered. However, if they are not, and our `ChannelManager` is lost or substantially old (or, in a future where we do not persist `ChannelManager` at all), we will not end up seeing payment resolution events for an HTLC. Instead, we really want to tell our `ChannelMonitor`s when the resolution of an HTLC is complete. Note that we don't particularly care about non-payment HTLCs, as there is no re-hydration of state to do there - `ChannelManager` load ignores forwarded HTLCs coming back from `get_pending_or_resolved_outbound_htlcs` as there's nothing to do - we always attempt to replay the success/failure and figure out if it mattered based on whether there was still an HTLC to claim/fail. Here we begin generating the new `ChannelMonitorUpdateStep::ReleasePaymentComplete` updates, updating functional tests for the new `ChannelMonitorUpdate`s where required.
`MonitorEvent`s aren't delivered to the `ChannelManager` in a durable fasion - if the `ChannelManager` fetches the pending `MonitorEvent`s, then the `ChannelMonitor` gets persisted (i.e. due to a block update) then the node crashes, prior to persisting the `ChannelManager` again, the `MonitorEvent` and its effects on the `ChannelManger` will be lost. This isn't likely in a sync persist environment, but in an async one this could be an issue. Note that this is only an issue for closed channels - `MonitorEvent`s only inform the `ChannelManager` that a channel is closed (which the `ChannelManager` will learn on startup or when it next tries to advance the channel state), that `ChannelMonitorUpdate` writes completed (which the `ChannelManager` will detect on startup), or that HTLCs resolved on-chain post closure. Of the three, only the last is problematic to lose prior to a reload. In previous commits we ensured that HTLC resolutions which came to `ChannelManager` via a `MonitorEvent` were replayed on startup if the `MonitorEvent` was lost. However, in cases where the `ChannelManager` was so stale that it didn't have the payment state for an HTLC at all, we only re-add it in cases where `ChannelMonitor::get_pending_or_resolved_outbound_htlcs` includes it. Because constantly re-adding a payment state and then failing it would generate lots of noise for users on startup (not to mention risk of confusing stale payment events for the latest state of a payment when the `PaymentId` has been reused to retry a payment). Thus, `get_pending_or_resolved_outbound_htlcs` does not include state for HTLCs which were resolved on chain with a preimage or HTLCs which were resolved on chain with a timeout after `ANTI_REORG_DELAY` confirmations. This critera matches the critera for generating a `MonitorEvent`, and works great under the assumption that `MonitorEvent`s are reliably delivered. However, if they are not, and our `ChannelManager` is lost or substantially old (or, in a future where we do not persist `ChannelManager` at all), we will not end up seeing payment resolution events for an HTLC. Instead, we really want to tell our `ChannelMonitor`s when the resolution of an HTLC is complete. Note that we don't particularly care about non-payment HTLCs, as there is no re-hydration of state to do there - `ChannelManager` load ignores forwarded HTLCs coming back from `get_pending_or_resolved_outbound_htlcs` as there's nothing to do - we always attempt to replay the success/failure and figure out if it mattered based on whether there was still an HTLC to claim/fail. Here we, finally, begin actually using the new `ChannelMonitorUpdateStep::ReleasePaymentComplete` updates, skipping re-hydration of pending payments once they have been fully resolved through to a user `Event`.
When a payment was sent and ultimately completed through an on-chain HTLC claim which we discover during startup, we deliberately break the payment tracking logic to keep it around forever, declining to send a `PaymentPathSuccessful` event but ensuring that we don't constantly replay the claim on every startup. However, now that we now have logic to complete a claim by marking it as completed in a `ChannelMonitor` and not replaying information about the claim on every startup. Thus, we no longer need to take the conservative stance and can correctly replay claims now, generating `PaymentPathSuccessful` events and allowing the state to be removed.
a791a4d
to
c0ab9e3
Compare
/// [`Self::CommitmentSecret`] update which clears the payment information from all un-revoked | ||
/// counterparty commitment transactions. | ||
ReleasePaymentComplete { | ||
htlc: SentHTLCId, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not as clear as it could be at the callsites, IMO. Since this field isn't actually an HTLC. Not a big deal though.
// The alternative would be to wipe the state when claiming, | ||
// generating a `PaymentPathSuccessful` event but regenerating | ||
// it and the `PaymentSent` on every restart until the | ||
// `ChannelMonitor` is removed. | ||
pending_outbounds.claim_htlc( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a TODO in this method in the from_onchain
handling that can be removed now
Event::PaymentPathSuccessful { payment_hash, .. } => { | ||
assert_eq!(payment_hash, Some(hash_b)); | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to confirm my understanding, this event could still potentially get lost?
lightning/src/ln/payment_tests.rs
Outdated
// After reload, the ChannelManager identified the failed payment and queued up the | ||
// PaymentSent and corresponding ChannelMonitorUpdate to mark the payment handled, but | ||
// while processing the pending `MonitorEvent`s (which were not processed before the | ||
// monitor was persisted) we will end up with a duplicate ChannelMonitorUpdate. | ||
expect_payment_sent(&nodes[0], payment_preimage, None, true, false); | ||
// PaymentSent (or not, if `persist_manager_post_event` resulted in us detecting we | ||
// already did that) and corresponding ChannelMonitorUpdate to mark the payment | ||
// handled, but while processing the pending `MonitorEvent`s (which were not processed | ||
// before the monitor was persisted) we will end up with a duplicate | ||
// ChannelMonitorUpdate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all one sentence and I find it a bit incomprehensible... Is there any way to break it up and/or reword?
@@ -16601,8 +16601,7 @@ where | |||
} | |||
|
|||
if is_channel_closed { | |||
for (htlc_source, (htlc, _)) in monitor.get_pending_or_resolved_outbound_htlcs() | |||
{ | |||
for (htlc_source, (htlc, _)) in monitor.get_all_current_outbound_htlcs() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous code would also check HTLCs in holder commitment txs, depending on what was actually confirmed onchain. I think the current behavior is correct based on the comment on fail_unbroadcast_htlcs
, but was it incorrect before?
false | ||
}; | ||
if !have_action { | ||
self.handle_post_event_actions(ev_completion_action); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage for this case
pending_events.iter().any(|(_, act)| act.as_ref() == Some(&action)) | ||
}; | ||
if !have_action { | ||
self.handle_post_event_actions([action]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage
// This should mostly only happen on startup, but it is possible to hit it in | ||
// rare cases where a MonitorUpdate is replayed after restart because a | ||
// ChannelMonitor wasn't persisted after it was applied (even though the | ||
// ChannelManager was). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added:
assert!(self.background_events_processed_since_startup.load(Ordering::Acquire));
here and all tests passed. So I think we don't have coverage of the "normal" startup case that's being outlined here?
let events = node.node.get_and_clear_pending_events(); | ||
if conditions.from_mon_update { | ||
check_added_monitors(node, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pre-existing, but generally for our monitor update testing it would be nice if we could check that the particular ChannelMonitorUpdateStep
is as-expected.
let during_startup = | ||
!self.background_events_processed_since_startup.load(Ordering::Acquire); | ||
if during_startup { | ||
let event = BackgroundEvent::MonitorUpdateRegeneratedOnStartup { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing test coverage
This PR fixes lost
MonitorEvent
s first in two commits which handle effectively replaying them on startup (first the preimage case then the timeout case) and then...doing so via a new
ChannelMonitorUpdate
.