Skip to content

Fix spurious MPP pathfinding failure #3707

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

valentinewallace
Copy link
Contributor

@valentinewallace valentinewallace commented Apr 3, 2025

This bug was recently surfaced to us by a user who wrote a test where the sender is attempting to route an MPP payment split across the two channels that it has with its LSP, where the LSP has a single large channel to the recipient.

Previously this led to a pathfinding failure because our router was not sending the maximum possible value over the first MPP path it found due to overestimating the fees needed to cover a later hop in the route. Because the path that had just been found was not maxxed out, our router assumed that we had already found enough paths to cover the full payment amount and that we were finding additional paths for the purpose of redundant path selection. This caused the router to mark the recipient's only channel as exhausted, with the intention of choosing more unique paths in future iterations. In reality, this ended up with the recipient's only channel being disabled and subsequently failing to find a route entirely.

Update the router to fully utilize the capacity of any paths it finds in this situation, preventing the "redundant path selection" behavior from kicking in.

Closes #3685

@ldk-reviews-bot
Copy link

ldk-reviews-bot commented Apr 3, 2025

👋 Thanks for assigning @TheBlueMatt as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.


// Aggregate the fees of the hops that come after this one, and use those fees to compute the
// maximum amount that this hop can contribute to the final value received by the payee.
let (next_hops_aggregated_base, next_hops_aggregated_prop) =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't using the aggregate version have rounding issues? Sadly if we're even one msat off I think the bug will persist :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure there's an easy way to get rid of the rounding issues :(

Will move forward with adding a fudge factor when checking whether we've hit prevent_redundant_path_selection. It might also be good to prevent that code from exhausting last-hop channels.

@valentinewallace valentinewallace force-pushed the 2025-03-lexe-router-bugfix branch from ce7b12b to 9f0eb8f Compare April 4, 2025 22:47
@valentinewallace
Copy link
Contributor Author

Updated some docs and added a fudge factor when deciding if we've fully utilized a path.

diff --git a/lightning/src/routing/router.rs b/lightning/src/routing/router.rs
index 1d529b9b9..73cc6d687 100644
--- a/lightning/src/routing/router.rs
+++ b/lightning/src/routing/router.rs
@@ -2111,9 +2111,10 @@ impl<'a> PaymentPath<'a> {
                                        cur_hop.hop_use_fee_msat = new_fee;
                                        total_fee_paid_msat += new_fee;
                                } else {
-                                       // It should not be possible because this function is called only to reduce the
-                                       // value. In that case, compute_fee was already called with the same fees for
-                                       // larger amount and there was no overflow.
+                                       // It should not be possible because this function is only called either to reduce the
+                                       // value or with a larger amount that was already checked for overflow in
+                                       // `compute_max_final_value_contribution`. In the former case, compute_fee was already
+                                       // called with the same fees for larger amount and there was no overflow.
                                        unreachable!();
                                }
                        }
@@ -2121,6 +2122,9 @@ impl<'a> PaymentPath<'a> {
                value_msat + extra_contribution_msat
        }

+       // Returns the maximum contribution that this path can make to the final value of the payment. May
+       // be slightly lower than the actual max due to rounding errors when aggregating fees along the
+       // path.
        fn compute_max_final_value_contribution(
                &self, used_liquidities: &HashMap<CandidateHopId, u64>, channel_saturation_pow_half: u8
        ) -> u64 {
@@ -3331,9 +3335,15 @@ where L::Target: Logger {
                                                .or_insert(spent_on_hop_msat);
                                        let hop_capacity = hop.candidate.effective_capacity();
                                        let hop_max_msat = max_htlc_from_capacity(hop_capacity, channel_saturation_pow_half);
-                                       if *used_liquidity_msat == hop_max_msat {
-                                               // If this path used all of this channel's available liquidity, we know
-                                               // this path will not be selected again in the next loop iteration.
+
+                                       // If this path used all of this channel's available liquidity, we know this path will not
+                                       // be selected again in the next loop iteration.
+                                       //
+                                       // Allow the used amount to be slightly below the max when deciding whether a channel is
+                                       // fully utilized, to account for rounding errors in
+                                       // `PaymentPath::compute_max_final_value_contribution`.
+                                       let maxed_out_hop_liquidity_range = hop_max_msat.saturating_sub(1000)..(hop_max_msat.saturating_add(1));
+                                       if maxed_out_hop_liquidity_range.contains(used_liquidity_msat) {
                                                prevented_redundant_path_selection = true;
                                        }
                                        debug_assert!(*used_liquidity_msat <= hop_max_msat);

@valentinewallace valentinewallace marked this pull request as ready for review April 4, 2025 22:49
@ldk-reviews-bot
Copy link

🔔 1st Reminder

Hey @TheBlueMatt! This PR has been waiting for your review.
Please take a look when you have a chance. If you're unable to review, please let us know so we can find another reviewer.

Comment on lines +3314 to +3317
let max_path_contribution_msat = payment_path.compute_max_final_value_contribution(
&used_liquidities, channel_saturation_pow_half
);
let desired_value_contribution = cmp::min(max_path_contribution_msat, final_value_msat);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than being fuzzy can we not have these determine whether we need to limit at a random hop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That does simplify things a bit. Pushed with the following diff:

diff --git a/lightning/src/routing/router.rs b/lightning/src/routing/router.rs
index 73cc6d687..bc4210b5c 100644
--- a/lightning/src/routing/router.rs
+++ b/lightning/src/routing/router.rs
@@ -3326,7 +3326,6 @@ where L::Target: Logger {
                                // might have been computed considering a larger value.
                                // Remember that we used these channels so that we don't rely
                                // on the same liquidity in future paths.
-                               let mut prevented_redundant_path_selection = false;
                                for (hop, _) in payment_path.hops.iter() {
                                        let spent_on_hop_msat = value_contribution_msat + hop.next_hops_fee_msat;
                                        let used_liquidity_msat = used_liquidities
@@ -3335,20 +3334,9 @@ where L::Target: Logger {
                                                .or_insert(spent_on_hop_msat);
                                        let hop_capacity = hop.candidate.effective_capacity();
                                        let hop_max_msat = max_htlc_from_capacity(hop_capacity, channel_saturation_pow_half);
-
-                                       // If this path used all of this channel's available liquidity, we know this path will not
-                                       // be selected again in the next loop iteration.
-                                       //
-                                       // Allow the used amount to be slightly below the max when deciding whether a channel is
-                                       // fully utilized, to account for rounding errors in
-                                       // `PaymentPath::compute_max_final_value_contribution`.
-                                       let maxed_out_hop_liquidity_range = hop_max_msat.saturating_sub(1000)..(hop_max_msat.saturating_add(1));
-                                       if maxed_out_hop_liquidity_range.contains(used_liquidity_msat) {
-                                               prevented_redundant_path_selection = true;
-                                       }
                                        debug_assert!(*used_liquidity_msat <= hop_max_msat);
                                }
-                               if !prevented_redundant_path_selection {
+                               if max_path_contribution_msat > value_contribution_msat {
                                        // If we weren't capped by hitting a liquidity limit on a channel in the path,
                                        // we'll probably end up picking the same path again on the next iteration.
                                        // Decrease the available liquidity of a hop in the middle of the path.

This bug was recently surfaced to us by a user who wrote a test where the
sender is attempting to route an MPP payment split across the two channels that
it has with its LSP, where the LSP has a single large channel to the recipient.

Previously this led to a pathfinding failure because our router was not sending
the maximum possible value over the first MPP path it found due to
overestimating the fees needed to cover the following hops. Because the path
that had just been found was not maxxed out, our router assumed that we had
already found enough paths to cover the full payment amount and that we were
finding additional paths for the purpose of redundant path selection. This
caused the router to mark the recipient's only channel as exhausted, with the
intention of choosing more unique paths in future iterations. In reality, this
ended up with the recipient's only channel being disabled and subsequently
failing to find a route entirely.

Update the router to fully utilize the capacity of any paths it finds in this
situation, preventing the "redundant path selection" behavior from kicking in.
@valentinewallace valentinewallace force-pushed the 2025-03-lexe-router-bugfix branch from 9f0eb8f to 209cb2a Compare April 7, 2025 18:06
@MaxFangX
Copy link
Contributor

MaxFangX commented Apr 8, 2025

Nice to see a fix in the works, thanks @valentinewallace!

Because the path that had just been found was not maxxed out

A question that's somewhat related - will this PR also have the effect of generally reducing the # of shards over which MPPs are sent, since found paths are maxed out? We're currently dealing with issues in prod where sending a MPP payment which requires only two of the sender's outbound channels (i.e. two MPP shards) is actually producing a MPP route with seven shards. With a per-path success rate of 85% (our observed probing success rate), the overall payment success rate drops to a measly 0.85^7 = 32%.

And the computed maximum per-path value contribution accounts for the scorer's liquidity bounds, correct?

@valentinewallace
Copy link
Contributor Author

Nice to see a fix in the works, thanks @valentinewallace!

Because the path that had just been found was not maxxed out

A question that's somewhat related - will this PR also have the effect of generally reducing the # of shards over which MPPs are sent, since found paths are maxed out? We're currently dealing with issues in prod where sending a MPP payment which requires only two of the sender's outbound channels (i.e. two MPP shards) is actually producing a MPP route with seven shards. With a per-path success rate of 85% (our observed probing success rate), the overall payment success rate drops to a measly 0.85^7 = 32%.

Hey @MaxFangX, this PR should reduce the number of shards in some cases but I'm not sure that it fully accounts for 7 shards vs 2... Let me know if you have logs for this case or any way to reproduce 👀

And the computed maximum per-path value contribution accounts for the scorer's liquidity bounds, correct?

Actually no, the liquidity bounds are used when sorting paths to pick the paths that are the most likely to succeed. Interesting point though, maybe something to consider.

@MaxFangX
Copy link
Contributor

MaxFangX commented Apr 8, 2025

Let me know if you have logs for this case or any way to reproduce 👀

I opened an issue to continue the discussion at #3717

And the computed maximum per-path value contribution accounts for the scorer's liquidity bounds, correct?

Actually no, the liquidity bounds are used when sorting paths to pick the paths that are the most likely to succeed. Interesting point though, maybe something to consider.

I think for us it's not imperative that the liquidity bounds are incorporated while computing the max per-path value contribution since the limiting factor is usually the channel that the user has with our LSP, rather than a public channel in the network.

Just speculating here, but I wonder if it would be a good strategy to:

  • Compute a number of possible MPP paths to the target
  • Sort the paths according to their score according to liquidity bounds
  • Iteratively increase the value contributed in the highest probability path until it runs out of capacity it can contribute, or is no longer the highest probability path
  • Continue doing this until there is sufficient value contribution across all paths to meet the payment amount.
  • Remove any MPP paths which were not necessary to meet the payment amount.

This would have the effect of simultaneously sending the highest value along the highest probability paths, as well as reducing the number of paths overall.

@ldk-reviews-bot
Copy link

✅ Added second reviewer: @wpaulino

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Router disables central node within route even if there is only one path
4 participants