Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Path Payment Memoization POC #4644

Draft
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

SirTyson
Copy link
Contributor

@SirTyson SirTyson commented Feb 7, 2025

Description

This PR is a proof of concept optimization for path payments that significantly decreases the duration of exceptional "slow ledger" and modestly improves average ledger close times.

Observation

Sometimes, stellar-core takes a very long time to apply transactions, on the scale of 5-10 seconds. Initially, I analyzed 2 such slow ledgers, 53514768 and 53514801. These ledgers take 3.01 seconds and 2.05 seconds to apply on my laptop, and even longer on validator hardware.

Like most transactions sets, these ledgers contained a large amount of arbitrage path payments, most of which were very similar or identical and most of which failed. I thought we might have an improper offer cache and doing extra disk IO, but offer table caches work as expected. Upon profiling, I found that these ledgers took over 1 second just constructing and committing non-root LedgerTxn objects. Despite not going to disk, the churn on constantly creating, commiting, and eventually rolling back ledgerTxn objects for repetitive path payments was very expensive.

Solution

The issue had to do with our current "exit early" strategy. In order to prevent recomputing the same failed offer pairs, we use the worstBestOfferCache. The issue with this cache is that it reasons on individual asset pairs instead of the payment path as a whole. For path payments, often many asset pairs along the path would succeed and an asset pair very deep in the order book would eventually fail. worstBestOfferCache is useless in preventing us from traversing this failed path, since we will not hit the cached asset pair until we have done most of the work.

The solution to this is to use dynamic programming memoization at the asset path level. To do this, for each failed path payment, I cache the path hash to the src sell amount and destination amount that failed. Before walking the path of another payment op, we check the path hash against the hash. The cache is conservative such that we fail early iff:

  1. The cache contains a src amount equal or larger than the current op source amount
  2. The corresponding destination amount for that src amount is lower or equal to current op destination amount

Intuitively, the reasoning is "If a previous failed transaction gave away more and received less than me, I must also fail."

These conditions assume that no offers along the path have been modified since the last failed path payment has been cached. To achieve this, we invalidate the cache as follows:

  1. Whenever an offer is updated or created, we invalidate all cached paths that contain the pair {sheep, wheat} and {wheat, sheep}. We must invalidate both sides of the trade due to rounding conditions.
  2. Whenever a liquidity bool is deposited to or withdrawn from, we invalidate all paths that contain the pair {assetA, assetB} and {assetB, assetA}. Note this constraint can probably be tightened, but I didn't bother because it happens so rarely.
  3. Whenever a path payment succeeds, invalidate the counter party to each asset trade pair.

Reasoning for point 3. Consider we have two path payments with the same path. The first path payment fails, so we cache it. The 2nd path payment succeeds. Because the two payments target the same path, the success of op 2 has made the market strictly more competitive for path payments of the same path. More formally, for some path N, if a path payment of path N is executed, no op with path N that previously failed could now succeed, so we do not need to invalidate the cache. However, we do need to invalidate the counter party. I.e. if the path assetA -> assetB -> assetC executes, we must invalidate the cache for the counter party pairs {assetC -> B} and {assetB -> assetA}.

Results

Disclosure: I performed all these tests on my laptop. Due to the memory consumption of tracy, I could only record in 500 ledger chunks. These are rough, preliminary estimates. I need to follow up with more ranges and use dev boxes to test. Disclosure over, you've been warned.

Replaying range 53514477 -> 53514977

  • Total apply time reduced by 20%.
  • PathPaymentStrictSend reduced by 32%.
  • PathPaymentStrictReceive reduced by 26%

Most importantly, "slow ledger" spikes we significantly reduced. In particular:

  • ledger 53514768 close time went from 3.01s -> 812 ms (73% decrease)
  • ledger 53514801 close time went from 2.05s -> 305 ms (85% decrease)

On more recent ledgers, I get the following results. From 55633887 -> 55634887 (1000 ledgers replayed from today)

  • total apply time decrease by 8%
  • There were no significant outliers, so max ledger close time is about the same, ~300 ms worst ledger for both.

TLDR

Modest improvement in the average case. Significant improvement towards mitigating against path payments that cross many offers. No apparent downside, except that this optimization is technically a protocol change (error codes returned may be different than before) and general risk in adding additional complexity to the order book (though in my opinion this cache is fairly straight forward all things considered).

Checklist

  • Reviewed the contributing document
  • Rebased on top of master (no merge commits)
  • Ran clang-format v8.0.0 (via make format or the Visual Studio extension)
  • Compiles
  • Ran all tests
  • If change impacts performance, include supporting evidence per the performance document

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants