Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR is a proof of concept optimization for path payments that significantly decreases the duration of exceptional "slow ledger" and modestly improves average ledger close times.
Observation
Sometimes, stellar-core takes a very long time to apply transactions, on the scale of 5-10 seconds. Initially, I analyzed 2 such slow ledgers, 53514768 and 53514801. These ledgers take 3.01 seconds and 2.05 seconds to apply on my laptop, and even longer on validator hardware.
Like most transactions sets, these ledgers contained a large amount of arbitrage path payments, most of which were very similar or identical and most of which failed. I thought we might have an improper offer cache and doing extra disk IO, but offer table caches work as expected. Upon profiling, I found that these ledgers took over 1 second just constructing and committing non-root LedgerTxn objects. Despite not going to disk, the churn on constantly creating, commiting, and eventually rolling back ledgerTxn objects for repetitive path payments was very expensive.
Solution
The issue had to do with our current "exit early" strategy. In order to prevent recomputing the same failed offer pairs, we use the
worstBestOfferCache
. The issue with this cache is that it reasons on individual asset pairs instead of the payment path as a whole. For path payments, often many asset pairs along the path would succeed and an asset pair very deep in the order book would eventually fail.worstBestOfferCache
is useless in preventing us from traversing this failed path, since we will not hit the cached asset pair until we have done most of the work.The solution to this is to use dynamic programming memoization at the asset path level. To do this, for each failed path payment, I cache the path hash to the src sell amount and destination amount that failed. Before walking the path of another payment op, we check the path hash against the hash. The cache is conservative such that we fail early iff:
Intuitively, the reasoning is "If a previous failed transaction gave away more and received less than me, I must also fail."
These conditions assume that no offers along the path have been modified since the last failed path payment has been cached. To achieve this, we invalidate the cache as follows:
Reasoning for point 3. Consider we have two path payments with the same path. The first path payment fails, so we cache it. The 2nd path payment succeeds. Because the two payments target the same path, the success of op 2 has made the market strictly more competitive for path payments of the same path. More formally, for some path N, if a path payment of path N is executed, no op with path N that previously failed could now succeed, so we do not need to invalidate the cache. However, we do need to invalidate the counter party. I.e. if the path assetA -> assetB -> assetC executes, we must invalidate the cache for the counter party pairs {assetC -> B} and {assetB -> assetA}.
Results
Disclosure: I performed all these tests on my laptop. Due to the memory consumption of tracy, I could only record in 500 ledger chunks. These are rough, preliminary estimates. I need to follow up with more ranges and use dev boxes to test. Disclosure over, you've been warned.
Replaying range 53514477 -> 53514977
Most importantly, "slow ledger" spikes we significantly reduced. In particular:
On more recent ledgers, I get the following results. From 55633887 -> 55634887 (1000 ledgers replayed from today)
TLDR
Modest improvement in the average case. Significant improvement towards mitigating against path payments that cross many offers. No apparent downside, except that this optimization is technically a protocol change (error codes returned may be different than before) and general risk in adding additional complexity to the order book (though in my opinion this cache is fairly straight forward all things considered).
Checklist
clang-format
v8.0.0 (viamake format
or the Visual Studio extension)