-
Notifications
You must be signed in to change notification settings - Fork 412
Move persist into async part of the sweeper #3819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
👋 Thanks for assigning @tnull as a reviewer! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Took a first look and left some comments.
Besides, I still think if we go this way we should just also switch to use a Notifier
to wake the background processor to trigger persistence.
lightning/src/util/sweep.rs
Outdated
@@ -783,11 +788,13 @@ where | |||
struct SweeperState { | |||
outputs: Vec<TrackedSpendableOutput>, | |||
best_block: BestBlock, | |||
dirty: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We had that discussion before: I'd really prefer it if we don't mix in runtime state with the SweeperState
, which is precisely the object we use to isolated the persisted state from the non-persisted state, which also avoid having to hand a mutable state to persist_state
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last time, I created that isolation, but then reverted in favor of an atomic boolean. Which direction would you suggest taking it with the dirty
flag? I don't think I'd like another atomic boolean. Already didn't like the first one, but two independent sync primitives is expanding the state space even further.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I would much prefer to just have another needs_persist: AtomicBool
on OutputSweeper
directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of flows that I'd like to avoid is stuff like: update state
, unlock state
, mark dirty
and then concurrently a persist
is happening in between unlock
and mark dirty
, ultimately leading to clean state marked as dirty that will be re-persisted without changes. Ofc the re-persist isn't the biggest problem, but I am cautious of requiring devs to reason through scenarios like the one above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try out here: main...joostjager:rust-lightning:sweeper-async-persist-atomicbool
I definitely feel all those cases pop up within me if I use that atomic bool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know this area of the code well, but I tend to agree with Joost that if we had put both flags inside the SweeperState
then it would be easier to reason about -- everything would have to be changed under the same lock so would definitely be no concerns about concurrency. At face value, having a separate lock seems like it asks for a race condition?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So there a two aspects here why I dislike the idea of pushing the runtime flags into SweeperState
:
- We intentionally had
SweeperState
hold the 'actual'/persisted state of the sweeper, not any runtime-specific behavior. The(_unused, dirty, (static_value, false)),
in the persistence logic really just shows that you unnecessarily broke the separation of data and logic we had here. If we think that all should be locked under a singleMutex
, we'd need to create a wrapper struct holding both theSweeperState
and the runtime-specificbool
to maintain that. - However, secondly, I don't think we should introduce the lock contention and block the background processor that is woken and processing a 'I need persist' notification just to check if it actually still needs to re-persist. We don't have strong guarantees when the BP responds to a notification, so if it's mid-loop already it might take a while until it gets back to actually process the persist. Also note that what we do in this PR is effectively splitting the persistence in two: sync in-line persistence for stuff that really needs to happen before we return (
track_spendable_outputs
) and 'lazy'/async persistence that will happen some time after block connection. For the latter we have relaxed consistency guarantees anyways, and we basically increase chances of missing a persistence anyways. So I don't quite understand where the concern for race conditions in this 'lazy' case comes from. I don't see why we favor lock contention over (theoretical) relaxed consistency guarantees for a case where we already opt into the latter knowingly.
It might also be noteworthy that post-async KVStore
we might need to rework the current pattern anyways, as we wouldn't be able to hold the MutexGuard
across the write().await
boundary. We'll figure that out when we get there, but it could mean that we need to clone the to-be persisted state before dropping the lock, and actually making the call, which would be another reason to not include ~unrelated fields in the state object.
TLDR: I'd prefer to continue to have the runtime bool
s live as AtomicBool
s on OutputSweeper
directly, but if you guys really worry about any races for the already-lazy case, we should at the very least solve it by wrapping the two fields and SweeperState
in yet another struct, maintaining the data/logic separation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After more exploration of the atomic bool direction, I couldn't get rid of the suggested or real race conditions. I kept the dirty flag as part of the state, and separated it from the persistent fields. Let me know what you think.
Regarding that separation, I do want to point to #3618 (comment). Opinions and also current practices vary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnull re: (1) can you elaborate on why you see the (_unused, dirty, (static_value, false))
as bad? Not obvious to me why this is different from the other places in the codebase where we do this but might be missing something.
Re: (2) I'm not sure I'm following because even if we use an atomic bool we'll still take the sweeper lock at least once in regenerate_and_broadcast_spend_if_necessary
-- this additional instance doesn't seem unique? Not saying we'll definitely have races with the atomic bool, just that readers have to think through whether we'll miss a persist or have an extra persist unless everything is changed under one lock, so I want to make sure I understand why it's worth it to not.
let result = { | ||
self.regenerate_and_broadcast_spend_if_necessary_internal().await?; | ||
|
||
// If there is still dirty state, we need to persist it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a weird pattern. Why not move persistence out of regenerate_and_broadcast_spend_if_necessary_internal
and just set the dirty flag there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at that, but I think we have to persist before we broadcast? Or is that not necessary?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at that, but I think we have to persist before we broadcast? Or is that not necessary?
Hmm, not sure if necessary, but yes, it's probably cleaner to persist that we broadcasted before we attempt it.
However, I think you can avoid the entire 'if it's still dirty'-pattern if you'd trigger the repersistence via a Notifier
rather than through the call to regenerate_and_broadcast_if_necessary
, as discussed below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Probably still need a dirty flag to prevent unnecessary persists when only sweeps need to be checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline. Probably still need a dirty flag to prevent unnecessary persists when only sweeps need to be checked.
Well, this was never the question, the question was around whether we need to run the 'if it's still dirty'-pattern after we may have just persisted. And to avoid that, we should just switch to use the notifier, as we intend to do that anyways.
👋 The first review has been submitted! Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer. |
You mean as part of this PR? I agree that that would be nicer than a timer, but it seems orthogonal to what we are doing here? |
Yes, I presume it would just be another (~ 20 LoC ?) commit that I don't consider orthogonal to changing the persistence scheme of the |
It is of course related, but it is not necessary to do it in this PR? For unblocking the async kv store, what's in this PR is all I need. |
See #3819 (comment): I think you can avoid that 'double-check' pattern if you have repersistence triggered via a notifier. |
lightning/src/util/sweep.rs
Outdated
@@ -783,11 +788,13 @@ where | |||
struct SweeperState { | |||
outputs: Vec<TrackedSpendableOutput>, | |||
best_block: BestBlock, | |||
dirty: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know this area of the code well, but I tend to agree with Joost that if we had put both flags inside the SweeperState
then it would be easier to reason about -- everything would have to be changed under the same lock so would definitely be no concerns about concurrency. At face value, having a separate lock seems like it asks for a race condition?
7cfec6a
to
6138980
Compare
@tnull @TheBlueMatt and I have also been looking ahead to the follow up to this where the kv store is made async. We need to ensure that await doesn't happen inside the sweeper state lock. One way of dealing with that is to just get the future inside the lock, and then await outside of it. And document on the trait that the call order needs to be preserved in the implementation of the kv store. |
lightning/src/util/sweep.rs
Outdated
@@ -783,11 +788,13 @@ where | |||
struct SweeperState { | |||
outputs: Vec<TrackedSpendableOutput>, | |||
best_block: BestBlock, | |||
dirty: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tnull re: (1) can you elaborate on why you see the (_unused, dirty, (static_value, false))
as bad? Not obvious to me why this is different from the other places in the codebase where we do this but might be missing something.
Re: (2) I'm not sure I'm following because even if we use an atomic bool we'll still take the sweeper lock at least once in regenerate_and_broadcast_spend_if_necessary
-- this additional instance doesn't seem unique? Not saying we'll definitely have races with the atomic bool, just that readers have to think through whether we'll miss a persist or have an extra persist unless everything is changed under one lock, so I want to make sure I understand why it's worth it to not.
lightning/src/util/sweep.rs
Outdated
let sweeper_state = | ||
Mutex::new(SweeperState { persistent: PersistentSweeperState { outputs, best_block } }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're gonna add the wrapper struct, should expand the commit message here for why
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added that we want to avoid the _unused construct.
Prepare for adding runtime state while avoiding the _unused serialization macro config.
To prepare for an async kv store trait that must be awaited, this commit moves the kv store calls from the chain notification handlers to the background process. It uses a dirty flag to communicate that there is something to persist. The block height is part of the persisted data. If that data does not make it to disk, the chain notifications are replayed after restart.
6138980
to
f71c795
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3819 +/- ##
=======================================
Coverage 89.88% 89.88%
=======================================
Files 160 160
Lines 129654 129668 +14
Branches 129654 129668 +14
=======================================
+ Hits 116534 116547 +13
- Misses 10425 10428 +3
+ Partials 2695 2693 -2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Prepares for making the kv store async in #3778. Otherwise it might be necessary to use block_on in the sweeper. For block_on, a runtime would be needed.