Move persist into async part of the sweeper #3819

joostjager · 2025-06-02T12:27:00Z

Prepares for making the kv store async in #3778. Otherwise it might be necessary to use block_on in the sweeper. For block_on, a runtime would be needed.

ldk-reviews-bot · 2025-06-02T12:27:03Z

👋 Thanks for assigning @tnull as a reviewer!
I'll wait for their review and will help manage the review process.
Once they submit their review, I'll check if a second reviewer would be helpful.

tnull

Took a first look and left some comments.

Besides, I still think if we go this way we should just also switch to use a Notifier to wake the background processor to trigger persistence.

tnull · 2025-06-04T11:15:36Z

lightning/src/util/sweep.rs

@@ -783,11 +788,13 @@ where
 struct SweeperState {
 	outputs: Vec<TrackedSpendableOutput>,
 	best_block: BestBlock,
+	dirty: bool,


We had that discussion before: I'd really prefer it if we don't mix in runtime state with the SweeperState, which is precisely the object we use to isolated the persisted state from the non-persisted state, which also avoid having to hand a mutable state to persist_state.

Last time, I created that isolation, but then reverted in favor of an atomic boolean. Which direction would you suggest taking it with the dirty flag? I don't think I'd like another atomic boolean. Already didn't like the first one, but two independent sync primitives is expanding the state space even further.

Yes, I would much prefer to just have another needs_persist: AtomicBool on OutputSweeper directly.

The type of flows that I'd like to avoid is stuff like: update state, unlock state, mark dirty and then concurrently a persist is happening in between unlock and mark dirty, ultimately leading to clean state marked as dirty that will be re-persisted without changes. Ofc the re-persist isn't the biggest problem, but I am cautious of requiring devs to reason through scenarios like the one above.

Try out here: main...joostjager:rust-lightning:sweeper-async-persist-atomicbool

I definitely feel all those cases pop up within me if I use that atomic bool.

So there a two aspects here why I dislike the idea of pushing the runtime flags into SweeperState:

We intentionally had SweeperState hold the 'actual'/persisted state of the sweeper, not any runtime-specific behavior. The (_unused, dirty, (static_value, false)), in the persistence logic really just shows that you unnecessarily broke the separation of data and logic we had here. If we think that all should be locked under a single Mutex, we'd need to create a wrapper struct holding both the SweeperState and the runtime-specific bool to maintain that.

However, secondly, I don't think we should introduce the lock contention and block the background processor that is woken and processing a 'I need persist' notification just to check if it actually still needs to re-persist. We don't have strong guarantees when the BP responds to a notification, so if it's mid-loop already it might take a while until it gets back to actually process the persist. Also note that what we do in this PR is effectively splitting the persistence in two: sync in-line persistence for stuff that really needs to happen before we return (track_spendable_outputs) and 'lazy'/async persistence that will happen some time after block connection. For the latter we have relaxed consistency guarantees anyways, and we basically increase chances of missing a persistence anyways. So I don't quite understand where the concern for race conditions in this 'lazy' case comes from. I don't see why we favor lock contention over (theoretical) relaxed consistency guarantees for a case where we already opt into the latter knowingly.

It might also be noteworthy that post-async KVStore we might need to rework the current pattern anyways, as we wouldn't be able to hold the MutexGuard across the write().await boundary. We'll figure that out when we get there, but it could mean that we need to clone the to-be persisted state before dropping the lock, and actually making the call, which would be another reason to not include ~unrelated fields in the state object.

TLDR: I'd prefer to continue to have the runtime bools live as AtomicBools on OutputSweeper directly, but if you guys really worry about any races for the already-lazy case, we should at the very least solve it by wrapping the two fields and SweeperState in yet another struct, maintaining the data/logic separation.

After more exploration of the atomic bool direction, I couldn't get rid of the suggested or real race conditions. I kept the dirty flag as part of the state, and separated it from the persistent fields. Let me know what you think.

Regarding that separation, I do want to point to #3618 (comment). Opinions and also current practices vary.

@tnull re: (1) can you elaborate on why you see the (_unused, dirty, (static_value, false)) as bad? Not obvious to me why this is different from the other places in the codebase where we do this but might be missing something.

Re: (2) I'm not sure I'm following because even if we use an atomic bool we'll still take the sweeper lock at least once in regenerate_and_broadcast_spend_if_necessary -- this additional instance doesn't seem unique? Not saying we'll definitely have races with the atomic bool, just that readers have to think through whether we'll miss a persist or have an extra persist unless everything is changed under one lock, so I want to make sure I understand why it's worth it to not.

After more exploration of the atomic bool direction, I couldn't get rid of the suggested or real race conditions. I kept the dirty flag as part of the state, and separated it from the persistent fields. Let me know what you think.

Yeah, as mentioned above and discussed offline, we can go this direction if you guys prefer it, as long as we maintain the separation of concerns, and drop the mutable reference from persist_state.

@tnull re: (1) can you elaborate on why you see the (_unused, dirty, (static_value, false)) as bad? Not obvious to me why this is different from the other places in the codebase where we do this but might be missing something.

The separation of data and logic is a nice principle to uphold, as it often makes reasoning about the code and later refactorings easier. I agree that we rarely follow that principle, but the sweeper is one of the few places where we did, and consciously chose to do so during the original review discussion. Would be great to maintain that.

Re: (2) I'm not sure I'm following because even if we use an atomic bool we'll still take the sweeper lock at least once in regenerate_and_broadcast_spend_if_necessary -- this additional instance doesn't seem unique?

Right, but block connections might be driven by another concurrent task, which might lead to unnecessarily re-acrquiring the mutex guard, just to check if we still have to persist once the BP loop gets around to it.

the sweeper is one of the few places where we did

I don't think this is the case, because before #3734 (where state separation was also a topic of discussion), there was no runtime-only state in the sweeper?

In the last push on this PR, I've added the state separation but not the atomic boolean.

tnull · 2025-06-04T11:20:35Z

lightning/src/util/sweep.rs

+		let result = {
+			self.regenerate_and_broadcast_spend_if_necessary_internal().await?;
+
+			// If there is still dirty state, we need to persist it.


This is a weird pattern. Why not move persistence out of regenerate_and_broadcast_spend_if_necessary_internal and just set the dirty flag there?

I looked at that, but I think we have to persist before we broadcast? Or is that not necessary?

I looked at that, but I think we have to persist before we broadcast? Or is that not necessary?

Hmm, not sure if necessary, but yes, it's probably cleaner to persist that we broadcasted before we attempt it.

However, I think you can avoid the entire 'if it's still dirty'-pattern if you'd trigger the repersistence via a Notifier rather than through the call to regenerate_and_broadcast_if_necessary, as discussed below.

Discussed offline. Probably still need a dirty flag to prevent unnecessary persists when only sweeps need to be checked.

Discussed offline. Probably still need a dirty flag to prevent unnecessary persists when only sweeps need to be checked.

Well, this was never the question, the question was around whether we need to run the 'if it's still dirty'-pattern after we may have just persisted. And to avoid that, we should just switch to use the notifier, as we intend to do that anyways.

ldk-reviews-bot · 2025-06-04T11:22:51Z

👋 The first review has been submitted!

Do you think this PR is ready for a second reviewer? If so, click here to assign a second reviewer.

joostjager · 2025-06-04T11:48:03Z

Besides, I still think if we go this way we should just also switch to use a Notifier to wake the background processor to trigger persistence.

You mean as part of this PR? I agree that that would be nicer than a timer, but it seems orthogonal to what we are doing here?

tnull · 2025-06-04T11:54:55Z

You mean as part of this PR? I agree that that would be nicer than a timer, but it seems orthogonal to what we are doing here?

Yes, I presume it would just be another (~ 20 LoC ?) commit that I don't consider orthogonal to changing the persistence scheme of the OutputSweeper, but very much in-line with / related to the effort in this PR.

joostjager · 2025-06-04T12:00:16Z

It is of course related, but it is not necessary to do it in this PR? For unblocking the async kv store, what's in this PR is all I need.

tnull · 2025-06-04T12:02:56Z

It is of course related, but it is not necessary to do it in this PR? For unblocking the async kv store, what's in this PR is all I need.

See #3819 (comment): I think you can avoid that 'double-check' pattern if you have repersistence triggered via a notifier.

lightning/src/util/sweep.rs

valentinewallace · 2025-06-04T17:45:03Z

lightning/src/util/sweep.rs

@@ -783,11 +788,13 @@ where
 struct SweeperState {
 	outputs: Vec<TrackedSpendableOutput>,
 	best_block: BestBlock,
+	dirty: bool,


I don't know this area of the code well, but I tend to agree with Joost that if we had put both flags inside the SweeperState then it would be easier to reason about -- everything would have to be changed under the same lock so would definitely be no concerns about concurrency. At face value, having a separate lock seems like it asks for a race condition?

joostjager · 2025-06-09T14:46:47Z

@tnull @TheBlueMatt and I have also been looking ahead to the follow up to this where the kv store is made async. We need to ensure that await doesn't happen inside the sweeper state lock.

One way of dealing with that is to just get the future inside the lock, and then await outside of it. And document on the trait that the call order needs to be preserved in the implementation of the kv store.

lightning/src/util/sweep.rs

valentinewallace · 2025-06-09T16:00:19Z

lightning/src/util/sweep.rs

@@ -783,11 +788,13 @@ where
 struct SweeperState {
 	outputs: Vec<TrackedSpendableOutput>,
 	best_block: BestBlock,
+	dirty: bool,


@tnull re: (1) can you elaborate on why you see the (_unused, dirty, (static_value, false)) as bad? Not obvious to me why this is different from the other places in the codebase where we do this but might be missing something.

Re: (2) I'm not sure I'm following because even if we use an atomic bool we'll still take the sweeper lock at least once in regenerate_and_broadcast_spend_if_necessary -- this additional instance doesn't seem unique? Not saying we'll definitely have races with the atomic bool, just that readers have to think through whether we'll miss a persist or have an extra persist unless everything is changed under one lock, so I want to make sure I understand why it's worth it to not.

lightning/src/util/sweep.rs

codecov · 2025-06-09T16:53:45Z

Codecov Report

Attention: Patch coverage is 62.50000% with 24 lines in your changes missing coverage. Please review.

Project coverage is 89.88%. Comparing base (0848e7a) to head (f71c795).

Files with missing lines	Patch %	Lines
lightning/src/util/sweep.rs	62.50%	20 Missing and 4 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #3819   +/-   ##
=======================================
  Coverage   89.88%   89.88%           
=======================================
  Files         160      160           
  Lines      129654   129668   +14     
  Branches   129654   129668   +14     
=======================================
+ Hits       116534   116547   +13     
- Misses      10425    10428    +3     
+ Partials     2695     2693    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

joostjager · 2025-06-10T11:05:20Z

lightning/src/util/sweep.rs

@@ -616,6 +633,9 @@ where
 				);
 				e
 			})
+			.map(|_| {
+				sweeper_state.dirty = false;


This is also again problematic when converting to async. If we don't hold the lock across the write await, we can't just update that dirty flag.

Right, see above, we'd need to keep persist_state take a non-mutable reference and re-acquire the Mutex to unset the flag. However, that of course might just be as race-y as setting/unsetting the AtomicBool.

Indeed, no more atomic update. But regardless of an in-state or independent dirty flag, this doesn't seem like a great direction either way.

But also it may be unavoidable?

tnull · 2025-06-10T11:53:57Z

lightning/src/util/sweep.rs

 	}

-	fn persist_state(&self, sweeper_state: &SweeperState) -> Result<(), io::Error> {
+	/// Flushes the current state to the persistence layer and marks the state as clean.
+	fn flush_state(&self, sweeper_state: &mut SweeperState) -> Result<(), io::Error> {


Not sure I'm onboard with this name change, as I have no intuition what 'flush_state' would mean. Plus, I still think this shouldn't take &mut SweeperState, but rather &PersistentState now.

Made the change. Name back to persist_state and taking an immutable ref. It does lead, as mentioned before, to code duplication. The dirty flag needs to be reset at multiple locations.

joostjager · 2025-06-10T15:50:48Z

I pushed a commit to verify whether this PR will actually work with an async kv store, and not holding the mutex across awaits. Also not that easy.

lightning/src/util/sweep.rs

tnull · 2025-06-11T09:02:40Z

I pushed a commit to verify whether this PR will actually work with an async kv store, and not holding the mutex across awaits. Also not that easy.

Okay, but I think it would be preferable to make the Async-KVStore changes in the Async-KVStore PR. Adding them here just makes following the changes harder.

joostjager · 2025-06-11T09:56:45Z

Just pushed the commit to discuss whether this is the direction that we want to go to. Should have made that more clear.

joostjager · 2025-06-11T14:59:15Z

To avoid confusion, I've parked the gist of the follow up here: joostjager/rust-lightning@sweeper-async-persist...joostjager:rust-lightning:sweeper-async-kvstore

Prepare for adding runtime state while avoiding the _unused serialization macro config.

To prepare for an async kv store trait that must be awaited, this commit moves the kv store calls from the chain notification handlers to the background process. It uses a dirty flag to communicate that there is something to persist. The block height is part of the persisted data. If that data does not make it to disk, the chain notifications are replayed after restart.

tnull

I think the current approach is fine, but if we'd want to make setting/unsetting the dirty flag safer (i.e., ensure that we don't forget something going forward), we could consider using an RAII pattern or similar to isolate the modifications and re-persists of the state further.

tnull · 2025-06-13T08:47:12Z

lightning/src/util/sweep.rs

@@ -382,7 +382,8 @@ where
 		output_spender: O, change_destination_source: D, kv_store: K, logger: L,
 	) -> Self {
 		let outputs = Vec::new();
-		let sweeper_state = Mutex::new(SweeperState { outputs, best_block });
+		let sweeper_state =
+			Mutex::new(SweeperState { persistent: PersistentSweeperState { outputs, best_block } });


nit: Persistent is a bit clunky terminology, IMO, but nbd.

Discussed offline, couldn't come up with something that is clearly better. Leaving as is.

tnull · 2025-06-13T08:58:38Z

lightning/src/util/sweep.rs

@@ -595,6 +610,8 @@ where
 			}
 			true
 		});
+
+		sweeper_state.dirty = true;


Seems this is redundant, but probably also can't hurt.

Leaving as is, to at least flag dirty in the functions that change state.

Also not doing the RAII pattern. There are just a few places where dirty needs to be cleared, and seems fine for now.

valentinewallace · 2025-06-13T14:27:19Z

lightning/src/util/sweep.rs

+		let sweeper_state = Mutex::new(SweeperState {
+			persistent: PersistentSweeperState { outputs, best_block },
+			dirty: false,
+		});


IMO adding the wrapper struct causes us to litter the code with .persistent in a lot of places where it's not relevant what's written to disk and what's not, and I'm not sure the concrete benefit besides the principle. Not worth holding up the PR though.

joostjager requested a review from tnull June 2, 2025 13:53

joostjager marked this pull request as ready for review June 4, 2025 09:50

tnull reviewed Jun 4, 2025

View reviewed changes

joostjager requested a review from valentinewallace June 4, 2025 15:53

valentinewallace reviewed Jun 4, 2025

View reviewed changes

joostjager force-pushed the sweeper-async-persist branch 4 times, most recently from 7cfec6a to 6138980 Compare June 9, 2025 14:44

joostjager requested a review from tnull June 9, 2025 14:44

valentinewallace reviewed Jun 9, 2025

View reviewed changes

joostjager force-pushed the sweeper-async-persist branch from 6138980 to f71c795 Compare June 9, 2025 16:44

joostjager commented Jun 10, 2025

View reviewed changes

joostjager removed the request for review from tnull June 10, 2025 11:29

tnull reviewed Jun 10, 2025

View reviewed changes

joostjager commented Jun 10, 2025

View reviewed changes

lightning/src/util/sweep.rs Outdated Show resolved Hide resolved

joostjager commented Jun 10, 2025

View reviewed changes

lightning/src/util/sweep.rs Outdated Show resolved Hide resolved

joostjager force-pushed the sweeper-async-persist branch 2 times, most recently from 67d9f8f to eec1c6b Compare June 11, 2025 14:55

joostjager force-pushed the sweeper-async-persist branch 2 times, most recently from 36ab2ee to 663254e Compare June 12, 2025 07:09

joostjager requested a review from tnull June 12, 2025 07:11

joostjager added 2 commits June 13, 2025 10:23

Separate sweeper persistent state

c09d41c

Prepare for adding runtime state while avoiding the _unused serialization macro config.

joostjager force-pushed the sweeper-async-persist branch from 663254e to 84ce2f2 Compare June 13, 2025 08:23

tnull reviewed Jun 13, 2025

View reviewed changes

joostjager requested a review from valentinewallace June 13, 2025 09:24

valentinewallace approved these changes Jun 13, 2025

View reviewed changes

joostjager requested a review from tnull June 13, 2025 14:39

Move persist into async part of the sweeper #3819

Are you sure you want to change the base?

Move persist into async part of the sweeper #3819

Uh oh!

Conversation

joostjager commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ldk-reviews-bot commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tnull left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tnull Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ldk-reviews-bot commented Jun 4, 2025

Uh oh!

joostjager commented Jun 4, 2025

Uh oh!

tnull commented Jun 4, 2025

Uh oh!

joostjager commented Jun 4, 2025

Uh oh!

tnull commented Jun 4, 2025

Uh oh!

Uh oh!

valentinewallace Jun 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joostjager commented Jun 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

codecov bot commented Jun 9, 2025

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

joostjager commented Jun 2, 2025 •

edited

Loading

ldk-reviews-bot commented Jun 2, 2025 •

edited

Loading

tnull Jun 5, 2025 •

edited

Loading

joostjager Jun 9, 2025 •

edited

Loading

valentinewallace Jun 4, 2025 •

edited

Loading

joostjager commented Jun 9, 2025 •

edited

Loading