TQ: Support persisting state to ledger #9310

andrewjstone · 2025-10-29T21:17:50Z

Builds on #9296

This commit persists state to a ledger, following the pattern used in the bootstore. It's done this way because the PersistentState itself is contained in the sans-io layer, but we must save it in the async task layer. The sans-io layer shouldn't know how the state is persisted, just that it is, and so we recreate the ledger for every time we write it.

A follow up will PR will deal with the early networking information saved by the bootstore, and will be very similar.

Builds on #9296 This commit persists state to a ledger, following the pattern used in the bootstore. It's done this way because the `PersistentState` itself is contained in the sans-io layer, but we must save it in the async task layer. The sans-io layer shouldn't know how the state is persisted, just that it is, and so we recreate the ledger for every time we write it. A follow up will PR will deal with the early networking information saved by the bootstore, and will be very similar.

trust-quorum/src/ledgers.rs

pietroalbini · 2025-10-31T18:22:06Z

trust-quorum/src/ledgers.rs

+        ledger
+            .commit()
+            .await
+            .expect("Critical: Failed to save bootstore ledger for Fsm::State");


I agree that failing to persist the ledger is a critical failure (especially because it seems like writing fails only if it can't write to both drives). Panicking feels wrong though. You have a better overview of all of the moving pieces than me: are we sure panicking here won't mess things up?

Unfortunately ledgers are very fragile and have numerous problems due to the impossibility of consensus with only 2 nodes. One of those problems was pointed out in the initial PR, but there are others. To be on the safe side, I generally consider a ledger failure fatal to the sled.

For trust quorum in particular: if we can't write a ledger, then we can't send messages after that fact that depend on that state being persisted. The protocol requires persistence for safety similar to raft and paxos, even though it is not a consensus protocol.

That being said, in this particular case maybe it isn't necessary to panic. Instead we could raise an [Alarm]. This allows querying of the node still, but is supposed to stop mutating operations. Unfortunately, alarms exist inside the protocol state machine at a lower level and so we may need a wrapper for things like this. Then we have to ensure that we permanently stop performing any trust quorum operations and respond only for queries about the status of the node. Those can be used to trigger a support call.

On the other hand, having the node just go down due to a panic will also trigger a support call :)

I'm somewhat torn on the situation, but guaranteeing nothing bad happens by accidentally sending messages is much easier to do when you just panic. The alarm situation requires maintaining state and checking it in all appropriate places.

andrewjstone force-pushed the tq-sprockets-3 branch from e727dea to b19f5cf Compare October 29, 2025 21:22

This was referenced Oct 29, 2025

Trust Quorum Tracking #8262

Open

TQ: Implement network config replication #9317

Open

andrewjstone requested a review from pietroalbini October 30, 2025 15:00

andrewjstone force-pushed the tq-sprockets-2 branch from a505cda to 1ac30a3 Compare October 31, 2025 16:13

andrewjstone force-pushed the tq-sprockets-3 branch from b19f5cf to d1d409c Compare October 31, 2025 16:17

pietroalbini reviewed Oct 31, 2025

View reviewed changes

fix comment

b93522f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TQ: Support persisting state to ledger #9310

TQ: Support persisting state to ledger #9310

andrewjstone commented Oct 29, 2025

Uh oh!

Uh oh!

pietroalbini Oct 31, 2025

Uh oh!

andrewjstone Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

TQ: Support persisting state to ledger #9310

Are you sure you want to change the base?

TQ: Support persisting state to ledger #9310

Conversation

andrewjstone commented Oct 29, 2025

Uh oh!

Uh oh!

pietroalbini Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

andrewjstone Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants