Skip to content

Conversation

@andrewjstone
Copy link
Contributor

Builds on #9296

This commit persists state to a ledger, following the pattern used in the bootstore. It's done this way because the PersistentState itself is contained in the sans-io layer, but we must save it in the async task layer. The sans-io layer shouldn't know how the state is persisted, just that it is, and so we recreate the ledger for every time we write it.

A follow up will PR will deal with the early networking information saved by the bootstore, and will be very similar.

Builds on #9296

This commit persists state to a ledger, following the pattern used in
the bootstore. It's done this way because the `PersistentState` itself
is contained in the sans-io layer, but we must save it in the async task
layer. The sans-io layer shouldn't know how the state is persisted, just
that it  is, and so we recreate the ledger for every time we write it.

A follow up will PR will deal with the early networking information saved
by the bootstore, and will be very similar.
ledger
.commit()
.await
.expect("Critical: Failed to save bootstore ledger for Fsm::State");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that failing to persist the ledger is a critical failure (especially because it seems like writing fails only if it can't write to both drives). Panicking feels wrong though. You have a better overview of all of the moving pieces than me: are we sure panicking here won't mess things up?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately ledgers are very fragile and have numerous problems due to the impossibility of consensus with only 2 nodes. One of those problems was pointed out in the initial PR, but there are others. To be on the safe side, I generally consider a ledger failure fatal to the sled.

For trust quorum in particular: if we can't write a ledger, then we can't send messages after that fact that depend on that state being persisted. The protocol requires persistence for safety similar to raft and paxos, even though it is not a consensus protocol.

That being said, in this particular case maybe it isn't necessary to panic. Instead we could raise an [Alarm]. This allows querying of the node still, but is supposed to stop mutating operations. Unfortunately, alarms exist inside the protocol state machine at a lower level and so we may need a wrapper for things like this. Then we have to ensure that we permanently stop performing any trust quorum operations and respond only for queries about the status of the node. Those can be used to trigger a support call.

On the other hand, having the node just go down due to a panic will also trigger a support call :)

I'm somewhat torn on the situation, but guaranteeing nothing bad happens by accidentally sending messages is much easier to do when you just panic. The alarm situation requires maintaining state and checking it in all appropriate places.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants