TQ: Integrate protocol with `NodeTask` #9296

andrewjstone · 2025-10-28T17:45:50Z

This builds on #9258

NodeTask now uses the trust_quorum_protocol::Node and trust_quorum_protocol::NodeCtx to send and receive trust quorum messages. An API to drive this was added to the NodeTaskHandle.

The majority of code in this PR is tests using the API.

A follow up will deal with saving persistent state to a Ledger.

Builds on #9296 This commit persists state to a ledger, following the pattern used in the bootstore. It's done this way because the `PersistentState` itself is contained in the sans-io layer, but we must save it in the async task layer. The sans-io layer shouldn't know how the state is persisted, just that it is, and so we recreate the ledger for every time we write it. A follow up will PR will deal with the early networking information saved by the bootstore, and will be very similar.

trust-quorum/src/task.rs

sunshowers

Didn't look too closely at the tests, but the code itself looks great! just a few minor comments that I'll trust you to resolve :)

trust-quorum/src/task.rs

sunshowers · 2025-10-30T21:48:41Z

trust-quorum/src/task.rs

+            for envelope in self.ctx.drain_envelopes() {
+                self.conn_mgr.send(envelope).await;
            }


do we want to do this concurrently, or is serially okay? I guess this shouldn't be cancelled since there's an instruction to make run a top-level task.

I think serially is fine. At most it's 31 messages being sent during a the Prepare phase of a reconfiguration. Most of the time it's going to be 1 message. There won't be any cancellation, and in all cases should just push onto a channel buffer.

sunshowers · 2025-10-30T21:49:47Z

trust-quorum/src/task.rs

        }
    }

+    // TODO: Process `ctx`: save persistent state


What's ctx here?

This is the internal NodeCtx of a trust-quorum node. It's a structure that contains the PerisistentState. The next PR in the series removes this comment and saves the PersistentState to a ledger.

sunshowers · 2025-10-30T21:50:42Z

trust-quorum/src/task.rs

+    /// Return the status of this node if it is a coordinator
+    CoordinatorStatus { responder: oneshot::Sender<Option<CoordinatorStatus>> },
+
+    /// Load a rack secret for the given epoch
+    LoadRackSecret {
+        epoch: Epoch,
+        responder: oneshot::Sender<
+            Result<Option<ReconstructedRackSecret>, LoadRackSecretError>,
+        >,
+    },


would consider calling all of the oneshot channels tx or similar

Done in 7beb55d

sunshowers · 2025-10-31T02:31:17Z

trust-quorum/src/task.rs

+            &poll_interval,
+            &poll_max,


hmm, honestly this should take a Duration, not a reference to it. Worth fixing at some point.

Builds on #9232 This is the first step in wrapping the `trust_quorum::Node` so that it can be used in an async context and integrated with sled-agent. Only the sprockets networking has been fully integrated so far such that each `NodeTask` has a `ConnMgr` that sets up a full mesh of sprockets connections. A test for this connectivity behavior has been written but the code is not wired into the production code yet. Messages can be sent between `NodeTasks` over sprockets connections. Each connection exists in it's own task managed by an `EstablishedConn`. The main `NodeTask` task sends messages to and receives messages from this task to interact with the outside world via sprockets. Currently only `Ping` messages are sent over the wire as a means to keep the connections alive and detect disconnects. A `NodeHandle` allows one to interact with the `NodeTask`. Currently only three operations are implemented with messages defined in `NodeApiRequest`. The user can instruct the node who it's peers are on the bootstrap network to establish connectivity, can poll for connectivity status, and can shutdown the node. All of this functionality is used in the accompanying test. It's important to re-iterate that this code only implements connectivity between trust quorum nodes and no actual trust quorum messages are sent. They can't be as a handle can not yet initiate a reconfiguration or LRTQ upgrade. That behavior will come in a follow up. This PR is large enough. A lot of this code is similar to the LRTQ connection management code, except that it operates over sprockets rather than TCP channels. This introduces some complexity, but it is mostly abstracted away into the `SprocketsConfig`.

…::TaskId

`NodeTask` now uses the `trust_quorum_protocol::Node` and `trust_quorum_protocol::NodeCtx` to send and receive trust quorum messages. An API to drive this was added to the `NodeTaskHandle`. The majority of code in this PR is tests using the API. A follow up will deal with saving persistent state to a Ledger.

Builds on #9296 This commit persists state to a ledger, following the pattern used in the bootstore. It's done this way because the `PersistentState` itself is contained in the sans-io layer, but we must save it in the async task layer. The sans-io layer shouldn't know how the state is persisted, just that it is, and so we recreate the ledger for every time we write it. A follow up will PR will deal with the early networking information saved by the bootstore, and will be very similar.

pietroalbini · 2025-10-30T10:12:26Z

trust-quorum/src/connection_manager.rs

+    pub async fn send(&self, envelope: Envelope) {
+        let Envelope { to, msg, .. } = envelope;
+        info!(self.log, "Sending {msg:?}"; "peer_id" => %to);
+        if let Some(handle) = self.established.get1(&to) {


I was quite confused when I first saw this, since it silently discards the message if a connection with the recipient is not established.

Originally I was going to suggest to rename the function to try_send or similar, but poking at the rest of the code I learned errors are discarded everywhere (EstablishedConn::run only logs the error message and kills the connection, without reporting the failure down the stack).

This makes sense, as in general RFD 238 is designed to be resilient to nodes disappearing at any point in time. I'm not sure if I would do anything in response to this comment. Just leaving this as a note for future me.

pietroalbini · 2025-10-31T15:36:16Z

trust-quorum/src/connection_manager.rs

-    async fn disconnect_client(&mut self, addr: SocketAddrV6) {
+    ///
+    /// Return the `BaseboardId` of the peer if an established connection is
+    // torn down.


Suggested change

// torn down.

/// torn down.

pietroalbini · 2025-10-31T16:29:55Z

trust-quorum/src/task.rs

+        // Tell all but the last node how to reach each other
+        for h in &setup.node_handles {
+            h.load_peer_addresses(setup.listen_addrs.iter().cloned().collect())
+                .await
+                .unwrap();
+        }


While in this case the code comment doesn't reflect the test, this is a more general review comment on these tests.

There is a lot of boilerplate copy/pasted between tests in each test, which makes it hard to see at a glance what a test is actually testing and what is the difference between tests.

We have a TestSetup struct we can add methods to. As an example, replacing connecting nodes with:

setup.connect_nodes(..).await; setup.connect_nodes(1..).await;

...and similar for the rest of the large boilerplate blocks would make it way easier to review tests and make sure we cover every case.

I fully agree. Will do.

I started to do this and then realized why I didn't do it in the first place. There actually isn't a ton of duplication inside the wait_for_condition futures. If you look closely you'll see that we check different conditions. Doing this would also couple the "informing nodes about addresses to connect to" and "waiting for nodes to connect" to really shorten things.

The one place it did make sense to dedup was in the loading of rack secrets. I went ahead and did that in aaad0c4

andrewjstone force-pushed the tq-sprockets-2 branch 3 times, most recently from 4e7f80b to a505cda Compare October 28, 2025 18:24

andrewjstone mentioned this pull request Oct 29, 2025

TQ: Support persisting state to ledger #9310

Open

andrewjstone mentioned this pull request Oct 29, 2025

Trust Quorum Tracking #8262

Open

19 tasks

andrewjstone requested a review from pietroalbini October 30, 2025 14:59

sunshowers reviewed Oct 30, 2025

View reviewed changes

trust-quorum/src/task.rs Show resolved Hide resolved

sunshowers approved these changes Oct 31, 2025

View reviewed changes

andrewjstone added 17 commits October 31, 2025 15:36

hakari

5cd3454

Fix up step method

a8e8be6

Use JoinSet and tokio::task::Id instead of FuturesUnordered and crate…

9c2a7e3

…::TaskId

logging cleanup

e1e49ea

more review cleanup

0a5a5e0

sock writer shutdown works again

ffbd488

clippy

4245fd0

Review comments

de3b4e0

Use BiHashMap and TriHashMap for connections

a4bfea5

No more graceful close from ConnMgr

2cfbfbc

no more test detritus

bbb6cca

Move sans-io code into trust-quorum-protocol crate

20b4100

hakari

b374e07

use try_send on connToMain channel

1efff12

error instead of warn

2a262c0

put errors back

99e5192

andrewjstone force-pushed the tq-sprockets branch from 809559c to 99e5192 Compare October 31, 2025 15:38

andrewjstone force-pushed the tq-sprockets-2 branch from a505cda to 1ac30a3 Compare October 31, 2025 16:13

pietroalbini reviewed Oct 31, 2025

View reviewed changes

andrewjstone added 4 commits October 31, 2025 19:34

Do not internally retry for load_rack_secret

aaad0c4

more review fixes for @sunshowers

7beb55d

fix typo

983a453

This is what I get for doing a quick search/replace

4b86d87

Base automatically changed from tq-sprockets to main October 31, 2025 22:19

TQ: Integrate protocol with NodeTask #9296

Are you sure you want to change the base?

TQ: Integrate protocol with NodeTask #9296

Uh oh!

Conversation

andrewjstone commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

sunshowers left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

andrewjstone Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TQ: Integrate protocol with `NodeTask` #9296

TQ: Integrate protocol with `NodeTask` #9296

andrewjstone commented Oct 28, 2025 •

edited

Loading

andrewjstone Oct 31, 2025 •

edited

Loading