Instance here:
https://github.com/microsoft/CCF/actions/runs/22304875267/job/64521474040
We're doing this inside run_ledger_chunk_bytes_check:
for node, chunk_size in nodes_and_sizes[1:]:
force_become_primary(network, args, node)
with node.client("user0") as c:
written = 0
# Stop just before this node completes the chunk
target_chunk_size = chunk_size - unit_size
while written < target_chunk_size:
r = c.post(
"/app/log/public",
{"id": chunk_size, "msg": "X" * unit_size},
)
assert r.status_code == http.HTTPStatus.OK, r
written += unit_size
c.wait_for_commit(r)
force_become_primary(network, args, smallest_node) # <<< Error in here
The first thing that force_become_primary() does is network.wait_for_node_commit_sync(), to ensure we start from an agreed state, but in this instance we didn't get there. Node 0 persistently reports commit is at 6.56, while the other 2 know it has advanced to 7.66.
12:17:18.814 | INFO | infra.network:wait_for_node_commit_sync:1680 - [0] GET /node/commit
12:17:18.814 | INFO | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"6.56"}
12:17:18.830 | INFO | infra.network:wait_for_node_commit_sync:1680 - [1] GET /node/commit
12:17:18.830 | INFO | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"7.66"}
12:17:18.846 | INFO | infra.network:wait_for_node_commit_sync:1680 - [2] GET /node/commit
12:17:18.846 | INFO | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"7.66"}
12:17:18.963 | INFO | infra.network:wait_for_node_commit_sync:1680 - [0] GET /node/commit
12:17:18.963 | INFO | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"6.56"}
12:17:18.979 | INFO | infra.network:wait_for_node_commit_sync:1680 - [1] GET /node/commit
12:17:18.979 | INFO | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"7.66"}
12:17:18.995 | INFO | infra.network:wait_for_node_commit_sync:1680 - [2] GET /node/commit
12:17:18.995 | INFO | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"7.66"}
During this, there is one worrying line from CTest interspersed:
https://github.com/microsoft/CCF/actions/runs/22304875267/job/64521474040#step:10:3375
12:17:19.263 | INFO | infra.network:wait_for_node_commit_sync:1680 - [0] GET /node/commit
Errors while running CTest
12:17:19.263 | INFO | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"6.56"}
Is that just a print/flush ordering mixup, or has something corrupted under our feet?
Anyway, the node logs don't show much more. There's a lot of election churn beforehand, but that's expected. We see the big writes arriving on Node 2. But then Node 0 seems to not here from the primary for a while? Becomes pre-vote candidate? But then becomes follower just before we timeout?
2026-02-23T12:17:15.532867Z 0 [info ] CCF/src/consensus/aft/raft.h:2249 | Becoming follower n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648]: 7.56
2026-02-23T12:17:15.543530Z 0 [info ] CCF/src/consensus/aft/raft.h:1899 | Send request_pre_vote_response from n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] to n[833f127625454ddf92072473da9851f938f4765b112e97a83ba1bdca792a3c5b]: false
2026-02-23T12:17:15.543613Z 0 [info ] CCF/src/consensus/aft/raft.h:1819 | Recv request_vote to n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] from n[833f127625454ddf92072473da9851f938f4765b112e97a83ba1bdca792a3c5b]: granted vote to candidate at 6.56 with local state at 6.56
2026-02-23T12:17:15.543643Z 0 [info ] CCF/src/consensus/aft/raft.h:1884 | Send request_vote_response from n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] to n[833f127625454ddf92072473da9851f938f4765b112e97a83ba1bdca792a3c5b]: true
2026-02-23T12:17:19.496238Z 0 [info ] CCF/src/consensus/aft/raft.h:2098 | Becoming pre-vote candidate n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648]: 7
2026-02-23T12:17:19.497479Z 0 [info ] CCF/src/host/socket.h:54 | TCP Node Outgoing connected
2026-02-23T12:17:19.498583Z 0 [info ] CCF/src/consensus/aft/raft.h:2010 | Recv request vote response to n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] from n[a0d8d254d76dd21f17b6ff28fcbe62e26ef9c22bbc54a0f907d8eba8b6344c60]: they voted no
2026-02-23T12:17:19.498787Z 0 [info ] CCF/src/consensus/aft/raft.h:2010 | Recv request vote response to n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] from n[833f127625454ddf92072473da9851f938f4765b112e97a83ba1bdca792a3c5b]: they voted no
2026-02-23T12:17:19.548420Z 0 [info ] CCF/src/consensus/aft/raft.h:2249 | Becoming follower n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648]: 7.56
2026-02-23T12:17:23.099462Z 0 [info ] CCF/src/host/sig_term.h:39 | SIGTERM: Shutting down enclave gracefully...
This may be a sign that Node 2 is essentially stalled - spinning writing these huge entries to disk, and leaving a lull in consensus traffic, which we don't wait long enough for?
A plausible explanation, but also a single repro - let's wait and see if this occurs again.
Instance here:
https://github.com/microsoft/CCF/actions/runs/22304875267/job/64521474040
We're doing this inside
run_ledger_chunk_bytes_check:The first thing that
force_become_primary()does isnetwork.wait_for_node_commit_sync(), to ensure we start from an agreed state, but in this instance we didn't get there. Node 0 persistently reports commit is at6.56, while the other 2 know it has advanced to7.66.During this, there is one worrying line from CTest interspersed:
https://github.com/microsoft/CCF/actions/runs/22304875267/job/64521474040#step:10:3375
Is that just a print/flush ordering mixup, or has something corrupted under our feet?
Anyway, the node logs don't show much more. There's a lot of election churn beforehand, but that's expected. We see the big writes arriving on Node 2. But then Node 0 seems to not here from the primary for a while? Becomes pre-vote candidate? But then becomes follower just before we timeout?
This may be a sign that Node 2 is essentially stalled - spinning writing these huge entries to disk, and leaving a lull in consensus traffic, which we don't wait long enough for?
A plausible explanation, but also a single repro - let's wait and see if this occurs again.