Skip to content

Failure in partitions_test - "Multiple commit values" during force_become_primary #7687

@eddyashton

Description

@eddyashton

Instance here:
https://github.com/microsoft/CCF/actions/runs/22304875267/job/64521474040

We're doing this inside run_ledger_chunk_bytes_check:

        for node, chunk_size in nodes_and_sizes[1:]:
            force_become_primary(network, args, node)
            with node.client("user0") as c:
                written = 0
                # Stop just before this node completes the chunk
                target_chunk_size = chunk_size - unit_size
                while written < target_chunk_size:
                    r = c.post(
                        "/app/log/public",
                        {"id": chunk_size, "msg": "X" * unit_size},
                    )
                    assert r.status_code == http.HTTPStatus.OK, r
                    written += unit_size
                c.wait_for_commit(r)

            force_become_primary(network, args, smallest_node)   # <<< Error in here

The first thing that force_become_primary() does is network.wait_for_node_commit_sync(), to ensure we start from an agreed state, but in this instance we didn't get there. Node 0 persistently reports commit is at 6.56, while the other 2 know it has advanced to 7.66.

12:17:18.814 | INFO     | infra.network:wait_for_node_commit_sync:1680 - [0] GET /node/commit
12:17:18.814 | INFO     | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"6.56"}
12:17:18.830 | INFO     | infra.network:wait_for_node_commit_sync:1680 - [1] GET /node/commit
12:17:18.830 | INFO     | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"7.66"}
12:17:18.846 | INFO     | infra.network:wait_for_node_commit_sync:1680 - [2] GET /node/commit
12:17:18.846 | INFO     | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"7.66"}
12:17:18.963 | INFO     | infra.network:wait_for_node_commit_sync:1680 - [0] GET /node/commit
12:17:18.963 | INFO     | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"6.56"}
12:17:18.979 | INFO     | infra.network:wait_for_node_commit_sync:1680 - [1] GET /node/commit
12:17:18.979 | INFO     | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"7.66"}
12:17:18.995 | INFO     | infra.network:wait_for_node_commit_sync:1680 - [2] GET /node/commit
12:17:18.995 | INFO     | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"7.66"}

During this, there is one worrying line from CTest interspersed:
https://github.com/microsoft/CCF/actions/runs/22304875267/job/64521474040#step:10:3375

12:17:19.263 | INFO     | infra.network:wait_for_node_commit_sync:1680 - [0] GET /node/commit
Errors while running CTest
12:17:19.263 | INFO     | infra.network:wait_for_node_commit_sync:1680 - 200 {"transaction_id":"6.56"}

Is that just a print/flush ordering mixup, or has something corrupted under our feet?

Anyway, the node logs don't show much more. There's a lot of election churn beforehand, but that's expected. We see the big writes arriving on Node 2. But then Node 0 seems to not here from the primary for a while? Becomes pre-vote candidate? But then becomes follower just before we timeout?

2026-02-23T12:17:15.532867Z 0   [info ] CCF/src/consensus/aft/raft.h:2249    | Becoming follower n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648]: 7.56
2026-02-23T12:17:15.543530Z 0   [info ] CCF/src/consensus/aft/raft.h:1899    | Send request_pre_vote_response from n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] to n[833f127625454ddf92072473da9851f938f4765b112e97a83ba1bdca792a3c5b]: false
2026-02-23T12:17:15.543613Z 0   [info ] CCF/src/consensus/aft/raft.h:1819    | Recv request_vote to n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] from n[833f127625454ddf92072473da9851f938f4765b112e97a83ba1bdca792a3c5b]: granted vote to candidate at 6.56 with local state at 6.56
2026-02-23T12:17:15.543643Z 0   [info ] CCF/src/consensus/aft/raft.h:1884    | Send request_vote_response from n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] to n[833f127625454ddf92072473da9851f938f4765b112e97a83ba1bdca792a3c5b]: true
2026-02-23T12:17:19.496238Z 0   [info ] CCF/src/consensus/aft/raft.h:2098    | Becoming pre-vote candidate n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648]: 7
2026-02-23T12:17:19.497479Z 0   [info ] CCF/src/host/socket.h:54             | TCP Node Outgoing connected
2026-02-23T12:17:19.498583Z 0   [info ] CCF/src/consensus/aft/raft.h:2010    | Recv request vote response to n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] from n[a0d8d254d76dd21f17b6ff28fcbe62e26ef9c22bbc54a0f907d8eba8b6344c60]: they voted no
2026-02-23T12:17:19.498787Z 0   [info ] CCF/src/consensus/aft/raft.h:2010    | Recv request vote response to n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648] from n[833f127625454ddf92072473da9851f938f4765b112e97a83ba1bdca792a3c5b]: they voted no
2026-02-23T12:17:19.548420Z 0   [info ] CCF/src/consensus/aft/raft.h:2249    | Becoming follower n[4a5a5afd3706f6c601a15bf8cbe612b0b69103fa45c1b282d00c16aa399ac648]: 7.56
2026-02-23T12:17:23.099462Z 0   [info ] CCF/src/host/sig_term.h:39           | SIGTERM: Shutting down enclave gracefully...

This may be a sign that Node 2 is essentially stalled - spinning writing these huge entries to disk, and leaving a lull in consensus traffic, which we don't wait long enough for?

A plausible explanation, but also a single repro - let's wait and see if this occurs again.

Metadata

Metadata

Assignees

No one assigned

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions