You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Test_handleMultipleTx/Test_handleMultipleTx_with_valid_sibling_transactions (services/propagation/Server_test.go:686) is racy under -race on multi-core runners. It passes a single iteration but fails reproducibly under stress.
Reproduction
go test -race -tags testtxmetacache -count=1 -timeout 600s \
-run 'Test_handleMultipleTx$' -cpu=8 ./services/propagation/
A single -count=1 is unreliable; bumping to -count=200–-count=2000 exposes the failure within ~30–60 seconds on an 8-core box. Locally I saw the first failure around iteration ~492 of a 2000-iteration loop; in CI it surfaced on a single run of the test (see e.g. job 76589759556 on PR #828).
Symptom
=== FAIL: services/propagation Test_handleMultipleTx/Test_handleMultipleTx_with_valid_sibling_transactions
Server_test.go:747:
Error: expected 200, actual 500
Server_test.go:752:
expected: "OK"
actual: "Failed to process transactions:
PROCESSING (4): [ProcessTransaction][<txid>] failed to validate transaction
..."
One or more sibling tx (out of 20) fails validation. The number of failing siblings varies (often 1, sometimes up to ~6) and the failing TxIDs are non-deterministic between runs.
Setup the test exercises
sqlitememory:///test UTXO store, chain height set to 101.
A coinbase tx with 20 P2PKH outputs created at block height 1 (matures at 101).
20 sibling tx, each spending a different vout of the coinbase, are submitted as a single /txs batch.
Re-validating each failed sibling synchronously against the same Validator and Store after the batch completes always succeeds. The bad state is transient.
The underlying validator error is wrapped at the propagation layer and stripped by errors.UserMessage (Server.go:833) before being written to the response, so the bug-actually-fired message isn't visible in CI logs. Surfacing it via a debug log of the inner error would speed up further triage.
The failure does NOT reproduce without -race, or with a verbose-test logger (whose logging volume changes goroutine scheduling enough to hide it).
Per-testify/require happens-before is fine: errSlots slots are written by distinct goroutines, processingWg.Wait() precedes reads. The race appears to be inside the validator / sqlitememory store path itself when 20 goroutines concurrently read the same parent coinbase record and write 20 distinct spending tx records.
Pre-PR-828 evidence
The flake is not introduced by PR #828. I checked out services/propagation/ at the pre-cherry-pick commit aae62aa95 (the merge commit on PR #828 before any of PR #886 was applied) and reproduced the same failure, same identical TxIDs, under the same stress harness.
Likely first appeared with #879 (perf(propagation): process /txs batch concurrently with ordered errors) — that's the commit that turned handleMultipleTx from sequential to concurrent.
Suggested next steps (in order)
Surface the wrapped validator error in the test output so the actual failure mode is visible — at least log it via t.Logf when the response code is non-200.
Determine whether the race is in:
The validator's parent-tx lookup path (single reader vs. multiple readers of the same UTXO record), or
The sqlitememory store's concurrent write path (multiple Create/Spend calls in flight), or
The propagation handler's tx-store sequencing (storeTransaction before Validate).
Decide whether the fix belongs in the validator, the store, or in handleMultipleTx (e.g., serialising txs that share a parent in the same batch).
Summary
Test_handleMultipleTx/Test_handleMultipleTx_with_valid_sibling_transactions(services/propagation/Server_test.go:686) is racy under-raceon multi-core runners. It passes a single iteration but fails reproducibly under stress.Reproduction
A single
-count=1is unreliable; bumping to-count=200–-count=2000exposes the failure within ~30–60 seconds on an 8-core box. Locally I saw the first failure around iteration ~492 of a 2000-iteration loop; in CI it surfaced on a single run of the test (see e.g. job 76589759556 on PR #828).Symptom
One or more sibling tx (out of 20) fails validation. The number of failing siblings varies (often 1, sometimes up to ~6) and the failing TxIDs are non-deterministic between runs.
Setup the test exercises
sqlitememory:///testUTXO store, chain height set to 101./txsbatch.handleMultipleTx,Server.go:679) dispatches one goroutine per tx (gated by the server-widebatchWorkerPoolsemaphore introduced in perf(propagation): process /txs batch concurrently with ordered errors #879) and writes per-tx errors into pre-allocated slots."OK".Diagnostic notes
ValidatorandStoreafter the batch completes always succeeds. The bad state is transient.errors.UserMessage(Server.go:833) before being written to the response, so the bug-actually-fired message isn't visible in CI logs. Surfacing it via a debug log of the inner error would speed up further triage.-race, or with a verbose-test logger (whose logging volume changes goroutine scheduling enough to hide it).testify/requirehappens-before is fine:errSlotsslots are written by distinct goroutines,processingWg.Wait()precedes reads. The race appears to be inside the validator / sqlitememory store path itself when 20 goroutines concurrently read the same parent coinbase record and write 20 distinct spending tx records.Pre-PR-828 evidence
The flake is not introduced by PR #828. I checked out
services/propagation/at the pre-cherry-pick commitaae62aa95(the merge commit on PR #828 before any of PR #886 was applied) and reproduced the same failure, same identical TxIDs, under the same stress harness.Likely first appeared with #879 (
perf(propagation): process /txs batch concurrently with ordered errors) — that's the commit that turnedhandleMultipleTxfrom sequential to concurrent.Suggested next steps (in order)
t.Logfwhen the response code is non-200.Create/Spendcalls in flight), orstoreTransactionbeforeValidate).handleMultipleTx(e.g., serialising txs that share a parent in the same batch).Related
handleMultipleTxconcurrent.