Skip to content

Flake: services/propagation Test_handleMultipleTx fails under -race + -cpu=8 stress #890

@icellan

Description

@icellan

Summary

Test_handleMultipleTx/Test_handleMultipleTx_with_valid_sibling_transactions (services/propagation/Server_test.go:686) is racy under -race on multi-core runners. It passes a single iteration but fails reproducibly under stress.

Reproduction

go test -race -tags testtxmetacache -count=1 -timeout 600s \
  -run 'Test_handleMultipleTx$' -cpu=8 ./services/propagation/

A single -count=1 is unreliable; bumping to -count=200-count=2000 exposes the failure within ~30–60 seconds on an 8-core box. Locally I saw the first failure around iteration ~492 of a 2000-iteration loop; in CI it surfaced on a single run of the test (see e.g. job 76589759556 on PR #828).

Symptom

=== FAIL: services/propagation Test_handleMultipleTx/Test_handleMultipleTx_with_valid_sibling_transactions
    Server_test.go:747:
        Error: expected 200, actual 500
    Server_test.go:752:
        expected: "OK"
        actual:   "Failed to process transactions:
                   PROCESSING (4): [ProcessTransaction][<txid>] failed to validate transaction
                   ..."

One or more sibling tx (out of 20) fails validation. The number of failing siblings varies (often 1, sometimes up to ~6) and the failing TxIDs are non-deterministic between runs.

Setup the test exercises

  • sqlitememory:///test UTXO store, chain height set to 101.
  • A coinbase tx with 20 P2PKH outputs created at block height 1 (matures at 101).
  • 20 sibling tx, each spending a different vout of the coinbase, are submitted as a single /txs batch.
  • The handler (handleMultipleTx, Server.go:679) dispatches one goroutine per tx (gated by the server-wide batchWorkerPool semaphore introduced in perf(propagation): process /txs batch concurrently with ordered errors #879) and writes per-tx errors into pre-allocated slots.
  • Test expects HTTP 200 + body "OK".

Diagnostic notes

  • Re-validating each failed sibling synchronously against the same Validator and Store after the batch completes always succeeds. The bad state is transient.
  • The underlying validator error is wrapped at the propagation layer and stripped by errors.UserMessage (Server.go:833) before being written to the response, so the bug-actually-fired message isn't visible in CI logs. Surfacing it via a debug log of the inner error would speed up further triage.
  • The failure does NOT reproduce without -race, or with a verbose-test logger (whose logging volume changes goroutine scheduling enough to hide it).
  • Per-testify/require happens-before is fine: errSlots slots are written by distinct goroutines, processingWg.Wait() precedes reads. The race appears to be inside the validator / sqlitememory store path itself when 20 goroutines concurrently read the same parent coinbase record and write 20 distinct spending tx records.

Pre-PR-828 evidence

The flake is not introduced by PR #828. I checked out services/propagation/ at the pre-cherry-pick commit aae62aa95 (the merge commit on PR #828 before any of PR #886 was applied) and reproduced the same failure, same identical TxIDs, under the same stress harness.

Likely first appeared with #879 (perf(propagation): process /txs batch concurrently with ordered errors) — that's the commit that turned handleMultipleTx from sequential to concurrent.

Suggested next steps (in order)

  1. Surface the wrapped validator error in the test output so the actual failure mode is visible — at least log it via t.Logf when the response code is non-200.
  2. Determine whether the race is in:
    • The validator's parent-tx lookup path (single reader vs. multiple readers of the same UTXO record), or
    • The sqlitememory store's concurrent write path (multiple Create/Spend calls in flight), or
    • The propagation handler's tx-store sequencing (storeTransaction before Validate).
  3. Decide whether the fix belongs in the validator, the store, or in handleMultipleTx (e.g., serialising txs that share a parent in the same batch).

Related

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type
No fields configured for issues without a type.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions