Skip to content

fix(chain-state): avoid engine deadlock from locks held across the trie compute (v2.2.5 base)#24871

Draft
opsuperchain wants to merge 1 commit into
paradigmxyz:mainfrom
opsuperchain:fix/chain-state-overlay-deferred-trie-deadlock-v2.2.5
Draft

fix(chain-state): avoid engine deadlock from locks held across the trie compute (v2.2.5 base)#24871
opsuperchain wants to merge 1 commit into
paradigmxyz:mainfrom
opsuperchain:fix/chain-state-overlay-deferred-trie-deadlock-v2.2.5

Conversation

@opsuperchain

Copy link
Copy Markdown

Same fix as #24870, but rooted at v2.2.5 (81c026181e96ef33a823f3ef4d2a28940e9fa4fe) instead of current main — that's the commit the affected op-reth deployments pin. The two touched functions (StateTrieOverlayManager::get_overlay, DeferredTrieData::{sort,wait_cloned}) are byte-identical between 81c0261 and main, so it's the same change; this branch just applies directly to the v2.2.x line. Opened as a separate branch for backporting / for anyone running v2.2.x (there's no release branch to target, so the base shows as main).

The bug: a cyclic deadlock that wedges the engine under sustained Engine-API block import (op-node / op-supernode CL-sync) — Block added to canonical chain N with no following Canonical chain committed N, forkchoiceUpdated times out, libmdbx logs Long-lived read transaction has been timed out open_duration=300.000s, and the process stays up. Two locks are held across a blocking rayon compute: get_overlay's overlays shard-write guard (vs remove_blocksoverlays.retain), and DeferredTrieData::wait_cloned's state Mutex across Self::sort (a rayon::join that gets work-steal-re-entered). Fix: don't hold a lock across the compute.

Full writeup, validation (6 h soak + repeated from-genesis CL-sync resyncs + a deterministic, timeout-guarded regression test), and a laziness-preserving alternative are in #24870. Downstream issue: ethereum-optimism/optimism#21076.

…ie compute

Same fix as paradigmxyz#24870, rooted at v2.2.5 (81c0261);
the target functions are identical to main. Eager variant: get_overlay computes
the overlay without holding the overlays DashMap entry guard, then recheck-and-
inserts; DeferredTrieData computes the sort eagerly in pending() so wait_cloned
never sorts on (or parks) a worker on the hot path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@opsuperchain

Copy link
Copy Markdown
Author

Self-contained repro for maintainers

Branch repro/overlay-deferred-trie-deadlock on the fork = main + a single deterministic, timeout-guarded regression test. No network, no devnet, no secrets — runs in seconds:

cargo test -p reth-chain-state --features rayon \
    remove_blocks_makes_progress_while_overlay_computes -- --nocapture
  • On unpatched main: fails after a 15 s watchdog with
    DEADLOCK: remove_blocks() did not complete within 15s while an overlay was computing.
    (confirmed — panics at ~15.03 s).
  • With this PR's fix: passes in < 1 s.

The 15 s watchdog + unconditional teardown mean it never hangs CI; it fails fast.

How it forces the deadlock (site 1): a 1-thread WorkerPool runs get_overlay, which (unpatched) holds the overlays vacant-entry shard-write guard across the compute; a #[cfg(test)] hook (keyed on the test's anchor hash, so it fires only for this test) parks compute_overlay on that worker; a second thread calls remove_blocksoverlays.retain(...), which must write-lock every shard and wedges on the held one.

Full writeup + the gdb-confirmed mechanism — including the DeferredTrieData::wait_cloned rayon::join work-steal re-entrancy (site 2) — is in REPRO_OVERLAY_DEADLOCK.md on that branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant