fix(chain-state): avoid engine deadlock from locks held across the trie compute (v2.2.5 base)#24871
Conversation
…ie compute Same fix as paradigmxyz#24870, rooted at v2.2.5 (81c0261); the target functions are identical to main. Eager variant: get_overlay computes the overlay without holding the overlays DashMap entry guard, then recheck-and- inserts; DeferredTrieData computes the sort eagerly in pending() so wait_cloned never sorts on (or parks) a worker on the hot path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Self-contained repro for maintainersBranch
The 15 s watchdog + unconditional teardown mean it never hangs CI; it fails fast. How it forces the deadlock (site 1): a 1-thread Full writeup + the gdb-confirmed mechanism — including the |
Same fix as #24870, but rooted at v2.2.5 (
81c026181e96ef33a823f3ef4d2a28940e9fa4fe) instead of currentmain— that's the commit the affected op-reth deployments pin. The two touched functions (StateTrieOverlayManager::get_overlay,DeferredTrieData::{sort,wait_cloned}) are byte-identical between81c0261andmain, so it's the same change; this branch just applies directly to the v2.2.x line. Opened as a separate branch for backporting / for anyone running v2.2.x (there's no release branch to target, so the base shows asmain).The bug: a cyclic deadlock that wedges the engine under sustained Engine-API block import (op-node / op-supernode CL-sync) —
Block added to canonical chain Nwith no followingCanonical chain committed N,forkchoiceUpdatedtimes out, libmdbx logsLong-lived read transaction has been timed out open_duration=300.000s, and the process stays up. Two locks are held across a blockingrayoncompute:get_overlay'soverlaysshard-write guard (vsremove_blocks→overlays.retain), andDeferredTrieData::wait_cloned'sstateMutex acrossSelf::sort(arayon::jointhat gets work-steal-re-entered). Fix: don't hold a lock across the compute.Full writeup, validation (6 h soak + repeated from-genesis CL-sync resyncs + a deterministic, timeout-guarded regression test), and a laziness-preserving alternative are in #24870. Downstream issue: ethereum-optimism/optimism#21076.