Skip to content

op-reth committed blocks later become unreadable, leaving local gap and blocking safe derivation #182

@jcortejoso

Description

@jcortejoso

Summary

A Celo mainnet celo-reth / op-reth datadir developed a local gap where blocks 67211634 through 67211637 became unreadable even though logs show they were received, executed, added, and committed earlier. The unsafe chain continued past the gap and matched Forno at later blocks, but op-node could not advance safe/finalized because the engine could not read the next safe block after 67211633.

This appears to be a storage/provider/static-file consistency issue, not a canonical chain divergence.

Observed local gap

On the affected op-reth RPC:

  • 67211633 was present.
  • 67211634 through 67211637 returned null.
  • 67211638 was present.

Forno had the missing blocks, and local 67211638.parentHash matched Forno block 67211637, so the local node did not appear to diverge from Celo mainnet. Instead, the local store had an unreadable gap inside an otherwise canonical chain.

Known hashes from the investigation:

  • local/Forno 67211633: 0x96d6fe1e1b81f43e786cde3d44f4c0d5c9e0da930fb4e49611b45cc89d06f3bc
  • Forno 67211634: 0x2fbd551dea2d53f20df1966a339d417291a207161f906a935c61f506a3f1fa06
  • Forno 67211635: 0xbae1c8dd32ad6b250e42c95d34ac2067646a131cab3c5d66478ab06c8a3d4a53
  • Forno 67211636: 0x66593291d1ced88758edb74b3be6015c4c768ec08f8cf6938a63edbd7744fda4
  • Forno 67211637: 0xb1cdd7ef255a42f555169ca41d7cab87e4a93e7e2cd18dbe6009a3df45533f4f
  • local/Forno 67211638: 0xd022d7c4517d3c76f2b70a81f7baf793a8ae091d41af2ea00e5861a4cb2d97b1

Logs indicate the missing blocks were committed

The missing block timestamps were approximately:

  • 67211634: 2026-05-18T13:53:12Z
  • 67211635: 2026-05-18T13:53:13Z
  • 67211636: 2026-05-18T13:53:14Z
  • 67211637: 2026-05-18T13:53:15Z

op-reth logs showed each of these blocks being received, added, and committed:

  • 67211634: received 13:53:12.035324Z, added 13:53:12.072262Z, committed 13:53:12.074077Z
  • 67211635: received 13:53:13.043232Z, added 13:53:13.162668Z, committed 13:53:13.164519Z
  • 67211636: received 13:53:14.037280Z, added 13:53:14.146090Z, committed 13:53:14.147863Z
  • 67211637: received 13:53:15.034068Z, added 13:53:15.047042Z, committed 13:53:15.048456Z

op-node also inserted/processed those payloads around the same timestamps.

Safe derivation symptom

Later, op-node advanced safe to 67211633 and then immediately reset because it could not read the next block from the engine:

Deriver system is resetting
err="expected engine was synced and had unsafe block to reconcile, but cannot find the block: not found"

The first reset happened around 2026-05-18T13:56:37Z, shortly after op-node recorded safe head 67211633. This lines up with the first missing block being 67211634.

Static-file/provider symptoms

A db get against static-file headers also reported an inconsistent static-file segment:

Error: File is in an inconsistent state

The affected segment was the headers segment covering 67000000_67499999.

A later recovery attempt with a Celo-aware stage command made it past the earlier CIP-64 transaction decoding problem (tracked separately in #180), but failed during Execution unwind because header 67211634 is missing:

2026-05-19T15:54:57.421615Z  INFO Unwinding{stage=Execution}: Starting unwind from=67293475 to=67211633 bad_block=None
2026-05-19T16:02:41.526782Z ERROR shutting down due to error
Error: database integrity error occurred: no header found for Number(67211634)

Caused by:
    no header found for Number(67211634)

Location:
    /usr/local/cargo/git/checkouts/reth-e231042ee7db3fb7/d6324d6/crates/cli/commands/src/stage/unwind.rs:74:9

This reinforces that the local database/static files are missing the header for 67211634, despite earlier commit logs.

Host/process signals checked

Grafana/host checks did not show an obvious OOM, process restart, or filesystem device error around the incident window:

  • op-reth/op-node process start timestamps appeared constant around the failure window.
  • node_vmstat_oom_kill did not indicate an OOM event.
  • node_filesystem_device_error was zero for /var/lib/op-reth and other checked mounts.

ExEx relevance

The node had proofs-history ExEx enabled, and the ExEx later logged Missing block 67211634, but that looks like an additional detector rather than the root cause. Similar behavior may have been observed on a Celo Sepolia node without ExEx enabled, so this should probably be investigated first as a storage/static-file/provider consistency issue independent of ExEx.

Impact

The node can continue serving/holding later unsafe blocks that match the canonical chain, but op-node safe/finalized derivation cannot cross the missing local block. Recovery by unwind is also blocked because the execution unwind expects the missing header.

Questions / investigation points

  • How can op-reth log a block as committed and later be unable to read its header by number?
  • Can static-file production or pruning leave a small header/body gap while later headers remain available?
  • Is there a known failure mode where static-file segment metadata/indexes become inconsistent without a process restart/OOM/filesystem error?
  • Should stage unwind have a repair path for this kind of partial static-file/header gap?
  • Is this specific to Celo primitives/storage routing, or inherited from upstream reth static-file behavior?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions