Skip to content

fix(metrics): run collection hooks off the /metrics request path#24827

Open
0xMars42 wants to merge 1 commit into
paradigmxyz:mainfrom
0xMars42:fix/metrics-offload-static-file-report
Open

fix(metrics): run collection hooks off the /metrics request path#24827
0xMars42 wants to merge 1 commit into
paradigmxyz:mainfrom
0xMars42:fix/metrics-offload-static-file-report

Conversation

@0xMars42

@0xMars42 0xMars42 commented Jun 3, 2026

Copy link
Copy Markdown
Contributor

Description

The three metric collection hooks registered by metrics_hooks (db,
static_file_provider and rocksdb report_metrics) run synchronously
inside the /metrics request handler. Each is wrapped in
throttle!(5 min), and when it fires it walks its backing store. The
static file provider in particular iterates every jar of every segment,
which grows as the node accumulates static files and takes several seconds
on large datasets.

Because that work runs on the HTTP handler future, the scrape is blocked
for the whole walk. As reported in #24566, a Prometheus/VictoriaMetrics
scraper with a scrape_timeout shorter than the walk drops a sample every
5 min (periodic dashboard gaps), and the floor keeps rising as static files
accumulate.

Fix

Decouple collection from rendering, as the issue suggests:

  • Add a periodic background task (start_metrics_collection_task) that
    runs the hooks on the blocking pool (spawn_blocking), so the walk can
    never starve the async runtime.
  • /metrics and the push gateway now only render the already-collected
    gauge values; neither runs hooks inline anymore. The request handler has
    bounded, dataset-independent latency.
  • The hooks keep their existing throttle!(5 min) gate, so the
    report_metrics cadence is unchanged; only the thread it runs on
    differs. Collection stays owned by the hooks, and the task just pumps
    them off the request path.

The push gateway task already ran these same hooks on a background task via
the same TaskExecutor, so this reuses an established pattern and also
stops that path from blocking on them.

Testing

Added metrics_endpoint_is_not_blocked_by_slow_hook: it registers a hook
that sleeps 3 s and asserts /metrics still responds within a 500 ms
client timeout and under 1 s wall clock. It fails on the previous
synchronous path (the request times out) and passes with the background
collection.

cargo test -p reth-node-metrics (4 passed), cargo +nightly fmt --check
and cargo +nightly clippy --all-targets --all-features -D warnings are
green.

Closes #24566

The three metric collection hooks (db, static file provider and rocksdb
`report_metrics`) ran synchronously inside the `/metrics` request handler.
On large datasets the static file walk takes several seconds, so each time
the 5 min throttle fired the scrape blocked for the whole walk, and
scrapers with a shorter `scrape_timeout` dropped a sample.

Run the hooks in a periodic background task on the blocking pool instead,
and let `/metrics` and the push gateway only render the already-collected
gauge values. The hooks keep their `throttle!(5 min)` gate, so the
`report_metrics` cadence is unchanged; only the thread it runs on differs.

Closes paradigmxyz#24566
@github-project-automation github-project-automation Bot moved this to Backlog in Reth Tracker Jun 3, 2026
@0xMars42 0xMars42 marked this pull request as ready for review June 4, 2026 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

/metrics handler hangs while running throttled report_metrics() hooks; blocks scrapers with multi-second timeouts on large datasets

1 participant