fix(metrics): run collection hooks off the /metrics request path#24827
Open
0xMars42 wants to merge 1 commit into
Open
fix(metrics): run collection hooks off the /metrics request path#248270xMars42 wants to merge 1 commit into
0xMars42 wants to merge 1 commit into
Conversation
The three metric collection hooks (db, static file provider and rocksdb `report_metrics`) ran synchronously inside the `/metrics` request handler. On large datasets the static file walk takes several seconds, so each time the 5 min throttle fired the scrape blocked for the whole walk, and scrapers with a shorter `scrape_timeout` dropped a sample. Run the hooks in a periodic background task on the blocking pool instead, and let `/metrics` and the push gateway only render the already-collected gauge values. The hooks keep their `throttle!(5 min)` gate, so the `report_metrics` cadence is unchanged; only the thread it runs on differs. Closes paradigmxyz#24566
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The three metric collection hooks registered by
metrics_hooks(db,static_file_providerandrocksdbreport_metrics) run synchronouslyinside the
/metricsrequest handler. Each is wrapped inthrottle!(5 min), and when it fires it walks its backing store. Thestatic file provider in particular iterates every jar of every segment,
which grows as the node accumulates static files and takes several seconds
on large datasets.
Because that work runs on the HTTP handler future, the scrape is blocked
for the whole walk. As reported in #24566, a Prometheus/VictoriaMetrics
scraper with a
scrape_timeoutshorter than the walk drops a sample every5 min (periodic dashboard gaps), and the floor keeps rising as static files
accumulate.
Fix
Decouple collection from rendering, as the issue suggests:
start_metrics_collection_task) thatruns the hooks on the blocking pool (
spawn_blocking), so the walk cannever starve the async runtime.
/metricsand the push gateway now only render the already-collectedgauge values; neither runs hooks inline anymore. The request handler has
bounded, dataset-independent latency.
throttle!(5 min)gate, so thereport_metricscadence is unchanged; only the thread it runs ondiffers. Collection stays owned by the hooks, and the task just pumps
them off the request path.
The push gateway task already ran these same hooks on a background task via
the same
TaskExecutor, so this reuses an established pattern and alsostops that path from blocking on them.
Testing
Added
metrics_endpoint_is_not_blocked_by_slow_hook: it registers a hookthat sleeps 3 s and asserts
/metricsstill responds within a 500 msclient timeout and under 1 s wall clock. It fails on the previous
synchronous path (the request times out) and passes with the background
collection.
cargo test -p reth-node-metrics(4 passed),cargo +nightly fmt --checkand
cargo +nightly clippy --all-targets --all-features -D warningsaregreen.
Closes #24566