Skip to content

fix(hermes): disconnect slow WS/SSE consumers to prevent OOM#3769

Merged
ali-behjati merged 4 commits into
mainfrom
hermes/streaming-slow-consumer-disconnect
May 29, 2026
Merged

fix(hermes): disconnect slow WS/SSE consumers to prevent OOM#3769
ali-behjati merged 4 commits into
mainfrom
hermes/streaming-slow-consumer-disconnect

Conversation

@ali-behjati
Copy link
Copy Markdown
Collaborator

@ali-behjati ali-behjati commented May 29, 2026

Summary

  • Add configurable slow-consumer protection for Hermes streaming endpoints via RPC_DISCONNECT_SLOW_CONSUMERS (default: true) and RPC_WS_MAX_WRITE_BUFFER_BYTES (default: 2 MiB).
  • Cap WebSocket write buffers when enabled and disconnect clients on WriteBufferFull instead of allowing tungstenite's unlimited outbound buffer to grow under TCP backpressure.
  • Disconnect lagging SSE clients with a terminal error event when broadcast lag is detected, instead of keeping connections open for up to 24h.
  • Add streaming observability metrics: active connections, slow-consumer disconnects, and SSE broadcast lag events.

Test plan

  • cargo test in apps/hermes/server (38 tests pass)
  • Manual: connect WS client, subscribe, throttle reads → connection closes and stream_slow_consumer_disconnects_total{protocol="ws"} increments
  • Manual: connect SSE client, throttle reads while slots update → stream ends with Slow consumer: disconnected and stream_slow_consumer_disconnects_total{protocol="sse"} increments
  • Manual: run with RPC_DISCONNECT_SLOW_CONSUMERS=false and confirm legacy behavior (WS unlimited buffer, SSE continues after lag errors)
  • Prod canary: watch stream_active_connections, stream_slow_consumer_disconnects_total, and pod memory after deploy

Made with Cursor


Open in Devin Review

Cap WebSocket write buffers and close lagging streaming clients behind
RPC_DISCONNECT_SLOW_CONSUMERS so slow consumers cannot grow unbounded
in-process queues and hold long-lived connections.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented May 29, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
api-reference Ready Ready Preview, Comment May 29, 2026 4:35pm
component-library Ready Ready Preview, Comment May 29, 2026 4:35pm
developer-hub Ready Ready Preview, Comment May 29, 2026 4:35pm
entropy-explorer Ready Ready Preview, Comment May 29, 2026 4:35pm
insights Error Error May 29, 2026 4:35pm
proposals Ready Ready Preview, Comment May 29, 2026 4:35pm
staking Ready Ready Preview, Comment May 29, 2026 4:35pm

Request Review

Minor release for slow-consumer disconnect and streaming backpressure changes.

Co-authored-by: Cursor <cursoragent@cursor.com>
devin-ai-integration[bot]

This comment was marked as resolved.

Skip the 24h timeout SSE event after slow-consumer disconnect, allow
setting disconnect_slow_consumers=false via CLI, and suppress dead_code
warnings in generated wormhole protobuf code.

Co-authored-by: Cursor <cursoragent@cursor.com>
devin-ai-integration[bot]

This comment was marked as resolved.

…tection

is_write_buffer_full downcast WsError from tokio-tungstenite 0.26, but axum
0.6 wraps tungstenite 0.20 errors, so the check never matched. Use tungstenite
0.20.1 directly and add a test for the axum::Error wrapping path.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 new potential issue.

View 8 additional findings in Devin Review.

Open in Devin Review

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚩 WS broadcast channel Lagged errors are not handled as slow consumer disconnects

For WebSocket, slow consumer detection relies on tungstenite::Error::WriteBufferFull (the send-side buffer to the client filling up), not on tokio::sync::broadcast::RecvError::Lagged (the server-side broadcast channel falling behind). At ws.rs:400-404, a Lagged error from self.notify_receiver.recv() is converted to anyhow!("Failed to receive update from store: {:?}", e), which won't match is_write_buffer_full. This means WS clients that lag on the broadcast channel are disconnected silently without the slow consumer metric being recorded. This is an asymmetry with the SSE handler which explicitly tracks sse_broadcast_lagged for the same condition. Consider whether WS should also record this metric for observability parity.

(Refers to lines 400-404)

Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

@ali-behjati ali-behjati merged commit 419d938 into main May 29, 2026
14 of 15 checks passed
@ali-behjati ali-behjati deleted the hermes/streaming-slow-consumer-disconnect branch May 29, 2026 16:43
keyvankhademi added a commit that referenced this pull request May 29, 2026
main already had #3769 ("disconnect slow WS/SSE consumers to prevent OOM"),
which solves the same problem this branch does but with a different mechanism
(tungstenite write-buffer cap + an RPC_DISCONNECT_SLOW_CONSUMERS config flag +
protocol-labelled metrics).

Per request, this branch's solution is kept for the overlap:
- ws.rs, sse.rs, metrics_middleware.rs: kept this branch's versions
  (per-write WS_SEND_TIMEOUT; SSE producer task + bounded channel; the
  sse_slow_consumer_disconnects / sse_connection_timeouts counters).
- api.rs, config/rpc.rs, rest.rs: reverted #3769's StreamingConfig scaffolding,
  since this solution is always-on and does not read it (a config flag that
  silently did nothing would be a footgun).
- Cargo.toml: kept the version bump to 0.11.0; dropped #3769's now-unused
  direct `tungstenite` dependency (it remains transitively via axum).
- network/wormhole.rs: kept #3769's `dead_code` allow (orthogonal CI fix).

All other incoming main changes (CI workflow bumps, fortuna/quorum/etc.) are
taken as-is. Verified: cargo check + clippy clean, 33/33 tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants