fix(hermes): disconnect slow WS/SSE consumers to prevent OOM#3769
Conversation
Cap WebSocket write buffers and close lagging streaming clients behind RPC_DISCONNECT_SLOW_CONSUMERS so slow consumers cannot grow unbounded in-process queues and hold long-lived connections. Co-authored-by: Cursor <cursoragent@cursor.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Minor release for slow-consumer disconnect and streaming backpressure changes. Co-authored-by: Cursor <cursoragent@cursor.com>
Skip the 24h timeout SSE event after slow-consumer disconnect, allow setting disconnect_slow_consumers=false via CLI, and suppress dead_code warnings in generated wormhole protobuf code. Co-authored-by: Cursor <cursoragent@cursor.com>
…tection is_write_buffer_full downcast WsError from tokio-tungstenite 0.26, but axum 0.6 wraps tungstenite 0.20 errors, so the check never matched. Use tungstenite 0.20.1 directly and add a test for the axum::Error wrapping path. Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
🚩 WS broadcast channel Lagged errors are not handled as slow consumer disconnects
For WebSocket, slow consumer detection relies on tungstenite::Error::WriteBufferFull (the send-side buffer to the client filling up), not on tokio::sync::broadcast::RecvError::Lagged (the server-side broadcast channel falling behind). At ws.rs:400-404, a Lagged error from self.notify_receiver.recv() is converted to anyhow!("Failed to receive update from store: {:?}", e), which won't match is_write_buffer_full. This means WS clients that lag on the broadcast channel are disconnected silently without the slow consumer metric being recorded. This is an asymmetry with the SSE handler which explicitly tracks sse_broadcast_lagged for the same condition. Consider whether WS should also record this metric for observability parity.
(Refers to lines 400-404)
Was this helpful? React with 👍 or 👎 to provide feedback.
main already had #3769 ("disconnect slow WS/SSE consumers to prevent OOM"), which solves the same problem this branch does but with a different mechanism (tungstenite write-buffer cap + an RPC_DISCONNECT_SLOW_CONSUMERS config flag + protocol-labelled metrics). Per request, this branch's solution is kept for the overlap: - ws.rs, sse.rs, metrics_middleware.rs: kept this branch's versions (per-write WS_SEND_TIMEOUT; SSE producer task + bounded channel; the sse_slow_consumer_disconnects / sse_connection_timeouts counters). - api.rs, config/rpc.rs, rest.rs: reverted #3769's StreamingConfig scaffolding, since this solution is always-on and does not read it (a config flag that silently did nothing would be a footgun). - Cargo.toml: kept the version bump to 0.11.0; dropped #3769's now-unused direct `tungstenite` dependency (it remains transitively via axum). - network/wormhole.rs: kept #3769's `dead_code` allow (orthogonal CI fix). All other incoming main changes (CI workflow bumps, fortuna/quorum/etc.) are taken as-is. Verified: cargo check + clippy clean, 33/33 tests pass.
Summary
RPC_DISCONNECT_SLOW_CONSUMERS(default: true) andRPC_WS_MAX_WRITE_BUFFER_BYTES(default: 2 MiB).WriteBufferFullinstead of allowing tungstenite's unlimited outbound buffer to grow under TCP backpressure.Test plan
cargo testinapps/hermes/server(38 tests pass)stream_slow_consumer_disconnects_total{protocol="ws"}incrementsSlow consumer: disconnectedandstream_slow_consumer_disconnects_total{protocol="sse"}incrementsRPC_DISCONNECT_SLOW_CONSUMERS=falseand confirm legacy behavior (WS unlimited buffer, SSE continues after lag errors)stream_active_connections,stream_slow_consumer_disconnects_total, and pod memory after deployMade with Cursor