Fix hanging requests with filtered steal #3016

Razz4780 · 2025-01-13T11:03:27Z

So...

This started as a small refactor in order with some hope to fix the hanging requests issue. I could not find any bug that could cause the problem and only later I found out that there was no problem. The repro application I was using was handling only one connection at a time and the first HTTP connection was not closed by the k8s proxy (most probably to be reused later). And so the second request would hang on intproxy's HTTP handshake attempt. Since we want to be user friendly, this PR introduces reusing local HTTP connections, which solves the problem. However, since it started as a refactor, it's big. Sorry.

Changes summarized:

StreamingBody was moved from mirrord-protocol to mirrord-intproxy without any notable changes. There was no need for it to be in the protocol crate.
BodyExt trait in mirrord-protocol was renamed to BatchedBody. The only notable change is moving from custom Future implementation (FramesFut) to using now_or_never. I was afraid of it in the past, now I'm not. I tested this with heavy load and did not detect any difference. Using now_or_never simplifies things, because some code no longer needs to be async
All requests in the intproxy are now of HttpRequest<StreamingBody> type, to remove ugly generics and match expressions. HttpRequestFallback enum, along with lots of conversion code, was removed from mirrord-protocol.
HttpResponseFallback type was moved to the agent without any notable changes. There was no need for it to be in the protocol crate.
ReversePortForwarder and its tests were fixed. It was never streaming responses' bodies, because IncomingProxy was not notified about agent protocol version. This change is not related to the issue, but the problem came up in the CI.
Removed h2::Error::is_reset check and the dependency on h2 completely. Instead of checking if the HTTP error is transient, we check if it's not transient (using hyper::Error methods, e.g hyper::Error::is_user). I think it's simpler and safer, since retrying a request is not harmful.
Added a simple BoundTcpSocket struct to the incoming proxy, which wraps logic for binding the same interface as user socket. Now we can actually see the bound socket address in tracing.
Added a ClientStore struct that caches unused local HTTP connection and cleans them up after some timeout.
HTTP requests stolen with a filter are now handled completely independently on the local side. Since HTTP is stateless, this is fine. Each HTTP request has its own dedicated HttpGatewayTask inside IncomingProxy. To reuse connections, they share a ClientStore instance.
Improved how connections stolen/mirrored in whole are now handled in the IncomingProxy. Each connection is handled by its own TcpProxyTask. The task knows whether the connection is stolen or mirrored. If it's mirrored, the data is no longer being sent to the main IncomingProxy task, it is immediately discarded. If it's stolen, the connection is no longer artificially kept alive until silent for a second (this mechanism makes sense only with the mirror mode, can introduce weird behavior in steal mode).
Interceptor task removed completely, now we have two separate tasks: HttpGatewayTask and TcpProxyTask
MetadataStore was moved to its own module without any notable changes
Added a unit test that verifies connection reuse (original issue)
IncomingProxy now optimizes HTTP response variant. If the whole response body is available when the response head is received, we no longer send the chunked response variant. Instead we respond with the framed variant. This allows us to use only one mirrord_protocol message.
IncomingProxy now does subscription checks when receiving a new connection/request. If we receive a connection/request on a remote port that we no longer subscribe, we unsubscribe immediately, without attempting to connect to the user application.
Improved tracing around IncomingProxy, e.g added time spent on polling response frames

meowjesty

Mostly some doc requests while I'm still going through stuff.

meowjesty · 2025-01-17T18:46:28Z

changelog.d/3013.fixed.md

@@ -0,0 +1 @@
+Fixed an issue where HTTP requests stolen with a filter would hang with a single-threaded local HTTP server.


Seems like an understatement. I know that this is what users would care about, but I think this PR warrants at least adding an "Also improved internal handling of incoming connections`, or something like that.

meowjesty · 2025-01-17T18:50:43Z

mirrord/agent/src/steal/http/response_fallback.rs

+        match self {
+            HttpResponseFallback::Framed(req) => req
+                .internal_response
+                .map_body(|body| body.map_err(|_| unreachable!()).boxed())


Why were we doing these map_err(unreachable), we should just unwrap if it's a panic anyways.

Because we need a specific body type. Some requests are passed to the clients, and the response body implements Body<Data = Bytes, Error = Infallible>. Other requests are passed to the original destination, and the response body implements Body<Data = Bytes, Error = hyper::Error>. With this trick we unify (the function passed to Body::map_err is of type Fn(Infallible) -> hyper::Error

meowjesty · 2025-01-17T18:59:45Z

mirrord/agent/src/steal/api.rs

@@ -40,7 +39,7 @@ pub(crate) struct TcpStealerApi {
    /// View on the stealer task's status.
    task_status: TaskStatus,

-    response_body_txs: HashMap<(ConnectionId, RequestId), Sender<hyper::Result<Frame<Bytes>>>>,
+    response_body_txs: HashMap<(ConnectionId, RequestId), ResponseBodyTx>,


Suggested change

response_body_txs: HashMap<(ConnectionId, RequestId), ResponseBodyTx>,

/// Keeps track of the TCP stealer connections, associating them with the [`RequestId`] of the HTTP response. The `RequestId` is id of the [`MatchedHttpRequest`], so we create an association between a request and its response.

response_body_txs: HashMap<(ConnectionId, RequestId), ResponseBodyTx>,

This one is a mess, my suggestion is confusing even to me, but I have faith you can build on top of it and make this the best doc ever.

meowjesty · 2025-01-17T19:10:41Z

mirrord/agent/src/steal/http.rs

-pub(crate) use self::reversible_stream::ReversibleStream;
+pub(crate) use filter::HttpFilter;
+pub(crate) use response_fallback::{HttpResponseFallback, ReceiverStreamBody};
+pub(crate) use reversible_stream::ReversibleStream;

 /// Handy alias due to [`ReversibleStream`] being generic, avoiding value mismatches.
 pub(crate) type DefaultReversibleStream = ReversibleStream<{ HttpVersion::MINIMAL_HEADER_SIZE }>;


Considering that this refactor is already changing a bunch of stuff around these areas, what do you think of we dropping most of this in favor of TcpStream::peek?

Unfortunately, we can't use peek. It may return too few bytes for us to determine connection type, and Successive calls return the same data :<

meowjesty · 2025-01-17T19:16:31Z

mirrord/protocol/src/batched_body.rs

+    }
+}
+
+pub struct Frames<D> {


Please add some docs, and while at it, either put in the docs what D stands for, or change the parameter to Data or something byte related.

meowjesty · 2025-01-17T19:22:02Z

mirrord/protocol/src/batched_body.rs

+        loop {
+            match self.frame().now_or_never() {
+                None => {
+                    frames.is_last = false;
+                    break;
+                }
+                Some(None) => {
+                    frames.is_last = true;
+                    break;
+                }
+                Some(Some(result)) => {
+                    frames.frames.push(result?);
+                }
+            }
+        }
+
+        Ok(frames)
+    }


Considering this is pretty much the same as the other method, maybe create a helper function that takes the Frames we're pushing?

meowjesty

Moar stuff

meowjesty · 2025-01-17T20:50:32Z

mirrord/cli/src/port_forward.rs

-        // construct IncomingMode from config file
+        let incoming = background_tasks.register(IncomingProxy::default(), (), 512);
+
+        agent_connection


Considering how finnicky this is (and how easy it is to forget to do this when starting an agent connection), I wonder if we shouldn't refactor how we do the version negotiation. Like, you only get an AgentConnection to send other messages after this whole thing has been done.

Maybe not for this PR, just some thoughts for the future.

meowjesty · 2025-01-17T20:57:53Z

mirrord/cli/src/port_forward.rs

@@ -609,21 +580,18 @@ impl ReversePortForwarder {
                    )
                }
            },
-            (MainTaskId::IncomingProxy, TaskUpdate::Finished(result)) => match result {
+
+            TaskUpdate::Finished(result) => match result {
                Ok(()) => {


This has major The operation completed successfully windows error vibes.

How do we get here?

We should never, unless there's a bug in the IncomingProxy. I'm up for doing unreachable! here, if you think it's better

meowjesty · 2025-01-17T21:01:49Z

mirrord/cli/src/port_forward.rs

+        ) {
+            loop {
+                let Some(message) = rx.recv().await else {
+                    break;


Instead of these breaks, can't we just unwrap?

This one might require the break on None if I'm understanding it correctly, but the others we could just unwrap instead of is_err, no?

Done, unwraping on send errors

meowjesty · 2025-01-17T21:08:38Z

mirrord/intproxy/src/proxies/incoming/http/response_mode.rs

+    /// Agent supports only
+    /// [`LayerTcpSteal::HttpResponse`](mirrord_protocol::tcp::LayerTcpSteal::HttpResponse)
+    #[default]
+    Basic,


This is basically the legacy version right? Do we know if there are mirrord-agents in the wild still using this? Considering that HTTP_FRAMED_VERSION has been a thing since the end of 2023, I feel like we could drop this.

I don't see a reason to drop this now. Like, this introduces ~100 lines of code in the whole project maybe 🤷‍♂️

meowjesty · 2025-01-17T21:16:12Z

mirrord/protocol/src/tcp.rs

+            .into_data()
+            .map(|bytes| Self::Data(bytes.into()))
+            .or_else(|frame| frame.into_trailers().map(Self::Trailers))
+            .expect("malformed frame type")


Dropping if else AND fixing a typo! This is like a hidden sweet you put here just for me.

meowjesty · 2025-01-17T21:47:15Z

mirrord/intproxy/src/proxies/incoming/http/client_store.rs

+        let now = Instant::now();
+        let mut min_last_used = None;
+        let notified = {
+            let mut guard = clients.lock().unwrap();


Maybe expect here with a message that's more user friendly than the usual rust unwrap?

meowjesty · 2025-01-17T21:53:09Z

mirrord/intproxy/src/proxies/incoming/http/client_store.rs

+    std::mem::drop(store);
+
+    loop {
+        let Some(clients) = clients.upgrade() else {


Suggested change

let Some(clients) = clients.upgrade() else {

// A failed `upgrade` here we have no more clients, so we can stop this task.

let Some(clients) = clients.upgrade() else {

Added comments

meowjesty · 2025-01-17T21:54:34Z

mirrord/intproxy/src/proxies/incoming/http/client_store.rs

+            let mut guard = clients.lock().unwrap();
+            let notified = notify.notified();
+            guard.retain(|client| {
+                if client.last_used + idle_client_timeout > now {


Suggested change

if client.last_used + idle_client_timeout > now {

// We keep only the clients that have not gone beyond their `idle_client_timeout`, and update their `last_used` time.

if client.last_used + idle_client_timeout > now {

Added comments

meowjesty · 2025-01-17T22:14:46Z

mirrord/intproxy/src/background_tasks.rs

+    }
+}
+
+pub struct Closed<T: BackgroundTask>(Sender<T::MessageOut>);


Docs.

Should probably add why we're using this, and how it helps.

Do we need this stuff to be pub?

meowjesty · 2025-01-17T22:14:56Z

mirrord/intproxy/src/background_tasks.rs

+pub struct Closed<T: BackgroundTask>(Sender<T::MessageOut>);
+
+impl<T: BackgroundTask> Closed<T> {
+    pub async fn cancel_on_close<F: Future>(&self, future: F) -> Option<F::Output> {


Do we need this to be pub?

Razz4780 force-pushed the michals/mbe-649-filtered-steal-hangs-on-playground branch from edb6727 to 22ce712 Compare January 13, 2025 11:04

Razz4780 marked this pull request as ready for review January 13, 2025 11:06

Razz4780 changed the title ~~Fix hanging requests with filtered steal~~ Fix hanging requests with filtered steal (WIP) Jan 13, 2025

Razz4780 added 20 commits January 14, 2025 10:27

Integration test

a0c317a

Moved MetadataStore to a separate module

f7359bc

bind_similar -> BoundTcpSocket

48bac27

BodyExt -> BatchedBody, reworked trait

5798219

local HTTP handling rework

6e2d3bc

Remove obsolete stuff from mirrord-protocol

6a47585

Moved HttpResponseFallback to the agent

4b79bf7

Moved frame senders to InterceptorHandle

3486e0a

HttpResponseReaders in IncomingProxy

e5b5d93

Clippy

520f5e9

Better tracing

85940d7

HttpResponseReader logic split into methods

f59a62c

unless_bus_closed

600d958

I hate this

8b15a86

HttpGateway tests

ab4fdb5

ClientStore test

d729c3b

Changed implementation of ClientStore shared state

1a08649

Some docs

c25b15e

Docs

295099f

Removed obsolete integration test - replaced before with a unit test

7508902

Razz4780 force-pushed the michals/mbe-649-filtered-steal-hangs-on-playground branch from 0c90fbf to 7508902 Compare January 16, 2025 21:32

Razz4780 changed the title ~~Fix hanging requests with filtered steal (WIP)~~ Fix hanging requests with filtered steal Jan 16, 2025

Razz4780 added 5 commits January 16, 2025 22:43

More ClientStore tracing

203af97

Less spammy debug for InProxyTaskMessage

39dee15

Clippy and docs

609ecd1

More tracing

20c6fc2

Clippy

053599c

Razz4780 added 5 commits January 17, 2025 13:58

Fixed TcpProxyTask

5c343fb

More IncomingProxy docs

0a346a4

macos tests fixed

7458a38

Fixed reverseportforwarder and its tests

7b8a14f

Upgrade fixed

78f47e8

meowjesty self-assigned this Jan 17, 2025

Razz4780 requested a review from meowjesty January 17, 2025 17:00

meowjesty requested changes Jan 17, 2025

View reviewed changes

Razz4780 added 6 commits January 18, 2025 14:38

Extended changelog

cfa95ff

Frames doc

c612152

Helper function for BatchedBody

91a18c4

auto_responder -> unwrap instead of is_err + break

16a4da1

ClientStore unwrap -> expect

ec0fd2c

Comments for ClientStore cleanup_task

b09ea4a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix hanging requests with filtered steal #3016

Fix hanging requests with filtered steal #3016

Razz4780 commented Jan 13, 2025 •

edited

Loading

meowjesty left a comment

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty left a comment

meowjesty Jan 17, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

Razz4780 Jan 18, 2025

meowjesty Jan 17, 2025

meowjesty Jan 17, 2025

meowjesty Jan 17, 2025

meowjesty Jan 17, 2025

		@@ -0,0 +1 @@
		Fixed an issue where HTTP requests stolen with a filter would hang with a single-threaded local HTTP server.

	response_body_txs: HashMap<(ConnectionId, RequestId), ResponseBodyTx>,
	/// Keeps track of the TCP stealer connections, associating them with the [`RequestId`] of the HTTP response. The `RequestId` is id of the [`MatchedHttpRequest`], so we create an association between a request and its response.
	response_body_txs: HashMap<(ConnectionId, RequestId), ResponseBodyTx>,

	let Some(clients) = clients.upgrade() else {
	// A failed `upgrade` here we have no more clients, so we can stop this task.
	let Some(clients) = clients.upgrade() else {

	if client.last_used + idle_client_timeout > now {
	// We keep only the clients that have not gone beyond their `idle_client_timeout`, and update their `last_used` time.
	if client.last_used + idle_client_timeout > now {

Fix hanging requests with filtered steal #3016

Are you sure you want to change the base?

Fix hanging requests with filtered steal #3016

Conversation

Razz4780 commented Jan 13, 2025 • edited Loading

meowjesty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

meowjesty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Razz4780 commented Jan 13, 2025 •

edited

Loading