tcp conn pool: fix wrong connection pool deletion #37944

zmiklank · 2025-01-09T09:46:38Z

Commit Message: Fixes an issue when active tcp conn pool is incorrectly deleted: After an empty tcp conn pool is added to the deferred deletion list the connection pool is removed from the pool map. Then, if new conn pool for the same host with the same hash_key is created before the deferred deletion is cleared it will be incorrectly deleted, because tcpConnPoolIsIdle erases the pool based on its hash_key. See detailed description in #37679.
Additional Description: N/A
Risk Level: Low
Testing: unit
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A
Fixes #37679

fixes envoyproxy#37679 Co-authored-by: Marko Lukša <[email protected]> Signed-off-by: Zuzana Miklankova <[email protected]>

zuercher · 2025-01-09T18:53:09Z

As a drive by during triage: I think this PR warrants a unit test. Also, in the event this occurs, I think the pool_to_erase should still be passed to deferredDelete (but to do that you'd need a reference to the InstancePtr that can be moved).

zmiklank · 2025-01-10T10:04:10Z

@zuercher Thanks. I will work on that.

ggreenway

In addition to adding a test, please add comments describing what is happening (basically the same content as the analysis in the linked issue).

/wait

Signed-off-by: Zuzana Miklankova <[email protected]>

zmiklank · 2025-01-17T15:07:30Z

Hello, I added a test and explanatory comment to the code. This comment also explains why in my opinion pool_to_erase should not be passed to deferredDelete in this case.
Is the comment sufficient, or should I write it more descriptive?

zmiklank · 2025-01-21T08:55:09Z

The tests that failed in CI seem to be unrelated to this PR.

ggreenway

Thanks for working on a test, but this test will pass with or without the functional change in this PR. Can you make the test so that it would fail without your change?

/wait

ggreenway · 2025-01-21T17:19:33Z

test/common/upstream/cluster_manager_impl_test.cc

@@ -7079,6 +7079,70 @@ TEST_P(ClusterManagerLifecycleTest, ConnPoolsIdleDeleted) {
  }
 }

+TEST_P(ClusterManagerLifecycleTest, ConnPoolsCorrectDeleted) {
+  TestScopedRuntime scoped_runtime;


This isn't used; delete

zmiklank · 2025-01-22T09:45:20Z

Thank you for looking at the test @ggreenway. The problem is that we cannot easily
reproduce the issue because it happens only in specific conditions and it is a
race condition. Even in the environment where it can occur, it happens
"randomly". That is why the test I provided is only checking that the analogous
flow works properly after applying the patch.

The conditions for the issue are described in #37944 in detail. To trigger the
problem, we need to create a situation where deferred deletion happens after
the idle pool is already removed by a Local/RemoteClose event, but at the same
time, a new connection pool with the same hash_key already exists. The test I
wrote tries to create this situation, in a less granular way that seems to be
needed.

Because this issue is a race condition, timing is important to create a test
that would fail with an unpatched Envoy. I looked at other race condition
tests. One of them tries to reproduce the race organically. Should I use the
same approach here, or should I try to adjust the test so it reproduces the
race on the first run (if this is possible)? I would appreciate any hint on how
to approach such a test.

ggreenway · 2025-01-22T18:11:23Z

In the unit test you wrote, there are no other threads, and all actions happen in a deterministic order, so you should be able to write the test so that it will fail without your fix. I think you just need to make all the events that you wrote in your diagnosis of the issue happen in the described order.

zmiklank · 2025-01-28T15:48:31Z

Thanks.
I think this PR might be duplicate of #30807.

tcp conn pool: fix wrong connection pool deletion

d585d83

fixes envoyproxy#37679 Co-authored-by: Marko Lukša <[email protected]> Signed-off-by: Zuzana Miklankova <[email protected]>

zuercher requested a review from ggreenway January 9, 2025 18:53

zuercher assigned ggreenway Jan 9, 2025

ggreenway requested changes Jan 13, 2025

View reviewed changes

repokitteh-read-only bot added the waiting label Jan 13, 2025

zmiklank added 2 commits January 17, 2025 09:17

tcp conn pool: add explanatory comment

2d151cb

Signed-off-by: Zuzana Miklankova <[email protected]>

tcp conn pool test: add test to check that correct tcp pool is destroyed

da05fa3

Signed-off-by: Zuzana Miklankova <[email protected]>

repokitteh-read-only bot removed the waiting label Jan 17, 2025

ggreenway requested changes Jan 21, 2025

View reviewed changes

repokitteh-read-only bot added the waiting label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tcp conn pool: fix wrong connection pool deletion #37944

tcp conn pool: fix wrong connection pool deletion #37944

zmiklank commented Jan 9, 2025

zuercher commented Jan 9, 2025

zmiklank commented Jan 10, 2025

ggreenway left a comment

zmiklank commented Jan 17, 2025

zmiklank commented Jan 21, 2025

ggreenway left a comment

ggreenway Jan 21, 2025

zmiklank commented Jan 22, 2025

ggreenway commented Jan 22, 2025

zmiklank commented Jan 28, 2025

tcp conn pool: fix wrong connection pool deletion #37944

Are you sure you want to change the base?

tcp conn pool: fix wrong connection pool deletion #37944

Conversation

zmiklank commented Jan 9, 2025

zuercher commented Jan 9, 2025

zmiklank commented Jan 10, 2025

ggreenway left a comment

Choose a reason for hiding this comment

zmiklank commented Jan 17, 2025

zmiklank commented Jan 21, 2025

ggreenway left a comment

Choose a reason for hiding this comment

ggreenway Jan 21, 2025

Choose a reason for hiding this comment

zmiklank commented Jan 22, 2025

ggreenway commented Jan 22, 2025

zmiklank commented Jan 28, 2025