Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcp conn pool: fix wrong connection pool deletion #37944

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

zmiklank
Copy link
Contributor

@zmiklank zmiklank commented Jan 9, 2025

Commit Message: Fixes an issue when active tcp conn pool is incorrectly deleted: After an empty tcp conn pool is added to the deferred deletion list the connection pool is removed from the pool map. Then, if new conn pool for the same host with the same hash_key is created before the deferred deletion is cleared it will be incorrectly deleted, because tcpConnPoolIsIdle erases the pool based on its hash_key. See detailed description in #37679.
Additional Description: N/A
Risk Level: Low
Testing: unit
Docs Changes: N/A
Release Notes: N/A
Platform Specific Features: N/A
Fixes #37679

fixes envoyproxy#37679

Co-authored-by: Marko Lukša <[email protected]>
Signed-off-by: Zuzana Miklankova <[email protected]>
@zuercher
Copy link
Member

zuercher commented Jan 9, 2025

As a drive by during triage: I think this PR warrants a unit test. Also, in the event this occurs, I think the pool_to_erase should still be passed to deferredDelete (but to do that you'd need a reference to the InstancePtr that can be moved).

@zmiklank
Copy link
Contributor Author

@zuercher Thanks. I will work on that.

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to adding a test, please add comments describing what is happening (basically the same content as the analysis in the linked issue).

/wait

@zmiklank
Copy link
Contributor Author

Hello, I added a test and explanatory comment to the code. This comment also explains why in my opinion pool_to_erase should not be passed to deferredDelete in this case.
Is the comment sufficient, or should I write it more descriptive?

@zmiklank
Copy link
Contributor Author

The tests that failed in CI seem to be unrelated to this PR.

Copy link
Contributor

@ggreenway ggreenway left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on a test, but this test will pass with or without the functional change in this PR. Can you make the test so that it would fail without your change?

/wait

@@ -7079,6 +7079,70 @@ TEST_P(ClusterManagerLifecycleTest, ConnPoolsIdleDeleted) {
}
}

TEST_P(ClusterManagerLifecycleTest, ConnPoolsCorrectDeleted) {
TestScopedRuntime scoped_runtime;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't used; delete

@zmiklank
Copy link
Contributor Author

Thank you for looking at the test @ggreenway. The problem is that we cannot easily
reproduce the issue because it happens only in specific conditions and it is a
race condition. Even in the environment where it can occur, it happens
"randomly". That is why the test I provided is only checking that the analogous
flow works properly after applying the patch.

The conditions for the issue are described in #37944 in detail. To trigger the
problem, we need to create a situation where deferred deletion happens after
the idle pool is already removed by a Local/RemoteClose event, but at the same
time, a new connection pool with the same hash_key already exists. The test I
wrote tries to create this situation, in a less granular way that seems to be
needed.

Because this issue is a race condition, timing is important to create a test
that would fail with an unpatched Envoy. I looked at other race condition
tests. One of them tries to reproduce the race organically. Should I use the
same approach here, or should I try to adjust the test so it reproduces the
race on the first run (if this is possible)? I would appreciate any hint on how
to approach such a test.

@ggreenway
Copy link
Contributor

In the unit test you wrote, there are no other threads, and all actions happen in a deterministic order, so you should be able to write the test so that it will fail without your fix. I think you just need to make all the events that you wrote in your diagnosis of the issue happen in the described order.

@zmiklank
Copy link
Contributor Author

Thanks.
I think this PR might be duplicate of #30807.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ActiveTcpClient destructor deletes the wrong connection pool
3 participants