Skip to content

DRIVERS-2884 Avoid connection churn when operations timeout #1675

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

prestonvasquez
Copy link
Member

@prestonvasquez prestonvasquez commented Oct 14, 2024

This PR implements the design for connection pooling improvements described in DRIVERS-2884, based on the CSOT (Client-Side Operation Timeout) spec. It addresses connection churn caused by network timeouts during operations, especially in environments with low client-side timeouts and high latency.

When a connection is checked out after a network timeout, the driver now attempts to resume and complete reading any pending server response (instead of closing and discarding the connection). This may require multiple checkouts.
Each pending response read is subject to a cumulative 3-second static timeout. The timeout is refreshed after each successful read, acknowledging that progress is being made. If no data is read and the timeout is exceeded, the connection is closed.

To reduce unnecessary latency, if the timeout has expired while the connection was idle in the pool, a non-blocking single-byte read is performed; if no data is available, the connection is closed immediately.
This update introduces new CMAP events and logging messages (PendingResponseStarted, PendingResponseSucceeded, PendingResponseFailed) to improve observability of this path.

Please complete the following before merging:

# after maxTimeMS, whereas mongod returns it after
# max(blockTimeMS, maxTimeMS). Until this ticket is resolved, these tests
# will not pass on sharded clusters.
topologies: ["standalone", "replicaset"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

standalone -> single

- name: findOne
object: *collection
arguments:
timeoutMS: 50
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In python this timeout is too small and causes this find to fail before sending anything to the server. The same problem exists in the other tests too. Perhaps all of theses tests should run a setup command (eg ping) to ensure a connection is created and available in the pool, then run the finds. What do you think?

@prestonvasquez prestonvasquez marked this pull request as ready for review April 25, 2025 21:36
@prestonvasquez prestonvasquez requested a review from a team as a code owner April 25, 2025 21:36
@prestonvasquez prestonvasquez requested review from a team as code owners April 25, 2025 21:36
@prestonvasquez prestonvasquez requested review from alcaeus, stIncMale, baileympearson and ShaneHarvey and removed request for a team April 25, 2025 21:36
Copy link

codeowners-service-app bot commented Apr 25, 2025

Assigned qingyang-hu for team dbx-spec-owners-csot because ShaneHarvey is out of office.
Assigned qingyang-hu for team dbx-spec-owners-csot because ShaneHarvey is out of office.
Assigned qingyang-hu for team dbx-spec-owners-csot because ShaneHarvey is out of office.

@prestonvasquez prestonvasquez removed the request for review from qingyang-hu April 25, 2025 22:08
Copy link
Member

@alcaeus alcaeus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to wait until #1792 is merged to review the schema changes. From what I can see in the UTF specification, those changes look good.

@@ -3555,6 +3579,8 @@ other specs *and* collating spec changes developed in parallel or during the sam

## Changelog

- 2025-04-25: **Schema version 1.24**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add what was changed here?

@@ -576,7 +576,106 @@ other threads from checking out [Connections](#connection) while establishing a
Before a given [Connection](#connection) is returned from checkOut, it must be marked as "in use", and the pool's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot comment on this in the diff. Line 236 in this file outlines a few states that a connection can be in. I think it makes sense to add the pending read state to this field.

Related to this - pending is already used to indicate a connection that has been created but not established. Can we choose a clear name for this new state? pending read seems like a clearer name

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another comment out-of-diff: should we allow drivers with background threads to close timed-out pending connections with the background thread? Or attempt the non-blocking read? This is a micro optimization but it in the scenario where a connection might have timed out and needs to be closed, the background thread could close the connection instead of a checkout request.

Copy link
Member Author

@prestonvasquez prestonvasquez May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we choose a clear name for this new state?

Absolutely, great point! "pending response" seems apt.

Should we allow drivers with background threads to close timed-out pending connections with the background thread?

IIUC this would require polling connections and then performing aliveness checks. This would not be more performant than awaiting a pending read on check-out.

Or attempt the non-blocking read?

From the design document:

The major flaw in this approach is that when an application runs two operations consecutively, it’s possible for 2 connections to be created in the same pool. For example:

# Connect directly so we can assume only 1 pool.
client = MongoClient(directConnection=True, timeoutMS=500)
try:
    client.t.t.insert_one({})
except PyMongoError as exc:
    if exc.timeout:
        print(f'Operation timed out: {exc}')
    else:
        raise
# With background reads, this operation could require a 2nd connection.
client.t.t.insert_one({})

In the above code, there should only be one connection in the pool at any given time because the operations are not run concurrently. A foreground reading approach guarantees this constraint. However, using background reads could result in two connections being opened. It is unacceptable for this code to open two pooled connections in this case because connections are expensive and a limited resource. Worse, the extra connection(s) will remain open forever by default as maxIdleTimeMS defaults to unlimited. Imagine a customer with 1000 such app servers, after this change they could end up using 1000 extra connections. At best this decreases performance and at worst it can cause a connection storm.

Another consideration is that even if the background approach is implemented, the foreground solution still needs to be implemented as well. So the background read approach adds additional implementation complexity.


#### Connection Aliveness Check Fails

1. Initialize a mock TCP listener to simulate the server-side behavior. The listener should write at least 5 bytes to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on adding a mock server to drivers-evergreen-tools for these tests? I could go either way - there are only two, so the burden on drivers isn't too great but it might be nice if drivers didn't need to worry about the mock server logic themselves.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m concerned that this solution will require drivers to spin up a server when trying to test locally. I’ve suggested DRIVERS-3183 to support raw-TCP connection test entities which will allow us to convert these prose tests to a unified spec test in the future.

@@ -576,7 +576,106 @@ other threads from checking out [Connections](#connection) while establishing a
Before a given [Connection](#connection) is returned from checkOut, it must be marked as "in use", and the pool's
availableConnectionCount MUST be decremented.

```text
If an operation times out the socket while awaiting a server response and CSOT is enabled and `maxTimeMS` was added to
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest: purely organizational, maybe a header would be nice here to help visually separate this section from the other checkIn information?

Suggested change
If an operation times out the socket while awaiting a server response and CSOT is enabled and `maxTimeMS` was added to
##### Awaiting Pending Read (CSOT-only)

The next time the connection is checked out, the driver MUST attempt to read and discard the remaining response from the
socket. The workflow for this is as follows:

- The connection MUST persist the current time recorded immediately after the original socket timeout, and this
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not opposed to having the logic in the connection but we do have precedent for other connection state related-actions in checkIn (destroying the connection, etc). Would recording the start time make sense at checkIn instead of after the timeout in the driver's connection abstraction?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have two thoughts on this. First, some drivers will likely add an object to the connection to maintain state for a pending response, for example:

type pendingResponseState struct {
	remainingBytes int32
	requestID      int32
	start          time.Time
}

type connection struct {
	// pendingResponseState contains information required to attempt a pending read
	// in the event of a socket timeout for an operation that has appended
	// maxTimeMS to the wire message.
	pendingResponseState   *pendingResponseState
}

It would be more pragmatic to update the current time where remainginBytes and requestID are assigned (which is when the socket times out).

Additionally, we want to start this “countdown” ASAP in case the connection is “dead”: in such cases the “aliveness check” will be a non-blocking failure while awaiting a pending response. Delaying when we set the current time reduces the likelihood (albeit small) that the cull 3 second pending response timeout has been exceeded while the connection remains idle in the pool.

@@ -576,7 +576,106 @@ other threads from checking out [Connections](#connection) while establishing a
Before a given [Connection](#connection) is returned from checkOut, it must be marked as "in use", and the pool's
availableConnectionCount MUST be decremented.

```text
If an operation times out the socket while awaiting a server response and CSOT is enabled and `maxTimeMS` was added to
the command, the driver MUST mark the connection as "pending" and record the current time in a way that can be updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So - I do understand this is the approach outlined in the design and that it is generally valuable to have example algoritims and pseudocode in the specification. However, I do think this approach makes assumptions about connection layer implementations and socket APIs that don't hold true for all drivers (at least in Node, they certainly don't). For example - Node's socket API is push-based and we collect chunks when they're available. Immediate reads do not make sense for our connection layer implementation and so the existing implementation of await_pending_response doesn't either.

I'm happy to work with you on the phrasing here but can we try to phrase these requirements in a way that outlines the requirements, and then outlines a particular implementation that satisfies the requirements? Ex: in Node, the socket API we use pushes chunks of data to us automatically (there is no read(n) method). We collect this into a buffer automatically when they're available. So when we implement these changes in Node, what we will likely do instead of this algorithm is:

  • on timeout, fail the current request and record the time of timeout but don't stop receiving data chunks from the socket.
  • set the socket's timeout to 3s (time out the socket if no chunks arrive in 3s)
  • in checkout, check if the pending connection has finished reading the response from the server. If it has, discard and continue. If not, calculate the wait time and wait. On success, proceed. On timeout, close the connection.

I think this approach still satisfies the goals of the spec changes but as the spec is currently written, our implementation would not be spec compliant. Thoughts?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC in the Node case, bytes that have not been consumed will still sit on the socket after a timeout. So you would still need to read and discard any buffered data when checking out a connection that has been pinned to a pending response. The three bullet points you note do not conflict with the algorithm, AFAICT. I would be happy to troubleshoot this offline with you, though.

connectionId: int64;

/**
* The time it took to complete the pending read.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So long as data is still coming back from socket in intervals of <3s, it is possible for the same connection to require multiple checkout requests to fully exhaust. So - is this duration the total time it took to read all of the data off of the socket (now() - time of timeout) or the amount of time that the checkout request waited on the final pending read wait?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same comment for logging events)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would anticipate this duration to be within the context of ConnectionPendingResponseStarted, i.e. 1 call to await_pending_response.

*/
connectionId: int64;

/**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be valuable to include a duration here as well, indicating how long the request waited for the pending read before failing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(same comment for logging events)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think this is a good idea, this could highlight the case where you are trickling 1 byte aliveness checks but the response continues to time out while attempting to discard the input TCP buffer. Good call.

`ConnectionPendingResponseFailed` events.
3. Instantiate a connection pool using the mock listener’s address, ensuring readiness without error. Attach the event
monitor to observe the connection’s state.
4. Check out a connection from the pool and initiate a read operation with an appropriate socket timeout (e.g, 10ms)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is related to my other comment about a less socket-api specific implementation. Is it possible to write these tests in a way that doesn't require explicit use of a read API? Node's connection layer doesn't expose a read method method, we only expose command() which performs write+read on the underlying socket.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So Node only has access to a round-trip API? Even in the non-public API? Could you not just discard the write half in the mock listener?

7. Verify that one event for each `ConnectionPendingResponseStarted` and `ConnectionPendingResponseFailed` was emitted.
Also verify that the fields were correctly set for each event.

#### Connection Aliveness Check Succeeds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a test that demonstrates that multiple aliveness checks might be required to fully read the response from the socket? I'm imagining the mock server emitting chunks of data every second for longer than 6s (2 * the static timeout). Each checkout should fail, but we'll continue to read

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It’s unclear to me what the goal of this would be. Are you just wanting to make sure the driver does not pre-maturely close the connection if the aliveness check succeeds? We could use event monitoring for that instead.


- description: "force a pending response read, fail first try, succeed second try"
operations:
- name: createEntities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If possible, can we add a test that demonstrates that when the pending read checkout has no timeoutMS set, we use socket_timeout_ms (if it is <3s)?

Copy link
Member Author

@prestonvasquez prestonvasquez May 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch! The Go Driver doesn’t support socket timeouts which is a technically deprecated option. Perhaps @ShaneHarvey can opine. If we decide to add this test would you mind implementing it since the Go Driver has no way of verifying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants