Avoid retrying on IO errors when it’s unclear if the server received the request. #192
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Main Change:
ATM, on I/O errors, we would reconnect to the failed node and retry the request if there were more retries left. This approach had a critical issue: we couldn’t reliably determine if the server had already received the request before the connection was broken. Retrying in such cases could result in duplicate command execution.
Example:
INCR key
.INCR key
.This PR differentiates between errors where it’s safe to retry and those where it’s not. Specifically, with multiplexed connections, if the
send
function returns an error, it guarantees that the server never received the data, making retries safe (see https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.html#method.send). For other errors, where we can’t be certain, retries are unsafe and will not be automatically attempted. Instead, these errors will now be returned to the user, who must manually retry if they determine it’s safe.Test Changes:
Since I/O errors are now returned to the user, tests that previously killed the server now loop to retry the request, simulating the handling of I/O errors on the user side.
Refresh Slots Change:
While testing this fix, I found that when all connections were unavailable and
refresh_slots
was called, it didn’t raise the expectedallConnectionsUnavailable
error. This has been fixed by updatingrandom_connections
function to return anOption
. Now, if no connections are found,refresh_slots
raises theallConnectionsUnavailable
error immediately. The state then shifts to reconnecting to the initial nodes, and slot refreshes are attempted using the new connections.Out of Scope:
A future PR could introduce a new configuration option to enable retries on connection error, allowing users to control this behavior at the client level.