Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid retrying on IO errors when it’s unclear if the server received the request. #192

Closed

Conversation

barshaul
Copy link

@barshaul barshaul commented Oct 1, 2024

Main Change:

ATM, on I/O errors, we would reconnect to the failed node and retry the request if there were more retries left. This approach had a critical issue: we couldn’t reliably determine if the server had already received the request before the connection was broken. Retrying in such cases could result in duplicate command execution.

Example:

  1. Client sends INCR key.
  2. Server receives the request and increments the key (e.g., key = 1).
  3. A network issue disconnects the client before the server can respond.
  4. The client reconnects and retries the INCR key.
  5. Server increments the key again (now key = 2). BAD outcome.

This PR differentiates between errors where it’s safe to retry and those where it’s not. Specifically, with multiplexed connections, if the send function returns an error, it guarantees that the server never received the data, making retries safe (see https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.html#method.send). For other errors, where we can’t be certain, retries are unsafe and will not be automatically attempted. Instead, these errors will now be returned to the user, who must manually retry if they determine it’s safe.

Test Changes:

Since I/O errors are now returned to the user, tests that previously killed the server now loop to retry the request, simulating the handling of I/O errors on the user side.

Refresh Slots Change:

While testing this fix, I found that when all connections were unavailable and refresh_slots was called, it didn’t raise the expected allConnectionsUnavailable error. This has been fixed by updating random_connections function to return an Option. Now, if no connections are found, refresh_slots raises the allConnectionsUnavailable error immediately. The state then shifts to reconnecting to the initial nodes, and slot refreshes are attempted using the new connections.

Out of Scope:

A future PR could introduce a new configuration option to enable retries on connection error, allowing users to control this behavior at the client level.

@barshaul barshaul force-pushed the dont_retry_on_timeout branch from e5cb712 to dbb41ff Compare October 9, 2024 13:53
@barshaul barshaul changed the title WIP: Dont retry on timeout WIP: Avoid retrying on IO errors when it’s unclear if the server received the request. Oct 9, 2024
@barshaul barshaul force-pushed the dont_retry_on_timeout branch 2 times, most recently from 0059c35 to 8522882 Compare October 9, 2024 19:54
@barshaul barshaul changed the title WIP: Avoid retrying on IO errors when it’s unclear if the server received the request. Avoid retrying on IO errors when it’s unclear if the server received the request. Oct 9, 2024
@barshaul barshaul force-pushed the dont_retry_on_timeout branch from 8522882 to aaadbf2 Compare October 9, 2024 20:03
@barshaul barshaul requested a review from eifrah-aws October 9, 2024 20:10
@barshaul barshaul marked this pull request as ready for review October 9, 2024 20:10
@barshaul
Copy link
Author

opened in valkey-glide:
valkey-io/valkey-glide#2479

@barshaul barshaul closed this Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant