Avoid retrying on IO errors when it’s unclear if the server received the request. #192

barshaul · 2024-10-01T15:22:33Z

Main Change:

ATM, on I/O errors, we would reconnect to the failed node and retry the request if there were more retries left. This approach had a critical issue: we couldn’t reliably determine if the server had already received the request before the connection was broken. Retrying in such cases could result in duplicate command execution.

Example:

Client sends INCR key.
Server receives the request and increments the key (e.g., key = 1).
A network issue disconnects the client before the server can respond.
The client reconnects and retries the INCR key.
Server increments the key again (now key = 2). BAD outcome.

This PR differentiates between errors where it’s safe to retry and those where it’s not. Specifically, with multiplexed connections, if the send function returns an error, it guarantees that the server never received the data, making retries safe (see https://docs.rs/tokio/latest/tokio/sync/mpsc/struct.Sender.html#method.send). For other errors, where we can’t be certain, retries are unsafe and will not be automatically attempted. Instead, these errors will now be returned to the user, who must manually retry if they determine it’s safe.

Test Changes:

Since I/O errors are now returned to the user, tests that previously killed the server now loop to retry the request, simulating the handling of I/O errors on the user side.

Refresh Slots Change:

While testing this fix, I found that when all connections were unavailable and refresh_slots was called, it didn’t raise the expected allConnectionsUnavailable error. This has been fixed by updating random_connections function to return an Option. Now, if no connections are found, refresh_slots raises the allConnectionsUnavailable error immediately. The state then shifts to reconnecting to the initial nodes, and slot refreshes are attempted using the new connections.

Out of Scope:

A future PR could introduce a new configuration option to enable retries on connection error, allowing users to control this behavior at the client level.

…eived the request

barshaul · 2024-10-20T10:41:38Z

opened in valkey-glide:
valkey-io/valkey-glide#2479

barshaul added 3 commits October 9, 2024 13:52

Dont retry on connection errors where it is unclear if the server rec…

fa483bf

…eived the request

Fixed test

32f6466

Fixes

3fd8428

barshaul force-pushed the dont_retry_on_timeout branch from e5cb712 to dbb41ff Compare October 9, 2024 13:53

barshaul changed the title ~~WIP: Dont retry on timeout~~ WIP: Avoid retrying on IO errors when it’s unclear if the server received the request. Oct 9, 2024

doc fixes

5f96843

barshaul force-pushed the dont_retry_on_timeout branch 2 times, most recently from 0059c35 to 8522882 Compare October 9, 2024 19:54

barshaul changed the title ~~WIP: Avoid retrying on IO errors when it’s unclear if the server received the request.~~ Avoid retrying on IO errors when it’s unclear if the server received the request. Oct 9, 2024

some fixes

aaadbf2

barshaul force-pushed the dont_retry_on_timeout branch from 8522882 to aaadbf2 Compare October 9, 2024 20:03

barshaul requested a review from eifrah-aws October 9, 2024 20:10

barshaul marked this pull request as ready for review October 9, 2024 20:10

barshaul closed this Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Avoid retrying on IO errors when it’s unclear if the server received the request. #192

Avoid retrying on IO errors when it’s unclear if the server received the request. #192

Uh oh!

barshaul commented Oct 1, 2024 •

edited

Loading

Uh oh!

barshaul commented Oct 20, 2024

Uh oh!

Uh oh!

Avoid retrying on IO errors when it’s unclear if the server received the request. #192

Avoid retrying on IO errors when it’s unclear if the server received the request. #192

Uh oh!

Conversation

barshaul commented Oct 1, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Main Change:

Test Changes:

Refresh Slots Change:

Out of Scope:

Uh oh!

barshaul commented Oct 20, 2024

Uh oh!

Uh oh!

barshaul commented Oct 1, 2024 •

edited

Loading