Add retry logic for RS3 WriteWalSegments #429

hachikuji · 2025-03-05T19:49:31Z

Retries for WriteWalSegments are not as straightforward as some of the blocking calls because we are using an asynchronous stream. The simplest approach I could come up with was to buffer the writes as they are being written. If an UNAVAILABLE error occurs while writing to the stream, then we continue writing to the buffer, but stop writing to the stream. Once all wal entries have been received, then the retry will be done using the synchronous writeWalSegments API.

hachikuji · 2025-03-05T19:55:54Z

kafka-client/src/test/java/dev/responsive/kafka/internal/db/rs3/RS3KVTableIntegrationTest.java

+import org.junit.jupiter.api.TestInfo;
+import org.junit.jupiter.api.extension.RegisterExtension;
+
+public class RS3KVTableIntegrationTest {


This shows up as an addition, but it is a simple rename. I thought IntegrationTest seemed more appropriate since we are using the RS3 container. I wanted RS3KVTableTest to allow mocking dependencies. If the rename seems reasonable, I can submit a separate PR so that it doesn't show up in the diff.

rodesai

Thanks! My main question is how we actually get errors returned back to us when doing a streaming rpc. Is it always through the response callback? Or can the stream sender throw exceptions directly. If it's the latter, then we wouldn't be handling those exceptions.

rodesai · 2025-03-05T20:24:15Z

kafka-client/src/main/java/dev/responsive/kafka/internal/db/rs3/client/grpc/GrpcRs3Util.java

+        return new RS3Exception(statusRuntimeException);
+      }
+    } else if (t instanceof StatusException) {
+      return new RS3Exception(t);


should we check for status UNAVAILABLE here?

rodesai · 2025-03-05T20:26:32Z

kafka-client/src/main/java/dev/responsive/kafka/internal/db/rs3/client/grpc/GrpcRS3Client.java

+          throw wrappedException;
        }
      }



we should log something here about the retries we're doing

rodesai · 2025-03-05T20:29:26Z

kafka-client/src/main/java/dev/responsive/kafka/internal/db/rs3/client/grpc/GrpcRS3Client.java

    throw new RS3TimeoutException("Timeout while attempting operation " + opDescription.get());
  }

+


should we wrap asyncStub.writeWALSegmentStream in withRetry to catch failures establishing the stream? I'm not sure if those get raised there or in the callbacks

rodesai · 2025-03-05T20:50:19Z

kafka-client/src/main/java/dev/responsive/kafka/internal/db/rs3/RS3KVFlushManager.java

+      if (isStreamActive) {
+        try {
+          streamSender.sendNext(entry);
+        } catch (IllegalStateException e) {


why are we catching IllegalStateException here? I think streamSender should throw grpc exceptions if it can't send a frame

Yeah, I think you are right. I let openai mislead me here since it suggested that onError could be invoked internally by grpc. If that were the case, then subsequent calls would fail with IllegalStateException. Instead it looks like onNext should throw the status exception and the caller is expected to either retry onNext or invoke onError to signal that it is giving up on the stream. I will revise the patch.

hachikuji · 2025-03-06T20:55:05Z

...rc/test/java/dev/responsive/kafka/internal/db/rs3/client/grpc/GrpcRS3ClientEndToEndTest.java

+  }
+
+  @Test
+  public void shouldRetryPutWithNetworkInterruption() {


These tests get us a little closer to simulating bad network behavior, but still rely on assumptions about how network errors are raised in the grpc stack.

rodesai · 2025-03-08T03:09:58Z

kafka-client/src/main/java/dev/responsive/kafka/internal/db/rs3/RS3KVFlushManager.java

+      return sendRecv.receiver().handle((result, throwable) -> {
+        Optional<Long> flushedOffset = result;
+        if (throwable instanceof RS3TransientException) {
+          flushedOffset = streamFactory.writeWalSegmentSync(retryBuffer);


the retries for this call happen in the grpc client right?

Yes, that's right. I was trying to keep retry logic encapsulated in the grpc client. But there's not really a way I could think of to encapsulate retries for the async API.

...-client/src/main/java/dev/responsive/kafka/internal/db/rs3/client/grpc/GrpcStreamSender.java

rodesai · 2025-03-08T03:17:17Z

kafka-client/src/main/java/dev/responsive/kafka/internal/db/rs3/RS3KVFlushManager.java

-            return RemoteWriteResult.success(kafkaPartition);
-          }
-      );
+      ifActiveStream(StreamSender::finish);


we probably need to catch exceptions from finish as well

In ifActiveStream, we catch RS3TransientException.

rodesai · 2025-03-08T03:22:09Z

kafka-client/src/main/java/dev/responsive/kafka/internal/db/rs3/RS3KVFlushManager.java

-      );
+      ifActiveStream(StreamSender::finish);
+
+      return sendRecv.receiver().handle((result, throwable) -> {


If there's a failure when sending the stream, is it guaranteed to be propagated to the GrpcMessageReceiver and therefore result in handle being called with the error here? I can't find any docs that says that's the case. Therefore we probably want to fall back to the sync path if we ever hit any errors when sending the stream.

I handled this in the latest commit by tracking a completion for both the send and recv sides. If either fails, then the combined completion fails as well.

rodesai · 2025-03-08T03:23:10Z

...-client/src/main/java/dev/responsive/kafka/internal/db/rs3/client/grpc/GrpcStreamSender.java

+    try {
+      grpcObserver.onNext(protoFactory.apply(msg));
+    } catch (Exception e) {
+      grpcObserver.onError(e);


related to my q below on how errors are surfaced, this grpcObserver is the grpc side of the client->server stream. Does that propagate the error to our side of the server->client stream?

rodesai

LGTM! The e2e test looks nice

rodesai · 2025-03-19T05:12:27Z

kafka-client-examples/e2e-test/docker/antithesis/async/config/volumes/app-rs3.properties

 responsive.rs3.port=50051
 responsive.rs3.logical.store.mapping=e2e:b1a45157-e2f0-4698-be0e-5bf3a9b8e9d1
 responsive.rs3.tls.enabled=false
+responsive.rs3.retry.timeout.ms=1800000


what default are we using? should probably be set to like half the max poll interval

I had it at 30s, which might be on the low side. Do we use the default poll interval of 5min?

I changed the default to 2 minutes. That is close to half the poll interval and matches the default producer delivery timeout.

hachikuji added 5 commits March 4, 2025 18:03

First attempt retry logic for write wal segments

a328171

Retry write wal segment in flush()

09d873c

Add retry tests for sync write

b42a77d

Check propagation of unexpected errors

f7e5fb5

Missing license headers

f891a8d

hachikuji commented Mar 5, 2025

View reviewed changes

Fix failing tests

b77c732

rodesai reviewed Mar 5, 2025

View reviewed changes

hachikuji added 2 commits March 6, 2025 10:31

Revise error expectations

142991e

Add end to end test with some initial tests

d761c7f

hachikuji commented Mar 6, 2025

View reviewed changes

Merge remote-tracking branch 'origin/main' into rs3-flush-retries

27cf041

rodesai reviewed Mar 8, 2025

View reviewed changes

hachikuji added 9 commits March 10, 2025 11:37

Add test cases for wrapException

e8cf9bc

Fix broken error wrapping logic

2931710

Fix checkstyle

023d514

Crank up retry timeout in antithesis tests

2a4ef94

Try to get exception source

3e6daee

Check for CompletionException

fc5ae08

Clean up debugging

ac00ece

set inactive in finally

18db82b

Tie send and receive completion into a single future

9d4d02a

rodesai approved these changes Mar 19, 2025

View reviewed changes

set default retry timeout to 2min

9401741

hachikuji merged commit 0820608 into main Mar 20, 2025
1 check passed

hachikuji deleted the rs3-flush-retries branch March 20, 2025 17:02

		throw new RS3TimeoutException("Timeout while attempting operation " + opDescription.get());
		}

Add retry logic for RS3 WriteWalSegments #429

Add retry logic for RS3 WriteWalSegments #429

Uh oh!

Conversation

hachikuji commented Mar 5, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rodesai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rodesai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants