feat: client reconnection management #1179

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

andygolay wants to merge 6 commits into feature/da-sequencer-node from andygolay/client-connection-tests

Contributor

andygolay commented Apr 3, 2025 •

edited

Loading

Summary

RFCs: Issue da-squencer: Client reconnection management #1169
Categories: any of protocol-units, da sequencer.

Changelog

Add two tests to da sequencer client crate:

test_client_reconnect_if_connection_fails
test_reopen_block_stream_at_correct_height

with mock RPC methods.

Testing

cargo test -p movement-da-sequencer-client

andygolay added 3 commits

April 3, 2025 07:20


          chore: initial commit

04d5762


          fix: add all files

d648444


          feat: tests pass

1442aee

andygolay linked an issue

that may be closed by this pull request

da-squencer: Client reconnection management #1169

Closed

andygolay self-assigned this

andygolay requested a review from sebtomba

April 3, 2025 11:35

andygolay marked this pull request as ready for review

April 3, 2025 11:35

andygolay requested review from 0xmovses, mzabaluev and musitdev as code owners

April 3, 2025 11:35

musitdev reviewed

View reviewed changes

Contributor

musitdev left a comment

The test is a good start but:

the server start and shutdown doesn't work
the client reconnection logic must be implemented in the client. It's not done currently.
So you should update the test that failed, then implements the logic to make the test pass.

protocol-units/da-sequencer/client/tests/connection.rs Outdated


		let service = DaSequencerNodeServiceServer::new(MockService);

		tokio::spawn(async move {

Contributor

musitdev Apr 3, 2025

I don't see the purpose of the oneshot channel. To shut down the server you've to abort the task or stop the server but here, the channel just waits before starting the server.

andygolay marked this pull request as draft

April 3, 2025 14:22

andygolay added 2 commits

April 3, 2025 17:28


          Merge remote-tracking branch 'origin/feature/da-sequencer-node' into …

6c86d97

…andygolay/client-connection-tests


          feat: add reconnection logic with block height to client, update tests

e2f8f63

andygolay changed the title ~~feat: client connection tests~~ feat: client reconnection management

andygolay marked this pull request as ready for review

April 3, 2025 22:16

andygolay requested a review from musitdev

April 3, 2025 22:16


          chore: doc comments

5a9bbc6

musitdev reviewed

View reviewed changes

Contributor

musitdev left a comment

The code is becoming complicated and doesn’t fully address our needs. I think some parts should be managed by the caller (the fullnode), while others should be handled here. I need to think about it more.

Simple methods like batch_write can be handled easily, but if batch_write fails, it means the stream has an issue and should be restarted. Perhaps we should use a heartbeat from the client to the server in parallel with the current one to detect issues and bypass connection errors.

In parallel, we also need to manage the current server heartbeat, which could be used to detect an idle connection that arrives intermittently.

Let’s keep this PR for now. I’ll open a specific issue to describe the entire behavior. We will see what we do with the PR after.

protocol-units/da-sequencer/client/src/lib.rs

    
              				Err(err) => {

              					tracing::warn!(

              						"DA sequencer Http2 connection failed: {}. Retrying in 10s...",

              						"DA sequencer HTTP/2 connection failed: {}. Retrying in 2s...",

Contributor

musitdev Apr 4, 2025

Set to 10 sec because it takes more than 2s to restart the DA.

protocol-units/da-sequencer/client/src/lib.rs

-              	client: DaSequencerNodeServiceClient<tonic::transport::Channel>,
+              	client: DaSequencerNodeServiceClient<Channel>,
+              	connection_string: String,
+              	last_received_height: Arc<Mutex<Option<u64>>>,

Contributor

musitdev Apr 4, 2025

last_received_height is a u64 to you can use AtomicU64 that is more efficient than a mutex.

Contributor Author

andygolay Apr 4, 2025 •

edited

Loading

It should still use Arc though? In that case, in try_connect, what value should be used in the returned client?

We had

last_received_height: Arc::new(Mutex::new(None)),

So I'm thinking either

last_received_height: Arc::new(AtomicU64::new(0)),

or

last_received_height: Arc::new(AtomicU64::new(u64::MAX))

would make sense?

I'm thinking MAX would be better, to represent that no value is set.

Contributor

musitdev Apr 4, 2025

Yes Max is better for a non Value. You still need the Arc to share the Atomic reference.

protocol-units/da-sequencer/client/src/lib.rs

+              			)),
+              			Err(e) => {
+              				tracing::warn!("stream_read_from_height failed, trying reconnect: {e}");
+              				self.reconnect()

Contributor

musitdev Apr 4, 2025

You try only one time immediately. You should have more retry and wait a little before retrying.

protocol-units/da-sequencer/client/src/lib.rs

+              				tracing::warn!("stream_read_from_height failed, trying reconnect: {e}");
+              				self.reconnect()
+              					.await
+              					.map_err(|e| Status::unavailable(format!("Reconnect failed: {e}")))?;

Contributor

musitdev Apr 4, 2025

No error should be return (ie: no ?) until the end of the reconnect loop.

protocol-units/da-sequencer/client/src/lib.rs

+              		stream: Streaming<StreamReadFromHeightResponse>,
+              		last_received_height: Arc<Mutex<Option<u64>>>,
+              	) -> Pin<Box<dyn Stream<Item = Result<StreamReadFromHeightResponse, Status>> + Send>> {
+              		let wrapped = unfold((stream, last_received_height), |(mut s, tracker)| async move {

Contributor

musitdev Apr 4, 2025

You should call tracker last_received_height

protocol-units/da-sequencer/client/src/lib.rs

+              					Some((Ok(msg), (s, tracker)))
+              				}
+              				Ok(None) => None,
+              				Err(e) => Some((Err(e), (s, tracker))),

Contributor

musitdev Apr 4, 2025

If the stream break or return an error, we should try to reconnect. But it starts to be very complicated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet