Skip to content

Resend Replication Requests After a Timeout #143

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 41 commits into from
Apr 24, 2025
Merged

Resend Replication Requests After a Timeout #143

merged 41 commits into from
Apr 24, 2025

Conversation

samliok
Copy link
Collaborator

@samliok samliok commented Apr 14, 2025

This PR introduces a TimeoutHandler, that tracks outgoing replication requests and resends them once a timeout has occurred.

The logic for handling incomplete requests is as follows

  • we send out replication requests, and not the timeouts in TimeoutHandler
  • when we receive a ReplicationResponse we check if the response data contains all the expected sequences
    • if so, we can return early and remove the timeout task
    • if not, we create new requests by figuring out the missing sequences and distributing them across nodes

This approach is nice because it keeps the logic fairly simple. However, because we only remove 1 TimeoutTask every time we receive a response, it's possible we could send duplicate replication requests over the network. For example say we have 2 TimeoutTasks from a node one expects seqs 5-10 and another with 11-15. If we receive seqs 5-13 from the node, we will only remove the timeout with 5-10 from the tasks maps, and resend a request for 11-15 after a timeout. To keep the logic simpler, I chose to not worry about this case since it's highly unlikely that a node would have two separate timeouts for consecutive sequences.

  • Create TimeoutHandler Struct
  • Make tests for TimeoutHandler
  • Integrate TimeoutHandler into ReplicationState
  • Make tests for replication when nodes timeout
  • Ensure incomplete responses don't get forgotten

@samliok samliok self-assigned this Apr 14, 2025
@samliok samliok marked this pull request as draft April 14, 2025 02:53
}

// ----------------------------------------------------------------------
type TaskHeap []*TimeoutTask
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assume that time moves only forward and not backwards, and we always add a task with the same timeout, then why do we need a heap and not a list? Is it ever possible that we add an element and it won't be the last one?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

was thinking this gives us more flexibility. if we decide to send timeout requests with different timeouts(for example exponential backoff on certain nodes) then using a heap will properly order them

@samliok samliok marked this pull request as ready for review April 14, 2025 19:18
}
}

// this is done from normalNode2 since the lagging node will request
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this normalNode2 and not the first node? We don't have a time advancement yet, so shouldn't the lagging node ask for the first sequences from the first node?

The first node has its filter set to reject replication requests, but in order for the lagging node to ask from the second node, time needs to pass in the test, no?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is helpful to wait for normalNode2 to send out its first replication response before we get its second response. Otherwise, the test would occasionally flake.

Copy link
Collaborator

@yacovm yacovm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some nits, LGTM otherwise.

@samliok samliok merged commit 6588b0d into main Apr 24, 2025
5 checks passed
@samliok samliok deleted the timeout branch April 24, 2025 16:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Distribute requests across all nodes
2 participants