Skip to content

feat: replication 2.0#215

Merged
xDarksome merged 21 commits into
mainfrom
feat/replication-2.0
Jul 28, 2025
Merged

feat: replication 2.0#215
xDarksome merged 21 commits into
mainfrom
feat/replication-2.0

Conversation

@xDarksome

@xDarksome xDarksome commented Jun 10, 2025

Copy link
Copy Markdown
Member

Description

WCN 2.0 replication machinery

Is primary/secondary replica sets a new concept?

New concept - no, but we haven't used it explicitly in replication before, previously if there's a migration then both replica sets were just merged into one, which resulted in us being overly restrictive on the required amount of successful responses.
https://github.com/WalletConnectFoundation/wcn/blob/14e1e6e961678df845a5912b84751c85b0b619ed/crates/core/src/cluster.rs#L363
Here for each additional node in the second replica set (the one within new keyspace, migration), we were adding +1 required successful response.
Imagine we have primary replica set {1, 2, 3, 4, 5} and we've removed/added 5+ node operators at the same time. Then the secondary replica set can be completely different (eg. {6, 7, 8, 9, 10}.)
For the operation to be successful we would require (5/2+1) + 5 = 8 successful responses. But realistically we only need 3 from each replica set.

How Has This Been Tested?

Unit

Due Diligence

  • Breaking change
  • Requires a documentation update
  • Requires a e2e/integration test update

@xDarksome xDarksome force-pushed the feat/replication-2.0 branch from c5d36c6 to da6afed Compare July 17, 2025 16:07
@xDarksome xDarksome changed the title feat: replication::DriverV2 feat: replication 2.0 Jul 17, 2025
@xDarksome xDarksome marked this pull request as ready for review July 22, 2025 15:31
@xDarksome xDarksome requested review from heilhead and nopestack July 22, 2025 15:31
std::net::SocketAddrV4,
};

pub fn node_keypair(operator_id: u8, node_id: u8) -> Keypair {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider prefixing all of these with fake such that they dont show up on symbol search in some IDEs when looking for similarly named attributes and methods

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are already within the module fake and intended to be used as fake::node_keypair etc.
What use-case do you have in mind? I never had such issues with rust-analyzer.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if intended to be called this way, please add doc-comments pointing that out

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is general code-style of preferring foo::Bar over FooBar. That way if your context is narrow it's fine to import foo::Bar directly and use it as Bar.

If you don't like the naming or it looks un-intuitive to you, we can consider other options.
Another alternative I considered was testing::fake_node_keypair, which to me looked more verbose.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that proposal although verbose, seems perfect

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@heilhead please pick between this 2, or propose yours

@heilhead heilhead Jul 24, 2025

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't particularly like the module name fake, and would prefer using testing instead. As for the prefixing/renaming the function, I don't think it's necessary, as this module would only exist in #[cfg(test)] code or behind the testing feature I assume.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll rename all fake modules to testing. fake::SmartContract will be testing::FakeSmartContract tho, as it's an actual "fake" by definition


// If both replica sets reached quorum, but the responses are different, then we
// return `None` indicating that quorum hasn't been reached.
(Some(a), Some(b)) => break Some(a).filter(|_| a == b),

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not if a == b { Some(a) } else { None }?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There was probably a more complicated chain at some point.


// If we didn't get the quorum and this is a read operation, then try to reconcile
// the responses.
None => reconciliation::reconcile(operation, &responses[..RF])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this apply only to 'collection' responses or to all of them?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as before, only map page and cardinality, the match on the operation type is withing the fn

this.execute_callback(operation.into(), ResponseChannel(tx))
.await
});
drop(handle);

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of explicitly dropping the handle here?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clippy complains otherwise

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So doing simply tokio::spawn(...); triggers a warning? If that's the case, I'd still prefer let _ = tokio::spawn(...);.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I was getting a warning on let _ = as well, will check

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

error: non-binding `let` on a future
   --> crates/replication2/src/coordinator/mod.rs:183:9
    |
183 | /         let _ = tokio::spawn(async move {
184 | |             this.execute_callback(operation.into(), ResponseChannel(tx))
185 | |                 .await
186 | |         });
    | |___________^

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not binding it at all works, I thought I had issues with it as well, weird

operator: &NodeOperator<N>,
operation: &Operation<'_>,
) -> storage_api::Result<operation::Output> {
let mut retries_left = operator.nodes.len();

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get an empty set of nodes here under any conditions? I suppose it's not, but since you're already returning a result here, maybe make sure that it's never empty? Otherwise it may panic with an overflow on subtraction.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an invariant of NodeOperator and it's being checked in the respective constructor. NodeOperator::next_node can also panic, but this would be a bug.

It's worth a comment tho, I'll add one.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also the field needs to be private, which it isn't rn. Thanks for pointing out.

operation: Operation<'_>,
) -> impl Future<Output = Result<operation::Output>> + Send;
) -> impl Future<Output = Result<operation::Output>> + Send {
async move { self.execute_ref(&operation).await }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the default implementation of the trait would result in a stack overflow due to recursion? Not sure it's a good idea.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a neat trick I didn't know about either. Stack overflow is not possible when dealing with statically typed futures. Compiler is smart enough to throw an error forcing you to implement one of the functions.

match response {
Ok(output) if &output != self.output => {}

// Errors are not being repaired.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it worth logging those regardless?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't see anything useful there, only the ErrorKind which will be available as a metric from wcn_rpc.

pub(super) fn is_repairable(operation: &Operation<'_>) -> bool {
use operation::{Borrowed, Owned};

match operation {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe simplify calling matches!

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matches! won't give the same variant exhaustion guarantees

) -> Option<operation::Output> {
use operation::{Borrowed, Owned};

match operation {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as before here

@xDarksome xDarksome merged commit f48efef into main Jul 28, 2025
12 checks passed
@xDarksome xDarksome deleted the feat/replication-2.0 branch July 28, 2025 18:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants