chore: lock free info command with replicaof v2 #5864

kostasrim · 2025-09-28T09:29:16Z

Following #5774, this PR removes locking the replica_of_mu_ from info command and uses the thread locals && replica of v2 algorithm.

No more locking and blocking for INFO command during replicaof or takeover 🥳 🎉 🌮

kostasrim · 2025-09-28T09:43:27Z

src/server/server_family.cc

  }
 }

+void ServerFamily::RoleV2(CmdArgList args, const CommandContext& cmd_cntx) {


Once the regression tests run for some time we will slowly remove Role() and ReplicaOf etc (all the variants of the previous versions)

Signed-off-by: kostas <[email protected]>

kostasrim · 2025-10-02T07:38:01Z

src/server/replica.cc

    int, replica_priority, 100,
    "Published by info command for sentinel to pick replica based on score during a failover");
-ABSL_FLAG(bool, experimental_replicaof_v2, true,
+ABSL_FLAG(bool, experimental_replicaof_v2, false,


Will revert back to true, just want to make sure I did not break anything in case we want to switch back to the old implemntation

kostasrim · 2025-10-02T07:47:13Z

src/server/protocol_client.cc

 ProtocolClient::~ProtocolClient() {
  exec_st_.JoinErrorHandler();

-  // FIXME: We should close the socket explictly outside of the destructor. This currently


After spending a few hours on this, I still don't understand why, if we keep this code, we get a segfault on the destructor of std::shared_ptr<Replica>. It seems that it happens during the preemption but the sock_ resources are already deallocated so Close() should early return because fd_ < 0.

What is more, the core dump shows that tl_replica and its copy, have a different ref counted object because one shows that it is expired and the other one having a ref count of 7. I added CHECK() before the crash to make sure that both copies of the shared_ptr point to the exact same control block. The checks passed yet the core dump showed otherwise which makes me think that this is somehow a memory corruption error.

The good thing is that we don't need this code anymore, as we handle closing the socket outside of the descturctor now.

While writing this, the only case I can think of is that the last instance of tl_replica gets destructed, but it needs to preempt and and info command comes in and grabs a copy while the shared_ptr is destructing which could lead to a race condition.

I will verify rthis theory once I am back from the holidays.

ps. the test that failed test_cancel_replication_immediately (and every 300 runs so its kinda time consuming to reproduce)

I am afraid the crash is a symptom of a bug in another place. Can you please create a branch that crashes and give me the instructions on how to cause the crash? I would like to inspect it myself

What is more, the core dump shows that tl_replica and its copy, have a different ref counted object - what is the copy of tl_replica?

I am afraid the crash is a symptom of a bug in another place. Can you please create a branch that crashes and give me the instructions on how to cause the crash? I would like to inspect it myself

I agree, it's a symptom and I believe it's a byproduct of UB. As I wrote I think I understand why it crashed (see below)

What is more, the core dump shows that tl_replica and its copy, have a different ref counted object - what is the copy of tl_replica?

Pardon me, I did not provide context. The crash always happens here:

2013 optional<Replica::Summary> ServerFamily::GetReplicaSummary() const { 2014 /// Without lock 2015 if (use_replica_of_v2_) { 2016 if (tl_replica == nullptr) { 2017 return nullopt; 2018 } 2019 auto replica = tl_replica; 2020 return replica->GetSummary(); 2021 }

Specifically, when we return and when we destruct the Replica object (which destructs the ProtocolClient).

When I used gdb to inspect the coredump, I saw the segfault happened in the destructor of the ProtocolClient (in line 2020 --> on the return statement). Then I printed both tl_replica and replica and they both pointed to the same object with a different ref count (one was zero the other one on some arbitrary vlaue which I don't remember).

Then I said, maybe we somehow end up with a different control block because of how we copy. I added a CHECK that both replica and tl_replica point to the exact same control block. It did not trigger.

Lastly, as I wrote above, I think I understand what happens:

if (sock_) { std::error_code ec; sock_->proactor()->Await([this, &ec]() { ec = sock_->Close(); }); LOG_IF(ERROR, ec) << "Error closing socket " << ec; }

This is the destructor of ProtocolClient. Now imagine, we update all tl_replicas (let's say because of ReplicaOf command). The last assignment to tl_replica (on whichever thread) will drop the ref count to 0. At this point it will preempt(because we dispatch to call Close on sock_). Now info all gets called and we grab a copy of tl_replica while its being destructed and we trigger UB. This is the timeline:

tl_replica is assigned to a new object, drops the previous ref count to zero and preempts in the destructor of Replica

Info all gets called, we grab a copy of tl_replica which is being destructed

boom

All in all, let me first verify this hypothesis and if it's wrong I will setup an environment for us to tag team.

And I did not have the time to verify this because I literally left for holidays the evening I came up with this hypothesis 😄

After looking on the core dump and adding a few more logs:

server_family.cc:4205] proactor: 0x33138180800setting 0x3313c0d0890 to 0x3313c0d0f90
server_family.cc:4205] proactor: 0x33138180300setting 0x3313c0d0890 to 0x3313c0d0f90

Within the body of ReplicaOf() command we update the tl_replica on each proactor to the new replica_ object. Proactor 0x33138180800 drops the use count from 2 to 1. Then the assignment tl_replica = repl observes the use count is 1 and calls ~Replica which fiber preempts as it calls Await(). tl_replica is now in an intermediate state.

4203 void ServerFamily::UpdateReplicationThreadLocals(SharedPtrW<Replica> repl) { 4204 service_.proactor_pool().AwaitFiberOnAll([repl](auto index, auto* context) { 4205 LOG(INFO) << "proactor: " << ProactorBase::me() <<"setting " << tl_replica << " to " << repl; 4206 tl_replica = repl;

and deep inside shared_ptr::operator=

718 _Sp_counted_base<_Lp>* __tmp = __r._M_pi; 719 if (__tmp != _M_pi) 720 { 721 if (__tmp != nullptr) 722 __tmp->_M_add_ref_copy(); 723 if (_M_pi != nullptr) 724 _M_pi->_M_release(); */ We release and fiber preempt here */ 725 _M_pi = __tmp; 726 } 727 return *this;

replica.cc:101] Destructing Replica: 0x3313c0d0890 on thread 0x33138180300
protocol_client.cc:122] Destructing Protocol: 0x3313c0d0890
908 I20251016 13:16:15.320950 64356 server_family.cc:2020] copying 0x3313c0d0f90 on thread 0x33138180300

We just copied tl_replica in the middle of assignment.

protocol_client.cc:130] Destructing Protocol complete: 0x3313c0d0890
After _M_pi->_M_release()

crash follows.

Then to verify this is indeed the case, I made sure that tl_replica is never in partial state even if its destructor preempts (basically just auto tmp = tl_replica; and tl_replica = repl) and the test never fails. It was indeed the case I imagined 🤓

Can we change the Replica design so that Replica interface so that its destructor won't preempt? Basically pull out all the non-trivial code into a Shutdown method that is explicitly called like we do in many other cases.

So

will add the one liner fix in this PR and remove the changes to the destructor (and outgoing_slot_migration) to reduce PR size

will follow up with a second PR. It's just buggy to preempt and do IO in the destructor.

hange the Replica design so that Replica interface so that its destructor won't preempt? Basically pull out all the non-trivial code into a Shutdown method that is explicitly called like we do in many other cases.

This is exactly what I did by removing the preemptive code (the Await()) on the destructor and that's why you see the explicit calls of Close() which also Shutdown the socket outside of the destructor.

As I said, to keep it simple to review, I will split this in a follow up PR since it also requires the changes in outgoing_slot_migration. This is the answer to your other question also: #5864 (comment)

kostasrim · 2025-10-02T07:48:09Z

With all the changes around replication, I will follow up with a tidy PR 😄

romange · 2025-10-14T21:23:46Z

src/server/server_family.cc

  add_replica->StartMainReplicationFiber(nullopt);
  cluster_replicas_.push_back(std::move(add_replica));
+  service_.proactor_pool().AwaitFiberOnAll(
+      [this](auto index, auto* cntx)


this is not the right way to do it. cluster_replicas_ can change in between.
instead you should capture it in the callback by value instead of this.

How can cluster_replicas_ change inbetween if all modifications to this vector happen under the same mutex replicaof_mu_ ? 🤔

ReplicaOfNoOne, ReplicaOf, Shutdown etc all lock the same mutex before object gets modified

romange · 2025-10-14T21:35:50Z

src/server/server_family.cc

    // Please note that we we do not use Metrics object here.
-    {
+    if (use_replica_of_v2_) {
+      // Deep copy because tl_replica might be overwritten inbetween


I find confusing the "Deep copy" term here.

You copy shared_ptr which is not deep copy, imho.

But then you cluster_replicas which is indeed a (deep) copy of a vector.

I agree the comment is misleading, will edit

romange · 2025-10-14T21:38:00Z

src/server/cluster/outgoing_slot_migration.cc

  }

  bool should_cancel_flows = false;
+  absl::Cleanup on_exit([this]() { CloseSocket(); });


is it an existing bug that is orthogonal to this change?

romange · 2025-10-14T21:38:18Z

src/server/server_family.cc

 using strings::HumanReadableNumBytes;

+// Initialized by REPLICAOF
+thread_local std::shared_ptr<Replica> tl_replica = nullptr;


please use ServerState for this

romange

Please prepare a "crash" branch for me to investigate
If outgoing_slot_migration change is a bug, I would like to ask to send it in another PR. Also, retire info_replication_valkey_compatible flag in the same PR by removing the "false" branch. Lets submit changes we are more confident about first.

romange · 2025-10-14T21:47:05Z

src/server/protocol_client.h

  auto* Proactor() const {
-    return sock_->proactor();
+    DCHECK(socket_thread_);
+    return socket_thread_;


why do you need socket_thread_ ?
do we access ProtocolClient from multiple threads?
why do we need last_io_time_ to be atomic?

I see now. The only method we access from multiple threads is Replica::GetSummary and you are fixing existing bugs. See my other PR comments. Lets group all the fixes in a preliminary PR first.

romange · 2025-10-14T23:40:03Z

There is a problem with this design. You use shared_ptr to be able to access the object even if it's "released" in another thread. But ProtocolClient is designed to be destructed in its own thread.
so suppose, ProtocolClient is created in thread A, and thread B holds the pointer to that object, and then
A destroys the reference to ProtocolClient, the thread B will hold the only reference and essentially will call the destructor of ProtocolClient in the wrong thread.

kostasrim · 2025-10-16T15:21:43Z

There is a problem with this design. You use shared_ptr to be able to access the object even if it's "released" in another thread. But ProtocolClient is designed to be destructed in its own thread. so suppose, ProtocolClient is created in thread A, and thread B holds the pointer to that object, and then A destroys the reference to ProtocolClient, the thread B will hold the only reference and essentially will call the destructor of ProtocolClient in the wrong thread.

I don't see any issue in destructing in a separate thread because by the time we call the destructor of ~ProtocolClient we have cleaned our resources. Specifically:

shard_flows_ is empty
sync_fb_ and acks_fb_ should have joined (I can add DCHECKS for this)
base class ProtocolClient::sock_ is already Shutdown() && Closed() properly on the right thread so there is no IO involved or anything else

The destructor is "trivial" from that aspect (of course we had preemption because of the Await which I removed)

romange · 2025-10-16T15:42:20Z

I prefer the opposite:

Submit a PR that does not change behavior: fixes bugs, refactor etc. Release it in 1.35
Submit a clean PR that adds a lockless replication. Release it in 1.36

kostasrim self-assigned this Sep 28, 2025

kostasrim force-pushed the kpr2 branch 2 times, most recently from 620f35c to 7e36655 Compare September 28, 2025 09:37

kostasrim commented Sep 28, 2025

View reviewed changes

kostasrim force-pushed the kpr2 branch 2 times, most recently from 7cee991 to 59b3350 Compare September 29, 2025 09:23

chore: lock free info command with replicaof v2

45cbcf1

Signed-off-by: kostas <[email protected]>

kostasrim force-pushed the kpr2 branch from 59b3350 to 45cbcf1 Compare September 30, 2025 07:15

kostasrim added 5 commits October 1, 2025 22:04

cleanup

76a3da6

cleanup

542e771

tests

06b4bd8

test old path

f574f04

fix

ede883c

kostasrim commented Oct 2, 2025

View reviewed changes

kostasrim requested a review from romange October 2, 2025 07:47

romange reviewed Oct 14, 2025

View reviewed changes

kostasrim mentioned this pull request Oct 17, 2025

chore: non preemptive ProtocolClient destructor #5927

Merged

Uh oh!

chore: lock free info command with replicaof v2 #5864

Are you sure you want to change the base?

chore: lock free info command with replicaof v2 #5864

Uh oh!

Conversation

kostasrim commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kostasrim Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostasrim Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostasrim Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostasrim commented Oct 2, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kostasrim Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romange Oct 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romange left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

romange commented Oct 14, 2025

Uh oh!

kostasrim commented Oct 16, 2025

Uh oh!

romange commented Oct 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kostasrim commented Sep 28, 2025 •

edited

Loading

kostasrim Sep 28, 2025 •

edited

Loading

kostasrim Oct 2, 2025 •

edited

Loading

kostasrim Oct 15, 2025 •

edited

Loading

kostasrim Oct 15, 2025 •

edited

Loading

romange Oct 14, 2025 •

edited

Loading

romange left a comment •

edited

Loading