Skip to content

Conversation

@kostasrim
Copy link
Contributor

@kostasrim kostasrim commented Sep 28, 2025

Following #5774, this PR removes locking the replica_of_mu_ from info command and uses the thread locals && replica of v2 algorithm.

No more locking and blocking for INFO command during replicaof or takeover 🥳 🎉 🌮

@kostasrim kostasrim self-assigned this Sep 28, 2025
@kostasrim kostasrim force-pushed the kpr2 branch 2 times, most recently from 620f35c to 7e36655 Compare September 28, 2025 09:37
}
}

void ServerFamily::RoleV2(CmdArgList args, const CommandContext& cmd_cntx) {
Copy link
Contributor Author

@kostasrim kostasrim Sep 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once the regression tests run for some time we will slowly remove Role() and ReplicaOf etc (all the variants of the previous versions)

@kostasrim kostasrim force-pushed the kpr2 branch 2 times, most recently from 7cee991 to 59b3350 Compare September 29, 2025 09:23
int, replica_priority, 100,
"Published by info command for sentinel to pick replica based on score during a failover");
ABSL_FLAG(bool, experimental_replicaof_v2, true,
ABSL_FLAG(bool, experimental_replicaof_v2, false,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will revert back to true, just want to make sure I did not break anything in case we want to switch back to the old implemntation

ProtocolClient::~ProtocolClient() {
exec_st_.JoinErrorHandler();

// FIXME: We should close the socket explictly outside of the destructor. This currently
Copy link
Contributor Author

@kostasrim kostasrim Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After spending a few hours on this, I still don't understand why, if we keep this code, we get a segfault on the destructor of std::shared_ptr<Replica>. It seems that it happens during the preemption but the sock_ resources are already deallocated so Close() should early return because fd_ < 0.

What is more, the core dump shows that tl_replica and its copy, have a different ref counted object because one shows that it is expired and the other one having a ref count of 7. I added CHECK() before the crash to make sure that both copies of the shared_ptr point to the exact same control block. The checks passed yet the core dump showed otherwise which makes me think that this is somehow a memory corruption error.

The good thing is that we don't need this code anymore, as we handle closing the socket outside of the descturctor now.

While writing this, the only case I can think of is that the last instance of tl_replica gets destructed, but it needs to preempt and and info command comes in and grabs a copy while the shared_ptr is destructing which could lead to a race condition.

I will verify rthis theory once I am back from the holidays.

ps. the test that failed test_cancel_replication_immediately (and every 300 runs so its kinda time consuming to reproduce)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid the crash is a symptom of a bug in another place. Can you please create a branch that crashes and give me the instructions on how to cause the crash? I would like to inspect it myself

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is more, the core dump shows that tl_replica and its copy, have a different ref counted object - what is the copy of tl_replica?

Copy link
Contributor Author

@kostasrim kostasrim Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am afraid the crash is a symptom of a bug in another place. Can you please create a branch that crashes and give me the instructions on how to cause the crash? I would like to inspect it myself

I agree, it's a symptom and I believe it's a byproduct of UB. As I wrote I think I understand why it crashed (see below)

What is more, the core dump shows that tl_replica and its copy, have a different ref counted object - what is the copy of tl_replica?

Pardon me, I did not provide context. The crash always happens here:

  2013 optional<Replica::Summary> ServerFamily::GetReplicaSummary() const {                                                                                                                  
  2014   /// Without lock                                                                                                                                                                    
  2015   if (use_replica_of_v2_) {                                                                                                                                                           
  2016     if (tl_replica == nullptr) {                                                                                                                                                      
  2017       return nullopt;                                                                                                                                                                 
  2018     }                                                                                                                                                                                 
  2019     auto replica = tl_replica;                                                                                                                                                        
  2020     return replica->GetSummary();                                                                                                                                                     
  2021   }  

Specifically, when we return and when we destruct the Replica object (which destructs the ProtocolClient).

When I used gdb to inspect the coredump, I saw the segfault happened in the destructor of the ProtocolClient (in line 2020 --> on the return statement). Then I printed both tl_replica and replica and they both pointed to the same object with a different ref count (one was zero the other one on some arbitrary vlaue which I don't remember).

Then I said, maybe we somehow end up with a different control block because of how we copy. I added a CHECK that both replica and tl_replica point to the exact same control block. It did not trigger.

Lastly, as I wrote above, I think I understand what happens:

 if (sock_) {
    std::error_code ec;
    sock_->proactor()->Await([this, &ec]() { ec = sock_->Close(); });
    LOG_IF(ERROR, ec) << "Error closing socket " << ec;
  }

This is the destructor of ProtocolClient. Now imagine, we update all tl_replicas (let's say because of ReplicaOf command). The last assignment to tl_replica (on whichever thread) will drop the ref count to 0. At this point it will preempt(because we dispatch to call Close on sock_). Now info all gets called and we grab a copy of tl_replica while its being destructed and we trigger UB. This is the timeline:

  1. tl_replica is assigned to a new object, drops the previous ref count to zero and preempts in the destructor of Replica
  2. Info all gets called, we grab a copy of tl_replica which is being destructed
  3. boom

All in all, let me first verify this hypothesis and if it's wrong I will setup an environment for us to tag team.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And I did not have the time to verify this because I literally left for holidays the evening I came up with this hypothesis 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking on the core dump and adding a few more logs:

server_family.cc:4205] proactor: 0x33138180800setting 0x3313c0d0890 to 0x3313c0d0f90
server_family.cc:4205] proactor: 0x33138180300setting 0x3313c0d0890 to 0x3313c0d0f90

Within the body of ReplicaOf() command we update the tl_replica on each proactor to the new replica_ object. Proactor 0x33138180800 drops the use count from 2 to 1. Then the assignment tl_replica = repl observes the use count is 1 and calls ~Replica which fiber preempts as it calls Await(). tl_replica is now in an intermediate state.

 4203 void ServerFamily::UpdateReplicationThreadLocals(SharedPtrW<Replica> repl) {
  4204     service_.proactor_pool().AwaitFiberOnAll([repl](auto index, auto* context) {
  4205     LOG(INFO) << "proactor: " << ProactorBase::me() <<"setting " << tl_replica << " to " << repl;
  4206     tl_replica = repl;

and deep inside shared_ptr::operator=

   718   _Sp_counted_base<_Lp>* __tmp = __r._M_pi;
   719   if (__tmp != _M_pi)
   720     {
   721       if (__tmp != nullptr)
   722         __tmp->_M_add_ref_copy();
   723       if (_M_pi != nullptr)
   724         _M_pi->_M_release();                        */ We release and fiber preempt here */
   725       _M_pi = __tmp;
   726     }
   727   return *this;

replica.cc:101] Destructing Replica: 0x3313c0d0890 on thread 0x33138180300
protocol_client.cc:122] Destructing Protocol: 0x3313c0d0890
908 I20251016 13:16:15.320950 64356 server_family.cc:2020] copying 0x3313c0d0f90 on thread 0x33138180300

We just copied tl_replica in the middle of assignment.

protocol_client.cc:130] Destructing Protocol complete: 0x3313c0d0890
After _M_pi->_M_release()

crash follows.

Then to verify this is indeed the case, I made sure that tl_replica is never in partial state even if its destructor preempts (basically just auto tmp = tl_replica; and tl_replica = repl) and the test never fails. It was indeed the case I imagined 🤓

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the Replica design so that Replica interface so that its destructor won't preempt? Basically pull out all the non-trivial code into a Shutdown method that is explicitly called like we do in many other cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So

  1. will add the one liner fix in this PR and remove the changes to the destructor (and outgoing_slot_migration) to reduce PR size
  2. will follow up with a second PR. It's just buggy to preempt and do IO in the destructor.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hange the Replica design so that Replica interface so that its destructor won't preempt? Basically pull out all the non-trivial code into a Shutdown method that is explicitly called like we do in many other cases.

This is exactly what I did by removing the preemptive code (the Await()) on the destructor and that's why you see the explicit calls of Close() which also Shutdown the socket outside of the destructor.

As I said, to keep it simple to review, I will split this in a follow up PR since it also requires the changes in outgoing_slot_migration. This is the answer to your other question also: #5864 (comment)

@kostasrim kostasrim requested a review from romange October 2, 2025 07:47
@kostasrim
Copy link
Contributor Author

With all the changes around replication, I will follow up with a tidy PR 😄

add_replica->StartMainReplicationFiber(nullopt);
cluster_replicas_.push_back(std::move(add_replica));
service_.proactor_pool().AwaitFiberOnAll(
[this](auto index, auto* cntx)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not the right way to do it. cluster_replicas_ can change in between.
instead you should capture it in the callback by value instead of this.

Copy link
Contributor Author

@kostasrim kostasrim Oct 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can cluster_replicas_ change inbetween if all modifications to this vector happen under the same mutex replicaof_mu_ ? 🤔

ReplicaOfNoOne, ReplicaOf, Shutdown etc all lock the same mutex before object gets modified

// Please note that we we do not use Metrics object here.
{
if (use_replica_of_v2_) {
// Deep copy because tl_replica might be overwritten inbetween
Copy link
Collaborator

@romange romange Oct 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find confusing the "Deep copy" term here.

  1. You copy shared_ptr which is not deep copy, imho.
  2. But then you cluster_replicas which is indeed a (deep) copy of a vector.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree the comment is misleading, will edit

}

bool should_cancel_flows = false;
absl::Cleanup on_exit([this]() { CloseSocket(); });
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it an existing bug that is orthogonal to this change?

using strings::HumanReadableNumBytes;

// Initialized by REPLICAOF
thread_local std::shared_ptr<Replica> tl_replica = nullptr;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please use ServerState for this

Copy link
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please prepare a "crash" branch for me to investigate
  2. If outgoing_slot_migration change is a bug, I would like to ask to send it in another PR. Also, retire info_replication_valkey_compatible flag in the same PR by removing the "false" branch. Lets submit changes we are more confident about first.

auto* Proactor() const {
return sock_->proactor();
DCHECK(socket_thread_);
return socket_thread_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you need socket_thread_ ?
do we access ProtocolClient from multiple threads?
why do we need last_io_time_ to be atomic?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now. The only method we access from multiple threads is Replica::GetSummary and you are fixing existing bugs. See my other PR comments. Lets group all the fixes in a preliminary PR first.

@romange
Copy link
Collaborator

romange commented Oct 14, 2025

There is a problem with this design. You use shared_ptr to be able to access the object even if it's "released" in another thread. But ProtocolClient is designed to be destructed in its own thread.
so suppose, ProtocolClient is created in thread A, and thread B holds the pointer to that object, and then
A destroys the reference to ProtocolClient, the thread B will hold the only reference and essentially will call the destructor of ProtocolClient in the wrong thread.

@kostasrim
Copy link
Contributor Author

There is a problem with this design. You use shared_ptr to be able to access the object even if it's "released" in another thread. But ProtocolClient is designed to be destructed in its own thread. so suppose, ProtocolClient is created in thread A, and thread B holds the pointer to that object, and then A destroys the reference to ProtocolClient, the thread B will hold the only reference and essentially will call the destructor of ProtocolClient in the wrong thread.

I don't see any issue in destructing in a separate thread because by the time we call the destructor of ~ProtocolClient we have cleaned our resources. Specifically:

  1. shard_flows_ is empty
  2. sync_fb_ and acks_fb_ should have joined (I can add DCHECKS for this)
  3. base class ProtocolClient::sock_ is already Shutdown() && Closed() properly on the right thread so there is no IO involved or anything else

The destructor is "trivial" from that aspect (of course we had preemption because of the Await which I removed)

@romange
Copy link
Collaborator

romange commented Oct 16, 2025

I prefer the opposite:

  1. Submit a PR that does not change behavior: fixes bugs, refactor etc. Release it in 1.35
  2. Submit a clean PR that adds a lockless replication. Release it in 1.36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants