-
Couldn't load subscription status.
- Fork 1.1k
chore: lock free info command with replicaof v2 #5864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
620f35c to
7e36655
Compare
| } | ||
| } | ||
|
|
||
| void ServerFamily::RoleV2(CmdArgList args, const CommandContext& cmd_cntx) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once the regression tests run for some time we will slowly remove Role() and ReplicaOf etc (all the variants of the previous versions)
7cee991 to
59b3350
Compare
Signed-off-by: kostas <[email protected]>
| int, replica_priority, 100, | ||
| "Published by info command for sentinel to pick replica based on score during a failover"); | ||
| ABSL_FLAG(bool, experimental_replicaof_v2, true, | ||
| ABSL_FLAG(bool, experimental_replicaof_v2, false, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will revert back to true, just want to make sure I did not break anything in case we want to switch back to the old implemntation
| ProtocolClient::~ProtocolClient() { | ||
| exec_st_.JoinErrorHandler(); | ||
|
|
||
| // FIXME: We should close the socket explictly outside of the destructor. This currently |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After spending a few hours on this, I still don't understand why, if we keep this code, we get a segfault on the destructor of std::shared_ptr<Replica>. It seems that it happens during the preemption but the sock_ resources are already deallocated so Close() should early return because fd_ < 0.
What is more, the core dump shows that tl_replica and its copy, have a different ref counted object because one shows that it is expired and the other one having a ref count of 7. I added CHECK() before the crash to make sure that both copies of the shared_ptr point to the exact same control block. The checks passed yet the core dump showed otherwise which makes me think that this is somehow a memory corruption error.
The good thing is that we don't need this code anymore, as we handle closing the socket outside of the descturctor now.
While writing this, the only case I can think of is that the last instance of tl_replica gets destructed, but it needs to preempt and and info command comes in and grabs a copy while the shared_ptr is destructing which could lead to a race condition.
I will verify rthis theory once I am back from the holidays.
ps. the test that failed test_cancel_replication_immediately (and every 300 runs so its kinda time consuming to reproduce)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid the crash is a symptom of a bug in another place. Can you please create a branch that crashes and give me the instructions on how to cause the crash? I would like to inspect it myself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is more, the core dump shows that tl_replica and its copy, have a different ref counted object - what is the copy of tl_replica?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid the crash is a symptom of a bug in another place. Can you please create a branch that crashes and give me the instructions on how to cause the crash? I would like to inspect it myself
I agree, it's a symptom and I believe it's a byproduct of UB. As I wrote I think I understand why it crashed (see below)
What is more, the core dump shows that tl_replica and its copy, have a different ref counted object - what is the copy of tl_replica?
Pardon me, I did not provide context. The crash always happens here:
2013 optional<Replica::Summary> ServerFamily::GetReplicaSummary() const {
2014 /// Without lock
2015 if (use_replica_of_v2_) {
2016 if (tl_replica == nullptr) {
2017 return nullopt;
2018 }
2019 auto replica = tl_replica;
2020 return replica->GetSummary();
2021 }
Specifically, when we return and when we destruct the Replica object (which destructs the ProtocolClient).
When I used gdb to inspect the coredump, I saw the segfault happened in the destructor of the ProtocolClient (in line 2020 --> on the return statement). Then I printed both tl_replica and replica and they both pointed to the same object with a different ref count (one was zero the other one on some arbitrary vlaue which I don't remember).
Then I said, maybe we somehow end up with a different control block because of how we copy. I added a CHECK that both replica and tl_replica point to the exact same control block. It did not trigger.
Lastly, as I wrote above, I think I understand what happens:
if (sock_) {
std::error_code ec;
sock_->proactor()->Await([this, &ec]() { ec = sock_->Close(); });
LOG_IF(ERROR, ec) << "Error closing socket " << ec;
}
This is the destructor of ProtocolClient. Now imagine, we update all tl_replicas (let's say because of ReplicaOf command). The last assignment to tl_replica (on whichever thread) will drop the ref count to 0. At this point it will preempt(because we dispatch to call Close on sock_). Now info all gets called and we grab a copy of tl_replica while its being destructed and we trigger UB. This is the timeline:
- tl_replica is assigned to a new object, drops the previous ref count to zero and preempts in the destructor of
Replica - Info all gets called, we grab a copy of
tl_replicawhich is being destructed - boom
All in all, let me first verify this hypothesis and if it's wrong I will setup an environment for us to tag team.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I did not have the time to verify this because I literally left for holidays the evening I came up with this hypothesis 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After looking on the core dump and adding a few more logs:
server_family.cc:4205] proactor: 0x33138180800setting 0x3313c0d0890 to 0x3313c0d0f90
server_family.cc:4205] proactor: 0x33138180300setting 0x3313c0d0890 to 0x3313c0d0f90
Within the body of ReplicaOf() command we update the tl_replica on each proactor to the new replica_ object. Proactor 0x33138180800 drops the use count from 2 to 1. Then the assignment tl_replica = repl observes the use count is 1 and calls ~Replica which fiber preempts as it calls Await(). tl_replica is now in an intermediate state.
4203 void ServerFamily::UpdateReplicationThreadLocals(SharedPtrW<Replica> repl) {
4204 service_.proactor_pool().AwaitFiberOnAll([repl](auto index, auto* context) {
4205 LOG(INFO) << "proactor: " << ProactorBase::me() <<"setting " << tl_replica << " to " << repl;
4206 tl_replica = repl;
and deep inside shared_ptr::operator=
718 _Sp_counted_base<_Lp>* __tmp = __r._M_pi;
719 if (__tmp != _M_pi)
720 {
721 if (__tmp != nullptr)
722 __tmp->_M_add_ref_copy();
723 if (_M_pi != nullptr)
724 _M_pi->_M_release(); */ We release and fiber preempt here */
725 _M_pi = __tmp;
726 }
727 return *this;
replica.cc:101] Destructing Replica: 0x3313c0d0890 on thread 0x33138180300
protocol_client.cc:122] Destructing Protocol: 0x3313c0d0890
908 I20251016 13:16:15.320950 64356 server_family.cc:2020] copying 0x3313c0d0f90 on thread 0x33138180300
We just copied tl_replica in the middle of assignment.
protocol_client.cc:130] Destructing Protocol complete: 0x3313c0d0890
After _M_pi->_M_release()
crash follows.
Then to verify this is indeed the case, I made sure that tl_replica is never in partial state even if its destructor preempts (basically just auto tmp = tl_replica; and tl_replica = repl) and the test never fails. It was indeed the case I imagined 🤓
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the Replica design so that Replica interface so that its destructor won't preempt? Basically pull out all the non-trivial code into a Shutdown method that is explicitly called like we do in many other cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So
- will add the one liner fix in this PR and remove the changes to the destructor (and outgoing_slot_migration) to reduce PR size
- will follow up with a second PR. It's just buggy to preempt and do IO in the destructor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hange the Replica design so that Replica interface so that its destructor won't preempt? Basically pull out all the non-trivial code into a
Shutdownmethod that is explicitly called like we do in many other cases.
This is exactly what I did by removing the preemptive code (the Await()) on the destructor and that's why you see the explicit calls of Close() which also Shutdown the socket outside of the destructor.
As I said, to keep it simple to review, I will split this in a follow up PR since it also requires the changes in outgoing_slot_migration. This is the answer to your other question also: #5864 (comment)
|
With all the changes around replication, I will follow up with a tidy PR 😄 |
| add_replica->StartMainReplicationFiber(nullopt); | ||
| cluster_replicas_.push_back(std::move(add_replica)); | ||
| service_.proactor_pool().AwaitFiberOnAll( | ||
| [this](auto index, auto* cntx) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not the right way to do it. cluster_replicas_ can change in between.
instead you should capture it in the callback by value instead of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can cluster_replicas_ change inbetween if all modifications to this vector happen under the same mutex replicaof_mu_ ? 🤔
ReplicaOfNoOne, ReplicaOf, Shutdown etc all lock the same mutex before object gets modified
| // Please note that we we do not use Metrics object here. | ||
| { | ||
| if (use_replica_of_v2_) { | ||
| // Deep copy because tl_replica might be overwritten inbetween |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find confusing the "Deep copy" term here.
- You copy shared_ptr which is not deep copy, imho.
- But then you
cluster_replicaswhich is indeed a (deep) copy of a vector.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the comment is misleading, will edit
| } | ||
|
|
||
| bool should_cancel_flows = false; | ||
| absl::Cleanup on_exit([this]() { CloseSocket(); }); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it an existing bug that is orthogonal to this change?
| using strings::HumanReadableNumBytes; | ||
|
|
||
| // Initialized by REPLICAOF | ||
| thread_local std::shared_ptr<Replica> tl_replica = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use ServerState for this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please prepare a "crash" branch for me to investigate
- If outgoing_slot_migration change is a bug, I would like to ask to send it in another PR. Also, retire
info_replication_valkey_compatibleflag in the same PR by removing the "false" branch. Lets submit changes we are more confident about first.
| auto* Proactor() const { | ||
| return sock_->proactor(); | ||
| DCHECK(socket_thread_); | ||
| return socket_thread_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do you need socket_thread_ ?
do we access ProtocolClient from multiple threads?
why do we need last_io_time_ to be atomic?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see now. The only method we access from multiple threads is Replica::GetSummary and you are fixing existing bugs. See my other PR comments. Lets group all the fixes in a preliminary PR first.
|
There is a problem with this design. You use shared_ptr to be able to access the object even if it's "released" in another thread. But |
I don't see any issue in destructing in a separate thread because by the time we call the destructor of
The destructor is "trivial" from that aspect (of course we had preemption because of the Await which I removed) |
|
I prefer the opposite:
|
Following #5774, this PR removes locking the
replica_of_mu_from info command and uses the thread locals && replica of v2 algorithm.No more locking and blocking for INFO command during
replicaofortakeover🥳 🎉 🌮