-
-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
No metadata leader because expected cluster size is larger than real size #6403
Comments
Do the servers have stable storage? i.e. they haven't had their volumes ripped out from under them during a rolling restart? |
Yes, they are attached to persistent volumes. |
Can you please get the output of |
Sure, this is the output (gotten from peer nats-0):
|
Have you experienced some disk corruption? Those surplus peer IDs look suspiciously mangled. NATS never generates peer IDs that contain non-alphanumeric characters (and never has done). |
No disk corruption as far as I know, however this issue started happening after some chaos tests were performed in the network, which included the NATS cluster. These tests included introducing packet corruption (single bit errors at random offsets) into the network packets to / from the containers. |
By what method did you introduce packet corruption? Was it by a proxy process or were the packets being rewritten with "fixed" TCP checksums? Ordinarily packet corruption (i.e. from flipped bits, faulty NICs/cables, wireless interference) would have been detected by the network driver or operating system due to failing checksums, therefore that mangled data would never usually reach the NATS Server process — it would either be retransmitted or the connection would be reset. So whatever method was used here feels suspicious to have corrupted the data in transit but also fixed the checksums so that no other part of the stack has noticed it. That would indeed explain how this happened though, as the Raft append entries would have had their origin ID mangled and the cluster would have identified those new IDs as if they were newly-participant nodes, in effect "scaling up" the cluster. |
It was via a Steadybit action (https://hub.steadybit.com/action/com.steadybit.extension_container.network_package_corruption) which is supposed to test network resilience to bit errors. In principle, I would expect this not to change any checksums. |
Will need to look into the exact mechanism there, but for now your best bet to fix the existing cluster is to add a 6th node, which should give you quorum (6/11 > 50%) and therefore it should elect a metaleader. Then peer-remove the corrupted IDs from the metalayer using You may find that this has also affected some replicated assets though, but once the metaleader is up, you can probably just resolve that by scaling the affected assets down to R1 and back up again. |
Thank you, is there any specific action that I should take to safely remove the 6th peer in the end of the process? Or do I just scale the StatefulSet back down and remove the peer with |
It should be safe to just scale it back down and peer-remove it, so long as no new replicated assets have been created in that time that could have ended up on the 6th node (i.e. new streams). |
Thank you. Adding a 6th node made it able to elect a new metadata leader again. However, there was an issue: it looks like the streams / KVs were orphaned due to the corruption issue, and ended up being recreated. Given that these streams have R=1, I wonder if that could be related to this: #5767 So, many of these streams / KVs ended up being recreated on the new 6th node. In this case, since the streams were recreated anyway, we just took the 6th peer offline, removed it and let the streams be recreated again. |
To give a bit more detail of what happened, we got lines like this in the logs, right after adding the 6th node:
Is that expected? Is there a way that we can stop streams from being cleaned up automatically like this in inconsistency scenarios like this one? |
@pcsegal on v2.10.25 we added some protections to prevent bad snapshots causing orphan streams to be deleted |
Observed behavior
I'm seeing an issue in a NATS cluster (k8s StatefulSet) with 5 pods. It's a simple NATS cluster with JetStream enabled, with each peer having routes to all the 5 peers.
An unexpected behavior started happening, where JetStream can no longer find a meta leader.
This is the output of
nats server report jetstream
run with the latest version of NATS CLI:So, it looks like the meta cluster is expected to have 10 peers, even though only 5 peers are shown. No offline peers are shown.
Everything indicates that the number of replicas didn't change at any point. The StatefulSet has no Horizontal Pod Autoscaler. According to Prometheus metrics that show number of StatefulSet replicas, the number of replicas has remained as 5.
This is the output of
nats server request jetstream
:Expected behavior
It looks like the expected meta cluster size should have remained at 5. If there were additional offline nodes that can't be contacted anymore, the JetStream report from NATS CLI should show the offline peers so that we can remove them to fix the expected cluster size.
Server and client version
NATS server version: 2.10.22.
NATS client version: 0.1.6.
Host environment
Server was running in Kubernetes 1.29.9 on Debian 11.
Steps to reproduce
I don't have any information on how to reproduce it as of yet.
The text was updated successfully, but these errors were encountered: