Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No metadata leader because expected cluster size is larger than real size #6403

Open
pcsegal opened this issue Jan 24, 2025 · 14 comments
Open
Labels
defect Suspected defect such as a bug or regression

Comments

@pcsegal
Copy link

pcsegal commented Jan 24, 2025

Observed behavior

I'm seeing an issue in a NATS cluster (k8s StatefulSet) with 5 pods. It's a simple NATS cluster with JetStream enabled, with each peer having routes to all the 5 peers.

An unexpected behavior started happening, where JetStream can no longer find a meta leader.

This is the output of nats server report jetstream run with the latest version of NATS CLI:

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                       JetStream Summary                                                       │
├────────┬──────────────────────┬────────┬─────────┬───────────┬──────────┬─────────┬────────┬─────────┬─────────┬──────────────┤
│ Server │ Cluster              │ Domain │ Streams │ Consumers │ Messages │ Bytes   │ Memory │ File    │ API Req │ API Err      │
├────────┼──────────────────────┼────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼──────────────┤
│ nats-0 │ xxxxxxxxxxxxxxx_nats │ ...    │ 15      │ 4         │ 484,777  │ 53 MiB  │ 0 B    │ 53 MiB  │ 9       │ 8 / 88.888%  │
│ nats-1 │ xxxxxxxxxxxxxxx_nats │ ...    │ 20      │ 7         │ 135,958  │ 20 MiB  │ 0 B    │ 20 MiB  │ 1       │ 1 / 100%     │
│ nats-2 │ xxxxxxxxxxxxxxx_nats │ ...    │ 26      │ 8         │ 28,407   │ 3.7 MiB │ 0 B    │ 3.7 MiB │ 2       │ 2 / 100%     │
│ nats-3 │ xxxxxxxxxxxxxxx_nats │ ...    │ 10      │ 4         │ 244,338  │ 21 MiB  │ 0 B    │ 21 MiB  │ 14      │ 12 / 85.714% │
│ nats-4 │ xxxxxxxxxxxxxxx_nats │ ...    │ 33      │ 14        │ 25,097   │ 2.0 MiB │ 0 B    │ 2.0 MiB │ 3       │ 3 / 100%     │
├────────┼──────────────────────┼────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼──────────────┤
│        │                      │        │ 104     │ 37        │ 918,577  │ 100 MIB │ 0 B    │ 100 MIB │ 29      │ 26           │
╰────────┴──────────────────────┴────────┴─────────┴───────────┴──────────┴─────────┴────────┴─────────┴─────────┴──────────────╯


WARNING: No cluster meta leader found. The cluster expects 10 nodes but only 5 responded. JetStream operation require at least 6 up nodes.

So, it looks like the meta cluster is expected to have 10 peers, even though only 5 peers are shown. No offline peers are shown.

Everything indicates that the number of replicas didn't change at any point. The StatefulSet has no Horizontal Pod Autoscaler. According to Prometheus metrics that show number of StatefulSet replicas, the number of replicas has remained as 5.

This is the output of nats server request jetstream:

{"server":{"name":"nats-0","host":"0.0.0.0","id":"NDFD72V63NHS6KBRYRC3T2F5IZSHOMLSTWMD4CVAUBQMRLOTCL3PBCIM","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":259,"time":"2025-01-24T10:18:45.329248609Z"},"data":{"server_id":"NDFD72V63NHS6KBRYRC3T2F5IZSHOMLSTWMD4CVAUBQMRLOTCL3PBCIM","now":"2025-01-24T10:18:45.329216868Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":56017416,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":9,"errors":8},"streams":15,"consumers":4,"messages":484777,"bytes":56017416,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}
{"server":{"name":"nats-2","host":"0.0.0.0","id":"NCSRCWMYELMB2Q5FWQ23HS36ATQSRNPAQ5T6TBKQ5OEVXCKI5ZCFLSKC","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":277,"time":"2025-01-24T10:18:45.329497212Z"},"data":{"server_id":"NCSRCWMYELMB2Q5FWQ23HS36ATQSRNPAQ5T6TBKQ5OEVXCKI5ZCFLSKC","now":"2025-01-24T10:18:45.329465369Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":3878208,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":2,"errors":2},"streams":26,"consumers":8,"messages":28407,"bytes":3878208,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}
{"server":{"name":"nats-1","host":"0.0.0.0","id":"NC5ZXUOB4DKVDJRCYFSAR2B2L4S4MYZUBTVLIT3563SGKMQQOKBSG5PV","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":233,"time":"2025-01-24T10:18:45.329540793Z"},"data":{"server_id":"NC5ZXUOB4DKVDJRCYFSAR2B2L4S4MYZUBTVLIT3563SGKMQQOKBSG5PV","now":"2025-01-24T10:18:45.329505955Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":21412910,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":1,"errors":1},"streams":20,"consumers":7,"messages":135958,"bytes":21412910,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}
{"server":{"name":"nats-4","host":"0.0.0.0","id":"NDG77CEBU5M5LOFC6PEXG7EHKPI7IMMGKC3GHXIBLW6RSMDIELGHB7GR","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":384,"time":"2025-01-24T10:18:45.330000931Z"},"data":{"server_id":"NDG77CEBU5M5LOFC6PEXG7EHKPI7IMMGKC3GHXIBLW6RSMDIELGHB7GR","now":"2025-01-24T10:18:45.32996075Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":2053477,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":3,"errors":3},"streams":33,"consumers":14,"messages":25097,"bytes":2053477,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}
{"server":{"name":"nats-3","host":"0.0.0.0","id":"NCK7RDZUVTB3L3XIDOJTQRDW5OFM2W4ZXDD5QC5NSD7MNC5LXYZWCBW3","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":105,"time":"2025-01-24T10:18:45.327529105Z"},"data":{"server_id":"NCK7RDZUVTB3L3XIDOJTQRDW5OFM2W4ZXDD5QC5NSD7MNC5LXYZWCBW3","now":"2025-01-24T10:18:45.327491681Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":22000539,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":0,"errors":0},"streams":10,"consumers":4,"messages":244338,"bytes":22000539,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}

Expected behavior

It looks like the expected meta cluster size should have remained at 5. If there were additional offline nodes that can't be contacted anymore, the JetStream report from NATS CLI should show the offline peers so that we can remove them to fix the expected cluster size.

Server and client version

NATS server version: 2.10.22.
NATS client version: 0.1.6.

Host environment

Server was running in Kubernetes 1.29.9 on Debian 11.

Steps to reproduce

I don't have any information on how to reproduce it as of yet.

@pcsegal pcsegal added the defect Suspected defect such as a bug or regression label Jan 24, 2025
@neilalexander
Copy link
Member

Do the servers have stable storage? i.e. they haven't had their volumes ripped out from under them during a rolling restart?

@pcsegal
Copy link
Author

pcsegal commented Jan 24, 2025

Yes, they are attached to persistent volumes.

@neilalexander
Copy link
Member

Can you please get the output of /raftz?group=_meta_ from the monitoring port of one of the servers?

@pcsegal
Copy link
Author

pcsegal commented Jan 24, 2025

Sure, this is the output (gotten from peer nats-0):

{
   "$SYS": {
      "_meta_": {
         "id": "S1Nunr6R",
         "state": "FOLLOWER",
         "size": 10,
         "quorum_needed": 6,
         "committed": 675182,
         "applied": 675182,
         "ever_had_leader": false,
         "term": 1191202101440508490,
         "voted_for": "HuYMtjaW",
         "pterm": 2251799813947597,
         "pindex": 675182,
         "ipq_proposal_len": 0,
         "ipq_entry_len": 0,
         "ipq_resp_len": 0,
         "ipq_apply_len": 0,
         "wal": {
            "messages": 0,
            "bytes": 0,
            "first_seq": 675183,
            "first_ts": "0001-01-01T00:00:00Z",
            "last_seq": 675182,
            "last_ts": "2025-01-17T15:18:02.207512974Z",
            "consumer_count": 0
         },
         "peers": {
            "\"kCGheKT": {
               "name": "",
               "known": true
            },
            "HuYMtjaW": {
               "name": "nats-4",
               "known": true,
               "last_seen": "1.640344676s"
            },
            "bkCGheKT": {
               "name": "nats-3",
               "known": true,
               "last_seen": "6.461742963s"
            },
            "cnr4t3eg": {
               "name": "",
               "known": true
            },
            "cnrtt3eg": {
               "name": "nats-2",
               "known": true,
               "last_seen": "6.462911285s"
            },
            "cnrtu3eg": {
               "name": "",
               "known": true
            },
            "yrzKKR@u": {
               "name": "",
               "known": true
            },
            "yrzKKRBt": {
               "name": "",
               "known": true
            },
            "yrzKKRBu": {
               "name": "nats-1",
               "known": true,
               "last_seen": "5.647085947s"
            }
         }
      }
   }
}

@neilalexander
Copy link
Member

Have you experienced some disk corruption? Those surplus peer IDs look suspiciously mangled. NATS never generates peer IDs that contain non-alphanumeric characters (and never has done).

@pcsegal
Copy link
Author

pcsegal commented Jan 24, 2025

No disk corruption as far as I know, however this issue started happening after some chaos tests were performed in the network, which included the NATS cluster. These tests included introducing packet corruption (single bit errors at random offsets) into the network packets to / from the containers.

@neilalexander
Copy link
Member

By what method did you introduce packet corruption? Was it by a proxy process or were the packets being rewritten with "fixed" TCP checksums?

Ordinarily packet corruption (i.e. from flipped bits, faulty NICs/cables, wireless interference) would have been detected by the network driver or operating system due to failing checksums, therefore that mangled data would never usually reach the NATS Server process — it would either be retransmitted or the connection would be reset.

So whatever method was used here feels suspicious to have corrupted the data in transit but also fixed the checksums so that no other part of the stack has noticed it.

That would indeed explain how this happened though, as the Raft append entries would have had their origin ID mangled and the cluster would have identified those new IDs as if they were newly-participant nodes, in effect "scaling up" the cluster.

@pcsegal
Copy link
Author

pcsegal commented Jan 24, 2025

It was via a Steadybit action (https://hub.steadybit.com/action/com.steadybit.extension_container.network_package_corruption) which is supposed to test network resilience to bit errors. In principle, I would expect this not to change any checksums.

@neilalexander
Copy link
Member

Will need to look into the exact mechanism there, but for now your best bet to fix the existing cluster is to add a 6th node, which should give you quorum (6/11 > 50%) and therefore it should elect a metaleader.

Then peer-remove the corrupted IDs from the metalayer using nats server cluster peer-remove -f <ID>, where <ID> is each of the corrupted/unseen peer IDs from the raftz output, and then you should be able to peer-remove and take down the new 6th server to clean up afterwards.

You may find that this has also affected some replicated assets though, but once the metaleader is up, you can probably just resolve that by scaling the affected assets down to R1 and back up again.

@pcsegal
Copy link
Author

pcsegal commented Jan 24, 2025

Thank you, is there any specific action that I should take to safely remove the 6th peer in the end of the process? Or do I just scale the StatefulSet back down and remove the peer with server cluster peer-remove?

@neilalexander
Copy link
Member

It should be safe to just scale it back down and peer-remove it, so long as no new replicated assets have been created in that time that could have ended up on the 6th node (i.e. new streams).

@pcsegal
Copy link
Author

pcsegal commented Jan 24, 2025

Thank you. Adding a 6th node made it able to elect a new metadata leader again. However, there was an issue: it looks like the streams / KVs were orphaned due to the corruption issue, and ended up being recreated. Given that these streams have R=1, I wonder if that could be related to this: #5767

So, many of these streams / KVs ended up being recreated on the new 6th node. In this case, since the streams were recreated anyway, we just took the 6th peer offline, removed it and let the streams be recreated again.

@pcsegal
Copy link
Author

pcsegal commented Jan 24, 2025

To give a bit more detail of what happened, we got lines like this in the logs, right after adding the 6th node:

Detected orphaned stream xxxxxxxx, will cleanup

Is that expected? Is there a way that we can stop streams from being cleaned up automatically like this in inconsistency scenarios like this one?

@wallyqs
Copy link
Member

wallyqs commented Jan 24, 2025

@pcsegal on v2.10.25 we added some protections to prevent bad snapshots causing orphan streams to be deleted

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
defect Suspected defect such as a bug or regression
Projects
None yet
Development

No branches or pull requests

3 participants