No metadata leader because expected cluster size is larger than real size #6403

pcsegal · 2025-01-24T12:13:27Z

Observed behavior

I'm seeing an issue in a NATS cluster (k8s StatefulSet) with 5 pods. It's a simple NATS cluster with JetStream enabled, with each peer having routes to all the 5 peers.

An unexpected behavior started happening, where JetStream can no longer find a meta leader.

This is the output of nats server report jetstream run with the latest version of NATS CLI:

╭───────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│                                                       JetStream Summary                                                       │
├────────┬──────────────────────┬────────┬─────────┬───────────┬──────────┬─────────┬────────┬─────────┬─────────┬──────────────┤
│ Server │ Cluster              │ Domain │ Streams │ Consumers │ Messages │ Bytes   │ Memory │ File    │ API Req │ API Err      │
├────────┼──────────────────────┼────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼──────────────┤
│ nats-0 │ xxxxxxxxxxxxxxx_nats │ ...    │ 15      │ 4         │ 484,777  │ 53 MiB  │ 0 B    │ 53 MiB  │ 9       │ 8 / 88.888%  │
│ nats-1 │ xxxxxxxxxxxxxxx_nats │ ...    │ 20      │ 7         │ 135,958  │ 20 MiB  │ 0 B    │ 20 MiB  │ 1       │ 1 / 100%     │
│ nats-2 │ xxxxxxxxxxxxxxx_nats │ ...    │ 26      │ 8         │ 28,407   │ 3.7 MiB │ 0 B    │ 3.7 MiB │ 2       │ 2 / 100%     │
│ nats-3 │ xxxxxxxxxxxxxxx_nats │ ...    │ 10      │ 4         │ 244,338  │ 21 MiB  │ 0 B    │ 21 MiB  │ 14      │ 12 / 85.714% │
│ nats-4 │ xxxxxxxxxxxxxxx_nats │ ...    │ 33      │ 14        │ 25,097   │ 2.0 MiB │ 0 B    │ 2.0 MiB │ 3       │ 3 / 100%     │
├────────┼──────────────────────┼────────┼─────────┼───────────┼──────────┼─────────┼────────┼─────────┼─────────┼──────────────┤
│        │                      │        │ 104     │ 37        │ 918,577  │ 100 MIB │ 0 B    │ 100 MIB │ 29      │ 26           │
╰────────┴──────────────────────┴────────┴─────────┴───────────┴──────────┴─────────┴────────┴─────────┴─────────┴──────────────╯


WARNING: No cluster meta leader found. The cluster expects 10 nodes but only 5 responded. JetStream operation require at least 6 up nodes.

So, it looks like the meta cluster is expected to have 10 peers, even though only 5 peers are shown. No offline peers are shown.

Everything indicates that the number of replicas didn't change at any point. The StatefulSet has no Horizontal Pod Autoscaler. According to Prometheus metrics that show number of StatefulSet replicas, the number of replicas has remained as 5.

This is the output of nats server request jetstream:

{"server":{"name":"nats-0","host":"0.0.0.0","id":"NDFD72V63NHS6KBRYRC3T2F5IZSHOMLSTWMD4CVAUBQMRLOTCL3PBCIM","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":259,"time":"2025-01-24T10:18:45.329248609Z"},"data":{"server_id":"NDFD72V63NHS6KBRYRC3T2F5IZSHOMLSTWMD4CVAUBQMRLOTCL3PBCIM","now":"2025-01-24T10:18:45.329216868Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":56017416,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":9,"errors":8},"streams":15,"consumers":4,"messages":484777,"bytes":56017416,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}
{"server":{"name":"nats-2","host":"0.0.0.0","id":"NCSRCWMYELMB2Q5FWQ23HS36ATQSRNPAQ5T6TBKQ5OEVXCKI5ZCFLSKC","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":277,"time":"2025-01-24T10:18:45.329497212Z"},"data":{"server_id":"NCSRCWMYELMB2Q5FWQ23HS36ATQSRNPAQ5T6TBKQ5OEVXCKI5ZCFLSKC","now":"2025-01-24T10:18:45.329465369Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":3878208,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":2,"errors":2},"streams":26,"consumers":8,"messages":28407,"bytes":3878208,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}
{"server":{"name":"nats-1","host":"0.0.0.0","id":"NC5ZXUOB4DKVDJRCYFSAR2B2L4S4MYZUBTVLIT3563SGKMQQOKBSG5PV","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":233,"time":"2025-01-24T10:18:45.329540793Z"},"data":{"server_id":"NC5ZXUOB4DKVDJRCYFSAR2B2L4S4MYZUBTVLIT3563SGKMQQOKBSG5PV","now":"2025-01-24T10:18:45.329505955Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":21412910,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":1,"errors":1},"streams":20,"consumers":7,"messages":135958,"bytes":21412910,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}
{"server":{"name":"nats-4","host":"0.0.0.0","id":"NDG77CEBU5M5LOFC6PEXG7EHKPI7IMMGKC3GHXIBLW6RSMDIELGHB7GR","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":384,"time":"2025-01-24T10:18:45.330000931Z"},"data":{"server_id":"NDG77CEBU5M5LOFC6PEXG7EHKPI7IMMGKC3GHXIBLW6RSMDIELGHB7GR","now":"2025-01-24T10:18:45.32996075Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":2053477,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":3,"errors":3},"streams":33,"consumers":14,"messages":25097,"bytes":2053477,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}
{"server":{"name":"nats-3","host":"0.0.0.0","id":"NCK7RDZUVTB3L3XIDOJTQRDW5OFM2W4ZXDD5QC5NSD7MNC5LXYZWCBW3","cluster":"xxxxxxxxxxxxxxx_nats","domain":"...","ver":"2.10.22","jetstream":true,"flags":3,"seq":105,"time":"2025-01-24T10:18:45.327529105Z"},"data":{"server_id":"NCK7RDZUVTB3L3XIDOJTQRDW5OFM2W4ZXDD5QC5NSD7MNC5LXYZWCBW3","now":"2025-01-24T10:18:45.327491681Z","config":{"max_memory":1073741824,"max_storage":10737418240,"store_dir":"/data/jetstream","sync_interval":120000000000,"domain":"...","compress_ok":true},"memory":0,"storage":22000539,"reserved_memory":0,"reserved_storage":0,"accounts":2,"ha_assets":1,"api":{"total":0,"errors":0},"streams":10,"consumers":4,"messages":244338,"bytes":22000539,"meta_cluster":{"name":"xxxxxxxxxxxxxxx_nats","peer":"fqA4S4SK","cluster_size":10,"pending":0}}}

Expected behavior

It looks like the expected meta cluster size should have remained at 5. If there were additional offline nodes that can't be contacted anymore, the JetStream report from NATS CLI should show the offline peers so that we can remove them to fix the expected cluster size.

Server and client version

NATS server version: 2.10.22.
NATS client version: 0.1.6.

Host environment

Server was running in Kubernetes 1.29.9 on Debian 11.

Steps to reproduce

I don't have any information on how to reproduce it as of yet.

The text was updated successfully, but these errors were encountered:

neilalexander · 2025-01-24T12:16:42Z

Do the servers have stable storage? i.e. they haven't had their volumes ripped out from under them during a rolling restart?

pcsegal · 2025-01-24T12:24:21Z

Yes, they are attached to persistent volumes.

neilalexander · 2025-01-24T12:28:58Z

Can you please get the output of /raftz?group=_meta_ from the monitoring port of one of the servers?

pcsegal · 2025-01-24T12:38:31Z

Sure, this is the output (gotten from peer nats-0):

{
   "$SYS": {
      "_meta_": {
         "id": "S1Nunr6R",
         "state": "FOLLOWER",
         "size": 10,
         "quorum_needed": 6,
         "committed": 675182,
         "applied": 675182,
         "ever_had_leader": false,
         "term": 1191202101440508490,
         "voted_for": "HuYMtjaW",
         "pterm": 2251799813947597,
         "pindex": 675182,
         "ipq_proposal_len": 0,
         "ipq_entry_len": 0,
         "ipq_resp_len": 0,
         "ipq_apply_len": 0,
         "wal": {
            "messages": 0,
            "bytes": 0,
            "first_seq": 675183,
            "first_ts": "0001-01-01T00:00:00Z",
            "last_seq": 675182,
            "last_ts": "2025-01-17T15:18:02.207512974Z",
            "consumer_count": 0
         },
         "peers": {
            "\"kCGheKT": {
               "name": "",
               "known": true
            },
            "HuYMtjaW": {
               "name": "nats-4",
               "known": true,
               "last_seen": "1.640344676s"
            },
            "bkCGheKT": {
               "name": "nats-3",
               "known": true,
               "last_seen": "6.461742963s"
            },
            "cnr4t3eg": {
               "name": "",
               "known": true
            },
            "cnrtt3eg": {
               "name": "nats-2",
               "known": true,
               "last_seen": "6.462911285s"
            },
            "cnrtu3eg": {
               "name": "",
               "known": true
            },
            "yrzKKR@u": {
               "name": "",
               "known": true
            },
            "yrzKKRBt": {
               "name": "",
               "known": true
            },
            "yrzKKRBu": {
               "name": "nats-1",
               "known": true,
               "last_seen": "5.647085947s"
            }
         }
      }
   }
}

neilalexander · 2025-01-24T13:03:00Z

Have you experienced some disk corruption? Those surplus peer IDs look suspiciously mangled. NATS never generates peer IDs that contain non-alphanumeric characters (and never has done).

pcsegal · 2025-01-24T14:07:05Z

No disk corruption as far as I know, however this issue started happening after some chaos tests were performed in the network, which included the NATS cluster. These tests included introducing packet corruption (single bit errors at random offsets) into the network packets to / from the containers.

neilalexander · 2025-01-24T14:17:52Z

By what method did you introduce packet corruption? Was it by a proxy process or were the packets being rewritten with "fixed" TCP checksums?

Ordinarily packet corruption (i.e. from flipped bits, faulty NICs/cables, wireless interference) would have been detected by the network driver or operating system due to failing checksums, therefore that mangled data would never usually reach the NATS Server process — it would either be retransmitted or the connection would be reset.

So whatever method was used here feels suspicious to have corrupted the data in transit but also fixed the checksums so that no other part of the stack has noticed it.

That would indeed explain how this happened though, as the Raft append entries would have had their origin ID mangled and the cluster would have identified those new IDs as if they were newly-participant nodes, in effect "scaling up" the cluster.

pcsegal · 2025-01-24T14:51:35Z

It was via a Steadybit action (https://hub.steadybit.com/action/com.steadybit.extension_container.network_package_corruption) which is supposed to test network resilience to bit errors. In principle, I would expect this not to change any checksums.

neilalexander · 2025-01-24T14:59:16Z

Will need to look into the exact mechanism there, but for now your best bet to fix the existing cluster is to add a 6th node, which should give you quorum (6/11 > 50%) and therefore it should elect a metaleader.

Then peer-remove the corrupted IDs from the metalayer using nats server cluster peer-remove -f <ID>, where <ID> is each of the corrupted/unseen peer IDs from the raftz output, and then you should be able to peer-remove and take down the new 6th server to clean up afterwards.

You may find that this has also affected some replicated assets though, but once the metaleader is up, you can probably just resolve that by scaling the affected assets down to R1 and back up again.

pcsegal · 2025-01-24T16:35:53Z

Thank you, is there any specific action that I should take to safely remove the 6th peer in the end of the process? Or do I just scale the StatefulSet back down and remove the peer with server cluster peer-remove?

neilalexander · 2025-01-24T16:48:54Z

It should be safe to just scale it back down and peer-remove it, so long as no new replicated assets have been created in that time that could have ended up on the 6th node (i.e. new streams).

pcsegal · 2025-01-24T18:10:32Z

Thank you. Adding a 6th node made it able to elect a new metadata leader again. However, there was an issue: it looks like the streams / KVs were orphaned due to the corruption issue, and ended up being recreated. Given that these streams have R=1, I wonder if that could be related to this: #5767

So, many of these streams / KVs ended up being recreated on the new 6th node. In this case, since the streams were recreated anyway, we just took the 6th peer offline, removed it and let the streams be recreated again.

pcsegal · 2025-01-24T20:12:13Z

To give a bit more detail of what happened, we got lines like this in the logs, right after adding the 6th node:

Detected orphaned stream xxxxxxxx, will cleanup

Is that expected? Is there a way that we can stop streams from being cleaned up automatically like this in inconsistency scenarios like this one?

wallyqs · 2025-01-24T20:47:02Z

@pcsegal on v2.10.25 we added some protections to prevent bad snapshots causing orphan streams to be deleted

pcsegal added the defect Suspected defect such as a bug or regression label Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No metadata leader because expected cluster size is larger than real size #6403

No metadata leader because expected cluster size is larger than real size #6403

pcsegal commented Jan 24, 2025

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025 •

edited

Loading

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025 •

edited

Loading

pcsegal commented Jan 24, 2025

wallyqs commented Jan 24, 2025

No metadata leader because expected cluster size is larger than real size #6403

No metadata leader because expected cluster size is larger than real size #6403

Comments

pcsegal commented Jan 24, 2025

Observed behavior

Expected behavior

Server and client version

Host environment

Steps to reproduce

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025 • edited Loading

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025

neilalexander commented Jan 24, 2025

pcsegal commented Jan 24, 2025 • edited Loading

pcsegal commented Jan 24, 2025

wallyqs commented Jan 24, 2025

pcsegal commented Jan 24, 2025 •

edited

Loading

pcsegal commented Jan 24, 2025 •

edited

Loading