[DOCS] Master cluster formation troubleshooting. Opster Migration #950

thekofimensah · 2025-03-27T04:30:45Z

Added section for issues forming a cluster on initial setup
Added other causes of instability

thekofimensah · 2025-04-08T22:30:22Z

@georgewallace

troubleshoot/elasticsearch/discovery-troubleshooting.md

Co-authored-by: Liam Thompson <[email protected]>

leemthompo · 2025-04-14T10:15:08Z

Language LGTM, @DaveCTurner could you confirm the technical facts are straight? :)

DaveCTurner · 2025-04-14T10:50:21Z

troubleshoot/elasticsearch/discovery-troubleshooting.md

+
+If your cluster has never successfully formed before and you see this message in the logs:
+
+`Master node not discovered yet this node has not previously joined a bootstrapped cluster`


The message is master not discovered yet, this node has not previously joined a bootstrapped cluster, and ... (always with additional information).

DaveCTurner · 2025-04-14T10:50:57Z

troubleshoot/elasticsearch/discovery-troubleshooting.md

+      - 192.168.1.2
+      - nodes.mycluster.com
+```
+2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and is ignored on subsequent starts.


Suggested change

2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and is ignored on subsequent starts.

2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and must be removed on subsequent starts.

Please also link to the docs about the initial_master_nodes setting here.

DaveCTurner · 2025-04-14T10:54:00Z

troubleshoot/elasticsearch/discovery-troubleshooting.md

+
+Only nodes with `node.master: true` are eligible to become master nodes and participate in elections. Make sure the nodes listed in `cluster.initial_master_nodes` are properly configured as master-eligible. Nodes with `node.voting_only: true` can participate in voting but cannot become master themselves. See [this guide](/deploy-manage/distributed-architecture/discovery-cluster-formation/discovery-hosts-providers.md) for more information.
+
+An {{es}} cluster requires a quorum of master-eligible nodes to elect a master. A quorum is defined as `(N/2 + 1)`, where N is the number of master-eligible nodes. If fewer than this number are available, the cluster will not elect a master and will not form. This quorum mechanism helps prevent split-brain scenarios where multiple nodes mistakenly believe they are the master. For more details, see [Quorum-based decision making](../../deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-quorums.md).


Did you mean (N+1)/2? Normally N is odd so this is probably the most useful way to express it. Otherwise to be fully technically correct the formula is ⌈(N+1)/2⌉. Or we can just say "majority" instead of "quorum" and avoid all this.

DaveCTurner · 2025-04-14T10:58:00Z

troubleshoot/elasticsearch/discovery-troubleshooting.md

+
+`Master node not discovered yet this node has not previously joined a bootstrapped cluster`
+
+This usually indicates a misconfiguration in your initial cluster settings. Note this is for self-hosted instances. In this case, verify the following:


I'm not sure about "usually". Often I see clusters failing to form with connectivity issues.

DaveCTurner · 2025-04-14T11:00:18Z

troubleshoot/elasticsearch/discovery-troubleshooting.md

@@ -42,7 +69,7 @@ If the logs suggest that discovery or master elections are failing due to timeou

    The threads involved in discovery and cluster membership are mainly `transport_worker` and `cluster_coordination` threads, for which there should never be a long wait. There may also be evidence of long waits for threads in the {{es}} logs, particularly looking at warning logs from `org.elasticsearch.transport.InboundHandler`. See [Networking threading model](elasticsearch://reference/elasticsearch/configuration-reference/networking-settings.md#modules-network-threading-model) for more information.

-
+If your cluster has recently lost one or more master-eligible nodes and the logs indicate that no master can be elected, verify that a quorum still exists. A master election requires a majority of the master-eligible nodes to be available (for example, 2 out of 3, or 3 out of 5). If the quorum cannot be met, the cluster will remain unformed until enough nodes are restored. This quorum mechanism is essential for ensuring consistency and preventing split-brain conditions.


I think this duplicates the information above:

If the logs or the health report indicate that {{es}} can’t discover enough nodes to form a quorum...

and

If the logs or the health report indicate that {{es}} has discovered a possible quorum of nodes...

No need for the user to have to work out what a quorum/majority really is, and indeed they often get confused because they need a majority of the master-eligible nodes that previously made up the cluster, it's not enough to start some new nodes because those nodes' votes won't yet count in the election. I'd rather we didn't add this paragraph.

DaveCTurner · 2025-04-14T11:00:43Z

troubleshoot/elasticsearch/discovery-troubleshooting.md

@@ -53,6 +80,10 @@ When a node wins the master election, it logs a message containing `elected-as-m
 * Packet captures will reveal system-level and network-level faults, especially if you capture the network traffic simultaneously at all relevant nodes and analyse it alongside the {{es}} logs from those nodes. You should be able to observe any retransmissions, packet loss, or other delays on the connections between the nodes.
 * Long waits for particular threads to be available can be identified by taking stack dumps of the main {{es}} process (for example, using `jstack`) or a profiling trace (for example, using Java Flight Recorder) in the few seconds leading up to the relevant log message.

+If your master node is also acting as a data node under heavy indexing or search load, this can cause instability. In clusters under high demand, it is recommended to use [dedicated master nodes](/deploy-manage/distributed-architecture/clusters-nodes-shards.md/node-roles#dedicated-master-node)—nodes configured with `node.master: true` and `node.data: false`-to reduce load and improve election reliability.


I don't think this is true any more. We don't use dedicated master nodes at all in serverless for instance.

Added opster information

5272f8b

github-actions bot deployed to docs-preview March 27, 2025 04:31 View deployment

georgewallace requested a review from leemthompo April 10, 2025 14:27

leemthompo reviewed Apr 10, 2025

View reviewed changes

troubleshoot/elasticsearch/discovery-troubleshooting.md Outdated Show resolved Hide resolved

leemthompo reviewed Apr 10, 2025

View reviewed changes

troubleshoot/elasticsearch/discovery-troubleshooting.md Outdated Show resolved Hide resolved

thekofimensah and others added 2 commits April 12, 2025 09:44

Update troubleshoot/elasticsearch/discovery-troubleshooting.md

577d348

Co-authored-by: Liam Thompson <[email protected]>

Update troubleshoot/elasticsearch/discovery-troubleshooting.md

f7edce5

Co-authored-by: Liam Thompson <[email protected]>

github-actions bot deployed to docs-preview April 12, 2025 01:45 View deployment

DaveCTurner reviewed Apr 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DOCS] Master cluster formation troubleshooting. Opster Migration #950

[DOCS] Master cluster formation troubleshooting. Opster Migration #950

thekofimensah commented Mar 27, 2025

thekofimensah commented Apr 8, 2025

leemthompo commented Apr 14, 2025

DaveCTurner Apr 14, 2025

DaveCTurner Apr 14, 2025

DaveCTurner Apr 14, 2025

DaveCTurner Apr 14, 2025

DaveCTurner Apr 14, 2025

DaveCTurner Apr 14, 2025


		If your cluster has never successfully formed before and you see this message in the logs:

		`Master node not discovered yet this node has not previously joined a bootstrapped cluster`

	2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and is ignored on subsequent starts.
	2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and must be removed on subsequent starts.


		Only nodes with `node.master: true` are eligible to become master nodes and participate in elections. Make sure the nodes listed in `cluster.initial_master_nodes` are properly configured as master-eligible. Nodes with `node.voting_only: true` can participate in voting but cannot become master themselves. See [this guide](/deploy-manage/distributed-architecture/discovery-cluster-formation/discovery-hosts-providers.md) for more information.

		An {{es}} cluster requires a quorum of master-eligible nodes to elect a master. A quorum is defined as `(N/2 + 1)`, where N is the number of master-eligible nodes. If fewer than this number are available, the cluster will not elect a master and will not form. This quorum mechanism helps prevent split-brain scenarios where multiple nodes mistakenly believe they are the master. For more details, see [Quorum-based decision making](../../deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-quorums.md).


		`Master node not discovered yet this node has not previously joined a bootstrapped cluster`

		This usually indicates a misconfiguration in your initial cluster settings. Note this is for self-hosted instances. In this case, verify the following:

		@@ -42,7 +69,7 @@ If the logs suggest that discovery or master elections are failing due to timeou

		The threads involved in discovery and cluster membership are mainly `transport_worker` and `cluster_coordination` threads, for which there should never be a long wait. There may also be evidence of long waits for threads in the {{es}} logs, particularly looking at warning logs from `org.elasticsearch.transport.InboundHandler`. See [Networking threading model](elasticsearch://reference/elasticsearch/configuration-reference/networking-settings.md#modules-network-threading-model) for more information.


		If your cluster has recently lost one or more master-eligible nodes and the logs indicate that no master can be elected, verify that a quorum still exists. A master election requires a majority of the master-eligible nodes to be available (for example, 2 out of 3, or 3 out of 5). If the quorum cannot be met, the cluster will remain unformed until enough nodes are restored. This quorum mechanism is essential for ensuring consistency and preventing split-brain conditions.

[DOCS] Master cluster formation troubleshooting. Opster Migration #950

Are you sure you want to change the base?

[DOCS] Master cluster formation troubleshooting. Opster Migration #950

Conversation

thekofimensah commented Mar 27, 2025

thekofimensah commented Apr 8, 2025

leemthompo commented Apr 14, 2025

DaveCTurner Apr 14, 2025

Choose a reason for hiding this comment

DaveCTurner Apr 14, 2025

Choose a reason for hiding this comment

DaveCTurner Apr 14, 2025

Choose a reason for hiding this comment

DaveCTurner Apr 14, 2025

Choose a reason for hiding this comment

DaveCTurner Apr 14, 2025

Choose a reason for hiding this comment

DaveCTurner Apr 14, 2025

Choose a reason for hiding this comment