Skip to content

[DOCS] Master cluster formation troubleshooting. Opster Migration #950

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

thekofimensah
Copy link
Contributor

  1. Added section for issues forming a cluster on initial setup
  2. Added other causes of instability

@thekofimensah
Copy link
Contributor Author

@georgewallace

@leemthompo
Copy link
Contributor

Language LGTM, @DaveCTurner could you confirm the technical facts are straight? :)


If your cluster has never successfully formed before and you see this message in the logs:

`Master node not discovered yet this node has not previously joined a bootstrapped cluster`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The message is master not discovered yet, this node has not previously joined a bootstrapped cluster, and ... (always with additional information).

- 192.168.1.2
- nodes.mycluster.com
```
2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and is ignored on subsequent starts.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and is ignored on subsequent starts.
2. For the first cluster startup, you must also configure `cluster.initial_master_nodes` with the node names (not IPs) of the initial set of master-eligible nodes. This setting is required when bootstrapping a new cluster and must be removed on subsequent starts.

Please also link to the docs about the initial_master_nodes setting here.


Only nodes with `node.master: true` are eligible to become master nodes and participate in elections. Make sure the nodes listed in `cluster.initial_master_nodes` are properly configured as master-eligible. Nodes with `node.voting_only: true` can participate in voting but cannot become master themselves. See [this guide](/deploy-manage/distributed-architecture/discovery-cluster-formation/discovery-hosts-providers.md) for more information.

An {{es}} cluster requires a quorum of master-eligible nodes to elect a master. A quorum is defined as `(N/2 + 1)`, where N is the number of master-eligible nodes. If fewer than this number are available, the cluster will not elect a master and will not form. This quorum mechanism helps prevent split-brain scenarios where multiple nodes mistakenly believe they are the master. For more details, see [Quorum-based decision making](../../deploy-manage/distributed-architecture/discovery-cluster-formation/modules-discovery-quorums.md).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean (N+1)/2? Normally N is odd so this is probably the most useful way to express it. Otherwise to be fully technically correct the formula is ⌈(N+1)/2⌉. Or we can just say "majority" instead of "quorum" and avoid all this.


`Master node not discovered yet this node has not previously joined a bootstrapped cluster`

This usually indicates a misconfiguration in your initial cluster settings. Note this is for self-hosted instances. In this case, verify the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about "usually". Often I see clusters failing to form with connectivity issues.

@@ -42,7 +69,7 @@ If the logs suggest that discovery or master elections are failing due to timeou

The threads involved in discovery and cluster membership are mainly `transport_worker` and `cluster_coordination` threads, for which there should never be a long wait. There may also be evidence of long waits for threads in the {{es}} logs, particularly looking at warning logs from `org.elasticsearch.transport.InboundHandler`. See [Networking threading model](elasticsearch://reference/elasticsearch/configuration-reference/networking-settings.md#modules-network-threading-model) for more information.


If your cluster has recently lost one or more master-eligible nodes and the logs indicate that no master can be elected, verify that a quorum still exists. A master election requires a majority of the master-eligible nodes to be available (for example, 2 out of 3, or 3 out of 5). If the quorum cannot be met, the cluster will remain unformed until enough nodes are restored. This quorum mechanism is essential for ensuring consistency and preventing split-brain conditions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this duplicates the information above:

If the logs or the health report indicate that {{es}} can’t discover enough nodes to form a quorum...

and

If the logs or the health report indicate that {{es}} has discovered a possible quorum of nodes...

No need for the user to have to work out what a quorum/majority really is, and indeed they often get confused because they need a majority of the master-eligible nodes that previously made up the cluster, it's not enough to start some new nodes because those nodes' votes won't yet count in the election. I'd rather we didn't add this paragraph.

@@ -53,6 +80,10 @@ When a node wins the master election, it logs a message containing `elected-as-m
* Packet captures will reveal system-level and network-level faults, especially if you capture the network traffic simultaneously at all relevant nodes and analyse it alongside the {{es}} logs from those nodes. You should be able to observe any retransmissions, packet loss, or other delays on the connections between the nodes.
* Long waits for particular threads to be available can be identified by taking stack dumps of the main {{es}} process (for example, using `jstack`) or a profiling trace (for example, using Java Flight Recorder) in the few seconds leading up to the relevant log message.

If your master node is also acting as a data node under heavy indexing or search load, this can cause instability. In clusters under high demand, it is recommended to use [dedicated master nodes](/deploy-manage/distributed-architecture/clusters-nodes-shards.md/node-roles#dedicated-master-node)—nodes configured with `node.master: true` and `node.data: false`-to reduce load and improve election reliability.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is true any more. We don't use dedicated master nodes at all in serverless for instance.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants