Skip to content

Update verify-zookeeper-sync-status.md #949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

rseldner
Copy link

@rseldner rseldner commented Mar 27, 2025

Revisiting this similar PR that fell through the cracks on my end https://github.com/elastic/cloud/pull/108690

Proposing an alternative zk sync status command that will provide a cleaner output and not rely on the use of additional system packages like nc or telnet. We can achieve the same thing using curl which is widely available on most Linux hosts.

Current commit is written as an inline shell script. So technically not a one-liner, but still easily copy/pasteable and much more readable. But here are the equivalent one-liners if we wish to stick to that format:

container level

docker exec frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done'

host level

for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done

I've used a grep that will closely match the one-liner output without the noisey trying port lines...
Leader output:

ZK mntr Response from port 2191:
zk_server_state leader
zk_znode_count  783
zk_synced_followers     0
zk_synced_non_voting_followers  0
zk_leader_uptime        795608083
zk_avg_leader_unavailable_time  293.0
zk_min_leader_unavailable_time  293
zk_max_leader_unavailable_time  293
zk_cnt_leader_unavailable_time  1
zk_sum_leader_unavailable_time  293
zk_avg_follower_sync_time       0.0
zk_min_follower_sync_time       0
zk_max_follower_sync_time       0
zk_cnt_follower_sync_time       0
zk_sum_follower_sync_time       0

But we could consider this grep for an even cleaner output

grep -E "^zk_server_state|^zk_followers|^zk_synced_followers|not currently serving|^zk_znode_count"

Leader Output:

ZK mntr Response from port 2191:
zk_server_state leader
zk_znode_count  783
zk_synced_followers     0

Revisiting this PR that fell through the cracks on my end elastic/cloud#108690

Proposing an alternative command that will provide a cleaner output and not rely on the use of additional system packages.  
We can achieve the same thing using `curl` which I think is just about ubiquitously available on linux hosts.

The new version is written as an inline shell script.  So technically not a `one-liner`, but still easily copy/pasteable and much more readable.
Equivalent one-liner in the current comment:
container level
```
docker exec frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done'
```
host level
```
for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done
```

I've used a grep that will closely match the one-liner output without the noisey `trying port` lines...
```
ZK mntr Response from port 2191:
zk_server_state leader
zk_znode_count  783
zk_synced_followers     0
zk_synced_non_voting_followers  0
zk_leader_uptime        795608083
zk_avg_leader_unavailable_time  293.0
zk_min_leader_unavailable_time  293
zk_max_leader_unavailable_time  293
zk_cnt_leader_unavailable_time  1
zk_sum_leader_unavailable_time  293
zk_avg_follower_sync_time       0.0
zk_min_follower_sync_time       0
zk_max_follower_sync_time       0
zk_cnt_follower_sync_time       0
zk_sum_follower_sync_time       0
```

but we could consider this grep for an even cleaner output
```
grep -E "^zk_server_state|^zk_followers|^zk_synced_followers|not currently serving|^zk_znode_count"
```
Leader Output:
```
ZK mntr Response from port 2191:
zk_server_state leader
zk_znode_count  783
zk_synced_followers     0
```
Copy link
Contributor

@kunisen kunisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rseldner thanks for the doc PR.

I have some thoughts:

[1]

I checked chatgpt and found curl telnet:// isn't a standard approach for interacting with raw TCP services. Instead, use nc (netcat) or telnet directly was recommended.

WDYT?

[2]

I understand it might be in the old days we didn't have nc bundled in container (e.g. https://github.com/elastic/support-dev-help/issues/7458#issuecomment-513885696 - nc: command not found) but nowadays we already have nc there by default.

I don't have strong opinion but IMHO, there's no absolute need for us to replace nc with curl nowadays.
If you strongly suggest this, it's fine.

[3]

This PR only works for ECE 4+ but not for ECE 3.x.
Do we also want to backport this to ECE 3.x docs?
(If so, then filing a pairing doc PR in https://github.com/elastic/cloud/issues but using asciidoc format is required)


FWIW, this is the preview page for ECE 4:
https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/949/troubleshoot/deployments/cloud-enterprise/verify-zookeeper-sync-status

Copy link
Contributor

@florent-leborgne florent-leborgne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a docs standpoint, with minor styling suggestions.

Not explicitly approving until Kuni & team validate this technically. Thanks!


* `zk_followers 2`
* `zk_synced_followers 2`
If the inline shell script command doesn’t work (e.g. your user lacks permissions to access docker), you can run the check directly from the director host. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the inline shell script command doesn’t work (e.g. your user lacks permissions to access docker), you can run the check directly from the director host. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems.
If the inline shell script command doesn’t work, you can run the check directly from the director host. This can happen for example when your user lacks permissions to access Docker. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems.

Small tweak to avoid parenthesis and phrasing like "e.g."

Copy link
Contributor

@kunisen kunisen Apr 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If the inline shell script command doesn’t work (e.g. your user lacks permissions to access docker), you can run the check directly from the director host. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems.
### Alternative: Check at host level
If the inline shell script command doesn’t work, you can run the check directly from the director host. This can happen for example when your user lacks permissions to access Docker. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems.

1. Run the equivalent inline shell script directly on the host terminal (outside of the zookeeper container)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
1. Run the equivalent inline shell script directly on the host terminal (outside of the zookeeper container)
1. Run the equivalent inline shell script directly on the host terminal, outside of the zookeeper container

fi
done
```
2. Look for the following lines in the output (just as noted above)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Look for the following lines in the output (just as noted above)
2. Look for the following lines in the output

@rseldner
Copy link
Author

Thanks for the feedback @kunisen

[1]

In more security-constrained environments, running commands inside containers or installing new packages (nc, telnet) isn’t always feasible (even for an ECE admin). Requesting to install telnet might get laughed out of the room by some security teams 😅 . On the other hand, curl is almost always pre-installed on ECE hosts, making it reliable for ad-hoc troubleshooting. The chatgpt recommendation you received likely didn’t account for these sorts of operational realities. Security constraints aside, using curl telnet:// works perfectly fine for issuing a simple mntr command to Zookeeper. I can give chatgpt a prompt myself that acknowledges this. (but why even rely on chatgpt?)

[2]

My main goal here was to offer a host-level curl alternative to replace the "use telnet instead" section. I also updated the container-level command as well for the sake of consistency, but we could absolutely leave that version using nc if preferred. We could alternatively document both: nc for the container and curl for the host. That said, switching to host-level curl makes the dilemma a moot point. It's portable and practical.

Also, the change I proposed also returns a slightly cleaner output and output lines sometimes request from support.

[3]
Understood. I started with repo given the upcoming doc changes (at the time)

@@ -16,10 +16,18 @@ It is recommended to check the ZooKeeper sync status before starting any mainten

To check that ZooKeeper is in sync with the correct number of followers, run the following steps:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
To check that ZooKeeper is in sync with the correct number of followers, run the following steps:
### Check at container level
To check that ZooKeeper is in sync with the correct number of followers, run the following steps:

Copy link
Contributor

@kunisen kunisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rseldner for the detailed explanation.
LGTM overall, I left a small suggestion which is to add the section headings to make the structure more logically visible.

:: Side note - 1
I wasn't aware that doing in curl telnet:// doesn't need to install telnet package. TIL. Thanks!

Also I had a test and found we still have the same issue - some OS doesn't have telnet and nc package installed by default, but the error was hidden by 2>/dev/null which was 🤦 very unfortunate, it doesn't tell you the reason even there's no nc in container image (which I tested and confirmed the behavior).

# No error
elastic@sl-kuniyasusen-d66e44db-host1:~$ docker exec frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do echo trying port: $i;echo mntr | nc localhost ${i} 2>/dev/null | grep "not currently serving";echo mntr | nc localhost ${i} 2>/dev/null| grep leader; echo mntr | $(which nc) localhost ${i} 2>/dev/null | grep follower ; done'
trying port: 2191
trying port: 2192
trying port: 2193
...

# But actually there's no nc package
elastic@sl-kuniyasusen-d66e44db-host1:~$ docker exec -it frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do echo trying port: $i;echo mntr | nc localhost ${i} ; done'
trying port: 2191
sh: nc: not found
trying port: 2192
sh: nc: not found
trying port: 2193
...

:: Side note 2

In more security-constrained environments, running commands inside containers or installing new packages (nc, telnet) isn’t always feasible (even for an ECE admin).

I think I would disagree.

In lots of situations we need customer's understanding and cooperation to take the information that is essential for troubleshooting.

In many cases, we argue with them about getting ECE diags, or ES diags for us, which I understand the security concern, however, this is crucial to us.
Also, one more example, sar is another thing we sometimes need, but this is not shipped by most OS packages by default.

I get your point - we should use the libraries and packages that are commonly bundled by most OS by default, as much as we can, but that doesn't mean we have to close the door due to i.e. security reason.

(One recent thing that hit this is we deprecated the built-in ECE diag and now we need to download it from Internet. For most ECE platform related cases, we have to let customer understand we need this.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants