Skip to content

Update verify-zookeeper-sync-status.md #949

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Apr 22, 2025
Merged

Conversation

rseldner
Copy link
Contributor

@rseldner rseldner commented Mar 27, 2025

Revisiting this similar PR that fell through the cracks on my end https://github.com/elastic/cloud/pull/108690

Proposing an alternative zk sync status command that will provide a cleaner output and not rely on the use of additional system packages like nc or telnet. We can achieve the same thing using curl which is widely available on most Linux hosts.

Current commit is written as an inline shell script. So technically not a one-liner, but still easily copy/pasteable and much more readable. But here are the equivalent one-liners if we wish to stick to that format:

container level

docker exec frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done'

host level

for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done

I've used a grep that will closely match the one-liner output without the noisey trying port lines...
Leader output:

ZK mntr Response from port 2191:
zk_server_state leader
zk_znode_count  783
zk_synced_followers     0
zk_synced_non_voting_followers  0
zk_leader_uptime        795608083
zk_avg_leader_unavailable_time  293.0
zk_min_leader_unavailable_time  293
zk_max_leader_unavailable_time  293
zk_cnt_leader_unavailable_time  1
zk_sum_leader_unavailable_time  293
zk_avg_follower_sync_time       0.0
zk_min_follower_sync_time       0
zk_max_follower_sync_time       0
zk_cnt_follower_sync_time       0
zk_sum_follower_sync_time       0

But we could consider this grep for an even cleaner output

grep -E "^zk_server_state|^zk_followers|^zk_synced_followers|not currently serving|^zk_znode_count"

Leader Output:

ZK mntr Response from port 2191:
zk_server_state leader
zk_znode_count  783
zk_synced_followers     0

Revisiting this PR that fell through the cracks on my end elastic/cloud#108690

Proposing an alternative command that will provide a cleaner output and not rely on the use of additional system packages.  
We can achieve the same thing using `curl` which I think is just about ubiquitously available on linux hosts.

The new version is written as an inline shell script.  So technically not a `one-liner`, but still easily copy/pasteable and much more readable.
Equivalent one-liner in the current comment:
container level
```
docker exec frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done'
```
host level
```
for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done
```

I've used a grep that will closely match the one-liner output without the noisey `trying port` lines...
```
ZK mntr Response from port 2191:
zk_server_state leader
zk_znode_count  783
zk_synced_followers     0
zk_synced_non_voting_followers  0
zk_leader_uptime        795608083
zk_avg_leader_unavailable_time  293.0
zk_min_leader_unavailable_time  293
zk_max_leader_unavailable_time  293
zk_cnt_leader_unavailable_time  1
zk_sum_leader_unavailable_time  293
zk_avg_follower_sync_time       0.0
zk_min_follower_sync_time       0
zk_max_follower_sync_time       0
zk_cnt_follower_sync_time       0
zk_sum_follower_sync_time       0
```

but we could consider this grep for an even cleaner output
```
grep -E "^zk_server_state|^zk_followers|^zk_synced_followers|not currently serving|^zk_znode_count"
```
Leader Output:
```
ZK mntr Response from port 2191:
zk_server_state leader
zk_znode_count  783
zk_synced_followers     0
```
@rseldner rseldner requested review from jakommo and kunisen March 27, 2025 00:38
Copy link
Contributor

@kunisen kunisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rseldner thanks for the doc PR.

I have some thoughts:

[1]

I checked chatgpt and found curl telnet:// isn't a standard approach for interacting with raw TCP services. Instead, use nc (netcat) or telnet directly was recommended.

WDYT?

[2]

I understand it might be in the old days we didn't have nc bundled in container (e.g. https://github.com/elastic/support-dev-help/issues/7458#issuecomment-513885696 - nc: command not found) but nowadays we already have nc there by default.

I don't have strong opinion but IMHO, there's no absolute need for us to replace nc with curl nowadays.
If you strongly suggest this, it's fine.

[3]

This PR only works for ECE 4+ but not for ECE 3.x.
Do we also want to backport this to ECE 3.x docs?
(If so, then filing a pairing doc PR in https://github.com/elastic/cloud/issues but using asciidoc format is required)


FWIW, this is the preview page for ECE 4:
https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/949/troubleshoot/deployments/cloud-enterprise/verify-zookeeper-sync-status

Copy link
Contributor

@florent-leborgne florent-leborgne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a docs standpoint, with minor styling suggestions.

Not explicitly approving until Kuni & team validate this technically. Thanks!

@rseldner
Copy link
Contributor Author

Thanks for the feedback @kunisen

[1]

In more security-constrained environments, running commands inside containers or installing new packages (nc, telnet) isn’t always feasible (even for an ECE admin). Requesting to install telnet might get laughed out of the room by some security teams 😅 . On the other hand, curl is almost always pre-installed on ECE hosts, making it reliable for ad-hoc troubleshooting. The chatgpt recommendation you received likely didn’t account for these sorts of operational realities. Security constraints aside, using curl telnet:// works perfectly fine for issuing a simple mntr command to Zookeeper. I can give chatgpt a prompt myself that acknowledges this. (but why even rely on chatgpt?)

[2]

My main goal here was to offer a host-level curl alternative to replace the "use telnet instead" section. I also updated the container-level command as well for the sake of consistency, but we could absolutely leave that version using nc if preferred. We could alternatively document both: nc for the container and curl for the host. That said, switching to host-level curl makes the dilemma a moot point. It's portable and practical.

Also, the change I proposed also returns a slightly cleaner output and output lines sometimes request from support.

[3]
Understood. I started with repo given the upcoming doc changes (at the time)

Copy link
Contributor

@kunisen kunisen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @rseldner for the detailed explanation.
LGTM overall, I left a small suggestion which is to add the section headings to make the structure more logically visible.

:: Side note - 1
I wasn't aware that doing in curl telnet:// doesn't need to install telnet package. TIL. Thanks!

Also I had a test and found we still have the same issue - some OS doesn't have telnet and nc package installed by default, but the error was hidden by 2>/dev/null which was 🤦 very unfortunate, it doesn't tell you the reason even there's no nc in container image (which I tested and confirmed the behavior).

# No error
elastic@sl-kuniyasusen-d66e44db-host1:~$ docker exec frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do echo trying port: $i;echo mntr | nc localhost ${i} 2>/dev/null | grep "not currently serving";echo mntr | nc localhost ${i} 2>/dev/null| grep leader; echo mntr | $(which nc) localhost ${i} 2>/dev/null | grep follower ; done'
trying port: 2191
trying port: 2192
trying port: 2193
...

# But actually there's no nc package
elastic@sl-kuniyasusen-d66e44db-host1:~$ docker exec -it frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do echo trying port: $i;echo mntr | nc localhost ${i} ; done'
trying port: 2191
sh: nc: not found
trying port: 2192
sh: nc: not found
trying port: 2193
...

:: Side note 2

In more security-constrained environments, running commands inside containers or installing new packages (nc, telnet) isn’t always feasible (even for an ECE admin).

I think I would disagree.

In lots of situations we need customer's understanding and cooperation to take the information that is essential for troubleshooting.

In many cases, we argue with them about getting ECE diags, or ES diags for us, which I understand the security concern, however, this is crucial to us.
Also, one more example, sar is another thing we sometimes need, but this is not shipped by most OS packages by default.

I get your point - we should use the libraries and packages that are commonly bundled by most OS by default, as much as we can, but that doesn't mean we have to close the door due to i.e. security reason.

(One recent thing that hit this is we deprecated the built-in ECE diag and now we need to download it from Internet. For most ECE platform related cases, we have to let customer understand we need this.)

Copy link
Contributor

@jakommo jakommo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM and love the simplification with curl telnet:// !
I did a quick search and it looks like some version of RHEL 9 might have libcurl minimal only (without telnet protocol support), but from a quick test with a RHEL 9.5 GCP VM it worked fine, so that might only be an issue with RHEL minimal image or the default was changed with a point release.
In any case, I think a minimal image would also have telnet and nc missing by default.

@leemthompo
Copy link
Contributor

@florent-leborgne @kunisen @rseldner is this good to merge?

Copy link
Contributor

@florent-leborgne florent-leborgne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@florent-leborgne florent-leborgne marked this pull request as ready for review April 22, 2025 12:52
@florent-leborgne florent-leborgne enabled auto-merge (squash) April 22, 2025 12:52
@florent-leborgne florent-leborgne merged commit 9379365 into main Apr 22, 2025
3 checks passed
@florent-leborgne florent-leborgne deleted the rseldner-patch-1 branch April 22, 2025 12:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants