-
Notifications
You must be signed in to change notification settings - Fork 58
Update verify-zookeeper-sync-status.md #949
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Revisiting this PR that fell through the cracks on my end elastic/cloud#108690 Proposing an alternative command that will provide a cleaner output and not rely on the use of additional system packages. We can achieve the same thing using `curl` which I think is just about ubiquitously available on linux hosts. The new version is written as an inline shell script. So technically not a `one-liner`, but still easily copy/pasteable and much more readable. Equivalent one-liner in the current comment: container level ``` docker exec frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done' ``` host level ``` for i in $(seq 2191 2199); do output=$(echo mntr | curl -s telnet://localhost:$i | grep -E "server_state|leader|follower|not currently serving|zk_znode_count"); [ -n "$output" ] && echo "ZK mntr Response from port $i:" && echo "$output" && break; done ``` I've used a grep that will closely match the one-liner output without the noisey `trying port` lines... ``` ZK mntr Response from port 2191: zk_server_state leader zk_znode_count 783 zk_synced_followers 0 zk_synced_non_voting_followers 0 zk_leader_uptime 795608083 zk_avg_leader_unavailable_time 293.0 zk_min_leader_unavailable_time 293 zk_max_leader_unavailable_time 293 zk_cnt_leader_unavailable_time 1 zk_sum_leader_unavailable_time 293 zk_avg_follower_sync_time 0.0 zk_min_follower_sync_time 0 zk_max_follower_sync_time 0 zk_cnt_follower_sync_time 0 zk_sum_follower_sync_time 0 ``` but we could consider this grep for an even cleaner output ``` grep -E "^zk_server_state|^zk_followers|^zk_synced_followers|not currently serving|^zk_znode_count" ``` Leader Output: ``` ZK mntr Response from port 2191: zk_server_state leader zk_znode_count 783 zk_synced_followers 0 ```
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rseldner thanks for the doc PR.
I have some thoughts:
[1]
I checked chatgpt and found curl telnet://
isn't a standard approach for interacting with raw TCP services. Instead, use nc (netcat)
or telnet
directly was recommended.
WDYT?
[2]
I understand it might be in the old days we didn't have nc bundled in container (e.g. https://github.com/elastic/support-dev-help/issues/7458#issuecomment-513885696 - nc: command not found
) but nowadays we already have nc
there by default.
I don't have strong opinion but IMHO, there's no absolute need for us to replace nc
with curl
nowadays.
If you strongly suggest this, it's fine.
[3]
This PR only works for ECE 4+ but not for ECE 3.x.
Do we also want to backport this to ECE 3.x docs?
(If so, then filing a pairing doc PR in https://github.com/elastic/cloud/issues but using asciidoc
format is required)
FWIW, this is the preview page for ECE 4:
https://docs-v3-preview.elastic.dev/elastic/docs-content/pull/949/troubleshoot/deployments/cloud-enterprise/verify-zookeeper-sync-status
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM from a docs standpoint, with minor styling suggestions.
Not explicitly approving until Kuni & team validate this technically. Thanks!
|
||
* `zk_followers 2` | ||
* `zk_synced_followers 2` | ||
If the inline shell script command doesn’t work (e.g. your user lacks permissions to access docker), you can run the check directly from the director host. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the inline shell script command doesn’t work (e.g. your user lacks permissions to access docker), you can run the check directly from the director host. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems. | |
If the inline shell script command doesn’t work, you can run the check directly from the director host. This can happen for example when your user lacks permissions to access Docker. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems. |
Small tweak to avoid parenthesis and phrasing like "e.g."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the inline shell script command doesn’t work (e.g. your user lacks permissions to access docker), you can run the check directly from the director host. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems. | |
### Alternative: Check at host level | |
If the inline shell script command doesn’t work, you can run the check directly from the director host. This can happen for example when your user lacks permissions to access Docker. This approach avoids entering the container and doesn't require installing additional tools like `telnet` or `nc`, relying instead on `curl`, which is typically available by default on most Linux systems. |
1. Run the equivalent inline shell script directly on the host terminal (outside of the zookeeper container) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1. Run the equivalent inline shell script directly on the host terminal (outside of the zookeeper container) | |
1. Run the equivalent inline shell script directly on the host terminal, outside of the zookeeper container |
fi | ||
done | ||
``` | ||
2. Look for the following lines in the output (just as noted above) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Look for the following lines in the output (just as noted above) | |
2. Look for the following lines in the output |
Thanks for the feedback @kunisen [1] In more security-constrained environments, running commands inside containers or installing new packages (nc, telnet) isn’t always feasible (even for an ECE admin). Requesting to install telnet might get laughed out of the room by some security teams 😅 . On the other hand, curl is almost always pre-installed on ECE hosts, making it reliable for ad-hoc troubleshooting. The chatgpt recommendation you received likely didn’t account for these sorts of operational realities. Security constraints aside, using [2] My main goal here was to offer a host-level curl alternative to replace the "use telnet instead" section. I also updated the container-level command as well for the sake of consistency, but we could absolutely leave that version using Also, the change I proposed also returns a slightly cleaner output and output lines sometimes request from support. [3] |
@@ -16,10 +16,18 @@ It is recommended to check the ZooKeeper sync status before starting any mainten | |||
|
|||
To check that ZooKeeper is in sync with the correct number of followers, run the following steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check that ZooKeeper is in sync with the correct number of followers, run the following steps: | |
### Check at container level | |
To check that ZooKeeper is in sync with the correct number of followers, run the following steps: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @rseldner for the detailed explanation.
LGTM overall, I left a small suggestion which is to add the section headings to make the structure more logically visible.
:: Side note - 1
I wasn't aware that doing in curl telnet://
doesn't need to install telnet package. TIL. Thanks!
Also I had a test and found we still have the same issue - some OS doesn't have telnet
and nc
package installed by default, but the error was hidden by 2>/dev/null
which was 🤦 very unfortunate, it doesn't tell you the reason even there's no nc
in container image (which I tested and confirmed the behavior).
# No error
elastic@sl-kuniyasusen-d66e44db-host1:~$ docker exec frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do echo trying port: $i;echo mntr | nc localhost ${i} 2>/dev/null | grep "not currently serving";echo mntr | nc localhost ${i} 2>/dev/null| grep leader; echo mntr | $(which nc) localhost ${i} 2>/dev/null | grep follower ; done'
trying port: 2191
trying port: 2192
trying port: 2193
...
# But actually there's no nc package
elastic@sl-kuniyasusen-d66e44db-host1:~$ docker exec -it frc-zookeeper-servers-zookeeper sh -c 'for i in $(seq 2191 2199); do echo trying port: $i;echo mntr | nc localhost ${i} ; done'
trying port: 2191
sh: nc: not found
trying port: 2192
sh: nc: not found
trying port: 2193
...
:: Side note 2
In more security-constrained environments, running commands inside containers or installing new packages (nc, telnet) isn’t always feasible (even for an ECE admin).
I think I would disagree.
In lots of situations we need customer's understanding and cooperation to take the information that is essential for troubleshooting.
In many cases, we argue with them about getting ECE diags, or ES diags for us, which I understand the security concern, however, this is crucial to us.
Also, one more example, sar
is another thing we sometimes need, but this is not shipped by most OS packages by default.
I get your point - we should use the libraries and packages that are commonly bundled by most OS by default, as much as we can, but that doesn't mean we have to close the door due to i.e. security reason.
(One recent thing that hit this is we deprecated the built-in ECE diag and now we need to download it from Internet. For most ECE platform related cases, we have to let customer understand we need this.)
Revisiting this similar PR that fell through the cracks on my end https://github.com/elastic/cloud/pull/108690
Proposing an alternative zk sync status command that will provide a cleaner output and not rely on the use of additional system packages like
nc
ortelnet
. We can achieve the same thing usingcurl
which is widely available on most Linux hosts.Current commit is written as an inline shell script. So technically not a
one-liner
, but still easily copy/pasteable and much more readable. But here are the equivalent one-liners if we wish to stick to that format:container level
host level
I've used a grep that will closely match the one-liner output without the noisey
trying port
lines...Leader output:
But we could consider this grep for an even cleaner output
Leader Output: