-
Notifications
You must be signed in to change notification settings - Fork 1
Description
I've seen this sometimes and it's still not clear to me.
If you track the "Remote clusters successfully initialized" message on Keess logs (and others related to that), sometimes you will see it fail to initialize at first, but get the missing server later. Example:
kess-wc-prod-gm:{"level":"info","timestamp":"2025-11-17T20:54:29Z","msg":"Remote clusters successfully initialized: [mc-prod-hq wc-beta-gm wc-beta-hq wc-prod-gm wc-prod-hq]"}
kess-wc-prod-gm:{"level":"info","timestamp":"2025-11-17T20:55:27Z","msg":"Remote clusters successfully initialized: [wc-beta-hq wc-beta-px wc-prod-gm wc-prod-hq mc-prod-hq wc-beta-gm]"}
But sometimes it will ignore that failed server forever until restart. I've seen a case like that where I could see on logs it received a timeout when trying the first time. I spotted a case of that on PAC-v1 too after HQ shutdown, but seems more frequent on PAC-v2.
One thing that might be or not be related, is that on PAC-v2 the secrets used for sync are bound to the pod lifetime, and need to be updated to/from storagegrid. If we restart all keess pods at the same time, it's seems easy to spot a situation like that.
To reproduce:
# Restart all keess pods at the same time:
for x in app-beta-px app-prod-hq app-prod-gm app-beta-hq app-beta-gm wc-beta-gm wc-beta-hq wc-beta-px wc-prod-gm wc-prod-hq mc-prod-hq; do kubectl --context $x -n keess rollout restart deployment keess; done
# wait 1 to 5 min and get logs from all
mkdir /tmp/keess-logs
for x in app-beta-px app-prod-hq app-prod-gm app-beta-hq app-beta-gm wc-beta-gm wc-beta-hq wc-beta-px wc-prod-gm wc-prod-hq mc-prod-hq; do kubectl --context $x -n keess logs deploy/keess --container keess > /tmp/keess-logs/$x; done
# check logs
cd /tmp/keess-logs/
grep initialize *Failed initializations are easy to spot because of line length, but there may be multiple matches for each keess pod.