Failed cluster initialization at startup may lead to ignore cluster

I've seen this sometimes and it's still not clear to me.

If you track the "Remote clusters successfully initialized" message on Keess logs (and others related to that), sometimes you will see it fail to initialize at first, but get the missing server later. Example:

```
kess-wc-prod-gm:{"level":"info","timestamp":"2025-11-17T20:54:29Z","msg":"Remote clusters successfully initialized: [mc-prod-hq wc-beta-gm wc-beta-hq wc-prod-gm wc-prod-hq]"}
kess-wc-prod-gm:{"level":"info","timestamp":"2025-11-17T20:55:27Z","msg":"Remote clusters successfully initialized: [wc-beta-hq wc-beta-px wc-prod-gm wc-prod-hq mc-prod-hq wc-beta-gm]"}
```

But sometimes it will ignore that failed server forever until restart. I've seen a case like that where I could see on logs it received a timeout when trying the first time. I spotted a case of that on PAC-v1 too after HQ shutdown, but seems more frequent on PAC-v2.

One thing that might be or not be related, is that on PAC-v2 the secrets used for sync are bound to the pod lifetime, and need to be updated to/from storagegrid. If we restart all keess pods at the same time, it's seems easy to spot a situation like that.

To reproduce:

```sh
# Restart all keess pods at the same time:

for x in app-beta-px app-prod-hq app-prod-gm app-beta-hq app-beta-gm wc-beta-gm wc-beta-hq wc-beta-px wc-prod-gm wc-prod-hq mc-prod-hq; do kubectl --context $x -n keess rollout restart deployment keess; done

# wait 1 to 5 min and get logs from all
mkdir /tmp/keess-logs

for x in app-beta-px app-prod-hq app-prod-gm app-beta-hq app-beta-gm wc-beta-gm wc-beta-hq wc-beta-px wc-prod-gm wc-prod-hq mc-prod-hq; do kubectl --context $x -n keess logs deploy/keess --container keess > /tmp/keess-logs/$x; done

# check logs
cd /tmp/keess-logs/
grep initialize *
```

Failed initializations are easy to spot because of line length, but there may be multiple matches for each keess pod.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed cluster initialization at startup may lead to ignore cluster #77

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed cluster initialization at startup may lead to ignore cluster #77

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions