Summary
check-nllb-ipv6 and check-nllb-traefik-ipv6 fail intermittently across unrelated PRs including pushes to main. The failure is always the same: context deadline exceeded inside the cluster_ready_without_controller1 sub-test.
Evidence
Across 22 sampled go.yml runs, these two tests account for the most cross-branch failures — appearing on main, renovate/main-go-1.x, renovate/main-containerd, renovate/main-konnectivity, and resource-patches (none of which touch NLLB code):
check-nllb-ipv6 failures (6 runs):
check-nllb-traefik-ipv6 failures (7 runs): same branches plus resource-patches and renovate/main-konnectivity.
Root cause
The test spins up a 3-controller, 2-worker cluster and performs 3 complete controller failover cycles. Each cycle calls checkClusterReadiness, which verifies konnectivity tunnel health by polling for pod logs. The konnectivity tunnel takes ~150–180s to reconnect after a controller stops.
All checkClusterReadiness calls share a single 15-minute context (from s.Context()). By the time the second failover cycle starts, ~7 minutes are already consumed. The second cycle needs ~180s minimum to pass, which should fit — but any runner slowness (a few extra seconds anywhere) burns through the remainder.
From the logs of the main branch failure (run/26908220612):
cluster_ready_without_controller0: PASS (181s)
...
cluster_ready_without_controller1: FAIL after 390s — context deadline exceeded
failed to get pod logs from kube-system/nllb-worker0:
Get "https://localhost:32780/api/v1/.../nllb-worker0/log": context deadline exceeded
The test was failing because konnectivity hadn't re-established the tunnel to the remaining controllers, and the outer context ran out before it did.
A contributing factor: the Envoy NLLB health check uses interval: 5s, unhealthy_threshold: 5, so it takes 25 seconds to mark a stopped controller as unhealthy. During those 25s, konnectivity reconnection attempts bounce off the dead controller unnecessarily (tracked separately in #7743).
Proposed fix
inttest/nllb/nllb_test.go — give each checkClusterReadiness call its own 5-minute sub-context instead of the shared global one:
s.Run("cluster_ready_without_"+controllerName, func() {
readinessCtx, cancel := context.WithTimeout(s.Context(), 5*time.Minute)
defer cancel()
s.Require().NoError(s.checkClusterReadiness(readinessCtx, clients, s.ControllerCount, controllerName))
})
inttest/nllb/nllb_test.go — drop metric-server from the deployment. It is not needed to verify NLLB functionality. Removing it from the check would bring each cycle from ~181s to ~55s, cutting the happy path from ~15 min to ~9 min.
Summary
check-nllb-ipv6andcheck-nllb-traefik-ipv6fail intermittently across unrelated PRs including pushes tomain. The failure is always the same:context deadline exceededinside thecluster_ready_without_controller1sub-test.Evidence
Across 22 sampled
go.ymlruns, these two tests account for the most cross-branch failures — appearing onmain,renovate/main-go-1.x,renovate/main-containerd,renovate/main-konnectivity, andresource-patches(none of which touch NLLB code):check-nllb-ipv6 failures (6 runs):
main)renovate/main-go-1.x)renovate/main-go-1.x)renovate/main-containerd)renovate/main-kubernetes)renovate/main-kubernetes)check-nllb-traefik-ipv6 failures (7 runs): same branches plus
resource-patchesandrenovate/main-konnectivity.Root cause
The test spins up a 3-controller, 2-worker cluster and performs 3 complete controller failover cycles. Each cycle calls
checkClusterReadiness, which verifies konnectivity tunnel health by polling for pod logs. The konnectivity tunnel takes ~150–180s to reconnect after a controller stops.All
checkClusterReadinesscalls share a single 15-minute context (froms.Context()). By the time the second failover cycle starts, ~7 minutes are already consumed. The second cycle needs ~180s minimum to pass, which should fit — but any runner slowness (a few extra seconds anywhere) burns through the remainder.From the logs of the
mainbranch failure (run/26908220612):The test was failing because konnectivity hadn't re-established the tunnel to the remaining controllers, and the outer context ran out before it did.
A contributing factor: the Envoy NLLB health check uses
interval: 5s, unhealthy_threshold: 5, so it takes 25 seconds to mark a stopped controller as unhealthy. During those 25s, konnectivity reconnection attempts bounce off the dead controller unnecessarily (tracked separately in #7743).Proposed fix
inttest/nllb/nllb_test.go— give eachcheckClusterReadinesscall its own 5-minute sub-context instead of the shared global one:inttest/nllb/nllb_test.go— dropmetric-serverfrom the deployment. It is not needed to verify NLLB functionality. Removing it from the check would bring each cycle from ~181s to ~55s, cutting the happy path from ~15 min to ~9 min.