check-nllb-ipv6 and check-nllb-traefik-ipv6 integration tests are flaky

## Summary

`check-nllb-ipv6` and `check-nllb-traefik-ipv6` fail intermittently across unrelated PRs including pushes to `main`. The failure is always the same: `context deadline exceeded` inside the `cluster_ready_without_controller1` sub-test.

## Evidence

Across 22 sampled `go.yml` runs, these two tests account for the most cross-branch failures — appearing on `main`, `renovate/main-go-1.x`, `renovate/main-containerd`, `renovate/main-konnectivity`, and `resource-patches` (none of which touch NLLB code):

**check-nllb-ipv6 failures (6 runs):**
- https://github.com/k0sproject/k0s/actions/runs/26908220612/job/79382421780 (`main`)
- https://github.com/k0sproject/k0s/actions/runs/26914275246/job/79403124287 (`renovate/main-go-1.x`)
- https://github.com/k0sproject/k0s/actions/runs/26903598074/job/79366073449 (`renovate/main-go-1.x`)
- https://github.com/k0sproject/k0s/actions/runs/26877801685/job/79273285841 (`renovate/main-containerd`)
- https://github.com/k0sproject/k0s/actions/runs/26914329416 (`renovate/main-kubernetes`)
- https://github.com/k0sproject/k0s/actions/runs/26849117832 (`renovate/main-kubernetes`)

**check-nllb-traefik-ipv6 failures (7 runs):** same branches plus `resource-patches` and `renovate/main-konnectivity`.

## Root cause

The test spins up a 3-controller, 2-worker cluster and performs 3 complete controller failover cycles. Each cycle calls `checkClusterReadiness`, which verifies konnectivity tunnel health by polling for pod logs. The konnectivity tunnel takes ~150–180s to reconnect after a controller stops.

All `checkClusterReadiness` calls share a single 15-minute context (from `s.Context()`). By the time the second failover cycle starts, ~7 minutes are already consumed. The second cycle needs ~180s minimum to pass, which should fit — but any runner slowness (a few extra seconds anywhere) burns through the remainder.

From the logs of the `main` branch failure (`run/26908220612`):
```
cluster_ready_without_controller0: PASS (181s)
...
cluster_ready_without_controller1: FAIL after 390s — context deadline exceeded
  failed to get pod logs from kube-system/nllb-worker0:
  Get "https://localhost:32780/api/v1/.../nllb-worker0/log": context deadline exceeded
```

The test was failing because konnectivity hadn't re-established the tunnel to the remaining controllers, and the outer context ran out before it did.

A contributing factor: the Envoy NLLB health check uses `interval: 5s, unhealthy_threshold: 5`, so it takes 25 seconds to mark a stopped controller as unhealthy. During those 25s, konnectivity reconnection attempts bounce off the dead controller unnecessarily (tracked separately in #7743).

## Proposed fix

**`inttest/nllb/nllb_test.go`** — give each `checkClusterReadiness` call its own 5-minute sub-context instead of the shared global one:

```go
s.Run("cluster_ready_without_"+controllerName, func() {
    readinessCtx, cancel := context.WithTimeout(s.Context(), 5*time.Minute)
    defer cancel()
    s.Require().NoError(s.checkClusterReadiness(readinessCtx, clients, s.ControllerCount, controllerName))
})
```

**`inttest/nllb/nllb_test.go`** — drop `metric-server` from the deployment. It is not needed to verify NLLB functionality. Removing it from the check would bring each cycle from ~181s to ~55s, cutting the happy path from ~15 min to ~9 min. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check-nllb-ipv6 and check-nllb-traefik-ipv6 integration tests are flaky #7742

Summary

Evidence

Root cause

Proposed fix

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

check-nllb-ipv6 and check-nllb-traefik-ipv6 integration tests are flaky #7742

Description

Summary

Evidence

Root cause

Proposed fix

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions