Skip to content

check-nllb-ipv6 and check-nllb-traefik-ipv6 integration tests are flaky #7742

@ncopa

Description

@ncopa

Summary

check-nllb-ipv6 and check-nllb-traefik-ipv6 fail intermittently across unrelated PRs including pushes to main. The failure is always the same: context deadline exceeded inside the cluster_ready_without_controller1 sub-test.

Evidence

Across 22 sampled go.yml runs, these two tests account for the most cross-branch failures — appearing on main, renovate/main-go-1.x, renovate/main-containerd, renovate/main-konnectivity, and resource-patches (none of which touch NLLB code):

check-nllb-ipv6 failures (6 runs):

check-nllb-traefik-ipv6 failures (7 runs): same branches plus resource-patches and renovate/main-konnectivity.

Root cause

The test spins up a 3-controller, 2-worker cluster and performs 3 complete controller failover cycles. Each cycle calls checkClusterReadiness, which verifies konnectivity tunnel health by polling for pod logs. The konnectivity tunnel takes ~150–180s to reconnect after a controller stops.

All checkClusterReadiness calls share a single 15-minute context (from s.Context()). By the time the second failover cycle starts, ~7 minutes are already consumed. The second cycle needs ~180s minimum to pass, which should fit — but any runner slowness (a few extra seconds anywhere) burns through the remainder.

From the logs of the main branch failure (run/26908220612):

cluster_ready_without_controller0: PASS (181s)
...
cluster_ready_without_controller1: FAIL after 390s — context deadline exceeded
  failed to get pod logs from kube-system/nllb-worker0:
  Get "https://localhost:32780/api/v1/.../nllb-worker0/log": context deadline exceeded

The test was failing because konnectivity hadn't re-established the tunnel to the remaining controllers, and the outer context ran out before it did.

A contributing factor: the Envoy NLLB health check uses interval: 5s, unhealthy_threshold: 5, so it takes 25 seconds to mark a stopped controller as unhealthy. During those 25s, konnectivity reconnection attempts bounce off the dead controller unnecessarily (tracked separately in #7743).

Proposed fix

inttest/nllb/nllb_test.go — give each checkClusterReadiness call its own 5-minute sub-context instead of the shared global one:

s.Run("cluster_ready_without_"+controllerName, func() {
    readinessCtx, cancel := context.WithTimeout(s.Context(), 5*time.Minute)
    defer cancel()
    s.Require().NoError(s.checkClusterReadiness(readinessCtx, clients, s.ControllerCount, controllerName))
})

inttest/nllb/nllb_test.go — drop metric-server from the deployment. It is not needed to verify NLLB functionality. Removing it from the check would bring each cycle from ~181s to ~55s, cutting the happy path from ~15 min to ~9 min.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions