The provision-cluster retry logic does not retry on transient errors #69

rick-a-lane-ii · 2024-02-15T20:36:56Z

The retry logic that is used when creating a cluster is not properly handling transient errors, and the primary purpose of retry logic is to handle transient errors.

infra-actions/.github/actions/provision-cluster/lib/kubeception.js

Line 75 in 919c007

} else if (response.message.statusCode == 425) {

The only status code being considered "transient" here is 425, which is actually an entirely expected status code per the Kubeception API design. The 425 is sent when the kubeconfig is not ready yet. This is questionable design/practice, but regardless, it is an expected situation as the design stands now.

A transient error is an unexpected and invalid response that will likely only happen for a very short period of time (and usually the next call succeeds). A 403, 404, or 500 would be the bare minimum to check for, as all of these could indicate an unexpected service outage. But typically it would be any error code, particularly in the 4xx and 5xx range.

We recently had issues that showcase this problem, as 403s were being briefly returned when the Kubeception API was being redeployed, but everything actually succeeded as expected, and it would have only sent this status code for a brief moment in time.

rick-a-lane-ii self-assigned this Mar 7, 2024

rick-a-lane-ii closed this as completed Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The provision-cluster retry logic does not retry on transient errors #69

The provision-cluster retry logic does not retry on transient errors #69

rick-a-lane-ii commented Feb 15, 2024

The provision-cluster retry logic does not retry on transient errors #69

The provision-cluster retry logic does not retry on transient errors #69

Comments

rick-a-lane-ii commented Feb 15, 2024