Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The provision-cluster retry logic does not retry on transient errors #69

Closed
rick-a-lane-ii opened this issue Feb 15, 2024 · 0 comments
Closed
Assignees

Comments

@rick-a-lane-ii
Copy link
Collaborator

The retry logic that is used when creating a cluster is not properly handling transient errors, and the primary purpose of retry logic is to handle transient errors.

} else if (response.message.statusCode == 425) {

The only status code being considered "transient" here is 425, which is actually an entirely expected status code per the Kubeception API design. The 425 is sent when the kubeconfig is not ready yet. This is questionable design/practice, but regardless, it is an expected situation as the design stands now.

A transient error is an unexpected and invalid response that will likely only happen for a very short period of time (and usually the next call succeeds). A 403, 404, or 500 would be the bare minimum to check for, as all of these could indicate an unexpected service outage. But typically it would be any error code, particularly in the 4xx and 5xx range.

We recently had issues that showcase this problem, as 403s were being briefly returned when the Kubeception API was being redeployed, but everything actually succeeded as expected, and it would have only sent this status code for a brief moment in time.

@rick-a-lane-ii rick-a-lane-ii self-assigned this Mar 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant