You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The retry logic that is used when creating a cluster is not properly handling transient errors, and the primary purpose of retry logic is to handle transient errors.
The only status code being considered "transient" here is 425, which is actually an entirely expected status code per the Kubeception API design. The 425 is sent when the kubeconfig is not ready yet. This is questionable design/practice, but regardless, it is an expected situation as the design stands now.
A transient error is an unexpected and invalid response that will likely only happen for a very short period of time (and usually the next call succeeds). A 403, 404, or 500 would be the bare minimum to check for, as all of these could indicate an unexpected service outage. But typically it would be any error code, particularly in the 4xx and 5xx range.
We recently had issues that showcase this problem, as 403s were being briefly returned when the Kubeception API was being redeployed, but everything actually succeeded as expected, and it would have only sent this status code for a brief moment in time.
The text was updated successfully, but these errors were encountered:
The retry logic that is used when creating a cluster is not properly handling transient errors, and the primary purpose of retry logic is to handle transient errors.
infra-actions/.github/actions/provision-cluster/lib/kubeception.js
Line 75 in 919c007
The only status code being considered "transient" here is 425, which is actually an entirely expected status code per the Kubeception API design. The 425 is sent when the kubeconfig is not ready yet. This is questionable design/practice, but regardless, it is an expected situation as the design stands now.
A transient error is an unexpected and invalid response that will likely only happen for a very short period of time (and usually the next call succeeds). A 403, 404, or 500 would be the bare minimum to check for, as all of these could indicate an unexpected service outage. But typically it would be any error code, particularly in the 4xx and 5xx range.
We recently had issues that showcase this problem, as 403s were being briefly returned when the Kubeception API was being redeployed, but everything actually succeeded as expected, and it would have only sent this status code for a brief moment in time.
The text was updated successfully, but these errors were encountered: