-
Notifications
You must be signed in to change notification settings - Fork 543
Only try once in HTTP health check commands #3469
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
The BaseWgetHealthCommand is used to configure readiness and liveness probes that Kubernetes will execute against the ray-worker container. These checks should return fairly quickly so that _Kubernetes_ can get the result and possibly retry them. By default, wget retries _20 times_. The retrying should be left to Kubernetes, not wget, so that the command doesn't just hang for minutes while Kubernetes waits for the result.
@@ -178,7 +178,7 @@ const ( | |||
RayAgentRayletHealthPath = "api/local_raylet_healthz" | |||
RayDashboardGCSHealthPath = "api/gcs_healthz" | |||
RayServeProxyHealthPath = "-/healthz" | |||
BaseWgetHealthCommand = "wget -T %d -q -O- http://localhost:%d/%s | grep success" | |||
BaseWgetHealthCommand = "wget --tries 1 -T %d -q -O- http://localhost:%d/%s | grep success" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please leave a comment why we only retry once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[discuss] I'm not sure how useful this change would be, because as long as
- the overall retry has a timeout
- retry operations follow exponential backoff
it doesn't matter how many times we attempt retry.
For readiness probe, failureThreshold, along with periodSeconds to define the readiness timeout.
You're right! This is a fairly small nit, but what's going on right now is that periodSeconds is causing multiple concurrent invocations of I absolutely owe ya a proper justification and test cases here. I just wanted to get this down to remember to come back to it, and to see whether it's worth pushing. While we're discussing...is the |
Thanks |
Yes, but I don't think the concurrent number would be high, the upper bound of which is |
The BaseWgetHealthCommand is used to configure readiness and liveness probes that Kubernetes will execute against the ray-worker container. These checks should return fairly quickly so that _Kubernetes_ can get the result and possibly retry them. By default, wget retries _20 times_. The retrying should be left to Kubernetes, not wget, so that the command doesn't just hang for minutes while Kubernetes waits for the result.
The BaseWgetHealthCommand is used to configure readiness and liveness probes that Kubernetes will execute against the ray-worker container. These checks should return fairly quickly so that _Kubernetes_ can get the result and possibly retry them. By default, wget retries _20 times_. The retrying should be left to Kubernetes, not wget, so that the command doesn't just hang for minutes while Kubernetes waits for the result.
The BaseWgetHealthCommand is used to configure readiness and liveness probes that Kubernetes will execute against the ray-worker container. These checks should return fairly quickly so that _Kubernetes_ can get the result and possibly retry them. By default, wget retries _20 times_. The retrying should be left to Kubernetes, not wget, so that the command doesn't just hang for minutes while Kubernetes waits for the result.
The BaseWgetHealthCommand is used to configure readiness and liveness probes that Kubernetes will execute against the ray-worker container. These checks should return fairly quickly so that Kubernetes can get the result and possibly retry them. By default, wget retries 20 times. The retrying should be left to Kubernetes, not wget, so that the command doesn't just hang for minutes while Kubernetes waits for the result.
Why are these changes needed?
Related issue number
Fixes #3468
Checks