-
Notifications
You must be signed in to change notification settings - Fork 1.1k
No Fast Retry option to avoid retrying on the same IP indefinitely #6330
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
No Fast Retry option to avoid retrying on the same IP indefinitely #6330
Conversation
c4ac021 to
f483e2d
Compare
|
Thank you for the pull request. Are you sure this would solve your problem? We still have #1252 open. Ideally we should add tests demonstrating the issue and that it is solved with this change. |
|
Thank you for considering the request, here is my analysis : I'm generally okay with the current connect() behavior of the UDP client and would prefer not to introduce any disruptive changes to Micrometer’s default Reactor Netty configuration. That’s why simply skipping retries in this context seems like a sufficient workaround to me. Alternative workarounds could include :
To properly unit test a “RetryOrIgnorePortUnreachableException” setup, we would need to:
However, given the inherently network-related nature of the issue, I’m not convinced that writing such tests would be worthwhile. |
f456f8a to
afce740
Compare
|
I just reworded the setup to make it clearer. |
|
I haven't had a chance yet to look deeply into this, but I'm still missing something. I could try testing more later when I have time, but in your reproduction steps, you have:
My understanding of #1252 is that that won't happen because DNS resolution only happens with Reactor Netty when the client is created. But are you saying DNS resolution is happening when the client disconnects and reconnects? |
|
Yes i do, IMHO, you are right, the Netty client will not resolve the host twice, and the non-resolved inetSocketAddress doesn't do the trick either. The related code in the StatsDMeterRegistry context focus on building the "remoteAddress" supplier, when a "InetSocketAddress.createUnresolved" is used, this call was added with commit 24b19c7 (UDS datagram support in StatsD (#2722)), in replacement of the host directly transfered to the UdpClient builder before. So, maybe should we replace : by What do you think ? |
This comment was marked as outdated.
This comment was marked as outdated.
I will defer to @violetagg on this. I don't remember why exactly I used |
|
Hello, I've just added a new update on the fork with two additions :
After some others unitary local tests, the fallback on cutting first UDP server loopback directly + change of DNS mapping to a second UDP server is passing. I also send my test scripts attached for anyone wants to reproduces testing steps (on MacOs) :
Finally please notice, as i have no time to make full tracing of the network exchanges here, my way to do is just to fix local near end-to-end unitary tests when i can reproduce falling use case, simply put. |
673eeeb to
4fe90b3
Compare
|
Hello, Following additional testing, the update as it is now works very well as soon as the DNS balancing is done before the previous pods are killed. So, maybe should we rename/update this Pull Request and aims to implement a full "Load Balancing over CoreDNS" UDPClient, relaying on the already done commits (don't retry on PortUnreachableException and Resolve IP address on connecting). That could be done by refactoring the prepareUdpClient method and callees : use instead a reactor "Flux.interval" that recreates a fresh UDPClient periodically that would be in relation with the BufferedFlux, or refactor the instanciation of this one too. What do you think ? PS : I'm aware of the failing unit test cases, i'm ok to work on that as soon as the PR will be almost finalized |
Signed-off-by: Arthur Filliot
Signed-off-by: Arthur Filliot
Signed-off-by: Arthur Filliot
4fe90b3 to
0a409f6
Compare
In Kubernetes UDP servers (typically datadog), pods are not closed very well and IP are still reachable after autoscaling applied. This cause the statsd module to retry indefinitely on the same IP, skipping name resolution with host setup based on the hostname, even when the pod is killed (typically behind clusterIP load-balancing).
This setup is just to skip retrying and rebuild the whole configuration, each time Netty lose the UDP server port.
Issue with Kubernetes autoscaled services is mentioned in : #3563
To reproduce :
With "FastRetry" setup to False :