No Fast Retry option to avoid retrying on the same IP indefinitely #6330

arthuroparis · 2025-06-02T12:16:18Z

In Kubernetes UDP servers (typically datadog), pods are not closed very well and IP are still reachable after autoscaling applied. This cause the statsd module to retry indefinitely on the same IP, skipping name resolution with host setup based on the hostname, even when the pod is killed (typically behind clusterIP load-balancing).

This setup is just to skip retrying and rebuild the whole configuration, each time Netty lose the UDP server port.

Issue with Kubernetes autoscaled services is mentioned in : #3563

To reproduce :

add two UDP servers on 127.0.0.2 and 127.0.03, with proper loopbacks listening on 8125, and DNS local cache "mylocalhost" pointing to 127.0.0.2
Start Java service with StatsD module sending to host: mylocalhost:8125
Check the 127.0.02 UDP server is receiving metrics
Terminate 127.0.0.2 UDP server -> Netty Client is logging PortUnreachableException
Update the local DNS cache to point to 127.0.0.3
The Netty client is still logging PortUnreachableException indefinitely until the loopback is closed (IOException then)
Only after IOException (one time), the client resolves the hostname and is redirected to 127.0.0.3

With "FastRetry" setup to False :

The previous last step is skipped and PortUnreachableException is causing resolution and redirection every time

shakuzen · 2025-06-04T04:30:26Z

Thank you for the pull request. Are you sure this would solve your problem? We still have #1252 open. Ideally we should add tests demonstrating the issue and that it is solved with this change.

arthuroparis · 2025-06-07T23:11:50Z

Thank you for considering the request, here is my analysis :
The issue is related to the "connection" of the Reactor Netty UDP client : due to intermittent disruptions in the Kubernetes Service and lacks of Readyness probes, some pods are terminated prematurely and this results in the Kubernetes Service returning ICMP errors indefinitely, which in turn trigger PortUnreachableException.

I'm generally okay with the current connect() behavior of the UDP client and would prefer not to introduce any disruptive changes to Micrometer’s default Reactor Netty configuration. That’s why simply skipping retries in this context seems like a sufficient workaround to me.

Alternative workarounds could include :

Implementing a custom StatsSender that avoids calling connect(), thereby eliminating the ICMP port unreachable responses entirely.
Adding a side-car to the Java Mircrometer container that also listens on UDP 8125 : that will drops the port unreachable packets for the Java container on the ICMP layer -and thus, avoiding PortUnreachableException

To properly unit test a “RetryOrIgnorePortUnreachableException” setup, we would need to:

Mock the DNS resolution of the JVM (InetAddress)
Mock the Netty Client

However, given the inherently network-related nature of the issue, I’m not convinced that writing such tests would be worthwhile.

arthuroparis · 2025-06-08T10:47:47Z

I just reworded the setup to make it clearer.

shakuzen · 2025-06-10T02:32:21Z

I haven't had a chance yet to look deeply into this, but I'm still missing something. I could try testing more later when I have time, but in your reproduction steps, you have:

Only after IOException (one time), the client resolves the hostname and is redirected to 127.0.0.3

My understanding of #1252 is that that won't happen because DNS resolution only happens with Reactor Netty when the client is created. But are you saying DNS resolution is happening when the client disconnects and reconnects?
/cc @violetagg in case I'm misunderstanding anything or something has changed in Reactor Netty.

arthuroparis · 2025-06-12T17:18:19Z

Yes i do,

IMHO, you are right, the Netty client will not resolve the host twice, and the non-resolved inetSocketAddress doesn't do the trick either.
However, when the flag is 'false' micrometer will loop over "prepareUdpClient" as long as the server sends port unreachable exceptions.
We did some tests extensively, and it seems it fixes the issue well, but only because we have setup an additional IP resolution within the configuration host() property.

The related code in the StatsDMeterRegistry context focus on building the "remoteAddress" supplier, when a "InetSocketAddress.createUnresolved" is used, this call was added with commit 24b19c7 (UDS datagram support in StatsD (#2722)), in replacement of the host directly transfered to the UdpClient builder before.

So, maybe should we replace :

line 236 :
	prepareUdpClient(publisher,
       () -> InetSocketAddress.createUnresolved(statsdConfig.host(), statsdConfig.port()));

by

line 236 :
	prepareUdpClient(publisher,
       () -> new InetSocketAddress(InetAddress.getByName(statsdConfig.host()), statsdConfig.port()));

What do you think ?

shakuzen · 2025-06-20T02:56:02Z

What do you think ?

I will defer to @violetagg on this. I don't remember why exactly I used InetSocketAddress.createUnresolved then. Will the UdpClient resolve the address when connecting with the proposed change above?

arthuroparis · 2025-06-20T12:29:30Z

Hello,

I've just added a new update on the fork with two additions :

as mentioned early, we need to resolve the IP address on restarts. So i've just added the InetAddress resolution within the SocketAddress resolution, plus fallback to the "createUnresolved" in case the hostname cannot be resolved. Please note, we are using, in our setup, a local cache or the IP/Hostname pair, so the behavior regarding this may differs.
The workaround configuration was initially put in place to fix a bug when the server is shutoff but still returning ICMP port unreachable packets. And yes, that makes the micrometer module works better. But we still have other issues, typically on cluster restarts, when, it seems, our UDP datadog pods are loaded afterward the Micrometer client or even OOM killed. In that case, the "loopback close" step applies. So after a little lecture of the Netty Reactor documentation i've figures out to access the underlying channel pipeline and add an additional name resolution step there too.

After some others unitary local tests, the fallback on cutting first UDP server loopback directly + change of DNS mapping to a second UDP server is passing.

I also send my test scripts attached for anyone wants to reproduces testing steps (on MacOs) :

Simple UDP server logging received metrics lines : udp-server-127-0-0-2.py.txt
Open/Close loopbacks : close-loopback-127-0-0-2.sh.txt - create-loopback-127-0-0-2.sh.txt
Local update of the DNS "mylocalhost" mappings: mylocalhost.sh.txt

Finally please notice, as i have no time to make full tracing of the network exchanges here, my way to do is just to fix local near end-to-end unitary tests when i can reproduce falling use case, simply put.

arthuroparis · 2025-06-27T08:31:10Z

Hello,

Following additional testing, the update as it is now works very well as soon as the DNS balancing is done before the previous pods are killed.
Unfortunately, if that's not the case the actual UDPClient cannot address the issue as it cannot loop or wait until the DNS is redirecting to a living UDP server. Worst, during HPA, if some of the pods are addressed by the DNS before the datadog service is up, they are killed automatically by lacking of Readyness Probes : in fact the container is receiving UDP packet and starts connections and then, the datadog service cannot starts as the port is allready in use.

So, maybe should we rename/update this Pull Request and aims to implement a full "Load Balancing over CoreDNS" UDPClient, relaying on the already done commits (don't retry on PortUnreachableException and Resolve IP address on connecting). That could be done by refactoring the prepareUdpClient method and callees : use instead a reactor "Flux.interval" that recreates a fresh UDPClient periodically that would be in relation with the BufferedFlux, or refactor the instanciation of this one too.

What do you think ?
Maybe an additional Pull-Request to be done or following on this one ?

PS : I'm aware of the failing unit test cases, i'm ok to work on that as soon as the PR will be almost finalized

Signed-off-by: Arthur Filliot

arthuroparis force-pushed the v.1.15.0-statsd-nofastretry branch 2 times, most recently from c4ac021 to f483e2d Compare June 2, 2025 13:00

shakuzen added the waiting for feedback We need additional information before we can continue label Jun 4, 2025

arthuroparis force-pushed the v.1.15.0-statsd-nofastretry branch 2 times, most recently from f456f8a to afce740 Compare June 8, 2025 10:38

This comment was marked as outdated.

Sign in to view

github-actions bot added the feedback-reminder label Jun 20, 2025

shakuzen added waiting for team An issue we need members of the team to review and removed waiting for feedback We need additional information before we can continue feedback-reminder labels Jun 20, 2025

arthuroparis force-pushed the v.1.15.0-statsd-nofastretry branch from 673eeeb to 4fe90b3 Compare June 20, 2025 12:43

Arthur Filliot added 3 commits July 3, 2025 12:21

no-fast-retry

9d18189

Signed-off-by: Arthur Filliot

statsd port-unreachable-method config

75d5190

Signed-off-by: Arthur Filliot

An additional resolution step on channel initialization

0a409f6

Signed-off-by: Arthur Filliot

arthuroparis force-pushed the v.1.15.0-statsd-nofastretry branch from 4fe90b3 to 0a409f6 Compare July 3, 2025 10:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

No Fast Retry option to avoid retrying on the same IP indefinitely #6330

No Fast Retry option to avoid retrying on the same IP indefinitely #6330

Uh oh!

arthuroparis commented Jun 2, 2025

Uh oh!

shakuzen commented Jun 4, 2025

Uh oh!

arthuroparis commented Jun 7, 2025 •

edited

Loading

Uh oh!

arthuroparis commented Jun 8, 2025

Uh oh!

shakuzen commented Jun 10, 2025

Uh oh!

arthuroparis commented Jun 12, 2025 •

edited

Loading

Uh oh!

This comment was marked as outdated.

shakuzen commented Jun 20, 2025

Uh oh!

arthuroparis commented Jun 20, 2025 •

edited

Loading

Uh oh!

arthuroparis commented Jun 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

No Fast Retry option to avoid retrying on the same IP indefinitely #6330

Are you sure you want to change the base?

No Fast Retry option to avoid retrying on the same IP indefinitely #6330

Uh oh!

Conversation

arthuroparis commented Jun 2, 2025

Uh oh!

shakuzen commented Jun 4, 2025

Uh oh!

arthuroparis commented Jun 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthuroparis commented Jun 8, 2025

Uh oh!

shakuzen commented Jun 10, 2025

Uh oh!

arthuroparis commented Jun 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

shakuzen commented Jun 20, 2025

Uh oh!

arthuroparis commented Jun 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arthuroparis commented Jun 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arthuroparis commented Jun 7, 2025 •

edited

Loading

arthuroparis commented Jun 12, 2025 •

edited

Loading

arthuroparis commented Jun 20, 2025 •

edited

Loading