Skip to content

Conversation

@arthuroparis
Copy link

In Kubernetes UDP servers (typically datadog), pods are not closed very well and IP are still reachable after autoscaling applied. This cause the statsd module to retry indefinitely on the same IP, skipping name resolution with host setup based on the hostname, even when the pod is killed (typically behind clusterIP load-balancing).

This setup is just to skip retrying and rebuild the whole configuration, each time Netty lose the UDP server port.

Issue with Kubernetes autoscaled services is mentioned in : #3563

To reproduce :

  • add two UDP servers on 127.0.0.2 and 127.0.03, with proper loopbacks listening on 8125, and DNS local cache "mylocalhost" pointing to 127.0.0.2
  • Start Java service with StatsD module sending to host: mylocalhost:8125
  • Check the 127.0.02 UDP server is receiving metrics
  • Terminate 127.0.0.2 UDP server -> Netty Client is logging PortUnreachableException
  • Update the local DNS cache to point to 127.0.0.3
  • The Netty client is still logging PortUnreachableException indefinitely until the loopback is closed (IOException then)
  • Only after IOException (one time), the client resolves the hostname and is redirected to 127.0.0.3

With "FastRetry" setup to False :

  • The previous last step is skipped and PortUnreachableException is causing resolution and redirection every time

@arthuroparis arthuroparis force-pushed the v.1.15.0-statsd-nofastretry branch 2 times, most recently from c4ac021 to f483e2d Compare June 2, 2025 13:00
@shakuzen
Copy link
Member

shakuzen commented Jun 4, 2025

Thank you for the pull request. Are you sure this would solve your problem? We still have #1252 open. Ideally we should add tests demonstrating the issue and that it is solved with this change.

@shakuzen shakuzen added the waiting for feedback We need additional information before we can continue label Jun 4, 2025
@arthuroparis
Copy link
Author

arthuroparis commented Jun 7, 2025

Thank you for considering the request, here is my analysis :
The issue is related to the "connection" of the Reactor Netty UDP client : due to intermittent disruptions in the Kubernetes Service and lacks of Readyness probes, some pods are terminated prematurely and this results in the Kubernetes Service returning ICMP errors indefinitely, which in turn trigger PortUnreachableException.

I'm generally okay with the current connect() behavior of the UDP client and would prefer not to introduce any disruptive changes to Micrometer’s default Reactor Netty configuration. That’s why simply skipping retries in this context seems like a sufficient workaround to me.

Alternative workarounds could include :

  • Implementing a custom StatsSender that avoids calling connect(), thereby eliminating the ICMP port unreachable responses entirely.
  • Adding a side-car to the Java Mircrometer container that also listens on UDP 8125 : that will drops the port unreachable packets for the Java container on the ICMP layer -and thus, avoiding PortUnreachableException

To properly unit test a “RetryOrIgnorePortUnreachableException” setup, we would need to:

  • Mock the DNS resolution of the JVM (InetAddress)
  • Mock the Netty Client

However, given the inherently network-related nature of the issue, I’m not convinced that writing such tests would be worthwhile.

@arthuroparis arthuroparis force-pushed the v.1.15.0-statsd-nofastretry branch 2 times, most recently from f456f8a to afce740 Compare June 8, 2025 10:38
@arthuroparis
Copy link
Author

I just reworded the setup to make it clearer.

@shakuzen
Copy link
Member

I haven't had a chance yet to look deeply into this, but I'm still missing something. I could try testing more later when I have time, but in your reproduction steps, you have:

Only after IOException (one time), the client resolves the hostname and is redirected to 127.0.0.3

My understanding of #1252 is that that won't happen because DNS resolution only happens with Reactor Netty when the client is created. But are you saying DNS resolution is happening when the client disconnects and reconnects?
/cc @violetagg in case I'm misunderstanding anything or something has changed in Reactor Netty.

@arthuroparis
Copy link
Author

arthuroparis commented Jun 12, 2025

Yes i do,

IMHO, you are right, the Netty client will not resolve the host twice, and the non-resolved inetSocketAddress doesn't do the trick either.
However, when the flag is 'false' micrometer will loop over "prepareUdpClient" as long as the server sends port unreachable exceptions.
We did some tests extensively, and it seems it fixes the issue well, but only because we have setup an additional IP resolution within the configuration host() property.

The related code in the StatsDMeterRegistry context focus on building the "remoteAddress" supplier, when a "InetSocketAddress.createUnresolved" is used, this call was added with commit 24b19c7 (UDS datagram support in StatsD (#2722)), in replacement of the host directly transfered to the UdpClient builder before.

So, maybe should we replace :

line 236 :
	prepareUdpClient(publisher,
       () -> InetSocketAddress.createUnresolved(statsdConfig.host(), statsdConfig.port()));

by

line 236 :
	prepareUdpClient(publisher,
       () -> new InetSocketAddress(InetAddress.getByName(statsdConfig.host()), statsdConfig.port()));

What do you think ?

@github-actions

This comment was marked as outdated.

@shakuzen shakuzen added waiting for team An issue we need members of the team to review and removed waiting for feedback We need additional information before we can continue feedback-reminder labels Jun 20, 2025
@shakuzen
Copy link
Member

What do you think ?

I will defer to @violetagg on this. I don't remember why exactly I used InetSocketAddress.createUnresolved then. Will the UdpClient resolve the address when connecting with the proposed change above?

@arthuroparis
Copy link
Author

arthuroparis commented Jun 20, 2025

Hello,

I've just added a new update on the fork with two additions :

  • as mentioned early, we need to resolve the IP address on restarts. So i've just added the InetAddress resolution within the SocketAddress resolution, plus fallback to the "createUnresolved" in case the hostname cannot be resolved. Please note, we are using, in our setup, a local cache or the IP/Hostname pair, so the behavior regarding this may differs.
  • The workaround configuration was initially put in place to fix a bug when the server is shutoff but still returning ICMP port unreachable packets. And yes, that makes the micrometer module works better. But we still have other issues, typically on cluster restarts, when, it seems, our UDP datadog pods are loaded afterward the Micrometer client or even OOM killed. In that case, the "loopback close" step applies. So after a little lecture of the Netty Reactor documentation i've figures out to access the underlying channel pipeline and add an additional name resolution step there too.

After some others unitary local tests, the fallback on cutting first UDP server loopback directly + change of DNS mapping to a second UDP server is passing.

I also send my test scripts attached for anyone wants to reproduces testing steps (on MacOs) :

Finally please notice, as i have no time to make full tracing of the network exchanges here, my way to do is just to fix local near end-to-end unitary tests when i can reproduce falling use case, simply put.

@arthuroparis arthuroparis force-pushed the v.1.15.0-statsd-nofastretry branch from 673eeeb to 4fe90b3 Compare June 20, 2025 12:43
@arthuroparis
Copy link
Author

Hello,

Following additional testing, the update as it is now works very well as soon as the DNS balancing is done before the previous pods are killed.
Unfortunately, if that's not the case the actual UDPClient cannot address the issue as it cannot loop or wait until the DNS is redirecting to a living UDP server. Worst, during HPA, if some of the pods are addressed by the DNS before the datadog service is up, they are killed automatically by lacking of Readyness Probes : in fact the container is receiving UDP packet and starts connections and then, the datadog service cannot starts as the port is allready in use.

So, maybe should we rename/update this Pull Request and aims to implement a full "Load Balancing over CoreDNS" UDPClient, relaying on the already done commits (don't retry on PortUnreachableException and Resolve IP address on connecting). That could be done by refactoring the prepareUdpClient method and callees : use instead a reactor "Flux.interval" that recreates a fresh UDPClient periodically that would be in relation with the BufferedFlux, or refactor the instanciation of this one too.

What do you think ?
Maybe an additional Pull-Request to be done or following on this one ?

PS : I'm aware of the failing unit test cases, i'm ok to work on that as soon as the PR will be almost finalized

Arthur Filliot added 3 commits July 3, 2025 12:21
Signed-off-by: Arthur Filliot
Signed-off-by: Arthur Filliot
@arthuroparis arthuroparis force-pushed the v.1.15.0-statsd-nofastretry branch from 4fe90b3 to 0a409f6 Compare July 3, 2025 10:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

waiting for team An issue we need members of the team to review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants