test: fix flaky https tests on windows#62451
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #62451 +/- ##
=======================================
Coverage 89.71% 89.71%
=======================================
Files 692 692
Lines 213988 214008 +20
Branches 41054 41047 -7
=======================================
+ Hits 191976 192000 +24
- Misses 14086 14088 +2
+ Partials 7926 7920 -6
🚀 New features to boost your workflow:
|
| assert.strictEqual(ports[ports.length - 1], port); | ||
| makeRequest(url, agent, common.mustCall((port) => { | ||
| assert.strictEqual(ports[ports.length - 1], port); | ||
| server.closeAllConnections(); |
There was a problem hiding this comment.
Why isn't agent.destroy() below sufficient?
There was a problem hiding this comment.
That's part of what I'm investigating. I'm running a stress test of this change on windows to determine if that deals with the flakiness. If it does, a follow up will be to figure out why the destroy isn't sufficient.
What's apparent is that there is a race condition on cleanup. Where and exactly why still needs to be determined along with a long term fix.
| }, common.mustCall((res) => { | ||
| assert.strictEqual(res.statusCode, 200); | ||
| res.on('end', () => { | ||
| server.closeAllConnections(); |
There was a problem hiding this comment.
What is preventing the socket from being closed normally? Is the keep-alive timeout longer than the test timeout? and if it is the case, can keep-alive be disable? I don't think it is needed for this test.
There was a problem hiding this comment.
Thinking through this... disabling keep-alive might address the issue for the various test-http(s)-* tests but there are tcp/tls tests also failing with the same crash that follow the same pattern, suggesting that the issue is a bit more fundamental than just using keep-alive.
This comment was marked as outdated.
This comment was marked as outdated.
|
Hmm... the CI stress test job seems to not actually be working correctly on most of the Windows configurations. |
This comment was marked as outdated.
This comment was marked as outdated.
|
@joyeecheung or @addaleax ... curious if either of you have thoughts on the root cause here. The flakes are just |
|
I don't have a Windows machine at hand at the moment but can you reproduce locally or is it only in CI? Did you try with |
|
I haven't even booted my windows machine in over two years. Once I get it back up and updated I'll see if I can reproduce locally and, if so, I'll try to get a usable dump from from it. |
89cd5ab to
374cc83
Compare
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This is an attempt to fix the flaky failures on http(s) and tcp tests on Windows. Signed-off-by: James M Snell <jasnell@gmail.com> Assisted-by: Opencode/Opus 4.6
374cc83 to
588d179
Compare
| socket.destroy(); | ||
| } | ||
| }; | ||
| process.on('beforeExit', onBeforeExit); |
There was a problem hiding this comment.
If the issue is Windows only, the hook and tracking should be added only on Windows, no? Anyway do you know when the issue started? Is it a regression, and did you manage to reproduce it locally?
There was a problem hiding this comment.
Yes, I was able to locally repro. It is a regression but I haven't yet been able to track down exactly when it started. It looks like a libuv issue.
mcollina
left a comment
There was a problem hiding this comment.
I'm running this PR, I'll report back on what I could find.
I'm also ok in skipping the flaky tests.
| } | ||
|
|
||
| self._connections++; | ||
| self._connectionSockets.add(socket); |
There was a problem hiding this comment.
This will unfortunately add overhead. I would prefer if we could avoid this entirely/find the root cause of this regression or adjust the tests.
There was a problem hiding this comment.
Adjusting the tests means adding closeAllConnections() to every test-http(s) test and that still leaves flakiness for all of the test-tls-* and test-net-* tests that have also been flaky.
Unfortunately, any of the |
|
Looking at https://github.com/nodejs/reliability it looks like the issue started on 10/03/2026, so I think it is a regression (or an update in CI runners). |
I think we should revert the commit that caused this (5bebd7e?) then. Anyway, 5bebd7e landed on 16/03/2026, six days later than the first appearance, so I don't think it is the culprit. We need to bisect. |
Oh nice, I was just about to start digging into the reports to check on that. Before considering a revert, is there anything else critical that the libuv update fixed? Want to make sure we don't re-introduce another problem trying to solve this one. |
|
ok... the key purpose of this PR was to experiment with a couple of approaches to dealing with the flakiness but I've been unable to find a single approach that is workable. Either reverting the libuv update or identifying the root cause in libuv and patching it up are likely the only path forward. |
|
Closing this in favor of a revert of the libuv update #62497 |
The test-https-server-options-incoming-message and
test-https-server-options-server-response tests can
be flaky on Windows.
Opencode/Opus is helping figure out why.
If this PR makes the tests non-flaky it's just a bandaid. There appears to be a race condition on cleanup on windows where shutting down connections races with shutting down the server. We'd still need to identify specifically where that race is occurring and where to fix it specifically. Right now, however, my goal is to unblock CI flakes.
The flakiness does seem generalized to any
test-http(s)-*andtest-net-*ortest-tls-*, not just one... which makes me strongly suspect the race is quite low in the stack, possibly even in libuv but that's yet to be determined precisely.This PR adds a pre-exit hook that will force open connections to be destroyed when the process is exiting, short-circuiting the race condition that's causing the process to crash on windows.