Skip to content

test: fix flaky https tests on windows#62451

Closed
jasnell wants to merge 1 commit intonodejs:mainfrom
jasnell:jasnell/hopefully-fix-flaky-windows-tests
Closed

test: fix flaky https tests on windows#62451
jasnell wants to merge 1 commit intonodejs:mainfrom
jasnell:jasnell/hopefully-fix-flaky-windows-tests

Conversation

@jasnell
Copy link
Copy Markdown
Member

@jasnell jasnell commented Mar 27, 2026

The test-https-server-options-incoming-message and
test-https-server-options-server-response tests can
be flaky on Windows.

Opencode/Opus is helping figure out why.

If this PR makes the tests non-flaky it's just a bandaid. There appears to be a race condition on cleanup on windows where shutting down connections races with shutting down the server. We'd still need to identify specifically where that race is occurring and where to fix it specifically. Right now, however, my goal is to unblock CI flakes.

The flakiness does seem generalized to any test-http(s)-* and test-net-* or test-tls-*, not just one... which makes me strongly suspect the race is quite low in the stack, possibly even in libuv but that's yet to be determined precisely.

This PR adds a pre-exit hook that will force open connections to be destroyed when the process is exiting, short-circuiting the race condition that's causing the process to crash on windows.

@nodejs-github-bot nodejs-github-bot added needs-ci PRs that need a full CI run. test Issues and PRs related to the tests. labels Mar 27, 2026
@jasnell jasnell requested a review from mcollina March 27, 2026 05:15
@nodejs-github-bot
Copy link
Copy Markdown
Collaborator

@codecov
Copy link
Copy Markdown

codecov bot commented Mar 27, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.71%. Comparing base (bdf75a6) to head (588d179).
⚠️ Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main   #62451   +/-   ##
=======================================
  Coverage   89.71%   89.71%           
=======================================
  Files         692      692           
  Lines      213988   214008   +20     
  Branches    41054    41047    -7     
=======================================
+ Hits       191976   192000   +24     
- Misses      14086    14088    +2     
+ Partials     7926     7920    -6     
Files with missing lines Coverage Δ
lib/net.js 94.70% <100.00%> (+0.04%) ⬆️

... and 25 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

assert.strictEqual(ports[ports.length - 1], port);
makeRequest(url, agent, common.mustCall((port) => {
assert.strictEqual(ports[ports.length - 1], port);
server.closeAllConnections();
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why isn't agent.destroy() below sufficient?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's part of what I'm investigating. I'm running a stress test of this change on windows to determine if that deals with the flakiness. If it does, a follow up will be to figure out why the destroy isn't sufficient.

What's apparent is that there is a race condition on cleanup. Where and exactly why still needs to be determined along with a long term fix.

}, common.mustCall((res) => {
assert.strictEqual(res.statusCode, 200);
res.on('end', () => {
server.closeAllConnections();
Copy link
Copy Markdown
Member

@lpinca lpinca Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is preventing the socket from being closed normally? Is the keep-alive timeout longer than the test timeout? and if it is the case, can keep-alive be disable? I don't think it is needed for this test.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking through this... disabling keep-alive might address the issue for the various test-http(s)-* tests but there are tcp/tls tests also failing with the same crash that follow the same pattern, suggesting that the issue is a bit more fundamental than just using keep-alive.

@nodejs-github-bot

This comment was marked as outdated.

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 27, 2026

Hmm... the CI stress test job seems to not actually be working correctly on most of the Windows configurations.

@nodejs-github-bot
Copy link
Copy Markdown
Collaborator

nodejs-github-bot commented Mar 27, 2026

@jasnell

This comment was marked as outdated.

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 27, 2026

@joyeecheung or @addaleax ... curious if either of you have thoughts on the root cause here. The flakes are just test-http(s)-* and test-tls-* or test-net-* tests in general on windows. It's not a single or particular set of sets failing each time, they all appear to be following the same pattern, and all fail with the same error code. It's definitely a race condition somewhere, I'm thinking it's most likely in libuv but it's not entirely clear. The closeAllConnections approach in this PR appears to resolve the flakiness in these particular tests but it's just a bandaid.

@lpinca
Copy link
Copy Markdown
Member

lpinca commented Mar 27, 2026

I don't have a Windows machine at hand at the moment but can you reproduce locally or is it only in CI? Did you try with autoSelectFamily disabled?

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 27, 2026

I haven't even booted my windows machine in over two years. Once I get it back up and updated I'll see if I can reproduce locally and, if so, I'll try to get a usable dump from from it.

@jasnell jasnell force-pushed the jasnell/hopefully-fix-flaky-windows-tests branch 2 times, most recently from 89cd5ab to 374cc83 Compare March 28, 2026 16:31
@nodejs-github-bot

This comment was marked as outdated.

@nodejs-github-bot

This comment was marked as outdated.

@nodejs-github-bot
Copy link
Copy Markdown
Collaborator

nodejs-github-bot commented Mar 29, 2026

This is an attempt to fix the flaky failures on http(s) and
tcp tests on Windows.

Signed-off-by: James M Snell <jasnell@gmail.com>
Assisted-by: Opencode/Opus 4.6
@jasnell jasnell force-pushed the jasnell/hopefully-fix-flaky-windows-tests branch from 374cc83 to 588d179 Compare March 29, 2026 13:53
@nodejs-github-bot
Copy link
Copy Markdown
Collaborator

@jasnell jasnell requested review from aduh95, panva and targos March 29, 2026 14:13
panva
panva previously approved these changes Mar 29, 2026
socket.destroy();
}
};
process.on('beforeExit', onBeforeExit);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the issue is Windows only, the hook and tracking should be added only on Windows, no? Anyway do you know when the issue started? Is it a regression, and did you manage to reproduce it locally?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was able to locally repro. It is a regression but I haven't yet been able to track down exactly when it started. It looks like a libuv issue.

Copy link
Copy Markdown
Member

@mcollina mcollina left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm running this PR, I'll report back on what I could find.

I'm also ok in skipping the flaky tests.

}

self._connections++;
self._connectionSockets.add(socket);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will unfortunately add overhead. I would prefer if we could avoid this entirely/find the root cause of this regression or adjust the tests.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adjusting the tests means adding closeAllConnections() to every test-http(s) test and that still leaves flakiness for all of the test-tls-* and test-net-* tests that have also been flaky.

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 29, 2026

I'm also ok in skipping the flaky tests

Unfortunately, any of the test-http(s)-* tests are flaky on windows now. It's not isolated to a specific set of tests that can be skipped

@lpinca
Copy link
Copy Markdown
Member

lpinca commented Mar 29, 2026

Looking at https://github.com/nodejs/reliability it looks like the issue started on 10/03/2026, so I think it is a regression (or an update in CI runners).

@lpinca
Copy link
Copy Markdown
Member

lpinca commented Mar 29, 2026

Unfortunately, any of the test-http(s)-* tests are flaky on windows now. It's not isolated to a specific set of tests.

I think we should revert the commit that caused this (5bebd7e?) then. Anyway, 5bebd7e landed on 16/03/2026, six days later than the first appearance, so I don't think it is the culprit. We need to bisect.

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 29, 2026

I think we should revert the commit that caused this (5bebd7e?) then. Anyway, 5bebd7e landed on 16/03/2026, six days later than the first appearance, so I don't think it is the culprit. We need to bisect.

Oh nice, I was just about to start digging into the reports to check on that. Before considering a revert, is there anything else critical that the libuv update fixed? Want to make sure we don't re-introduce another problem trying to solve this one.

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 29, 2026

ok... the key purpose of this PR was to experiment with a couple of approaches to dealing with the flakiness but I've been unable to find a single approach that is workable. Either reverting the libuv update or identifying the root cause in libuv and patching it up are likely the only path forward.

@jasnell
Copy link
Copy Markdown
Member Author

jasnell commented Mar 29, 2026

Closing this in favor of a revert of the libuv update #62497

@jasnell jasnell closed this Mar 29, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

needs-ci PRs that need a full CI run. test Issues and PRs related to the tests.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants