Skip to content

fix: propagate context deadline to readPrelogin to prevent hangs#360

Open
dlevy-msft-sql wants to merge 22 commits into
microsoft:mainfrom
dlevy-msft-sql:fix/prelogin-context-deadline
Open

fix: propagate context deadline to readPrelogin to prevent hangs#360
dlevy-msft-sql wants to merge 22 commits into
microsoft:mainfrom
dlevy-msft-sql:fix/prelogin-context-deadline

Conversation

@dlevy-msft-sql
Copy link
Copy Markdown

@dlevy-msft-sql dlevy-msft-sql commented Apr 17, 2026

Summary

Propagate the context deadline through prelogin without relying on SetReadDeadline, close the expired-deadline hole before readPrelogin(), and tighten the timeout regression test to match the driver's real cancellation outcomes.

Problem

The current branch reduced toconn.timeout only when time.Until(deadline) > 0. That still leaves a narrow hang window if the context deadline expires after the prelogin write but before the blocking prelogin read starts. With connection timeout=0, readPrelogin() can still block indefinitely.

The timeout regression test was also widened too far and could accept unrelated failures by matching any error string containing timeout.

Fix

  • add preloginTimeout() to compute the effective prelogin read timeout
  • return the context error immediately when the deadline is already expired before readPrelogin()
  • keep the existing timeout reduction and unconditional restore around the prelogin read
  • add direct unit coverage for no deadline, tighter context deadline, shorter connection timeout, and expired deadline
  • tighten TestQueryTimeout to accept only context.DeadlineExceeded, timeout-capable net.Error, SQL error 3980, or the driver's explicit cancel-confirmation failure

Testing

  • go test -run '^TestPrelogin' -count=1 .
  • go test -run '^TestQueryTimeout$' -count=1 .
  • go test -run '^(TestPreloginTimeout|TestPreloginRespectsContextDeadline|TestQueryTimeout)$' -count=1 .
  • go build .

Coverage

Two defensive branches in connect() remain uncovered by unit tests because they guard nanosecond-window race conditions that cannot be reliably triggered without mocking time or context internals:

  1. return nil, err from preloginTimeout in connect() (tds.go line ~1249): This requires the context deadline to expire in the microsecond gap between writePrelogin completing and preloginTimeout being called. The preloginTimeout function itself has 100% coverage via direct unit tests; only the error-return site in connect() is not hit.

  2. return nil, context.DeadlineExceeded in the net.Error timeout guard (tds.go line ~1290): This fires when the socket read returns a timeout error and the context deadline has passed, but ctx.Err() has not yet propagated (checked a few lines earlier). Triggering this requires the context deadline and socket timeout to expire at the exact same instant with ctx.Err() returning nil on one check and the deadline being past on the next, a sub-microsecond race.

Both are intentional safety nets that convert ambiguous I/O timeouts into clear context errors. They are tested indirectly through the preloginTimeout unit tests (100% covered) and the TestPreloginDeadlineAndSocketTimeoutRace integration test.

Fixes #254

@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 87.50000% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 96.64%. Comparing base (c10fa99) to head (4d3b4ac).
⚠️ Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
tds.go 87.50% 6 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##             main     #360       +/-   ##
===========================================
+ Coverage   80.66%   96.64%   +15.97%     
===========================================
  Files          35       92       +57     
  Lines        6842    74405    +67563     
===========================================
+ Hits         5519    71907    +66388     
- Misses       1055     2156     +1101     
- Partials      268      342       +74     
Flag Coverage Δ
unittests 96.56% <89.74%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
tds.go 71.57% <87.50%> (+1.89%) ⬆️

... and 60 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@dlevy-msft-sql dlevy-msft-sql added this to the v1.11.0 milestone Apr 17, 2026
@dlevy-msft-sql dlevy-msft-sql force-pushed the fix/prelogin-context-deadline branch 8 times, most recently from dba4ed8 to 0d5021f Compare April 24, 2026 23:44
TestLoginTimeout uses a very tight deadline (latency+200ms) which can
cause the cancel-confirmation path to fire, producing a ServerError
instead of a timeout. Apply the same flexible error matching used in
TestQueryTimeout so the test passes regardless of which timing-dependent
error surfaces.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses a connect-time hang in the TDS prelogin phase by ensuring the prelogin read is bounded by the caller’s context deadline (even when connection timeout=0), and tightens timeout-related tests to match the driver’s real cancellation behaviors.

Changes:

  • Add preloginTimeout() and use it to temporarily reduce timeoutConn.timeout around readPrelogin() based on the context deadline (including immediate failure for already-expired deadlines).
  • Tighten TestQueryTimeout / TestLoginTimeout assertions to accept only specific, expected timeout/cancel outcomes.
  • Add new unit tests covering preloginTimeout() behavior and a regression test ensuring prelogin honors context deadlines.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
tds.go Adds preloginTimeout() and applies context-derived timeout to prevent readPrelogin() hangs.
queries_test.go Narrows accepted timeout errors to specific expected driver/server/network outcomes.
prelogin_deadline_test.go Adds direct unit/regression coverage for prelogin timeout computation and deadline enforcement.

Comment thread tds.go
Comment thread tds.go
Comment thread queries_test.go
…ove coverage

- close toconn on preloginTimeout and readPrelogin error paths to prevent
  socket leaks (addresses Copilot review comments)
- extract isAcceptableTimeoutErr to deduplicate TestQueryTimeout and
  TestLoginTimeout error checking
- simplify preloginTimeout expired-deadline path: return ctx.Err() directly
  instead of branching on nil (the dead branch was unreachable)
- add test case for connTimeout==0 to reach 100% coverage on preloginTimeout
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread tds.go
Addresses Copilot review: if the context is already canceled
(e.g. context.WithCancel) but has no deadline, preloginTimeout
previously returned connTimeout with nil error, allowing
readPrelogin to block indefinitely when connTimeout==0.

Now checks ctx.Err() first regardless of deadline presence.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread prelogin_deadline_test.go Outdated
Comment thread tds.go
- Add goroutine to watch ctx.Done() and close the connection to
  unblock readPrelogin when context is canceled without a deadline
- Check ctx.Err() after readPrelogin to avoid using a conn that
  may have been closed by the cancel watcher (race safety)
- Add TestPreloginRespectsContextCancel integration test
- Align test comment with actual 5s bound per review feedback
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread tds.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

Comment thread tds.go Outdated
Comment thread prelogin_deadline_test.go Outdated
Comment thread prelogin_deadline_test.go
Comment thread prelogin_deadline_test.go
Address Copilot review comments:
- Only override read error with context.DeadlineExceeded when the
  error is actually a net timeout, not EOF/connection reset
- Assert net.Error with Timeout()=true in socket timeout test
- Verify server-side close detection in routing redirect test
- Remove TestPreloginTimeoutErrorPath (context expires before dial,
  does not exercise preloginTimeout; covered by unit tests)
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread tds.go
Comment thread prelogin_deadline_test.go
Replaces goodPreloginSequence (which calls t.Fatal) with inline code using

t.Errorf+return. Adds t.Errorf on Accept failure.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

Comment thread prelogin_deadline_test.go
Comment thread prelogin_deadline_test.go Outdated
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

- Add preloginTimeout test for ctxTimeout <= 0 race condition using
  custom context (pastDeadlineContext) to reach the defensive branch
  where Err() returns nil but time.Until(deadline) is negative
- Add TestConnectSuccessfulPreloginAndLogin with mock TDS server that
  completes the full prelogin+login handshake, covering the success
  path (toconn = nil) without requiring a real SQL Server instance
- Add TestPreloginDeadlineAndSocketTimeoutRace with matched timeouts
  to exercise the code path where socket timeout and context deadline
  fire nearly simultaneously
- Extract sendLoginResponse helper for reuse in mock server tests

preloginTimeout coverage: 90.9% -> 100%
connect() success path now covered by unit tests
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread prelogin_deadline_test.go Outdated
…Login

Inline the prelogin/login handshake logic instead of calling
goodPreloginSequence, which uses t.Fatal. Calling t.Fatal from a
non-test goroutine only terminates that goroutine via runtime.Goexit,
which could leave the test hanging on <-serverErr.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comment thread prelogin_deadline_test.go Outdated
preloginTimeout checks ctx.Err() first, then time.Until(deadline),
not the other way around.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated no new comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

connect() hangs when server does not respond to pre-login

3 participants