fix: refresh upstream tokens transparently instead of forcing re-auth#4036
fix: refresh upstream tokens transparently instead of forcing re-auth#4036aron-muon wants to merge 10 commits intostacklok:mainfrom
Conversation
efdbe42 to
cd44bbf
Compare
cd44bbf to
0dd2c1c
Compare
The upstreamswap middleware returned 401 when upstream access tokens expired, forcing users through full re-authentication even though valid refresh tokens existed in storage. This happened because: 1. Redis/memory storage TTL was set to access token expiry, deleting the entry (and refresh token) when the access token expired 2. Storage returned nil on ErrExpired, discarding the refresh token 3. The middleware had no refresh path — only 401 Fix all three layers: - Add DefaultRefreshTokenTTL (30 days) to storage entry TTL so refresh tokens survive past access token expiry - Return token data alongside ErrExpired from storage so callers can use the refresh token - Add UpstreamTokenRefresher interface and implementation that wraps the upstream OAuth2Provider and storage - Plumb the refresher through Server → EmbeddedAuthServer → Runner → MiddlewareRunner - Update upstreamswap middleware to attempt refresh before returning 401, only requiring re-auth when the refresh token itself fails Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
0dd2c1c to
ab9807b
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4036 +/- ##
==========================================
+ Coverage 68.61% 68.65% +0.03%
==========================================
Files 445 446 +1
Lines 45374 45462 +88
==========================================
+ Hits 31135 31210 +75
- Misses 11841 11844 +3
- Partials 2398 2408 +10 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Add comprehensive tests for RefreshAndStore (6 cases) and middleware refresh paths (4 cases: successful refresh, failed refresh, no refresh token, defense-in-depth expired-without-error). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
hey @aron-muon thanks a lot for the patches this was actually something we had on our roadmap so the feature makes complete sense 🙏🏻 I'll make sure to review the PR as soon as possible. |
jhrozek
left a comment
There was a problem hiding this comment.
Nice work on transparent token refresh! A few minor comments below.
| stor storage.UpstreamTokenStorage, | ||
| sessionID string, | ||
| refresherGetter RefresherGetter, | ||
| ) (*storage.UpstreamTokens, error) { |
There was a problem hiding this comment.
Not a blocker for this PR, but worth flagging: when multiple concurrent requests hit getOrRefreshUpstreamTokens with the same expired session, each one will independently call RefreshAndStore. With providers that rotate refresh tokens (issuing a single-use replacement), all but the first caller will use a stale refresh token and fail.
A singleflight.Group keyed on sessionID would collapse concurrent refreshes into one. I've actually been working on something similar in a parallel branch, so I'll address this in a follow-up rather than piling onto this PR.
There was a problem hiding this comment.
Nice call, added a singleflight.Group keyed on sessionID to deduplicate concurrent refreshes. Scoped per middleware instance to avoid test interference and also added TestSingleFlightRefresh_ConcurrentRequests to verify only one RefreshAndStore fires with 10 concurrent expired-session requests
jhrozek
left a comment
There was a problem hiding this comment.
Second pass: a few more findings from an adversarial review.
|
This is great work. I submitted a bunch of nits but take them really as nits, none of htem are really blocking, if there's one I'd love to have fixed before merging, it's the tokens.IsExpired, feel free to just dismiss the rest. |
Co-authored-by: Jakub Hrozek <jakub.hrozek@posteo.se>
Co-authored-by: Jakub Hrozek <jakub.hrozek@posteo.se>
- Fix step numbering: renumber step 6 → 5 after step 5 removal - Update redis integration test: assert returned token data is non-nil on ErrExpired, consistent with the unit test contract - Fix test closures: pass subtest t to setupStorage/setupProvider to ensure assertion failures are attributed to the correct subtest Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
16bddba to
d1415a7
Compare
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
Wrap upstream token refresh in singleflight.Group keyed on sessionID to collapse concurrent refreshes into one call. Prevents providers with single-use refresh tokens from failing all but the first concurrent caller. Added TestSingleFlightRefresh_ConcurrentRequests to verify the fix. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Large PR justification has been provided. Thank you!
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Thanks for the review! I included a fix for each nit and also added a fix for the multiple concurrent requests issue. |
Summary
Users are forced to fully re-authenticate with upstream OAuth providers every time the upstream access token expires (controlled by
accessTokenLifespan), even though valid refresh tokens exist in storage.The root cause is that upstream access tokens and refresh tokens are stored together in a single storage entry, with the entry's TTL/expiry set to the access token's expiry. When the access token expires, the entry is deleted (Redis) or marked expired (memory) — losing the refresh token, which is typically valid for 30-90 days or longer depending on the provider. The
upstreamswapmiddleware has no refresh path and returns 401, forcing full re-authentication.This affects both Redis and in-memory storage backends — Redis deletes the key via TTL, and memory storage's cleanup goroutine removes the expired entry.
Type of change
Root cause
The storage model bundles access + refresh tokens in a single entry (
UpstreamTokensstruct). The entry TTL is derived from the access token'sExpiresAt, so when the access token expires:GetUpstreamTokensreturnsnil, ErrExpired— discarding the token dataIdeally, access and refresh tokens would be stored separately with independent TTLs matching their actual lifetimes. This fix extends the bundled entry's TTL as a pragmatic solution; a future refactor could separate them for cleaner lifecycle management.
Changes
Storage layer (
pkg/authserver/storage/)DefaultRefreshTokenTTL(30 days) in both Redis and memory storage, so refresh tokens survive past access token expiryGetUpstreamTokensto return token data alongsideErrExpired(instead ofnil) so callers can use the refresh tokenExpiresAt(access token expiry) rather than the entry'sexpiresAt(storage TTL) for the expired checkToken refresher (
pkg/authserver/refresher.go)UpstreamTokenRefresherinterface instorage/types.goupstream.OAuth2Provider.RefreshTokens()+UpstreamTokenStorage.StoreUpstreamTokens()Plumbing
Server→EmbeddedAuthServer→Runner→MiddlewareRunnerusing the same lazy accessor pattern asGetUpstreamTokenStorageMiddleware (
pkg/auth/upstreamswap/)getOrRefreshUpstreamTokenshelper to keep cyclomatic complexity under lint thresholdProduction validation
Deployed to a production cluster with Redis (AWS Valkey) storage (we use a sentinel emulator which basically just returns the Valkey URL in each case. One of the little clever ways we could use Valkey). All four upstream providers successfully refreshed tokens transparently — no user re-authentication required:
cf.mcp.atlassian.com/v1/tokenapp.asana.com/-/oauth_tokenslack-gov.com/api/oauth.v2.accessoauth2.googleapis.com/tokenRedis TTLs confirmed updated to ~30 days (previously ~1 hour). GitHub has not yet expired its 8-hour access token but uses the same code path.
Test plan
ErrExpiredDoes this introduce a user-facing change?
Yes — upstream OAuth sessions now persist beyond the access token lifetime. Users will no longer be forced to re-authenticate as long as their upstream refresh token is valid (typically 30 days to indefinite depending on the provider).
Large PR Justification
Generated with Claude Code