Skip to content

fix: refresh upstream tokens transparently instead of forcing re-auth#4036

Open
aron-muon wants to merge 10 commits intostacklok:mainfrom
aron-muon:aron/retoken-issue
Open

fix: refresh upstream tokens transparently instead of forcing re-auth#4036
aron-muon wants to merge 10 commits intostacklok:mainfrom
aron-muon:aron/retoken-issue

Conversation

@aron-muon
Copy link
Contributor

@aron-muon aron-muon commented Mar 6, 2026

Summary

Users are forced to fully re-authenticate with upstream OAuth providers every time the upstream access token expires (controlled by accessTokenLifespan), even though valid refresh tokens exist in storage.

The root cause is that upstream access tokens and refresh tokens are stored together in a single storage entry, with the entry's TTL/expiry set to the access token's expiry. When the access token expires, the entry is deleted (Redis) or marked expired (memory) — losing the refresh token, which is typically valid for 30-90 days or longer depending on the provider. The upstreamswap middleware has no refresh path and returns 401, forcing full re-authentication.

This affects both Redis and in-memory storage backends — Redis deletes the key via TTL, and memory storage's cleanup goroutine removes the expired entry.

Type of change

  • Bug fix

Root cause

The storage model bundles access + refresh tokens in a single entry (UpstreamTokens struct). The entry TTL is derived from the access token's ExpiresAt, so when the access token expires:

  1. Storage deletes or expires the entry — losing the still-valid refresh token
  2. GetUpstreamTokens returns nil, ErrExpired — discarding the token data
  3. Middleware returns 401 — no refresh path exists

Ideally, access and refresh tokens would be stored separately with independent TTLs matching their actual lifetimes. This fix extends the bundled entry's TTL as a pragmatic solution; a future refactor could separate them for cleaner lifecycle management.

Changes

Storage layer (pkg/authserver/storage/)

  • Extended upstream token entry TTL by DefaultRefreshTokenTTL (30 days) in both Redis and memory storage, so refresh tokens survive past access token expiry
  • Changed GetUpstreamTokens to return token data alongside ErrExpired (instead of nil) so callers can use the refresh token
  • Memory storage now checks the token's own ExpiresAt (access token expiry) rather than the entry's expiresAt (storage TTL) for the expired check

Token refresher (pkg/authserver/refresher.go)

  • New UpstreamTokenRefresher interface in storage/types.go
  • Implementation wraps upstream.OAuth2Provider.RefreshTokens() + UpstreamTokenStorage.StoreUpstreamTokens()
  • Preserves binding fields (ProviderID, UserID, UpstreamSubject, ClientID) across refresh
  • Handles refresh token rotation (keeps old refresh token if provider doesn't issue a new one)

Plumbing

  • Exposed refresher through ServerEmbeddedAuthServerRunnerMiddlewareRunner using the same lazy accessor pattern as GetUpstreamTokenStorage

Middleware (pkg/auth/upstreamswap/)

  • Middleware now attempts transparent refresh before returning 401
  • Extracted getOrRefreshUpstreamTokens helper to keep cyclomatic complexity under lint threshold
  • Only requires re-auth when the refresh token itself is invalid/revoked

Production validation

Deployed to a production cluster with Redis (AWS Valkey) storage (we use a sentinel emulator which basically just returns the Valkey URL in each case. One of the little clever ways we could use Valkey). All four upstream providers successfully refreshed tokens transparently — no user re-authentication required:

Provider Token Endpoint Access Token Lifetime Refresh Token Rotated
Atlassian cf.mcp.atlassian.com/v1/token 1 hour Yes
Asana app.asana.com/-/oauth_token 1 hour Yes
Slack (GovSlack) slack-gov.com/api/oauth.v2.access 12 hours Yes
Google oauth2.googleapis.com/token 1 hour Yes

Redis TTLs confirmed updated to ~30 days (previously ~1 hour). GitHub has not yet expired its 8-hour access token but uses the same code path.

Test plan

  • Updated storage tests to verify tokens returned alongside ErrExpired
  • Updated cleanup tests for extended TTL
  • Updated middleware tests with refresher parameter
  • All existing unit tests pass
  • Build clean, golangci-lint clean
  • Deployed to production — verified transparent refresh for Atlassian, Asana, Slack, and Google

Does this introduce a user-facing change?

Yes — upstream OAuth sessions now persist beyond the access token lifetime. Users will no longer be forced to re-authenticate as long as their upstream refresh token is valid (typically 30 days to indefinite depending on the provider).

Large PR Justification

  • Multiple related changes that would break if separated

Generated with Claude Code

@github-actions github-actions bot added the size/M Medium PR: 300-599 lines changed label Mar 6, 2026
@aron-muon aron-muon changed the title Aron/retoken issue Refresh upstream tokens transparently instead of forcing re-auth Mar 6, 2026
@aron-muon aron-muon changed the title Refresh upstream tokens transparently instead of forcing re-auth fix: refresh upstream tokens transparently instead of forcing re-auth Mar 6, 2026
@aron-muon aron-muon force-pushed the aron/retoken-issue branch from efdbe42 to cd44bbf Compare March 6, 2026 14:07
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 6, 2026
@aron-muon aron-muon marked this pull request as ready for review March 6, 2026 14:09
@aron-muon aron-muon force-pushed the aron/retoken-issue branch from cd44bbf to 0dd2c1c Compare March 6, 2026 14:11
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 6, 2026
The upstreamswap middleware returned 401 when upstream access tokens
expired, forcing users through full re-authentication even though
valid refresh tokens existed in storage. This happened because:

1. Redis/memory storage TTL was set to access token expiry, deleting
   the entry (and refresh token) when the access token expired
2. Storage returned nil on ErrExpired, discarding the refresh token
3. The middleware had no refresh path — only 401

Fix all three layers:

- Add DefaultRefreshTokenTTL (30 days) to storage entry TTL so
  refresh tokens survive past access token expiry
- Return token data alongside ErrExpired from storage so callers
  can use the refresh token
- Add UpstreamTokenRefresher interface and implementation that wraps
  the upstream OAuth2Provider and storage
- Plumb the refresher through Server → EmbeddedAuthServer → Runner →
  MiddlewareRunner
- Update upstreamswap middleware to attempt refresh before returning
  401, only requiring re-auth when the refresh token itself fails

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@aron-muon aron-muon force-pushed the aron/retoken-issue branch from 0dd2c1c to ab9807b Compare March 6, 2026 14:12
@github-actions github-actions bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 6, 2026
@codecov
Copy link

codecov bot commented Mar 6, 2026

Codecov Report

❌ Patch coverage is 81.81818% with 20 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.65%. Comparing base (85c5f3e) to head (2f1c6dd).

Files with missing lines Patch % Lines
pkg/auth/upstreamswap/middleware.go 87.23% 4 Missing and 2 partials ⚠️
pkg/authserver/server_impl.go 0.00% 6 Missing ⚠️
pkg/runner/runner.go 0.00% 5 Missing ⚠️
pkg/authserver/runner/embeddedauthserver.go 0.00% 2 Missing ⚠️
pkg/authserver/storage/redis.go 88.88% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4036      +/-   ##
==========================================
+ Coverage   68.61%   68.65%   +0.03%     
==========================================
  Files         445      446       +1     
  Lines       45374    45462      +88     
==========================================
+ Hits        31135    31210      +75     
- Misses      11841    11844       +3     
- Partials     2398     2408      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Add comprehensive tests for RefreshAndStore (6 cases) and middleware
refresh paths (4 cases: successful refresh, failed refresh, no refresh
token, defense-in-depth expired-without-error).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Mar 6, 2026
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 6, 2026
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 7, 2026
@jhrozek
Copy link
Contributor

jhrozek commented Mar 7, 2026

hey @aron-muon thanks a lot for the patches this was actually something we had on our roadmap so the feature makes complete sense 🙏🏻

I'll make sure to review the PR as soon as possible.

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work on transparent token refresh! A few minor comments below.

stor storage.UpstreamTokenStorage,
sessionID string,
refresherGetter RefresherGetter,
) (*storage.UpstreamTokens, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a blocker for this PR, but worth flagging: when multiple concurrent requests hit getOrRefreshUpstreamTokens with the same expired session, each one will independently call RefreshAndStore. With providers that rotate refresh tokens (issuing a single-use replacement), all but the first caller will use a stale refresh token and fail.

A singleflight.Group keyed on sessionID would collapse concurrent refreshes into one. I've actually been working on something similar in a parallel branch, so I'll address this in a follow-up rather than piling onto this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice call, added a singleflight.Group keyed on sessionID to deduplicate concurrent refreshes. Scoped per middleware instance to avoid test interference and also added TestSingleFlightRefresh_ConcurrentRequests to verify only one RefreshAndStore fires with 10 concurrent expired-session requests

Copy link
Contributor

@jhrozek jhrozek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second pass: a few more findings from an adversarial review.

@jhrozek
Copy link
Contributor

jhrozek commented Mar 8, 2026

This is great work. I submitted a bunch of nits but take them really as nits, none of htem are really blocking, if there's one I'd love to have fixed before merging, it's the tokens.IsExpired, feel free to just dismiss the rest.

aron-muon and others added 2 commits March 9, 2026 10:46
Co-authored-by: Jakub Hrozek <jakub.hrozek@posteo.se>
Co-authored-by: Jakub Hrozek <jakub.hrozek@posteo.se>
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 9, 2026
- Fix step numbering: renumber step 6 → 5 after step 5 removal
- Update redis integration test: assert returned token data is non-nil
  on ErrExpired, consistent with the unit test contract
- Fix test closures: pass subtest t to setupStorage/setupProvider to
  ensure assertion failures are attributed to the correct subtest

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 9, 2026
@aron-muon aron-muon force-pushed the aron/retoken-issue branch from 16bddba to d1415a7 Compare March 9, 2026 11:12
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/L Large PR: 600-999 lines changed labels Mar 9, 2026
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

Wrap upstream token refresh in singleflight.Group keyed on sessionID
to collapse concurrent refreshes into one call. Prevents providers with
single-use refresh tokens from failing all but the first concurrent
caller.

Added TestSingleFlightRefresh_ConcurrentRequests to verify the fix.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions bot removed the size/XL Extra large PR: 1000+ lines changed label Mar 9, 2026
@github-actions github-actions bot dismissed their stale review March 9, 2026 11:20

Large PR justification has been provided. Thank you!

@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Mar 9, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2026

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@aron-muon
Copy link
Contributor Author

This is great work. I submitted a bunch of nits but take them really as nits, none of htem are really blocking, if there's one I'd love to have fixed before merging, it's the tokens.IsExpired, feel free to just dismiss the rest.

Thanks for the review! I included a fix for each nit and also added a fix for the multiple concurrent requests issue.

@aron-muon aron-muon requested a review from jhrozek March 9, 2026 11:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants