Skip to content

Conversation

@jasonlin45
Copy link

Description

Helps avoid issues like:

Error connecting: timed out trying to acquire lease after 1m30s: /home/jason/.cache/snowflake/credential_cache.lease
exit status 1

Under high concurrency scenarios such as a dbt project executing with aggressive concurrency, our credentials cache is getting absolutely HAMMERED with requests. Connections at the moment require exclusive access via lease to both read and write from the cache.

However, architecting an inter-process read-write lock for credentials via the file system is a task that has great complexity and therefore risk.

This solution adds a simple intraprocess cache layer that allows us to cache in memory within a process with a pass through to the file based cache. This way, only cache hydration / token refresh (with jitter) will lock our cache file.
It does not resolve high multi-processing cache contention, but it resolves cache contention due to aggressive multi-threading.

Basic Perf Testing

The following tests was conducted with 100 concurrent threads (goroutines) within a single process.

Current State

First attempt:

Error connecting: timed out trying to acquire lease after 1m30s: /home/jason/.cache/snowflake/credential_cache.lease
exit status 1

Second attempt:

=== Summary Statistics over 100 concurrent trials ===
First Connection:
  Min:      490.795489ms
  Q1:       606.904188ms
  Median:   861.536127ms
  Q3:       15.507705703s
  Max:      1m14.859322079s
  Average:  10.128332142s

Second Connection:
  Min:      164.254038ms
  Q1:       623.820647ms
  Median:   2.847455471s
  Q3:       20.431970229s
  Max:      1m19.480456457s
  Average:  13.111362603s
This PR:
=== Summary Statistics over 100 concurrent trials ===
First Connection:
  Min:      176.643098ms
  Q1:       220.754675ms
  Median:   281.826397ms
  Q3:       667.38235ms
  Max:      869.882234ms
  Average:  435.704222ms

Second Connection:
  Min:      112.619557ms
  Q1:       148.539442ms
  Median:   166.516176ms
  Q3:       264.936479ms
  Max:      535.121657ms
  Average:  224.196105ms

Based on this test... a monumental performance improvement and increased stability.

}

func deleteCredentialWithLease(tokenSpec *secureTokenSpec) error {
lease, err := credentialsStorage.acquireLease()
Copy link

@VersusFacit VersusFacit Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to think more about this but my initial concern is multiple connections each calling delete when only one should.

the original paths in the auth function one layer up have invariants based on a lease being acquired. We think to rethink those invariants under this new setup to make sure the behavior is correct.


var credentialsStorage = newSecureStorageManager()

// Helper fast-paths to minimize lease contention for common operations.
Copy link

@VersusFacit VersusFacit Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned by all the errors we swallow up from these functions as opposed to handling them.

If there are certain we want to whitelist/swallow let's make that explicit -- at callsites or in this layer -- and fail the rest of them.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of these, and moved the relaxed version of get to the manager itself

leaseHandler *LeaseHandler
// in-memory fast path cache to reduce file lock contention within a process
memMu sync.RWMutex
mem map[string]string
Copy link

@VersusFacit VersusFacit Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A whole map for a pair? Or do you expect this to grow over time. In the runtime there's only one auth method...I think that's a claim I can make. The map time feels heavy handed where a tuple would do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just meant to mirror what we are already storing on disk.
A map can also handle different credential caches. I don't think there's too much overhead being added by the map

Copy link

@VersusFacit VersusFacit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty big fan of how surgical this feels although I have logic concerns.

Tomorrow or Wednesday I can spend real time tracing execution paths / thinking of failure cases. For now left some comments for you to consider.

Copy link
Collaborator

@felipecrv felipecrv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see how this is faster and reduces contention, but I need to see an argument for correctness that is sound.

One model we could follow:

auth_function_using_shared_mutable_data(optional_lease):
  data = relaxed read;
  use the data
  if fail
     lease = acquire the lease
     return auth_function_using_shared_mutable_data(lease)
     
  do_something_more_with_the_data();
  if reason to update or delete but optionalLease == nil:
     lease = acquire the lease
     return auth_function_using_shared_mutable_data(lease)

This is very similar to double-cheked locking, but instead of copy-and pasting the check of the credentials stored we are recursing once with the lease acquired after the relaxed check.

https://en.wikipedia.org/wiki/Double-checked_locking

auth.go Outdated
credentialsStorage.deleteCredential(lease, newOAuthAccessTokenSpec(sc.cfg.OauthTokenRequestURL, sc.cfg.User))
}
if sessionParameters[clientRequestMfaToken] == true {
_ = deleteCredentialWithLease(newMfaTokenSpec(sc.cfg.Host, sc.cfg.User))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more like "delete credential without lease" because you're not requiring a lease, right?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this work since this function has a lease acquired already? I don't remember if I made lease acquisition recursive (even if I did, it's bad to rely on recursive locks/leases).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Deleted and deferring to existing deleteCredential on the storage manager

The lease provided is nil a lot of the time. There is indeed a risk of deadlock. A lease was acquired within the function itself.

auth.go Outdated
}
if sessionParameters[clientStoreTemporaryCredential] == true && sc.cfg.Authenticator.isOauthNativeFlow() {
_ = deleteCredentialWithLease(newOAuthAccessTokenSpec(sc.cfg.OauthTokenRequestURL, sc.cfg.User))
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for all of these here you could just pass the lease. deletion is a very destructive operation.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree - got rid of this and went back to the existing one

auth.go Outdated
if token != "" {
_ = setCredentialWithLease(newIDTokenSpec(sc.cfg.Host, sc.cfg.User), token)
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same concern with setters. You already have a lease here. Pass the lease.

auth.go Outdated
}, func() string {
return ""
})
token, err := awaitValue(valueAwaiter, func() (string, error) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to go fmt to make these spaces become tabs. Make the diff cleaner.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is my bad, messed up my editor extensions.
ran a manual go fmt and this is a lot cleaner now.

auth.go Outdated
return ""
})
token, err := awaitValue(valueAwaiter, func() (string, error) {
return getCredentialFast(newOAuthAccessTokenSpec(oauthClient.tokenURL(), sc.cfg.User))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add getCredentialFast to the same interface that contains getCredential? You would still take the lease, but not try to renew it right before making the query.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UPDATE: getCredentialRelaxed is a better name

Copy link
Author

@jasonlin45 jasonlin45 Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to getCredentialRelaxed and moved to the storage manager itself

auth.go Outdated
sc.cfg.IDToken = tok
}
} else if sc.cfg.ClientStoreTemporaryCredential == ConfigBoolTrue {
tok, _ := getCredentialFast(newIDTokenSpec(sc.cfg.Host, sc.cfg.User))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call it getCredentialsStale or getCredentialRelaxed. You're implementing something that is analogous to relaxed atomic reads.

https://www.google.com/search?hl=en&q=relaxed%20memory%20model

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a "Relaxed" read. All the writes sets and delete must be "Release" (i.e. they must have already "Acquire"d a lease).

If the "Relaxed" read is bad, we "Acquire" a lease before querying it again.

auth.go Outdated
Comment on lines 836 to 837
if lease == nil {
lease, err = credentialsStorage.acquireLease()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you got all the way here without a lease, what you should probably start from the beginning, but with a lease. This can be easily implemented with recursion (max depth will be 2).

You change func authenticateWithConfig(sc *snowflakeConn) error { to take an optionalLease parameter and you change the code here to be:

if optionalLease == nil {
   lease = acquire the lease
   return authenticateWithConfig(sc, lease)
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Recursion can't be done here because of interactions with the valueAwaiter the Snowflake folks added. Recursing out of the function releases other goroutine to go ahead, and then I had a deadlock.

auth.go Outdated
proofKey)
authData, err = authenticate(
sc.ctx,
lease,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're passing a nil lease to a function that didn't expect nil

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you might have to rename the parameter to optionalLease

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. This can be nil in some cases. Renamed accordingly!

@jasonlin45 jasonlin45 requested a review from felipecrv November 25, 2025 09:33
@jasonlin45
Copy link
Author

jasonlin45 commented Nov 25, 2025

@felipecrv I made as many of the changes as I could that were requested.

However, the 2-depth recursion does not play well with the value awaiter which forces single execution. I consolidated as much as I could, with some defensive lease acquisition for sensitive operations since you pointed out that authenticate in the current state can be called with a nil lease.

With the requested changes, I'm seeing even better performance.

defer lease.Release()
// Lease for coordination - may be provided or acquired locally
var lease *Lease = optionalLease
leaseAcquiredLocally := false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The lease being nil indicates it's not acquired, right? With the bool we now have 4 different states to handle.

sc.cleanup()
return err
}
leaseAcquiredLocally = true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optionalLease being non-nil speaks for itself

// Case 1: cached ID token failed -> clear + try one interactive refresh
case sc.cfg.Authenticator == AuthTypeExternalBrowser && sc.cfg.IDToken != "":
credentialsStorage.deleteCredential(lease, newIDTokenSpec(sc.cfg.Host, sc.cfg.User))
_ = credentialsStorage.deleteCredential(lease, newIDTokenSpec(sc.cfg.Host, sc.cfg.User))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't delete in the filesystem if the lease is nil. As an invariant, no mutation should derive from a relaxed read of the cached credentials.

}
defer lease.Release()
// Lease for coordination - may be provided or acquired locally
var lease *Lease = optionalLease
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can keep mutating optionalLease. You have a variable named lease that is just as optional as optionalLease

// When using parallel login coordination, we need the lease but cannot recurse
// due to the valueAwaiter
if lease == nil {
lease, err = credentialsStorage.acquireLease()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lease, err = credentialsStorage.acquireLease()
optionalLease, err = credentialsStorage.acquireLease()

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You acquire a lease here and proceed with the authentication process, but since your decision to proceed with auth comes from a relaxed read, you can't be sure the credentials cache is still empty. Between the relaxed read and this lease acquisition, some other process might have populated the credentials cache.

That's why the double-checked locking pattern requires two checks: one relaxed check that is fast and then another check after the lease/lock acquisition.

}
}
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change

@felipecrv
Copy link
Collaborator

felipecrv commented Nov 25, 2025

However, the 2-depth recursion does not play well with the value awaiter which forces single execution. I consolidated as much as I could, with some defensive lease acquisition for sensitive operations since you pointed out that authenticate in the current state can be called with a nil lease.

This is a tail-call, so we can implement with a goto, but we don't have goto, so you can use a loop:

optionalLease = nil

for {
  // initial checks based on relaxed reads
  
  if situation_that_requires_a_mutation {
    if optionalLease == nil {
      optionalLease = acquireLease()
      continue  // equivalent to recursion
    }
    perform_the_mutation(optionalLease) // optionalLease is always set here
  }

  if any_other_situation_that_requires_a_lease {
    if optionalLease == nil {
      optionalLease = acquireLease()
      continue  // equivalent to recursion
    }
    run_the_situation_that_requires_a_lease(optionalLease) // optionalLease is always set here
  }

  // ...  
}

You need to have this "guard" before every operation that needs a lease. It's not OK to proceed without a lease, but it's also not OK to get a lease just keep going with the function. You can't be sure you're in the right path because you've got there due to relaxed reads. You need strong reads to base the initial checks.

authenticate might have to change to return an error when you call it with optionalLease=nil and internally it needs a lease. Because when it hits a point where it needs a lease, it has to return that error so that the caller, can acquire the lease and do continue (effectively a recursion or goto to the beginning of the process).

Empirically, you might not be seeing problems with your current code, but you can't prove that it prevents multiple attempts to login even when something is already in cache. This is all time-dependent, so you won't see the problems in your tests. You have to prove correctness at least informally.

With the requested changes, I'm seeing even better performance.

Nice.

@felipecrv
Copy link
Collaborator

One relaxation that should reduce the "number of things that require a lease" is allowing this:

cached = relaxed_read();
if new_value == cached:
   skip // no lease needed, but don't mutate cause that would risk overriding a later value
else:
   lease = acquire_lease();
   store.setNewValue(lease, new_value)

So you implement my stronger requirements (e.g. never mutate without a lease and checks repeated from the beginning) without making the code slow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants