-
Notifications
You must be signed in to change notification settings - Fork 0
Intraprocess credential caching #21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: v1.17.0+dbt
Are you sure you want to change the base?
Conversation
secure_storage_manager.go
Outdated
| } | ||
|
|
||
| func deleteCredentialWithLease(tokenSpec *secureTokenSpec) error { | ||
| lease, err := credentialsStorage.acquireLease() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to think more about this but my initial concern is multiple connections each calling delete when only one should.
the original paths in the auth function one layer up have invariants based on a lease being acquired. We think to rethink those invariants under this new setup to make sure the behavior is correct.
secure_storage_manager.go
Outdated
|
|
||
| var credentialsStorage = newSecureStorageManager() | ||
|
|
||
| // Helper fast-paths to minimize lease contention for common operations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned by all the errors we swallow up from these functions as opposed to handling them.
If there are certain we want to whitelist/swallow let's make that explicit -- at callsites or in this layer -- and fail the rest of them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got rid of these, and moved the relaxed version of get to the manager itself
| leaseHandler *LeaseHandler | ||
| // in-memory fast path cache to reduce file lock contention within a process | ||
| memMu sync.RWMutex | ||
| mem map[string]string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A whole map for a pair? Or do you expect this to grow over time. In the runtime there's only one auth method...I think that's a claim I can make. The map time feels heavy handed where a tuple would do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just meant to mirror what we are already storing on disk.
A map can also handle different credential caches. I don't think there's too much overhead being added by the map
VersusFacit
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty big fan of how surgical this feels although I have logic concerns.
Tomorrow or Wednesday I can spend real time tracing execution paths / thinking of failure cases. For now left some comments for you to consider.
felipecrv
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see how this is faster and reduces contention, but I need to see an argument for correctness that is sound.
One model we could follow:
auth_function_using_shared_mutable_data(optional_lease):
data = relaxed read;
use the data
if fail
lease = acquire the lease
return auth_function_using_shared_mutable_data(lease)
do_something_more_with_the_data();
if reason to update or delete but optionalLease == nil:
lease = acquire the lease
return auth_function_using_shared_mutable_data(lease)
This is very similar to double-cheked locking, but instead of copy-and pasting the check of the credentials stored we are recursing once with the lease acquired after the relaxed check.
auth.go
Outdated
| credentialsStorage.deleteCredential(lease, newOAuthAccessTokenSpec(sc.cfg.OauthTokenRequestURL, sc.cfg.User)) | ||
| } | ||
| if sessionParameters[clientRequestMfaToken] == true { | ||
| _ = deleteCredentialWithLease(newMfaTokenSpec(sc.cfg.Host, sc.cfg.User)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more like "delete credential without lease" because you're not requiring a lease, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this work since this function has a lease acquired already? I don't remember if I made lease acquisition recursive (even if I did, it's bad to rely on recursive locks/leases).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Deleted and deferring to existing deleteCredential on the storage manager
The lease provided is nil a lot of the time. There is indeed a risk of deadlock. A lease was acquired within the function itself.
auth.go
Outdated
| } | ||
| if sessionParameters[clientStoreTemporaryCredential] == true && sc.cfg.Authenticator.isOauthNativeFlow() { | ||
| _ = deleteCredentialWithLease(newOAuthAccessTokenSpec(sc.cfg.OauthTokenRequestURL, sc.cfg.User)) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for all of these here you could just pass the lease. deletion is a very destructive operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree - got rid of this and went back to the existing one
auth.go
Outdated
| if token != "" { | ||
| _ = setCredentialWithLease(newIDTokenSpec(sc.cfg.Host, sc.cfg.User), token) | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same concern with setters. You already have a lease here. Pass the lease.
auth.go
Outdated
| }, func() string { | ||
| return "" | ||
| }) | ||
| token, err := awaitValue(valueAwaiter, func() (string, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to go fmt to make these spaces become tabs. Make the diff cleaner.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is my bad, messed up my editor extensions.
ran a manual go fmt and this is a lot cleaner now.
auth.go
Outdated
| return "" | ||
| }) | ||
| token, err := awaitValue(valueAwaiter, func() (string, error) { | ||
| return getCredentialFast(newOAuthAccessTokenSpec(oauthClient.tokenURL(), sc.cfg.User)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add getCredentialFast to the same interface that contains getCredential? You would still take the lease, but not try to renew it right before making the query.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UPDATE: getCredentialRelaxed is a better name
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to getCredentialRelaxed and moved to the storage manager itself
auth.go
Outdated
| sc.cfg.IDToken = tok | ||
| } | ||
| } else if sc.cfg.ClientStoreTemporaryCredential == ConfigBoolTrue { | ||
| tok, _ := getCredentialFast(newIDTokenSpec(sc.cfg.Host, sc.cfg.User)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Call it getCredentialsStale or getCredentialRelaxed. You're implementing something that is analogous to relaxed atomic reads.
https://www.google.com/search?hl=en&q=relaxed%20memory%20model
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a "Relaxed" read. All the writes sets and delete must be "Release" (i.e. they must have already "Acquire"d a lease).
If the "Relaxed" read is bad, we "Acquire" a lease before querying it again.
auth.go
Outdated
| if lease == nil { | ||
| lease, err = credentialsStorage.acquireLease() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you got all the way here without a lease, what you should probably start from the beginning, but with a lease. This can be easily implemented with recursion (max depth will be 2).
You change func authenticateWithConfig(sc *snowflakeConn) error { to take an optionalLease parameter and you change the code here to be:
if optionalLease == nil {
lease = acquire the lease
return authenticateWithConfig(sc, lease)
}There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Recursion can't be done here because of interactions with the valueAwaiter the Snowflake folks added. Recursing out of the function releases other goroutine to go ahead, and then I had a deadlock.
auth.go
Outdated
| proofKey) | ||
| authData, err = authenticate( | ||
| sc.ctx, | ||
| lease, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're passing a nil lease to a function that didn't expect nil
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you might have to rename the parameter to optionalLease
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. This can be nil in some cases. Renamed accordingly!
|
@felipecrv I made as many of the changes as I could that were requested. However, the 2-depth recursion does not play well with the value awaiter which forces single execution. I consolidated as much as I could, with some defensive lease acquisition for sensitive operations since you pointed out that With the requested changes, I'm seeing even better performance. |
| defer lease.Release() | ||
| // Lease for coordination - may be provided or acquired locally | ||
| var lease *Lease = optionalLease | ||
| leaseAcquiredLocally := false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The lease being nil indicates it's not acquired, right? With the bool we now have 4 different states to handle.
| sc.cleanup() | ||
| return err | ||
| } | ||
| leaseAcquiredLocally = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optionalLease being non-nil speaks for itself
| // Case 1: cached ID token failed -> clear + try one interactive refresh | ||
| case sc.cfg.Authenticator == AuthTypeExternalBrowser && sc.cfg.IDToken != "": | ||
| credentialsStorage.deleteCredential(lease, newIDTokenSpec(sc.cfg.Host, sc.cfg.User)) | ||
| _ = credentialsStorage.deleteCredential(lease, newIDTokenSpec(sc.cfg.Host, sc.cfg.User)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't delete in the filesystem if the lease is nil. As an invariant, no mutation should derive from a relaxed read of the cached credentials.
| } | ||
| defer lease.Release() | ||
| // Lease for coordination - may be provided or acquired locally | ||
| var lease *Lease = optionalLease |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can keep mutating optionalLease. You have a variable named lease that is just as optional as optionalLease
| // When using parallel login coordination, we need the lease but cannot recurse | ||
| // due to the valueAwaiter | ||
| if lease == nil { | ||
| lease, err = credentialsStorage.acquireLease() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| lease, err = credentialsStorage.acquireLease() | |
| optionalLease, err = credentialsStorage.acquireLease() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You acquire a lease here and proceed with the authentication process, but since your decision to proceed with auth comes from a relaxed read, you can't be sure the credentials cache is still empty. Between the relaxed read and this lease acquisition, some other process might have populated the credentials cache.
That's why the double-checked locking pattern requires two checks: one relaxed check that is fast and then another check after the lease/lock acquisition.
| } | ||
| } | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a tail-call, so we can implement with a goto, but we don't have goto, so you can use a loop: optionalLease = nil
for {
// initial checks based on relaxed reads
if situation_that_requires_a_mutation {
if optionalLease == nil {
optionalLease = acquireLease()
continue // equivalent to recursion
}
perform_the_mutation(optionalLease) // optionalLease is always set here
}
if any_other_situation_that_requires_a_lease {
if optionalLease == nil {
optionalLease = acquireLease()
continue // equivalent to recursion
}
run_the_situation_that_requires_a_lease(optionalLease) // optionalLease is always set here
}
// ...
}You need to have this "guard" before every operation that needs a lease. It's not OK to proceed without a lease, but it's also not OK to get a lease just keep going with the function. You can't be sure you're in the right path because you've got there due to relaxed reads. You need strong reads to base the initial checks.
Empirically, you might not be seeing problems with your current code, but you can't prove that it prevents multiple attempts to login even when something is already in cache. This is all time-dependent, so you won't see the problems in your tests. You have to prove correctness at least informally.
Nice. |
|
One relaxation that should reduce the "number of things that require a lease" is allowing this: So you implement my stronger requirements (e.g. never mutate without a lease and checks repeated from the beginning) without making the code slow. |
Description
Helps avoid issues like:
Under high concurrency scenarios such as a dbt project executing with aggressive concurrency, our credentials cache is getting absolutely HAMMERED with requests. Connections at the moment require exclusive access via lease to both read and write from the cache.
However, architecting an inter-process read-write lock for credentials via the file system is a task that has great complexity and therefore risk.
This solution adds a simple intraprocess cache layer that allows us to cache in memory within a process with a pass through to the file based cache. This way, only cache hydration / token refresh (with jitter) will lock our cache file.
It does not resolve high multi-processing cache contention, but it resolves cache contention due to aggressive multi-threading.
Basic Perf Testing
The following tests was conducted with 100 concurrent threads (goroutines) within a single process.
Current State
First attempt:
Second attempt:
This PR:
Based on this test... a monumental performance improvement and increased stability.