Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add metric pgrst_jwt_cache_size in admin server #3802

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

taimoorzaeem
Copy link
Collaborator

@taimoorzaeem taimoorzaeem commented Nov 26, 2024

Add metric pgrst_jwt_cache_size in admin server which shows the cache size in bytes.

@steve-chavez
Copy link
Member

Don't forget the feedback about the actual size in bytes #3801 (comment)

Ultimately, I believe we're going to need an LRU cache for the JWT cache to be production ready. So having this metric in bytes will be useful now and later.

@taimoorzaeem
Copy link
Collaborator Author

Don't forget the feedback about the actual size in bytes #3801 (comment)

Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes.

It works as follows:

Data.Cache gives a function toList which returns that cache entries in a list of tuples ([(ByteString, AuthResult, Maybe TimeSpec)] in our case).

Now, we can use ghc-datasize library code to calculate the byte size of ByteString and AuthResult but not Maybe TimeSpec (because recursiveSizeNF only works on types that are an instance of NFData typeclass, hence I am calling it an "approximation").

@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch 2 times, most recently from 1254ed5 to 11727ab Compare December 17, 2024 06:50
@steve-chavez
Copy link
Member

This is pretty cool, so I do see the size starting at 0 then increasing as I do requests:

# $ nix-shell
# $ PGRST_ADMIN_SERVER_PORT=3001  PGRST_JWT_CACHE_MAX_LIFETIME=30000 postgrest-with-postgresql-16  -f test/spec/fixtures/load.sql postgrest-run

$ curl localhost:3001/metrics
# HELP pgrst_jwt_cache_size The number of cached JWTs
# TYPE pgrst_jwt_cache_size gauge
pgrst_jwt_cache_size 0.0

$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 72.0

$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 144.0

$ curl localhost:3000/authors_only -H "Authorization: Bearer $(postgrest-gen-jwt --exp 10 postgrest_test_author)"
[]
$ curl localhost:3001/metrics
..
pgrst_jwt_cache_size 216.0

Of course this doesn't drop down after a while because we need #3801 for that.

One issue that I've noticed is that we're printing empty log lines for each request:

[nix-shell:~/Projects/postgrest]$ PGRST_ADMIN_SERVER_PORT=3001  PGRST_JWT_CACHE_MAX_LIFETIME=30000 postgrest-with-postgresql-16  -f test/spec/fixtures/load.sql postgrest-run
...
18/Jan/2025:18:16:32 -0500: Schema cache loaded in 17.1 milliseconds
18/Jan/2025:18:16:34 -0500: 
18/Jan/2025:18:16:38 -0500: 
18/Jan/2025:18:16:42 -0500: 

This is due to the addition of the new Observation

And the following line here:

This is surprising behavior, I'll try to refactor this on a new PR.

For now, how about printing a message like:

JWTCache sz-> "The JWT Cache size increased to " <> sz <> "bytes"

This should only happen for a log-level greater than debug, check how this is done on:

observationLogger :: LoggerState -> LogLevel -> ObservationHandler
observationLogger loggerState logLevel obs = case obs of
o@(PoolAcqTimeoutObs _) -> do
when (logLevel >= LogError) $ do
logWithDebounce loggerState $
logWithZTime loggerState $ observationMessage o
o@(QueryErrorCodeHighObs _) -> do
when (logLevel >= LogError) $ do
logWithZTime loggerState $ observationMessage o
o@(HasqlPoolObs _) -> do
when (logLevel >= LogDebug) $ do
logWithZTime loggerState $ observationMessage o
PoolRequest ->
pure ()
PoolRequestFullfilled ->
pure ()
o ->
logWithZTime loggerState $ observationMessage o

@steve-chavez
Copy link
Member

Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes.

The above is understandable. What we really need is a good enough approximation so we have an order-of-magnitude understanding to see if the cache size is in KB, MB or GB. So far this seems enough. We should definitely document why we do an approximation though.

@@ -15,7 +15,7 @@ module PostgREST.App
, run
) where


import Control.Monad
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unused?

@steve-chavez
Copy link
Member

I've just noticed that perf badly drops down on this PR:

param v12.2.3 head main
throughput 448 121 399

https://github.com/PostgREST/postgrest/pull/3802/checks?check_run_id=34517395528

The recursiveSize function says:

This function works very quickly on small data structures, but can be slow on large and complex ones. If speed is an issue it's probably possible to get the exact size of a small portion of the data structure and then estimate the total size from that.

@taimoorzaeem What could we do to avoid this drop? Maybe calculate the cache size periodically on a background thread? Any thoughts?

Comment on lines 19 to 20
-- Code in this module is taken from:
-- https://hackage.haskell.org/package/ghc-datasize-0.2.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taimoorzaeem Could you remind me why we need to vendor this code here? Maybe it's possible to use the dependency directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

@taimoorzaeem taimoorzaeem Jan 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding the dependency directly would have been perfect. The only issue is that ghc-datasize-0.2.7 is marked broken in nixpkgs. Any workaround ideas?

error: Package ‘ghc-datasize-0.2.7’ in /nix/store/hyy4vjyamr7pj1br9y8r1ssssqp570y2-source/pkgs/development/haskell-modules/hackage-packages.nix:120611 is marked as broken, refusing to evaluate.

       a) To temporarily allow broken packages, you can use an environment variable
          for a single invocation of the nix tools.

            $ export NIXPKGS_ALLOW_BROKEN=1

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It builds fine for me on latest nixpkgs master. So we probably just need to make it as unbroken.

For now, it should be enough to add this to overlays/haskell-packages.nix:

      ghc-datasize = lib.markUnbroken prev.ghc-datasize;

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Works! Thanks.

src/PostgREST/Auth.hs Outdated Show resolved Hide resolved
Comment on lines 192 to 193
jwtCacheSize <- calcApproxCacheSizeInBytes jwtCache
observer $ JWTCache jwtCacheSize
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function works very quickly on small data structures, but can be slow on large and complex ones. If speed is an issue it's probably possible to get the exact size of a small portion of the data structure and then estimate the total size from that.

@taimoorzaeem What could we do to avoid this drop? Maybe calculate the cache size periodically on a background thread? Any thoughts?

@steve-chavez In this PR, we are calculating size of complete cache on EVERY request which is frankly horrible (can't believe I wrote this). How about whenever we add a cache entry, we calculate the size of only a single cache entry which would be small and quick and increment the cache size accordingly.

Then later in #3801, we can think about a background thread to purge expired entries or however we would like to handle purging.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@taimoorzaeem Sounds good. Let's try that and see the impact on perf.

@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch from 11727ab to 145c236 Compare January 19, 2025 14:41
@taimoorzaeem
Copy link
Collaborator Author

Calculating "actual" byte size of cache is sort of tricky (haskell laziness is a lot to deal with sometimes) and I still haven't figured it out YET. In the meantime, I have written some code to approximate the cache size in bytes.

The above is understandable. What we really need is a good enough approximation so we have an order-of-magnitude understanding to see if the cache size is in KB, MB or GB. So far this seems enough. We should definitely document why we do an approximation though.

Now that I have gotten better at haskell 🚀, I solved the issue and we don't need an "approximation". We CAN calculate full cache size in bytes.

@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch 3 times, most recently from 4b1afd1 to d9de43a Compare January 19, 2025 18:11
@taimoorzaeem
Copy link
Collaborator Author

taimoorzaeem commented Jan 19, 2025

Load test and memory test failing with:

src/PostgREST/Auth.hs:44:1: error:
    Could not find module ‘GHC.DataSize’
    Perhaps you haven't installed the profiling libraries for package ‘ghc-datasize-0.2.7’?
    Use -v (or `:set -v` in ghci) to see a list of the files searched for.
   |
44 | import GHC.DataSize            (recursiveSizeNF)

Is there any additional configuration that needs to be added for profiling?

Comment on lines 53 to 55
# nixpkgs have ghc-datasize-0.2.7 marked as broken
ghc-datasize = lib.markUnbroken prev.ghc-datasize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# nixpkgs have ghc-datasize-0.2.7 marked as broken
ghc-datasize = lib.markUnbroken prev.ghc-datasize;
# TODO: Remove this once https://github.com/NixOS/nixpkgs/pull/375121
# has made it to us.
ghc-datasize = lib.markUnbroken prev.ghc-datasize;

@wolfgangwalther
Copy link
Member

Load test and memory test failing with:

Does it happen locally, too?

I am confused, can't exactly spot what's wrong right now.

@taimoorzaeem
Copy link
Collaborator Author

Does it happen locally, too?

I am confused, can't exactly spot what's wrong right now.

Yes, it does happen locally.

[nix-shell]$ postgrest-loadtest
...
src/PostgREST/Auth.hs:44:1: error:
    Could not find module ‘GHC.DataSize’
    Perhaps you haven't installed the profiling libraries for package ‘ghc-datasize-0.2.7’?
    Use -v (or `:set -v` in ghci) to see a list of the files searched for.
   |
44 | import GHC.DataSize            (recursiveSizeNF)
   | ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch from d9de43a to 7ff02db Compare January 20, 2025 17:16
@taimoorzaeem taimoorzaeem force-pushed the metric/jwt-cache-size branch from 7ff02db to 6eb7d3c Compare January 20, 2025 17:19
@taimoorzaeem
Copy link
Collaborator Author

Need some way to run postgrest-loadtest on the CI. Currently it is failing because of building with Nix. Running PGRST_BUILD_CABAL=1 posgrest-loadtest works locally but I am not sure how to set this up on CI temporarily to check loadtest results. Does running loadtest on CI equivalent to running it locally? Would I get same results?

@wolfgangwalther
Copy link
Member

wolfgangwalther commented Jan 20, 2025

Yeah, running the loadtest in CI is not working when dependencies are changed, because it needs to run against the base branch, which doesn't have the dependencies, yet... and then cabal just breaks it somehow.

You can run something like this locally to get the same markdown output:

          postgrest-loadtest-against main v12.2.5
          postgrest-loadtest-report

(but it's likely that it fails the same way... :D)

@wolfgangwalther
Copy link
Member

Perhaps you haven't installed the profiling libraries for package ‘ghc-datasize-0.2.7’?

The thing I don't understand about this error message is, that it appears in contexts where we don't use profiling libs. We do for the memory test, so the error message makes "kind of sense" (I still don't understand why it happens, though). But for loadtest and the regular build on MacOS... those don't use profiling.

Hm... actually - we don't do a regular dynamically linked linux build via nix, I think. So in fact it fails for every nix build except the static build. Still don't know what's happening, though.

@steve-chavez
Copy link
Member

[nix-shell]$ postgrest-loadtest
...
src/PostgREST/Auth.hs:44:1: error:

I get a similar error when trying this locally too.

Yeah, running the loadtest in CI is not working when dependencies are changed, because it needs to run against the base branch, which doesn't have the dependencies, yet... and then cabal just breaks it somehow.

To unblock the PR, how about only adding the dependency in another PR and then merging it? Then I assume this PR would run the loadtest?

@wolfgangwalther
Copy link
Member

To unblock the PR, how about only adding the dependency in another PR and then merging it? Then I assume this PR would run the loadtest?

No I don't think so. It seems the dependency issue was fixed a while ago in 0c5d2e5. Also the fact that all nix builds fail, the memory test, on darwin, etc. - indicates that there is something else going on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

Successfully merging this pull request may close these issues.

3 participants