Skip to content

fix: Start listening after schema cache load#4880

Merged
steve-chavez merged 1 commit into
PostgREST:mainfrom
mkleczek:push-tynrmqwlwwus
Jun 16, 2026
Merged

fix: Start listening after schema cache load#4880
steve-chavez merged 1 commit into
PostgREST:mainfrom
mkleczek:push-tynrmqwlwwus

Conversation

@mkleczek

@mkleczek mkleczek commented May 5, 2026

Copy link
Copy Markdown
Collaborator

This change ensures PostgREST starts listening on a server socket only after it loaded the schema cache and is ready to handle requests. It is no longer going to return 503 errors during startup until the schema cache is loaded.

DISCLAIMER:
This commit was authored entirely by a human without the assistance of LLMs.

Comment thread CHANGELOG.md Outdated
Comment thread src/PostgREST/Admin.hs
Comment thread src/PostgREST/App.hs
@steve-chavez

steve-chavez commented May 5, 2026

Copy link
Copy Markdown
Member

Previous discussion on the motivation of the change on #4703 (comment).

@steve-chavez steve-chavez requested a review from laurenceisla May 5, 2026 16:38
@steve-chavez

Copy link
Copy Markdown
Member

@mkleczek As per #4703 (comment), this would clearly benefit the case of SO_REUSEPORT given 2 PostgREST instances running.

But let's consider the scenario of a single instance managed by systemd behind a proxy:

  • The service restarts (for any reason, could be manual restart because somehow the schema cache failed reloading).
  • Right now our startup is fast (milliseconds) and we start responding with 503s. During this time clients will get the 503s with a clear error message that says we're "Retrying.." plus a Retry-After header.

With this change, now we'll not respond and clients will get a connection refused with no error message. And this state can last multiple seconds now that we wait for the scache to load.

So under this scenario, it looks like this new behavior is worse?


Thinking more what we need is zero-downtime restarts, which I guess is easier under this new behavior since we could rely on systemd socket activation?

@mkleczek

mkleczek commented May 5, 2026

Copy link
Copy Markdown
Collaborator Author

So under this scenario, it looks like this new behavior is worse?

Not really.

From the point of view of the client there is not much gain from these 503 errors comparing to some connect timeout or similar. The client has to handle connection issues anyway because there are many more cases when they can happen (for example the whole server might have crashed). In case of reverse proxies in front of PostgREST (ie. always) - the client will get some 50x anyway.
The only reasonable way for the client to handle network issues is to retry. Well behaving clients will use some exponential backoff with jitter retry policy to avoid overwhelming freshly started server (ie. to avoid thundering herd).
Retry-After is not very useful because it is not reliable. What's worse: if all clients retry according to this header then... boom - thundering herd - I would even say Retry-After is more harmful than good.

Comment thread test/io/test_io.py
@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch from 5bb48c5 to 99554cb Compare May 5, 2026 18:51
@steve-chavez

Copy link
Copy Markdown
Member

What's worse: if all clients retry according to this header then... boom - thundering herd - I would even say Retry-After is more harmful than good.

Thought about removing the Retry-After, but its docs say:

[...] Retry-After indicates the minimum time that the user agent is asked to wait

So it's a minimum, not exact time. I think it should be fine to be clear about this on the docs and recommend jitter.

@steve-chavez

Copy link
Copy Markdown
Member

@mkleczek The direction here is good, make sure to address the comments and then we can merge this.

@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch from 99554cb to 6a927aa Compare May 7, 2026 08:45
@mkleczek mkleczek marked this pull request as draft May 8, 2026 06:05
@mkleczek

mkleczek commented May 8, 2026

Copy link
Copy Markdown
Collaborator Author

I am marking this PR as draft to address concerns related to handling schema cache loading errors during start-up.

It seems the right course of action cannot be any of these two extremes:

  • always return 503 during schema cache loading on startup
  • start listening only after successful schema cache load

The first one forces clients to handle normal conditions as errors.
The second one makes the clients unaware of errors that might happen during schema cache loading which makes diagnostics more difficult.

It seems like the best (ultimate?) startup sequence should be:

  1. Start admin server.
  2. Try to load schema cache once.
  3. Start listening on main socket
  4. If there was an error in 2, enter retry loop.

That way we achieve both:

  • happy path (ie. successful startup sequence) does not cause any error responses
  • errors are properly reported to clients

The above requires wider refactoring - today the whole schema cache loading loop is implemented in a single function without any means to introspect the state of the loading process. Clients can only trigger asynchronous schema load and wait for it to finish.
It makes it related to #4856, which in turn is a prerequisite to implement loading the schema cache using listener connection to fix #4842.

@steve-chavez WDYT?

@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch 2 times, most recently from f577ea6 to 3eed89b Compare May 8, 2026 13:24
@wolfgangwalther

Copy link
Copy Markdown
Member
  • Start admin server.
  • Try to load schema cache once.
  • Start listening on main socket
  • If there was an error in 2, enter retry loop.

I wrote up two different proposals but threw them away, because I always came to the conclusion that this is the sensible thing to do.

So 👍

@steve-chavez

Copy link
Copy Markdown
Member

It seems like the best (ultimate?) startup sequence should be:

Looks much better. Also 👍 from me.

@laurenceisla

Copy link
Copy Markdown
Member
  1. Try to load schema cache once.
  2. Start listening on main socket

So between these two steps, we'd still return the connection error, however after that we'd retry and get the 503. I agree with this.

@steve-chavez Not sure if it was discussed elsewhere, but this would mean that the proposal to wait until the schema cache is loaded on startup is no longer desired, right?

@steve-chavez

Copy link
Copy Markdown
Member

@laurenceisla The waiting is being discussed on #4873 (comment). #4129 won't be solved here.

@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch 3 times, most recently from e653c00 to 55c3e8c Compare May 12, 2026 05:01
@mkleczek mkleczek marked this pull request as ready for review May 12, 2026 06:07
@mkleczek mkleczek requested a review from steve-chavez May 12, 2026 06:07
@mkleczek

Copy link
Copy Markdown
Collaborator Author

It seems like the best (ultimate?) startup sequence should be:

  1. Start admin server.
  2. Try to load schema cache once.
  3. Start listening on main socket
  4. If there was an error in 2, enter retry loop.

That way we achieve both:

  • happy path (ie. successful startup sequence) does not cause any error responses
  • errors are properly reported to clients

Updated the code to implemented the above.

@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch from c442894 to feffcfd Compare May 25, 2026 04:40
@mkleczek

Copy link
Copy Markdown
Collaborator Author

Needs a rebase after 1a6ba20.

The reason I didn't do refactoring first was to avoid hard to resolve conflicts. I'd be grateful if we collaborated more on PRs to make our job easier instead of harder.

Rebased

@wolfgangwalther

Copy link
Copy Markdown
Member

The reason I didn't do refactoring first was to avoid hard to resolve conflicts.

Same reasoning here - but with an eye on our future selves, when we need to maintain things. It's much more likely we'd like to revert this fix compared to the refactor. If we do the refactor first, then the fix, it's easy to revert. If we do it the other way around, we'd then need to be very careful at that time.

btw rebasing your changeset over it should not have been hard. It should have been as easy as:

The result after the two commits is the same, so that part is really easy. The harder to resolve conflict, which included actually looking at the code, was the one that I did when I cherry-picked it. That's why I did it and didn't force it onto you.

Comment thread test/io/test_io.py

@steve-chavez steve-chavez left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked at all the change in tests, they look fine.

All is left is resolving https://github.com/PostgREST/postgrest/pull/4880/changes#r3306654090.

@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch 4 times, most recently from 8297bfd to 28f283d Compare June 2, 2026 07:13
@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch 3 times, most recently from 01b020d to 0cb9e0f Compare June 14, 2026 10:59
@mkleczek

Copy link
Copy Markdown
Collaborator Author

Looked at all the change in tests, they look fine.

All is left is resolving https://github.com/PostgREST/postgrest/pull/4880/changes#r3306654090.

Done

@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch from 0cb9e0f to 15b2665 Compare June 15, 2026 05:51
@mkleczek mkleczek requested a review from steve-chavez June 15, 2026 05:54
@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch from 15b2665 to 1adea05 Compare June 15, 2026 05:59
This change ensures PostgREST starts listening on a server socket only after it loaded the schema cache and is ready to handle requests. It is no longer going to return 503 errors during startup until the schema cache is loaded.
@mkleczek mkleczek force-pushed the push-tynrmqwlwwus branch from 1adea05 to 3a0356f Compare June 15, 2026 10:25
@steve-chavez

Copy link
Copy Markdown
Member

Was about to merge but noticed the loadtest failed: https://github.com/PostgREST/postgrest/actions/runs/27539944616?pr=4880

201 POST /films?columns=id,title,year,runtime,genres,director,actors,plot,posterUrl ❌ 5.7 %

Is it just a sporadic failure?

@mkleczek

Copy link
Copy Markdown
Collaborator Author

Was about to merge but noticed the loadtest failed: https://github.com/PostgREST/postgrest/actions/runs/27539944616?pr=4880

201 POST /films?columns=id,title,year,runtime,genres,director,actors,plot,posterUrl ❌ 5.7 %

Is it just a sporadic failure?

It does look like some spurious failure, indeed. The changes in this PR do not touch request processing code paths. @wolfgangwalther WDYT?

@steve-chavez

Copy link
Copy Markdown
Member

Hm, now I also see that we have an improvement in perf?

404 GET /actoxs?actor=eq.1 -5.9 %
https://github.com/PostgREST/postgrest/actions/runs/27539944616?pr=4880

But as you mentioned that doesn't make sense for this PR. So I guess the the change column is misleading; perhaps we should only report the change when it has surpassed a threshold (like 10%)? @wolfgangwalther WDYT?

@steve-chavez steve-chavez merged commit 8fa26ee into PostgREST:main Jun 16, 2026
47 of 48 checks passed
@wolfgangwalther

Copy link
Copy Markdown
Member

So I guess the the change column is misleading; perhaps we should only report the change when it has surpassed a threshold (like 10%)? @wolfgangwalther WDYT?

I don't think that would help - the opposite in fact. We'd be unable to detect small improvements like the one in #5008.

Reading loadtest results just isn't straight forward. It has never been and will never be. To properly assess these results, you always need to look at a lot of context. For example:

  • The code changes. Is any change in performance expected / sensible?
  • All the different loadtests, not just one of them. Do they show the same pattern or conflicting patterns? Of course, this will not help if we're dealing with a change in performance for a very specific request.
  • The other metrics, not just P0. Do they show the same or does it look more like noise?
  • Previous (or future, by running again) runs on the same commit. Just repeat the whole thing, does the pattern persist? Or does it vanish?
  • Recent loadtests on the main branch and/or other PRs. These can give an indication whether GitHub's runners are just overloaded right now, more noisy then normal, something like that.

For this PR we have seen some crazy numbers in both directions on recent runs. Same after merge on main. And on the next commit on main. This indicates there is just a high level of baseline noise in GHA right now. You should get much better results by running the loadtest locally, ideally with a mostly idle machine otherwise.

@steve-chavez

steve-chavez commented Jun 17, 2026

Copy link
Copy Markdown
Member

This indicates there is just a high level of baseline noise in GHA right now. You should get much better results by running the loadtest locally, ideally with a mostly idle machine otherwise.

Hm, that's no good TBH. It makes me lose confidence on the loadtests, I'd rather always run them locally, but that wastes too much time.

What if we use https://docs.github.com/en/actions/concepts/runners/self-hosted-runners for the loadtests? We did use a custom AWS instance for aarch64 builds before, so it should work. Edit: we could use an euronodes VPS.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

4 participants