Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Store startup fixes #5926

Open
wants to merge 13 commits into
base: master
Choose a base branch
from
Open

Store startup fixes #5926

wants to merge 13 commits into from

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Apr 3, 2025

With multiple nodes, they could race each other and cause database errors.

I've reworked the startup code to be (a) race-proof (I hope) and (b) easier to follow. Any node that wants to run database setup now gets a lock on the primary, and runs all code needed for setup while holding that lock. That way, nodes can't interfere with each other. In the common case, where there are no database changes, be it because of migrations, configuration changes, or code changes that map different tables, the time during which any node holds the lock is very brief. This could be further optimized but let's first see how this performs in practice.

@lutter lutter requested a review from incrypto32 April 3, 2025 17:25
@lutter lutter force-pushed the lutter/sharded branch 11 times, most recently from 2199de5 to 6bc5d84 Compare April 6, 2025 05:10
lutter added 8 commits April 6, 2025 11:07
It should be up to the operator if they use it or not, and when they want
to reset it
The current database setup code was inerently racy when several nodes were
starting up as it relied on piecemeal locking of individual steps. This
change completely revamps the strategy we use: setup now takes a lock on
the primary, so that only one node at a time will run the setup code.
Before, PoolState was just an enum and code all over the place dealt with
its interior mutability. Now, we encapsulate that to simplify code using
the PoolState
Instead of dealing with disabled shards (shards that have a pool size of 0
configured), filter those shards out on startup and warn about them.

The end effect is that for that configuration, users will get an error of
'unkown shard' rather than 'shard disabled'. Since configuring a shard to
have no connections is kinda pathological, and leads to an error when it is
used either way, the code simplification is worth the slightly less helpful
error message.

Removing the 'disabled' state from pools has ripple effects to quite a few
other places, simplifying them a bit
With the previous code, we would run setup initially when creating all
pools, but they would not be marked as set up. On the first access to the
pool we would try to run setup again, which is not needed. This change
makes it so that we remember that we ran setup successfully when pools are
created
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant