-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: add hot secondaries rfc #11227
base: main
Are you sure you want to change the base?
Conversation
7953 tests run: 7569 passed, 0 failed, 384 skipped (full report)Flaky tests (3)Postgres 17
Postgres 14
Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
ce3d23e at 2025-03-14T20:27:59.970Z :recycle: |
docs/rfcs/043-hot-secondaries.md
Outdated
## Purpose | ||
|
||
We aim to provide a sub-second RTO for pageserver failures, for mission | ||
critical workloads. To do this, we should enable the postgres client |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a second benefit that hot secondaries bring: scaling read traffic. say someone runs a lot of analytic workloads on some database in parallel. for oltp stuff this is probably all already handle-able via caches, but idk.
docs/rfcs/043-hot-secondaries.md
Outdated
The average total disk write bandwidth is the sum of WAL generation rate plus L1/image generation rate: this is about the same as a normal attached location. The average disk _read_ bandwidth of a hot secondary is far lower than an attached location because it is not reading back layers to compact them -- layers are only read in periods where the attached location was unavailable, so computes started reading from a hot secondary. | ||
|
||
The trigger for virtual compaction can be similar to the existing trigger | ||
for L1 compaction on attached locations: once we build up a deep stack of L0s, then we do virtual compaction to trim it. This assumes that the attached location has kept up with compaction. The hot secondary can be |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if both the primary and the hot secondary are in 100% perfect sync, so they have the same number of l0s.
then the moment comes when the hot secondary and primary both think about doing compaction. at that point, the secondary will look for remote layers immediately, while the primary is not ready yet, it hasn't uploaded any files yet.
edit: what I'm trying to say is that there is a risk of the hot secondary lagging behind in a similar fashion to the warm secondary. the warm secondary misses out on new layers until they make it into the layer map. the hot secondary doesn't miss out on them but has a larger compaction debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would the transition from hot secondary to primary work? we now have some data in remote storage that might be inconsistent with local state, the primary might be ahead or behind, might have less layer files, etc.
in general, the design before has been that S3 is the pristine version of the state that other places, like secondaries or primaries, are downstream of. But now, once the hot secondary becomes a primary, it might have a step to delete files that are in S3 but not needed locally, because we have a slightly differently cut local copy of them, and we probably don't want to redownload stuff during an attach in order to become operational (this was the goal of the hot secondary after all).
also I'm wondering about backpressure, should hot secondaries failing to catch up cause backpressure? we can probably answer this later too, but if there is no backpressure, we might end in situations where the hot secondary is behind but has different l0s, so it might be smarter to ditch those l0s instead of ditching what's in s3.
- after some short timeout (100s of ms), compute gives up on getpage requests to the primary and sends | ||
them to the hot secondary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the compute learn about the pageserver hosting the hot secondary location? The RFC does not mention anything on it, so I'm assuming the current apply-config
mechanism is implied.
I think that's fine to start with, but it implies an unbounded availability gap when faced with notification delivery issues (the like of which we've seen quite a few lately).
Problem
Summary of changes