Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add hot secondaries rfc #11227

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

docs: add hot secondaries rfc #11227

wants to merge 3 commits into from

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Mar 13, 2025

Problem

Summary of changes

Copy link

github-actions bot commented Mar 13, 2025

7953 tests run: 7569 passed, 0 failed, 384 skipped (full report)


Flaky tests (3)

Postgres 17

Postgres 14

Code coverage* (full report)

  • functions: 32.3% (8732 of 27002 functions)
  • lines: 48.4% (74806 of 154645 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
ce3d23e at 2025-03-14T20:27:59.970Z :recycle:

## Purpose

We aim to provide a sub-second RTO for pageserver failures, for mission
critical workloads. To do this, we should enable the postgres client
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a second benefit that hot secondaries bring: scaling read traffic. say someone runs a lot of analytic workloads on some database in parallel. for oltp stuff this is probably all already handle-able via caches, but idk.

The average total disk write bandwidth is the sum of WAL generation rate plus L1/image generation rate: this is about the same as a normal attached location. The average disk _read_ bandwidth of a hot secondary is far lower than an attached location because it is not reading back layers to compact them -- layers are only read in periods where the attached location was unavailable, so computes started reading from a hot secondary.

The trigger for virtual compaction can be similar to the existing trigger
for L1 compaction on attached locations: once we build up a deep stack of L0s, then we do virtual compaction to trim it. This assumes that the attached location has kept up with compaction. The hot secondary can be
Copy link
Member

@arpad-m arpad-m Mar 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if both the primary and the hot secondary are in 100% perfect sync, so they have the same number of l0s.

then the moment comes when the hot secondary and primary both think about doing compaction. at that point, the secondary will look for remote layers immediately, while the primary is not ready yet, it hasn't uploaded any files yet.

edit: what I'm trying to say is that there is a risk of the hot secondary lagging behind in a similar fashion to the warm secondary. the warm secondary misses out on new layers until they make it into the layer map. the hot secondary doesn't miss out on them but has a larger compaction debt.

Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would the transition from hot secondary to primary work? we now have some data in remote storage that might be inconsistent with local state, the primary might be ahead or behind, might have less layer files, etc.

in general, the design before has been that S3 is the pristine version of the state that other places, like secondaries or primaries, are downstream of. But now, once the hot secondary becomes a primary, it might have a step to delete files that are in S3 but not needed locally, because we have a slightly differently cut local copy of them, and we probably don't want to redownload stuff during an attach in order to become operational (this was the goal of the hot secondary after all).

also I'm wondering about backpressure, should hot secondaries failing to catch up cause backpressure? we can probably answer this later too, but if there is no backpressure, we might end in situations where the hot secondary is behind but has different l0s, so it might be smarter to ditch those l0s instead of ditching what's in s3.

Comment on lines +132 to +133
- after some short timeout (100s of ms), compute gives up on getpage requests to the primary and sends
them to the hot secondary.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does the compute learn about the pageserver hosting the hot secondary location? The RFC does not mention anything on it, so I'm assuming the current apply-config mechanism is implied.

I think that's fine to start with, but it implies an unbounded availability gap when faced with notification delivery issues (the like of which we've seen quite a few lately).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants