-
Notifications
You must be signed in to change notification settings - Fork 519
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs(compute): RFC for compute rolling restart with prewarm #11294
base: main
Are you sure you want to change the base?
Conversation
7964 tests run: 7580 passed, 0 failed, 384 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
cfaaa34 at 2025-03-19T20:19:46.609Z :recycle: |
When `lfc_dump_interval_sec` is set to `N`, `compute_ctl` will periodically dump the LFC state | ||
and store it in S3, so that it could be used either for auto-prewarm after restart or by replica | ||
during the rolling restart. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this. S3 writes are expensive to scale, and if we're about to write every N seconds it's going to be expensive to host read replicas.
I'd expected CPlane-triggered writes; either during shutdown (for warming up after start) or before shutdown (for hot restart); not systematic writes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for CPlane-triggered writes.
Alternatively it can done as part of checkpoint.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think checkpoint is a good place for this; it's much too frequent on highly loaded systems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, seconds is just a unit, it can be set to 5, 15 minutes. For cplane-orchestrated there is a separate API, I imagined periodic dumping to be useful for
i) auto-prewarm, i.e. compute periodically dumps LFC content, so later it can be used at restart. In theory, we can only dump at graceful shutdown, but then it won't help with accidental compute crash/restart, as there might be no LFC state to warm up from
ii) later for having a hot standby, i.e. it can periodically fetch fresh content from S3 and do prewarming
I don't want to wire too complex logic via cplane, so TBH, I don't see other options to have a robust auto-prewarm without periodic dumping of the LFC state, pg_prewarm does the same via pg_prewarm.autoprewarm_interval
@MMeent @knizhnik do you have any specific suggestions of how we can implement it without periodic dumping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can make it part of endpoint shutdown procedures?
Note that "i.e. it can periodically fetch fresh content from S3 and do prewarming" won't work as you seem to expect it to, as prewarming is explicitly designed to never evict existing pages from LFC, and thus won't do much if the size of LFC doesn't change for the hot standby. It's prewarming by replacing unused pages with potentially useful pages, and explicitly not LFC state synchronization.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As for hot standby, it's probably good enough to "just" disable the replay filter and require replay of all WAL (rather than only those records which we already have in our caches)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment about hot standby
We can make it part of endpoint shutdown procedures?
Yes, this is what I meant by 'In theory, we can only dump at graceful shutdown'. That'd work in most of the cases, but what I don't like is that it doesn't cover any abnormal termination like OOM, VM restarts, etc.
With your cost estimation, dumping every 5 minutes is completely reasonable
and store it in S3, so that it could be used either for auto-prewarm after restart or by replica | ||
during the rolling restart. | ||
|
||
When `lfc_auto_prewarm` is set to `true`, `compute_ctl` will start prewarming the LFC upon restart |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this is happening compute is not able to receive queries right? If so, we shouldn't use this in normal compute starts which are not part of restart logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, autoprewarm can happen in the background while there are user queries active.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, as Matthias mentioned it should be completely async. Furthermore, even any failures during auto-prewarm shouldn't be critical, it's mentioned in the failure modes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see thanks. I think I asked my question incorrectly. While pre-warming the LFC from where it's stored is the compute able to accept queries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should be fully functional, so auto-prewarm doesn't block anything
2. Control plane promotion request timed out or hit network issues. If it never reached the | ||
compute, control plane should just repeat it. If it did reach the compute, then during | ||
retry control plane can hit `409` as previous request triggered the promotion already. | ||
In this case, control plane need to retry until either `200` or | ||
permanent error `500` is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of 500
I assume we are going to suspend the compute and start a new one without pre-warm right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, 500 means a permanent failure, so it's likely unsafe to retry without full restart
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, linked to my question above in that case if there is prewarmed LFC and if we are able to accept queries while prewarming (I doubt it's the case though) we could still achieve computes with prewarmed LFC on start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There two separate flows:
-
Auto-prewarm, i.e. just start compute normally, then prewarm it in the background, but clients can connect to it right after start already even if prewarm wasn't finished, so we trade performance for HA
-
Restart via promotion of the already warm replica, this is a preferred way, as clients get guaranteed performance
Yet, if 2. fails, we can always fall back to just 1., so that's what this section is about. Does my comment make it any clearer?
|
||
3.1. Terminate the primary compute. Starting from here, **this is a critical section**, | ||
if anything goes off, the only option is to start the primary normally and proceed | ||
with auto-prewarm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean by proceed with auto-prewarm? Does it mean primary will start without pre-warming but will continue doing the automatic pre-warm. In this case I don't see how we use auto-prewarm in restart code path. Do we have any usage for it outside of this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that we just start primary from scratch with empty caches, the only option to improve the situation is to do async auto-prewarm, while already accepting new connections. So from cplane perspective, it's just a normal start from with auto-prewarm enabled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only option to improve the situation is to do async auto-prewarm
I see, how does this improve the situation if the compute is already starting with empty caches? When would we use the auto-prewarmed LFC if we always try to trigger prewarm on restarts triggered from the cplane?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, that's debatable whether we need auto-prewarm or not at all. It exists in vanilla Postgres. The idea is that we can prewarm caches faster when we do it intentionally vs. when user tries to prewarm by just doing their normal workload
Imagine cplane, it eventually accesses all non-deleted projects/endpoint/branches. If we just restart it at Neon, it will take some time to visit all objects. Yet, if we actively prewarm caches in the brackground, the chance that next project read will hit the cache will be higher, as it will be already there, even though cplane hasn't read it explicitly
In practice, it may not suite all workloads, but we cannot answer for sure until we implement it and battle-test, but imo it exists in vanilla Postgres for a reason, so there are use-cases
1. `POST /store_lfc_state` -- dump LFC state using Postgres SQL interface and store result in S3. | ||
This has to be a blocking call, i.e. it will return only after the state is stored in S3. | ||
If there is any concurrent request in progress, we should return `429 Too Many Requests`, | ||
and let the caller to retry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This concurrency control, does that cover both/all lfc endpoints, or just this one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense only for blocking calls /promote and /store_lfc_state. For other methods, it should be pretty safe to do parallel requests, although I don't expect that cplane will do that intentionally
pub enum PrewarmStatus { | ||
/// Prewarming was never requested on this compute | ||
Off, | ||
/// Prewarming was requested, but not started yet | ||
Pending, | ||
/// Prewarming is in progress. The caller should follow | ||
/// `PrewarmState::progress`. | ||
InProgress, | ||
/// Prewarming has been successfully completed | ||
Completed, | ||
/// Prewarming failed. The caller should look at | ||
/// `PrewarmState::error` for the reason. | ||
Failed, | ||
/// It is intended to be used by auto-prewarm if none of | ||
/// the previous LFC states is available in S3. | ||
/// This is a distinct state from the `Failed` because | ||
/// technically it's not a failure and could happen if | ||
/// compute was restart before it dumped anything into S3, | ||
/// or just after the initial rollout of the feature. | ||
Skipped, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be more along the lines of the following?
pub enum PrewarmStatus { | |
/// Prewarming was never requested on this compute | |
Off, | |
/// Prewarming was requested, but not started yet | |
Pending, | |
/// Prewarming is in progress. The caller should follow | |
/// `PrewarmState::progress`. | |
InProgress, | |
/// Prewarming has been successfully completed | |
Completed, | |
/// Prewarming failed. The caller should look at | |
/// `PrewarmState::error` for the reason. | |
Failed, | |
/// It is intended to be used by auto-prewarm if none of | |
/// the previous LFC states is available in S3. | |
/// This is a distinct state from the `Failed` because | |
/// technically it's not a failure and could happen if | |
/// compute was restart before it dumped anything into S3, | |
/// or just after the initial rollout of the feature. | |
Skipped, | |
} | |
pub enum PrewarmStatus { | |
/// Prewarming was never requested on this compute | |
Off, | |
/// Prewarming was requested, but not started yet | |
Pending { segments: N }, | |
/// Prewarming is in progress. The caller should follow | |
/// `PrewarmState::progress`. | |
InProgress { segments: N, done: N }, | |
/// Prewarming has been successfully completed | |
Completed { segments: N }, | |
/// Prewarming failed. The caller should look at | |
/// `PrewarmState::error` for the reason. | |
Failed { error: Option<String> }, | |
/// It is intended to be used by auto-prewarm if none of | |
/// the previous LFC states is available in S3. | |
/// This is a distinct state from the `Failed` because | |
/// technically it's not a failure and could happen if | |
/// compute was restart before it dumped anything into S3, | |
/// or just after the initial rollout of the feature. | |
Skipped, | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could be, yeah, it's just such non-uniformly-typed enums have very-very bad representation in OpenAPI spec and json generally, so I'd prefer to split the status from any aux info. We can have some custom serializer, but that's more work for no good reason
I'm afraid that it will be harder to consume such response from cplane, wdyt @mtyazici?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are able to present it in the spec I think generators would handle them alright. But I think instead of using union types we could have a status object that contains all mentioned fields in the suggestion.
type PrewarmStatus struct {
Status string
Segments int64
DoneSegments int64
Error string
}
|
||
### Complete rolling restart flow | ||
|
||
```mermaid |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is missing the important element where the Primary that has to be shut down first, but the (old) primary does not show up in this diagram.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Primary in the diagram is the old primary, and it's shut down first, it's also mentioned in text. Or what do you mean?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, it didn't have the same actor labeling as those of the secondary, so I'd failed to notice this.
Either way, that needs additional details, as there are some compute_ctl-postgres interactions which we need to detail on the primary as well for this to work correctly and consistently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My intent was to keep it reasonably high-level. Do you see some important interactions missing here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not "missing" per se, but I do find it confusing that the RFC does go into detail for one side (the promoting replica) but not the other (the primary), while both sides are critial for correct functioning of the system. I'd expected both sides to have the same amount of detail.
|
||
There are many things to consider here, but three items just off the top of my head: | ||
|
||
1. How to properly start the `walproposer` inside Postgres. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can wait for PostgreSQL to become primary before starting streaming, I don't think it needs to be all that difficult.
3.1. Terminate the primary compute. Starting from here, **this is a critical section**, | ||
if anything goes off, the only option is to start the primary normally and proceed | ||
with auto-prewarm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We terminate the old primary before promoting the secondary. If I'm reading this right, there's a an availability gap between the termination and promotion. Is this reading correct? If so, can we avoid it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I'm reading this right, there's a an availability gap between the termination and promotion
The secondary could be made available for read-only queries before the primary shuts down. However, we can not allow write queries to the secondary before the primary has shut down, so there will always be a gap where there is no writer available.
The system should be fast enough, however, that this gap is no longer than a second at most; and probably much less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to aim for something where the happy path has no hiccups. If that's not possible, let's be explicit about it and say why.
Proxy learns about about the new primary after the promotion of the new compute, so there should be no queries on it at that point (according to the diagram). It seems like we can swap promotion and termination around while maintaining this property.
What would happen to the client in this case? Let's say we have an in-progress query when the old primary terminates. Client would probably retry and eventually get routed to the new primary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we can swap promotion and termination around while maintaining this property.
No, Vlad, we can not do that.
Two computes can Never, NEVER, never both be Primary on the same timeline at the same time. CPlane is supposed to make sure of that, and Compute will fail if it doesn't. Promotion of the replica before the original Primary shut down will cause errors, panics, data loss, shutdowns, and/or nasal deamons on the Primary, this promoting Secondary, or both.
We don't want to test that theory.
What would happen to the client in this case? Let's say we have an in-progress query when the old primary terminates.
Their session would and should get disconnected. How that client handles disconnections is not something we care about here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Two computes can Never, NEVER, never both be Primary on the same timeline at the same time. CPlane is supposed to make sure of that, and Compute will fail if it doesn't. Promotion of the replica before the original Primary shut down will cause errors, panics, data loss, shutdowns, and/or nasal deamons on the Primary, this promoting Secondary, or both.
At the risk of being annoying: why? I understand it doesn't work in the case when two primaries are being written to. But what about if we could guarantee the new primary is idle? More specifically, by idle I mean it would have to not write any WAL and be in some sort of consensus observer role.
I'm sure you're right about this working with the current state of afairs, but I'm curious about what the technical roadblock is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Postgres can write WAL not only as result of some query execution.
There many background activities which cause writing WAL: checkpoint, vacuum,...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More specifically, by idle I mean it would have to not write any WAL and be in some sort of consensus observer role.
That's what a hot standby is for PostgreSQL - the hot standby is not allowed to write WAL, but could stop reading and applying WAL from the primary and start writing its own WAL at a moment's notice.
The issue is that you can't promote until you've replayed all acknowledged WAL from safekeepers, because otherwise you'd fail to replay commits that the primary may have already sent acknowledgements for to its clients, essentially losing commits.
|
||
5. Any other unexpected failure or timeout during promotion. Unfortunately, at this moment | ||
we already have the primary node stopped, so the only option is to start primary again | ||
and proceed with auto-prewarm. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The compute is available during pre-warm, correct? Wondering if there's any interactions between the user workload and prewarm that are worth considering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was wondering this as well. Will we try to pre-warm computes when not restarting a compute; for example compute start due to a proxy connection.
If compute is not available during pre-warm the answer is definitely no but if it's available then the LFC cache might have become stale over time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied here #11294 (comment) and here #11294 (comment) as well
Wondering if there's any interactions between the user workload and prewarm that are worth considering.
if it's available then the LFC cache might have become stale over time.
This PR #10442 introduced additional locking when accessing LFC, so it's now considered safe to write there concurrently, so that's the base for all this work.
During prefetch, we always request the latest pages from the pageserver. If, after loading page gets modified, then it will be either updated in LFC (in case of primary) or evicted from LFC after receiving the corresponding WAL record (in case of replica). In other words, if pages is not present in the LFC, we will fetch it from the pageserver; if someone (backend, normal client workload) tries to write it concurrently, then the access will be synchronized, and we should still get a freshness guarantee. @knizhnik or @MMeent can correct me, as I'm not fluent in the underlying mechanism, I consider it as given here
Thus, it should be safe to prewarm LFC concurrently with user load. The only problem is performance, I wrote about it in other comments, but anyway. Yes, if it's highly intensive workload, then prewarm can compete for storage resources with user workload, so we can consider auto-prewarm to be user-togglable feature, I wrote about it in the section about auto-prewarm concerns
@VladLazar @mtyazici let me know if it makes it clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or evicted from LFC after receiving the corresponding WAL record (in case of replica).
This is incorrect: WAL will be replayed for every page that's currently in the LFC or shared buffers.
Wondering if there's any interactions between the user workload and prewarm that are worth considering.
if it's available then the LFC cache might have become stale over time.
The LFC can become stale, but only if the main page is still in shared buffers. The (modified) page from buffers will at some point be written, which will make the LFC lose its stale-ness.
In any case, the stale-ness of a page in LFC doesn't (shouldn't) matter for this RFC.
3.2. Send cache invalidation message to all proxies, notifying them that all new connections | ||
should request and wait for the new connection details. At this stage, proxy has to also | ||
drop any existing connections to the old primary, so they didn't do stale reads. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Is it possible to make the old primary close the connections instead?
It's easy enough to prevent proxy from starting a new connection (assuming at-least-once-delivery) but closing existing connections is not so easy at the moment, and with the upcoming pglb work this likely would not be practical. It's not a major blocker, but I want to explore the options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This step is right after we terminate the primary, so yes, during normal termination, we can expect that at this moment primary will be already terminated and all connections to it will be closed.
I added this item after talking to Stas, as he had a fair point that the old primary could be unresponsive during this promotion flow, so we will send termination and k8s resources deletion requests, but we cannot generally guarantee that it will be dead by this time. So this step is more to protect from the situation, when old connections will still be connected to the old primary
See also item 7 in failure modes. I'm not quite sure how big the problem is. Safekeepers will guarantee that there is only one running writer at a time, so it's more like a nice-to-have, than must-have feature, just to prevent unnecessary interference and side effects (I worried about some stale reads from the old primary and failing writes because safekeepers should reject them)
Both of this must be validated by S3 proxy and JWT token used by `compute_ctl`. | ||
|
||
### Unresolved questions | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
## Authentication and authorization | |
1. Control plane should generate a JWT token which would be used by compute for authorizing requests | |
to S3 proxy. This token should be passed in `/compute/api/v2/computes/{id}/spec` route, | |
parsed by `compute_ctl` as an optional field (to preserve backward compatibility) and passed as-is | |
to proxy requests during prewarm or prewarm offload. If no token is passed, requests to proxy | |
should return 505 Not Supported (other services have `--dev` flags which disable token check or | |
config fields which bypass authentication when absent but that's error-prone) and log the error. | |
2. S3 proxy should have `auth_public_key.pem` similar to pageserver | |
for spawning an https server for compute requests (this would also further help in getting | |
PCI-DSS certification) and verifying compute requests. The pemfile should be supplied by | |
infra/. As with pageserver, S3 proxy should have `/reload_auth_validation_keys` endpoint | |
to reload the pemfile from config should it change. | |
## Metrics | |
In addition to changing `compute_ctl`'s /status, proxy should provide request duration | |
metrics along with result server codes as labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe Matthias's RFC focuses more on the high-level overview, but ok, I'll copy the comments. Btw this will probably be split into tasks for https://github.com/neondatabase/cloud/issues/24225 (see sub-issues)
Do we have a calculaton on the network bandwidth needed on the compute k8s cluster for a set of (large) tenants with LFCs in the order of hundreds of GB either storing or retrieving their LFC state from S3? |
As for the compute_ctl->proxy side, it would be regular async streaming file download/upload. Regarding network capacity, most of clients will use general proxies (one per region), however, with some big ones with large CUs we'll probably have dedicated proxy instances running on the compute nodes (not confirmed as I need to test performance implications). |
The question was regarding capacity of 25 Gib/s NIC at physical hardware of k8s node running computes. |
Problem
Neon currently implements several features that guarantee high uptime of compute nodes:
This helps us to be well-within the uptime SLO of 99.95% most of the time. Problems begin when we go up to multi-TB workloads and 32-64 CU computes. During restart, compute looses all caches: LFC, shared buffers, file system cache. Depending on the workload, it can take a lot of time to warm up the caches, so that performance could be degraded and might be even unacceptable for certain workloads. The latter means that although current approach works well for small to
medium workloads, we still have to do some additional work to avoid performance degradation after restart of large instances.
Rendered version
Part of https://github.com/neondatabase/cloud/issues/19011