-
Notifications
You must be signed in to change notification settings - Fork 470
doc/design: platform v2, physical distributed architecture of the query/control layer #24879
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
aljoscha
wants to merge
11
commits into
MaterializeInc:main
Choose a base branch
from
aljoscha:design-pv2-physical-architecture
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from all commits
Commits
Show all changes
11 commits
Select commit
Hold shift + click to select a range
8c17aa2
doc/design: skeleton of pv2 physical architecture
aljoscha bb5f96c
type out section about current architecture
aljoscha 266d107
write up multi-envd-multi-clusterd section
aljoscha 7d8aeb4
fix NOTE
aljoscha 6e6bdd5
add TODO
aljoscha 50f5d11
add section about only-clusterd architecture, questions only
aljoscha 598d7d1
fixup link to logical design doc
aljoscha be56d36
fixup title
aljoscha 7c38773
fix weird typo
aljoscha f79e15a
rename multi-envd-multi-clusterd -> multi-adapterd-multi-clusterd
aljoscha 1a05143
sketch how only-clusterd architecture could work, plus some more ques…
aljoscha File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
291 changes: 291 additions & 0 deletions
291
doc/developer/design/20240131_pv2_physical_architecture.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,291 @@ | ||
# Platform v2: Physical, Distributed Architecture of the Query/Control Layer | ||
|
||
> [!WARNING] | ||
> Currently, mostly a draft! For exploration and discussion. I'm explaining the | ||
> difficulties with a _clusterd-only_ architecture, mostly coming up with | ||
> questions that we'd need to answer. And I show how we can easily evolve our | ||
> current architecture to a _multi-envd-clusterd_ architecture. | ||
> | ||
> This has "I think", and "I" phrases, because this is really me (aljoscha) | ||
> exploring questions. | ||
|
||
## Context | ||
|
||
As part of the platform v2 work (specifically use-case isolation) we want to | ||
develop a scalable and isolated query/control that is made up of multiple | ||
processes that interact with distributed primitives at the moments where | ||
coordination is required. | ||
|
||
In Platform v2: Logical Architecture of the Query/Control Layer [(in-progress | ||
PR)](https://github.com/MaterializeInc/materialize/pull/23543) we lay out the | ||
logical architecture and reason about how we _can_ provide correctness in a | ||
more decoupled/distributed setting. Here, we will lay out the physical | ||
architecture: what kinds of processes make up an environment, how are they | ||
connected, how are queries routed, etc. | ||
|
||
## Goals | ||
|
||
- Specify what processes and how many we run | ||
- Specify how those processes interact to form a distributed query/control layer | ||
- More, TBD! | ||
|
||
## Non-Goals | ||
|
||
- TBD! | ||
|
||
## Overview | ||
|
||
TBD! | ||
|
||
## Current Architecture | ||
|
||
With our current architecture, one environment runs multiple `clusterd` and a | ||
single `environmentd` (`envd`) process. | ||
|
||
 | ||
|
||
The single `envd` process is responsible for/does: | ||
|
||
- interacts with durable Catalog | ||
- hosts the adapter layer, including the Coordinator | ||
- hosts Controllers, which instruct clusters and process their responses | ||
- has read holds: | ||
- for storage objects, these are ultimately backed by persist `SinceHandles`, | ||
which Coordinator holds via a `StorageController` | ||
- for compute objects, these are ephemeral, the `ComputeController` ensures | ||
these holds by choosing when to send `AllowCompaction` commands to clusters | ||
- a compute controller is the single point that is responsible for | ||
instructing a cluster about when it can do compaction, this is how it | ||
logically has "holds" on compute objects | ||
- processes user queries: for this, it needs both types of read holds mentioned above | ||
- downgrades read holds based on outstanding queries and policies, and | ||
responses about write frontiers from clusters | ||
|
||
How are queries routed/processed: | ||
|
||
1. simple! | ||
2. all queries arrive at single `envd`, in the adapter layer | ||
3. adapter instructs responsible controller to spin up computation on a cluster | ||
4. responsible controller sends commands to all replicas of the cluster | ||
5. responsible controller eventually receives responses from replicas | ||
6. adapter sends results back to client | ||
|
||
### Commentary | ||
|
||
Pros: | ||
|
||
- Empirically works quite well so far! | ||
|
||
Cons: | ||
|
||
- Single `envd` is single point through which all query/control processing must | ||
pass, limiting scalability and isolation. | ||
- We run more than one type of process, which cloud might find harder to | ||
operate. But it's what we've done for an while now. | ||
|
||
## Multi-adapterd, Multi-clusterd Architecture | ||
|
||
With this architecture, we would run multiple `clusterd`, which make up | ||
replicas which make up clusters. Same as with our current architecture. | ||
|
||
The difference is that we shard the work of `envd` by cluster (the logical | ||
CLUSTER that users know, _not_ `clusterd`) and now call it `adapterd`. There | ||
would likely have to remain a skeleton `envd` whose only job it is to talk to | ||
the Orchestrator/Kubernetes: it has to make sure that the processes that | ||
_should_ be running _are_ running, namely `clusterd` and `adapterd` for | ||
clusters that exist in the catalog. | ||
|
||
 | ||
|
||
> [!NOTE] | ||
> We assume there is a balancer layer that can understand enough about | ||
> selecting clusters and can route user queries to the correct `adapterd`. | ||
> | ||
> And we are in fact working on that, it's `balancerd`! | ||
|
||
- the `adapterd` can all write to the Catalog | ||
- the `adapterd` subscribe to the Catalog to learn about changes that other | ||
`adapterd` are adding and update their in-memory view of the Catalog and | ||
instruct the controller for their cluster when necessary | ||
- the `adapterd` use the distributed Timestamp Oracle to make reads/writes | ||
linearizable | ||
- the `adapterd` use the Tables-Txn abstraction for writing to tables from multiple | ||
processes | ||
- a given `adapterd` hosts one controller (one for compute and one for storage) | ||
that is _only_ responsible for the objects running on that cluster | ||
- that controller instructs the cluster (replicas), same as before, and | ||
receives responses | ||
- the `adapterd` has read holds: | ||
- for storage objects, these are ultimately backed by persist `SinceHandles`, | ||
which Coordinator holds via a `CollectionsController` (todo: ref to design | ||
doc)[] | ||
- the single compute controller is the single point that is responsible for | ||
instructing _its_ cluster about when it can do compaction, this is how it | ||
logically has "holds" on compute objects | ||
- adapter downgrades read holds based on outstanding queries and policies, and | ||
responses about write frontiers from _its_ cluster | ||
- a given `adapterd` only has read holds on compute objects in its cluster | ||
- a given `adapterd` can only process queries that exclusively use compute objects | ||
on that cluster. This is the same today, a query cannot mix compute resources | ||
from different clusters! | ||
- a given `adapterd` holds `SinceHandles` for all storage collections (at least | ||
those that it is interested in), because storage collections can be accesses | ||
from any cluster | ||
|
||
How are queries routed/processed: | ||
|
||
1. simple-ish! | ||
2. the balancer layer routes a query to the responsible `adapterd` | ||
3. query arrives in the adapter layer | ||
4. adapter instructs _its_ controller to spin up computation on _its_ cluster | ||
5. controller sends commands to all replicas of _its_ cluster | ||
6. controller eventually receives responses from replicas | ||
7. adapter sends results back to client | ||
|
||
### Commentary | ||
|
||
Pros: | ||
|
||
- Easy to migrate to from our current architecture. | ||
- Easy-ish to think about how information flows in the system, who "holds" read | ||
holds, how queries are routed through adapter and to replicas. | ||
- Nothing changes in the UX, users still interact with clusters, it's just that | ||
the query/control work is now sharded, meaning there is better | ||
isolation/scalability. | ||
|
||
Cons: | ||
|
||
- We run more than one type of process, which cloud might find harder to | ||
operate. But it's what we've done for an while now. | ||
- We run more than one `envd`-shaped processes (our new `adapterd`, one per | ||
cluster), which we previously didn't. | ||
|
||
## Multi-clusterd, No-envd Architecture | ||
|
||
It's certainly possible, but hard for me to see how we get there with how | ||
controllers, read holds, and query processing work today. | ||
|
||
Remember: a user-facing CLUSTER is made up of replicas, which themselves are | ||
made up up `clusterd` processes. | ||
|
||
Questions: | ||
|
||
- How does Catalog content (the source of truth) get translated to instructions | ||
for clusters/replicas? | ||
- In the multi-adapterd-multi-colusterd design, there is still a single point | ||
(the controller, in that cluster's `adapterd`) that instructs all cluster | ||
replicas, based on the Catalog. Which is easy to reason about. | ||
- Do all replicas participate in a distributed protocol to figure out what to | ||
do amongst themselves? | ||
- Who holds on to `SinceHandles` for storage collections? Is it all `clusterd` | ||
processes? | ||
- The `n*m` problem, where `n` is number of `clusterd` processes, and `m` is | ||
number of storage collections, doesn't seem to problematic, because our | ||
`n` is small. This one seems tractable, at least. | ||
- Who holds on to read holds for compute objects, and how? | ||
- In the multi-adapterd-multi-colusterd design, there is still a single point | ||
(the controller, in that clusters `adapterd`) that instructs all cluster | ||
replicas, allows them to do compaction when desired. It can therefore | ||
logically keep read holds. | ||
- Would all `clusterd` processes somehow logically keep read holds for all | ||
replicas? They would need this to do query processing! | ||
- How do queries get routed? | ||
- In the multi-adapterd-multi-clusterd design, there is still a single | ||
entrypoint _per cluster_: that cluster's `adapterd` process. | ||
- The balancer could choose one `clusterd` process, round-robin. And then | ||
that `clusterd` process (the piece of adapter logic in it) has to process | ||
the query, talk to the other replicas/processes? | ||
- There could be one "special" `clusterd` process that does the query/control | ||
work. Feels very iffy! | ||
|
||
|
||
### Commentary | ||
|
||
Pros: | ||
|
||
- We only run one type of process. | ||
- Plus the one super-slim `envd` for bootstrapping things and talking to the | ||
Orchestrator/Kubernetes? | ||
|
||
Cons: | ||
|
||
- Hard to see how we can evolve our design to there from our current | ||
architecture. At least for me, aljoscha. | ||
|
||
## Sketch of multi-clusterd architecture | ||
|
||
In order to smear the responsibilities across the multiple `clusterd`, instead | ||
of them being held by a single `adapterd`, we need to look at three main | ||
responsibilities: | ||
|
||
1. translating Catalog contents to commands for clusters/replicas | ||
2. holding on to logical since handles for compute objects (across all the | ||
replicas) and synthesizing `AllowCompaction` commands | ||
3. routing/processing queries | ||
|
||
> [!NOTE] | ||
> I'll use persist shards and their `compare_and_append` when talking about | ||
> distributed coordination below, but these ideas can be mapped to other | ||
> primitives. It's just what we're used to by now! | ||
|
||
### Catalog -> cluster/replica commands | ||
|
||
The input to synthesizing cluster/replicas commands are: 1. Catalog changes, | ||
and 2. currently held read holds. If all replicas hold read holds and can | ||
listen to Catalog changes, they all have access to the same information. | ||
|
||
The need to coordinate in writing down commands, for example in a durable | ||
command log/persist shard. All processes could scramble and try to | ||
`compare_and_append`, or they could elect a leader that does the appending, | ||
with less contention. | ||
|
||
All clusters/replicas would listen on the durable command log instead of being | ||
instructed by controller commands. | ||
|
||
### Read Holds | ||
|
||
Read holds can also be facilitated by the durable command log: all processes | ||
put in place an `AllowCompaction` command for the things they care about. | ||
Clusters/replicas listen to this to know when they can downgrade. They can | ||
downgrade only based on the minimum/meet over all read holds/`AllowCompaction` | ||
commands. | ||
|
||
Compute currently downgrades aggressively, because it's cheap to send | ||
controller messages to clusters/replicas. If we use the durable command log for | ||
read holds/downgrades, we would have to rate-limit writing new | ||
`AllowCompaction` commands to the log. This means we potentially do less | ||
compaction, but that could be fine. | ||
|
||
A CLUSTER (the user-facing one) could also use this concept to collectively | ||
"hold" a set of persist since handles for all collections: the different | ||
processes put in place read holds in the command log, and any process can | ||
downgrade the since handle to the minimum/meet of what we have in the durable | ||
command log. This way, we multiplex multiple logical holds into one set of | ||
"physical" since handles. Each CLUSTER logically holds one persist since handle | ||
per collection. | ||
|
||
### Query Routing | ||
|
||
All processes can connect to all other clusters/replicas. The balancer routes a | ||
query to one `clusterd` process. | ||
|
||
**Question**: What about sessions? | ||
|
||
- Equivalent to Postgres, where, in order to connect to a different database | ||
you have to re-connect. Our equivalent would be that you have to re-connect | ||
if you want to talk to a different cluster. | ||
- Still means that all the queries from one session have to be routed to the | ||
same `clusterd` process, or they somehow have to share session state. | ||
- Makes the user experience slightly worse. | ||
|
||
## Questions | ||
|
||
Jan: Who serves queries when there is no replica? | ||
Dan: Is there work that a cluster/replica needs to do when there are no replicas? | ||
|
||
## Alternatives | ||
|
||
TBD! | ||
|
||
## Open questions | ||
|
Binary file added
BIN
+127 KB
doc/developer/design/static/pv2_physical_architecture/multi-adapterd-clusterd.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good thing to work through! This option seems to me to have the most upside, but certainly has the most open questions. I think until those are resolved a bit more, hard to have an opinion on which is the best end state.
Also I think there's a world in which we evolve to here through something like the Multi-cluster-controllerd, Multi-clusterd Architecture, but it'd be good to flesh out how that would work early on.