-
Notifications
You must be signed in to change notification settings - Fork 716
ct: End to end cluster recovery #29553
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements L0 recovery functionality by adding the ability to bootstrap partitions with custom initial offsets and terms. This enables programmatic partition creation for cluster recovery scenarios where partitions need to start at specific known offsets.
Changes:
- Added
bootstrap_partition_statefunction to create partition state with custom offset/term - Introduced
partition_bootstrap_paramsto store and propagate bootstrap parameters through the system - Added commands and handlers to set bootstrap parameters on existing topics before partition materialization
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| src/v/raft/consensus_utils.h | Declares new bootstrap_partition_state function |
| src/v/raft/consensus_utils.cc | Implements bootstrap_partition_state to create Raft snapshot and set storage metadata |
| src/v/raft/tests/bootstrap_partition_state_test.cc | Tests bootstrap_partition_state functionality |
| src/v/raft/tests/BUILD | Adds build configuration for bootstrap test |
| src/v/cluster/types.h | Defines partition_bootstrap_params and related command data structures |
| src/v/cluster/types.cc | Implements output operator for partition_bootstrap_params |
| src/v/cluster/topics_frontend.h | Declares set_bootstrap_params frontend method |
| src/v/cluster/topics_frontend.cc | Implements set_bootstrap_params to replicate bootstrap command |
| src/v/cluster/topic_updates_dispatcher.h | Declares apply method for bootstrap params command |
| src/v/cluster/topic_updates_dispatcher.cc | Implements command dispatch for bootstrap params |
| src/v/cluster/topic_table.h | Adds bootstrap_params field to partition metadata |
| src/v/cluster/topic_table.cc | Implements storage and retrieval of bootstrap params |
| src/v/cluster/tests/topic_table_test.cc | Tests bootstrap params command application |
| src/v/cluster/partition_manager.h | Adds bootstrap_params parameter to manage method |
| src/v/cluster/partition_manager.cc | Implements bootstrap logic using bootstrap_partition_state |
| src/v/cluster/controller_snapshot.h | Updates snapshot schema to include bootstrap_params |
| src/v/cluster/controller_backend.h | Adds bootstrap_params parameter to create_partition |
| src/v/cluster/controller_backend.cc | Passes bootstrap params through partition creation pipeline |
| src/v/cluster/commands.h | Defines set_partition_bootstrap_params_cmd |
2c63bce to
b85c5ce
Compare
b85c5ce to
1fa4866
Compare
Add infrastructure to support bootstrapping partitions with custom initial offset and term. This is used for programmatic partition creation during cluster recovery. - Add partition_bootstrap_params struct in types.h - Add get_partition_bootstrap_params() API to topic_table - Pass bootstrap_params through controller_backend to partition_manager - Add bootstrap_existing_log() helpers in raft/consensus_utils Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
1fa4866 to
2c8cdfc
Compare
Add a controller command to set bootstrap parameters for partitions in an existing topic. This enables cluster recovery to: 1. Create topics without remote_topic_properties 2. Set bootstrap params via this command 3. Let controller_backend create partitions with known offsets - Add set_partition_bootstrap_params_cmd_data struct - Add set_partition_bootstrap_params_cmd command type - Add apply() method in topic_table and topic_updates_dispatcher - Add set_bootstrap_params() method in topics_frontend - Add unit tests Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
paramters propagation and use. The test registers partition bootstrap parameters (start offset and term id) and then creates the topic. The partition for which the bootstrap parameters were registered is then validated to have the right starting offset. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
2c8cdfc to
f4683d2
Compare
Retry command for Build#80344please wait until all jobs are finished before running the slash command |
CI test resultstest results on build#80344
test results on build#80380
test results on build#80435test results on build#80448
|
| add_random_topic(); // invalidates iterator | ||
| BOOST_REQUIRE_THROW((void)it->first, iterator_stability_violation); | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Create topics without remote_topic_properties
I get that we can avoid downloading them on autocreate, but I think we still want to preserve the initial revision id of the topic. Topic manifests (used for read replicas) paths will include the initial revisions, and if we aren't preserving it, every time we do recovery we'll start uploading manifests to a different path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but why do we need them in L0?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't just for L0, it's for recovery of the entire topic. L1 state is discoverable (e.g. through read replicas) by topic manifests
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. L0 doesn't need it to function but polluting bucket with manifests is a no go. Do we store initial revision in the metastore?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd love to avoid pulling in all topic manifests if possible. Given that we already have the metastore it should be a main way to pull metadata. Using a mix of data from the metastore and the manifests feels a bit odd. The metastore is clearly superior for the recovery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I don't think we need to download the manifests. But the caller of recovery (wcr) should have enough information from the controller to give the revision id (if a remote revision is set, use it, and if not, use the topic revision)
| model::topic_namespace, | ||
| absl::btree_map<model::partition_id, partition_bootstrap_params>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we guarantee that the topic creation that consumes these is actually the topic that we care about? Should we be including a revision id somewhere in the key?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't know the revision id of the topic because the topic is created after the bootstrap params are set. This is only used during the full cluster recovery. I guess we want to clean this up after partitions are created to avoid a situation when the topic is re-created but after that it's picking up the old bootstrap params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the cleanup for that stuff. So now the cluster recovery backend is setting the bootstrap parameters and then when the recovery is completed and all partitions are reconciled and actually created on replicas bootstrap parameters are removed.
| // Params should still be available after topic creation | ||
| params0 = table.local().get_partition_bootstrap_params(ntp0); | ||
| BOOST_REQUIRE(params0.has_value()); | ||
| BOOST_REQUIRE_EQUAL(params0->start_offset, model::offset(1000)); | ||
|
|
||
| params1 = table.local().get_partition_bootstrap_params(ntp1); | ||
| BOOST_REQUIRE(params1.has_value()); | ||
| BOOST_REQUIRE_EQUAL(params1->start_offset, model::offset(2000)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an important property to maintain? I would have thought that once we create the topic, we might want the topic table to remove the bootstrap params.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, but at the moment there is no command that removes them
but the intention is to eventually remove them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added the command that removes the bootstrap parameters, there is a test for that below.
| /// Data for setting bootstrap parameters on existing topic partitions. | ||
| /// Used by cluster recovery to specify known offsets for partitions. | ||
| struct set_partition_bootstrap_params_cmd_data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we're leaving the bootstrap params in the topics table anyway, what's the rationale for having another command for this, vs including it as an optional parameter for topic creation?
Having it be separate feels a bit odd, because e.g. cluster recovery could fail midway and we could be left with the boostrap params in the topic table (and maybe they'd be unintentionally used by some other topic creation?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's just a separation of concerns. The plan is to add a clean-up command that will remove bootstrap state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any thoughts about the case where WCR fails and we end up leaving state in the topics table?
WDYT about storing the bootstrap params in the cluster_recovery_state (also replicated on every node on every shard through the cluster_recovery_table). That way it makes it clear that these bootstrap params are only for recovery, and it becomes easy to clear state atomically with respect to WCR status (e.g. once we complete or fail WCR, the state transition could clear this map).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that should be fine. If recovery fails the user will retry the recovery. The same set of bootstrap params will be used.
I can try to do this, but I'm a bit concerned about new dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the user doesn't retry recovery though, this seems like a surprise waiting to happen. It doesn't seem like an unreasonable sequence of events that a user tries to recover, it fails midway, but the user thinks inspects the cluster and thinks it's good enough to continue their jobs, and doesn't retry.
Re: dependencies, yea it's a fair concern. I'm hoping there aren't any surprises there. I do appreciate that the cluster recovery table state updates are deterministic, but it's worth thinking about if there are races between its updates and topic creation...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that there are races because we're waiting until the changes are applied to the topic table and the reconciliation loop is doing the same thing. Essentially, both commands are replicated with the same log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the easiest solution is to clear this state on recovery failure. I'll try this next.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, when the new recovery starts.
| // hasn't been constructed yet. | ||
| // TODO: implement a recovery primitive for cloud topics | ||
| if (!ntp_cfg.cloud_topic_enabled()) { | ||
| if (bootstrap_params.has_value()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we also condition this on having cloud_topics_enabled set? It feels risky to be permissive here, unless you have some other user of bootstrap params in mind?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mechanism is generic, it's not tied to the cloud topics and will work in any case.
| assert result[0] is not None | ||
| return result[0] | ||
|
|
||
| def _wait_for_metastore_start_offset( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not used
Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
to the topics_frontend. The method replciates the command that clears pending bootstrap state from the topic_table. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
how the clear_partition_bootstrap_params_cmd command is handled Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
68bd273 to
429feb3
Compare
When all topics/partitions are created and reconciled invoke the method to remove bootstrap state. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
When the cloud topic is recovered the revision id has to be populated. This is done in the cluster recovery reconciler. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
e149526 to
de21311
Compare
Retry command for Build#80435please wait until all jobs are finished before running the slash command |
Add two new commands to the offline_log_viewer tool. Signed-off-by: Evgeny Lazin <4lazin@gmail.com>
|
upd:
|
This PR enables cloud topics cluster recovery to bootstrap partitions with the correct start offset and term from the L1 metastore. When a cloud topics cluster is recovered, partitions now start at their known offsets rather than offset 0, ensuring data consistency without requiring cloud access during partition creation.
Changes
Core Infrastructure
New Controller Command:
set_partition_bootstrap_params_cmdAdds a new controller command to set bootstrap parameters (start_offset, initial_term) for partitions before topic creation
Parameters are stored in
topic_table._pending_bootstrap_paramsmap keyed by NTPtopics_frontend::set_bootstrap_params()provides the API for setting these parametersPartition Bootstrap Flow
controller_backendfetches bootstrap params fromtopic_tablewhen creating partitionspartition_manager::manage()uses bootstrap params to initialize partition state viabootstrap_partition_state()New
raft::bootstrap_partition_state()function creates initial raft state with known offset/termCluster Recovery Integration
Modified
cluster_recovery_backend.ccto query L1 metastore for each cloud topic partition'sstart_offsetandtermCalls
set_bootstrap_params()before creating recovered cloud topicsPartitions are created with correct offsets from metastore, avoiding cloud access during partition creation
Backports Required
Release Notes