-
Notifications
You must be signed in to change notification settings - Fork 484
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ETCD-610: automated backups no config
- Loading branch information
Showing
1 changed file
with
165 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,165 @@ | ||
--- | ||
title: automated-backups-no-config | ||
authors: | ||
- "@elbehery" | ||
reviewers: | ||
- "@soltysh" | ||
- "@dusk125" | ||
- "@hasbro17" | ||
- "@tjungblu" | ||
- "@williamcaban" | ||
approvers: | ||
- "@soltysh" | ||
- "@dusk125" | ||
- "@hasbro17" | ||
- "@tjungblu" | ||
- "@williamcaban" | ||
api-approvers: | ||
- "@soltysh" | ||
creation-date: 2024-06-17 | ||
last-updated: 2024-06-17 | ||
tracking-link: | ||
- https://issues.redhat.com/browse/ETCD-609 | ||
see-also: | ||
- "https://issues.redhat.com/browse/OCPSTRAT-1411" | ||
- "https://issues.redhat.com/browse/OCPSTRAT-529" | ||
--- | ||
|
||
|
||
# Automated Backups of etcd with No Config | ||
|
||
## Summary | ||
|
||
Enable the automated backups of etcd from day 1 without provided config from the user. | ||
|
||
## Motivation | ||
|
||
The [current automated backup of etcd](https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html#creating-automated-etcd-backups_backup-etcd) of an OpenShift | ||
cluster relies on user provided configurations. | ||
|
||
To improve the customers experience, this work proposes taking etcd backup using default config from day one, without additional configuration from the user. | ||
|
||
|
||
### Goals | ||
- Backups should be taken without configuration after cluster installation | ||
- Backups are saved to a default PersistentVolume, that could be overridden by user. | ||
- Backups are taken according to a default schedule, that could be overridden by user. | ||
- Backups are taken according to a default retention policy, that could be overridden by user. | ||
|
||
|
||
### Non-Goals | ||
- Save cluster backups to cloud storage e.g S3. | ||
- This could be a future enhancement or extension to the API. | ||
- Automate cluster restoration. | ||
- Provide automated backups for non-self hosted architectures like Hypershift. | ||
|
||
### User Stories | ||
- As a cluster administrator I want cluster backup to be taken without configuration. | ||
- As a cluster administrator I want to schedule recurring cluster backups so that I have a recent cluster state to recover from in the event of quorum loss (i.e. losing a majority of control-plane nodes). | ||
- As a cluster administrator I want to have failure to take cluster backups for more than a configurable period to be reported to me via critical alerts. | ||
|
||
|
||
## Proposal | ||
- Provide a default etcd backup configuration as a `EtcdBackup` CR. | ||
- Provide a default reliable storage mechanism to save the backup data on. | ||
- Provide a separate feature gate for the automated backup with default config. | ||
|
||
|
||
### Workflow Description | ||
- The user will enable the AutomatedBackupNoConfig feature gate. | ||
- A `EtcdBackup` CR is being installed on the cluster. | ||
- A PVC and PV are being created to save the backup data reliably. | ||
|
||
|
||
### API Extensions | ||
- No API changes are required. | ||
- See https://github.com/openshift/enhancements/blob/master/enhancements/etcd/automated-backups.md#api-extensions | ||
|
||
### Topology Considerations | ||
TBD | ||
#### Hypershift / Hosted Control Planes | ||
TBD | ||
#### Standalone Clusters | ||
TBD | ||
#### Single-node Deployments or MicroShift | ||
TBD | ||
|
||
### Implementation Details/Notes/Constraints [optional] | ||
- Need to agree on a default schedule. | ||
- Need to agree on a default retention policy. | ||
|
||
- Several options exist for the default PVCName. | ||
- Relying on `dynamic provisioning` is sufficient, however not an option for `SNO` or `BM` clusters. | ||
- Utilising `local storage operator` is a proper solution, however installing a whole operator is too much overhead. | ||
- The most viable solution to cover all OCP variants is to use `local volume`. | ||
Please find below this solution's Pros & Cons. | ||
- Pros :- | ||
- The PV will be bound to a specific master node. | ||
- The Backup Pod will be scheduled always to this master node since it uses a PVC bound to this PV. | ||
- Using an SSD is possible, since local volume allows mounting it into a Pod. | ||
- Having the Backup Pod scheduling deterministic is vital, since the retention policy needs the recent backup in place to work. | ||
- Cons :- | ||
- If the master node where the PV is mounted became unhealthy/unavailable/unreachable etc. | ||
- The backups are no longer accessible, also taking new backups is no longer possible. | ||
- Ideally, the backups should be taken on a round-robin fashion on each master node to avoid overwhelming a specific etcd member more than others. | ||
- Also spreading the backups among all etcd cluster members provide guarantees for disaster recovery in case of losing two members at the same time. | ||
- However, using the `local volume` option over all the master nodes will be complicated and error prune. | ||
|
||
### Risks and Mitigations | ||
|
||
When the backups are configured to be saved to a `local`` type PV, the backups are all saved to a singular master node where the PV is provisioned on the local disk. | ||
|
||
In the event of a node becoming inaccessible or unschedulable, the recurring backups would not be scheduled. The periodic backup config would have to be recreated or updated with a different PVC that allows for a new PV to be provisioned on a node that is healthy. | ||
|
||
### Drawbacks | ||
|
||
## Design Details | ||
|
||
### Open Questions | ||
|
||
|
||
## Test Plan | ||
|
||
An e2e test will be added to practice the scenario as follows | ||
|
||
- Enable AutomatedBackupNoConfig FeatureGate. | ||
- Verify that a `Etcdbackup` CR has been installed. | ||
- Verify that a PV, PVC has been created. | ||
- Verify that a backup has been taken. | ||
- Verify that the backups are valid according to the retention policy. | ||
|
||
## Graduation Criteria | ||
TBD | ||
|
||
### Dev Preview -> Tech Preview | ||
TBD | ||
|
||
### Tech Preview -> GA | ||
TBD | ||
|
||
### Removing a deprecated feature | ||
TBD | ||
|
||
## Upgrade / Downgrade Strategy | ||
TBD | ||
|
||
## Version Skew Strategy | ||
TBD | ||
|
||
## Operational Aspects of API Extensions | ||
TBD | ||
|
||
#### Failure Modes | ||
TBD | ||
|
||
## Support Procedures | ||
TBD | ||
|
||
## Implementation History | ||
TBD | ||
|
||
## Alternatives | ||
TBD | ||
|
||
## Infrastructure Needed | ||
TBD |