From f54b7a3c93ac15e32694f07756e40f2f55bf715e Mon Sep 17 00:00:00 2001 From: Mustafa Elbehery Date: Mon, 17 Jun 2024 23:44:35 +0200 Subject: [PATCH] ETCD-610: automated backups no config --- .../etcd/automated-backups-no-config.md | 165 ++++++++++++++++++ 1 file changed, 165 insertions(+) create mode 100644 enhancements/etcd/automated-backups-no-config.md diff --git a/enhancements/etcd/automated-backups-no-config.md b/enhancements/etcd/automated-backups-no-config.md new file mode 100644 index 00000000000..e2602971b04 --- /dev/null +++ b/enhancements/etcd/automated-backups-no-config.md @@ -0,0 +1,165 @@ +--- +title: automated-backups-no-config +authors: + - "@elbehery" +reviewers: + - "@soltysh" + - "@dusk125" + - "@hasbro17" + - "@tjungblu" + - "@williamcaban" +approvers: + - "@soltysh" + - "@dusk125" + - "@hasbro17" + - "@tjungblu" + - "@williamcaban" +api-approvers: + - "@soltysh" +creation-date: 2024-06-17 +last-updated: 2024-06-17 +tracking-link: + - https://issues.redhat.com/browse/ETCD-609 +see-also: + - "https://issues.redhat.com/browse/OCPSTRAT-1411" + - "https://issues.redhat.com/browse/OCPSTRAT-529" +--- + + +# Automated Backups of etcd with No Config + +## Summary + +Enable the automated backups of etcd from day 1 without provided config from the user. + +## Motivation + +The [current automated backup of etcd](https://docs.openshift.com/container-platform/4.15/backup_and_restore/control_plane_backup_and_restore/backing-up-etcd.html#creating-automated-etcd-backups_backup-etcd) of an OpenShift +cluster relies on user provided configurations. + +To improve the customers experience, this work proposes taking etcd backup using default config from day one, without additional configuration from the user. + + +### Goals +- Backups should be taken without configuration after cluster installation +- Backups are saved to a default PersistentVolume, that could be overridden by user. +- Backups are taken according to a default schedule, that could be overridden by user. +- Backups are taken according to a default retention policy, that could be overridden by user. + + +### Non-Goals +- Save cluster backups to cloud storage e.g S3. + - This could be a future enhancement or extension to the API. +- Automate cluster restoration. +- Provide automated backups for non-self hosted architectures like Hypershift. + +### User Stories +- As a cluster administrator I want cluster backup to be taken without configuration. +- As a cluster administrator I want to schedule recurring cluster backups so that I have a recent cluster state to recover from in the event of quorum loss (i.e. losing a majority of control-plane nodes). +- As a cluster administrator I want to have failure to take cluster backups for more than a configurable period to be reported to me via critical alerts. + + +## Proposal +- Provide a default etcd backup configuration as a `EtcdBackup` CR. +- Provide a default reliable storage mechanism to save the backup data on. +- Provide a separate feature gate for the automated backup with default config. + + +### Workflow Description +- The user will enable the AutomatedBackupNoConfig feature gate. +- A `EtcdBackup` CR is being installed on the cluster. +- A PVC and PV are being created to save the backup data reliably. + + +### API Extensions +- No API changes are required. +- See https://github.com/openshift/enhancements/blob/master/enhancements/etcd/automated-backups.md#api-extensions + +### Topology Considerations +TBD +#### Hypershift / Hosted Control Planes +TBD +#### Standalone Clusters +TBD +#### Single-node Deployments or MicroShift +TBD + +### Implementation Details/Notes/Constraints [optional] +- Need to agree on a default schedule. +- Need to agree on a default retention policy. + +- Several options exist for the default PVCName. + - Relying on `dynamic provisioning` is sufficient, however not an option for `SNO` or `BM` clusters. + - Utilising `local storage operator` is a proper solution, however installing a whole operator is too much overhead. + - The most viable solution to cover all OCP variants is to use `local volume`. + Please find below this solution's Pros & Cons. + - Pros :- + - The PV will be bound to a specific master node. + - The Backup Pod will be scheduled always to this master node since it uses a PVC bound to this PV. + - Using an SSD is possible, since local volume allows mounting it into a Pod. + - Having the Backup Pod scheduling deterministic is vital, since the retention policy needs the recent backup in place to work. + - Cons :- + - If the master node where the PV is mounted became unhealthy/unavailable/unreachable etc. + - The backups are no longer accessible, also taking new backups is no longer possible. + - Ideally, the backups should be taken on a round-robin fashion on each master node to avoid overwhelming a specific etcd member more than others. + - Also spreading the backups among all etcd cluster members provide guarantees for disaster recovery in case of losing two members at the same time. + - However, using the `local volume` option over all the master nodes will be complicated and error prune. + +### Risks and Mitigations + +When the backups are configured to be saved to a `local`` type PV, the backups are all saved to a singular master node where the PV is provisioned on the local disk. + +In the event of a node becoming inaccessible or unschedulable, the recurring backups would not be scheduled. The periodic backup config would have to be recreated or updated with a different PVC that allows for a new PV to be provisioned on a node that is healthy. + +### Drawbacks + +## Design Details + +### Open Questions + + +## Test Plan + +An e2e test will be added to practice the scenario as follows + +- Enable AutomatedBackupNoConfig FeatureGate. +- Verify that a `Etcdbackup` CR has been installed. +- Verify that a PV, PVC has been created. +- Verify that a backup has been taken. +- Verify that the backups are valid according to the retention policy. + +## Graduation Criteria +TBD + +### Dev Preview -> Tech Preview +TBD + +### Tech Preview -> GA +TBD + +### Removing a deprecated feature +TBD + +## Upgrade / Downgrade Strategy +TBD + +## Version Skew Strategy +TBD + +## Operational Aspects of API Extensions +TBD + +#### Failure Modes +TBD + +## Support Procedures +TBD + +## Implementation History +TBD + +## Alternatives +TBD + +## Infrastructure Needed +TBD \ No newline at end of file