Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disabled PV re-provisioning by StorageClasses option on restore #8287

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

clcondorcet
Copy link

@clcondorcet clcondorcet commented Oct 11, 2024

Thank you for contributing to Velero!

Please add a summary of your change

This PR addresses a specific disaster recovery use case:

When restoring a cluster after a disaster, PV may or may not have associated snapshots.
However, the underlying volumes from the CSI driver remain intact. I want to relink PVs with existing volumes instead of creating new ones. This PR propose a way to retrieve existing volumes without recreating them.

In my case, the PV has a Reclaim Policy Delete enforced by the StorageClass.
Velero currently does not restore PVs with a Delete policy and no snapshots, which makes sense for regular backup/restore scenarios but can be limiting in disaster recovery situations.

Proposed solution:

This PR introduces a new feature that allows Velero to restore PVs as-is when they meet the following conditions:

  • No snapshot is available.
  • The PV's StorageClass is listed in a newly introduced field in the CRD specifications called disabledPVReprovisioningStorageClasses.

This ensures that Velero can relink the PV to its existing volume and rebind the associated PVC, similar to how it restores PVs with a Reclaim Policy Retain.

Implementation Details:

This solution uses StorageClass names rather than PV names, as some CSI drivers generate random PV names.
Also, using StorageClass makes sense in this context because PVs sharing the same StorageClass generally have similar configurations.

This solution does not bypass snapshots. If snapshots are available, restoring from them is preferred as it is generally more reliable. However, this is open for discussion.

Does your change fix a particular issue?

There is no direct issue but I found two issues that are close to what I've added.

Please indicate you've done the following:

@BarthV
Copy link

BarthV commented Oct 11, 2024

Awesome! We are really interested in this annotation in order to recover all our existing volumes.

@Lyndon-Li
Copy link
Contributor

However, the underlying volumes from the CSI driver remain intact

Does this mean a coincident and unsteady situation -- the CSI driver just doesn't have chance to reclaim the storage volumes of the PVs because of the disaster?
If so, the consequence of volumes are decided by the specific storage and it is unsafe to reuse the volumes.

@Depyume
Copy link

Depyume commented Oct 12, 2024

However, the underlying volumes from the CSI driver remain intact

Does this mean a coincident and unsteady situation -- the CSI driver just doesn't have chance to reclaim the storage volumes of the PVs because of the disaster? If so, the consequence of volumes are decided by the specific storage and it is unsafe to reuse the volumes.

I think the goal is here to support cases where the entire kubernetes cluster is lost (e.g. burnt bare metal servers) and you still want to recover Volumes that may still exist on the storage backend, regardless Reclaim Policies of backuped volumes objects.

Currently Velero doesn't allow "recovering" / "reattaching" such volumes and either ignore them, or force empty volume recreation (expecting it to be recovered from a snapshot)

@reasonerjt reasonerjt added the Needs triage We need discussion to understand problem and decide the priority label Oct 14, 2024
@felfa01
Copy link

felfa01 commented Nov 20, 2024

Stumbled upon this PR and it seems to be solving a use case for Disaster Recovery that I have. All my PVs are created with reclaimPolicy: Delete via the StorageClass but for DR scenarios where the backend storage is left intact I'd like to have the option of PVs reconnecting to the backend storage instead of being re-created.

I am able to do this with some manual intervention pre-backup:

  • Modify the PV and remove any external-provisioner finalizers and change to reclaimPolicy: Retain
  • Run the Backup
  • Mimic a disaster by wiping the cluster
  • Re-create the cluster and run a restore
  • See that all PVs are recreated but are reconnected to the instact backend storage

I have tried automatic the manual step of modiyfing the PVs pre-backup but this appears to be not supported but from what I can see this PR is in line with what I am after. Looking forward to it.

@Depyume
Copy link

Depyume commented Nov 20, 2024

A rebase on this PR would be great :)

@clcondorcet
Copy link
Author

Stumbled upon this PR and it seems to be solving a use case for Disaster Recovery that I have. All my PVs are created with reclaimPolicy: Delete via the StorageClass but for DR scenarios where the backend storage is left intact I'd like to have the option of PVs reconnecting to the backend storage instead of being re-created.

That's exactly the idea! I'm glad to see that this feature is wanted by others.

Any update from the maintaining team ?

Copy link
Member

@kaovilai kaovilai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

disabledPVReprovisioningStorageClasses:
description: |-
DisabledPVReprovisioningStorageClasses is a slice of StorageClasses names.
PV without snaptshot and having one of these StorageClass will not be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
PV without snaptshot and having one of these StorageClass will not be
PV without snapshot and having one of these StorageClass will not be

@@ -129,6 +129,13 @@ type RestoreSpec struct {
// +optional
// +nullable
UploaderConfig *UploaderConfigForRestore `json:"uploaderConfig,omitempty"`

// DisabledPVReprovisioningStorageClasses is a slice of StorageClasses names.
// PV without snaptshot and having one of these StorageClass will not be
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// PV without snaptshot and having one of these StorageClass will not be
// PV without snapshot and having one of these StorageClass will not be

obj *unstructured.Unstructured,
logger logrus.FieldLogger,
) (*unstructured.Unstructured, error) {
logger.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and it's storage class has re-provisionning disabled.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and it's storage class has re-provisionning disabled.")
logger.Infof("Restoring persistent volume as-is because it doesn't have a snapshot and restore has storage class re-provisionning disabled.")

@Lyndon-Li
Copy link
Contributor

However, the underlying volumes from the CSI driver remain intact

Does this mean a coincident and unsteady situation -- the CSI driver just doesn't have chance to reclaim the storage volumes of the PVs because of the disaster? If so, the consequence of volumes are decided by the specific storage and it is unsafe to reuse the volumes.

I think the goal is here to support cases where the entire kubernetes cluster is lost (e.g. burnt bare metal servers) and you still want to recover Volumes that may still exist on the storage backend, regardless Reclaim Policies of backuped volumes objects.

Currently Velero doesn't allow "recovering" / "reattaching" such volumes and either ignore them, or force empty volume recreation (expecting it to be recovered from a snapshot)

The thing is the volumes may or may not exist in the storage, or it is not a steady situation. Operations may succeed or may not.
From the perspective of Kubernetes, the volumes in the storage are already abandoned objects. Therefore, reclaiming a new volume and restoring data there is a more rational approach.

Could you clarify why this relink approach is must-have and what the problem is if you use the current create new volume-restore approach?

@illrill
Copy link

illrill commented Jan 9, 2025

Could you clarify why this relink approach is must-have and what the problem is if you use the current create new volume-restore approach?

Time to recovery is the key, at least for me.

Last time I had to do a full cluster restore which involved a Restic data restore of a couple of TB's from Azure Blob onto persistent volumes based on Azure Files NFS, the restoration took several hours due to throttling, bandwidth and I/O slowless.

If I'd had the ability to re-link restored PV's to the existing (unharmed) Azure Files NFS volumes, the restore would probably have taken seconds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
has-changelog has-unit-tests Needs triage We need discussion to understand problem and decide the priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants