Skip to content

KEP-1847: Auto remove PVCs created by StatefulSet #1915

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 3 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 275 additions & 0 deletions keps/sig-storage/1847-autoremove-statefulset-pvcs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,275 @@
# KEP-1847: Auto remove PVCs created by StatefulSet

<!-- toc -->
- [Release Signoff Checklist](#release-signoff-checklist)
- [Summary](#summary)
- [Motivation](#motivation)
- [Goals](#goals)
- [Non-Goals](#non-goals)
- [Proposal](#proposal)
- [Background](#background)
- [Changes required](#changes-required)
- [User Stories (optional)](#user-stories-optional)
- [Story 1](#story-1)
- [Story 2](#story-2)
- [Notes/Constraints/Caveats (optional)](#notesconstraintscaveats-optional)
- [Risks and Mitigations](#risks-and-mitigations)
- [Design Details](#design-details)
- [Volume delete policy for the StatefulSet created PVCs](#volume-delete-policy-for-the-statefulset-created-pvcs)
- [Cluster role change for statefulset controller](#cluster-role-change-for-statefulset-controller)
- [Test Plan](#test-plan)
- [Graduation Criteria](#graduation-criteria)
- [Alpha release](#alpha-release)
- [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy)
- [Version Skew Strategy](#version-skew-strategy)
- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire)
- [Feature Enablement and Rollback](#feature-enablement-and-rollback)
- [Implementation History](#implementation-history)
- [Drawbacks](#drawbacks)
- [Alternatives](#alternatives)
<!-- /toc -->

## Release Signoff Checklist

Items marked with (R) are required *prior to targeting to a milestone / release*.

- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] (not the initial KEP PR)
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
- [ ] (R) Design details are appropriately documented
- [ ] (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
- [ ] (R) Graduation criteria is in place
- [ ] (R) Production readiness review completed
- [ ] Production readiness review approved
- [ ] "Implementation History" section is up-to-date for milestone
- [ ] User-facing documentation has been created in [kubernetes/website], for publication to [kubernetes.io]
- [ ] Supporting documentation e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes


[kubernetes.io]: https://kubernetes.io/
[kubernetes/enhancements]: https://git.k8s.io/enhancements
[kubernetes/kubernetes]: https://git.k8s.io/kubernetes
[kubernetes/website]: https://git.k8s.io/website

## Summary
The proposal is to add a feature to autodelete the PVCs created by StatefulSet.

## Motivation

Currently, the PVCs created automatically by the StatefulSet are not deleted when
the StatefulSet is deleted. As can be seen by the discussion in the issue
[55045](https://github.com/kubernetes/kubernetes/issues/55045) there are several use
cases where the PVCs which are automatically created are deleted as well. In many
StatefulSet use cases, PVCs have a different lifecycle than the pods of the
StatefulSet, and should not be deleted at the same time. Because of this, PVC
deletion will be opt-in for users.

### Goals

Provide a feature to auto delete the PVCs created by StatefulSet.
Ensure that the pod restarts due to non scale down events such as rolling
update or node drain does not delete the PVC.

### Non-Goals

This proposal does not plan to address how the underlying PVs are treated on PVC deletion.
That functionality will continue to be governed by the ReclaimPolicy of the storage class.

## Proposal

### Background

Controller `garbagecollector` is responsible for ensuring that when a statefulset
set is deleted the corresponding pods spawned from the StatefulSet is deleted.
The `garbagecollector` uses `OwnerReference` added to the `Pod` by statefulset controller
to delete the Pod. Similar mechanism is leveraged by this proposal to automatically
delete the PVCs created by the StatefulSet controller.

### Changes required

The following changes are required:

1. Add `PersistentVolumeClaimDeletePolicy` entry into StatefulSet spec inorder to make this feature an opt-in.
2. Provide the following PersistentVolumeClaimPolicies:
* `Retain` - this is the default policy and is considered in cases where no policy is specified. This would be the existing behaviour - when a StatefulSet is deleted, no action is taken with
respect to the PVCs created by the StatefulSet.
* `RemoveOnScaledown` - When a pod is deleted on scale down, the corresponding PVC is deleted as well.
A scale up following a scale down, will wait till old PVC for the removed Pod is deleted and ensure
that the PVC used is a freshly created one.
* `RemoveOnStatefulSetDeletion` - PVCs corresponding to the StatefulSet are deleted when StatefulSet
themselves get deleted.
3. Add `patch` to the statefulset controller rbac cluster role for `persistentvolumeclaims`.

### User Stories (optional)

#### Story 1
User environment is such at the content of the PVCs which are created automatically during StatefulSet
creation need not be retained after the StatefulSet is deleted. User also requires that the scale
up/down occurs in a fast manner, and leverages any previously existing auto created PVCs within the
life time of the StatefulSet. An option needs to be provided for the user to auto-delete the PVCs
once the StatefulSet is deleted.

User would set the `PersistentVolumeClaimDeletePolicy` as `RemoveOnStatefulSetDelete` which would ensure that
the PVCs created automatically during the StatefulSet activation is removed once the StatefulSet
is deleted.

#### Story 2
User is cost conscious but at the same time can sustain slower scale up(after a scale down) speeds. Needs
a provision where the PVC created for a pod(which is part of the StatefulSet) is removed when the Pod
is deleted as part of a scale down. Since the subsequent scale up needs to create fresh PVCs, it will
be slower than scale ups relying on existing PVCs(from earlier scale ups).

User would set the `PersistentVolumeClaimDeletePolicy` as 'RemoveOnScaledown' ensuring PVCs are deleted when corresponding
Pods are deleted. New Pods created during scale up followed by a scaledown will wait for freshly created PVCs.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a real world example of such an application ?if PVC should be deleted when pod is deleted, how is this different than a pod using an emptyDir ? I would vote for some more concrete use cases here

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The volume is retained if the pod is rescheduled, eg if the node goes down or is upgraded.

+1 to adding the explicitly so it's clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or statefulset rolling update

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will update the pod rescheduling note shortly.


### Notes/Constraints/Caveats (optional)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps add information on applicability of the feature across configurations. Does the feature work with Local PVCs for instance?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kk-src Maybe something like the following?

This feature applies to PVs which are dynamically provisioned from the volumeClaimTemplate of a StatefulSet. Any PVC and PV provisioned from this mechanism will function with this feature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @kow3ns
@mattcary - Thank you, added to the text. Will come out in the next commit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it have to be dynamically provisioned from the StatefulSet controller? Since this behavior is opt-in, the user should understand what they're getting into?

For example, today, you can manually create a PV object, and set Delete reclaim policy. The provisioner will delete the volume even though it didn't provision it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what if pods have multiple PVCs? Should all of them be deleted? I don't think that would make sense, there might be a shared volume.

Theres a 1:1 PVC:PV relationship so that doesn't come up in that case.

I think it's much simpler to define reasonable behavior if we scope to the volumeClaimTemplates only.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair point. Scoping the behavior seems reasonable to me, I would just clarify that this is about PVC objects being created by the StatefulSet controller, and not about PV objects being dynamically provisioned. You can have StatefulSet create PVCs that can be bound to pre-created PVs and those should still be in scope for the Statefulset reclaim policy.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 In the design section we are talking about the PVCs created from the VolumeClaimTemplate in the StatefulSet. That's exactly the set we're interested in.

eg, if a user created a PVC that matched the naming convention of the volumeClaimTemplate, a pod created for the statefulset would attach to it, so it seems unsurprising that the new reclaim policy would cause that PVC to be deleted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I think I misunderstood then. I thought you wanted PVCs manually created by the user with the same naming convention to NOT be part of the reclaim policy. But it sounds like we do want it part of the reclaim policy.

And PVCs referenced outside of volumeClaimTemplates, ie in volumes are out of scope.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, sorry, there was a discussion that only partially made it into the comments. I think that the version down in "design details" is correct: if the statefulset has a volumeClaimTemplate, the static naming scheme defines a PVC for each pod which is called here the associated PVC of a pod. These associated PVCs are the ones that are deleted or not according to the policy.


This feature applies to PVCs which are dynamically provisioned from the volumeClaimTemplate of
a StatefulSet. Any PVC and PV provisioned from this mechanism will function with this feature.

### Risks and Mitigations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A discussion of durability risks with the changed behavior (even though it is opt-in) should be added here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kk-src Maybe something like the following?

If the PVCReclaimPolicy is changed from its default of "Retain", then PVs will be deleted on SS scaledown or deletion. A user will lose data on those PVs if proactive effort is not taken to replicate that data. However, PVs associated with the StatefulSet will be more durable than ephemeral volumes would be, as they are only deleted on scaledown or StatefulSet deletion, and not on other pod lifecycle events like being rescheduled to a new node, even with the new retain policies.


Currently the PVCs created by statefulset are not deleted automatically. Using the
`RemoveOnScaledown` or `RemoveOnStatefulSetDeletion` would delete the PVCs
automatically. Since this involves persistent data being deleted, users should take
appropriate care using this feature. Having the `Retain` behaviour as default
will ensure that the PVCs remain intact by default and only a conscious choice
made by user will involve any persistent data being deleted. Also, PVCs associated with the StatefulSet will be more
durable than ephemeral volumes would be, as they are only deleted on scaledown or StatefulSet deletion, and not on other pod lifecycle events
like being rescheduled to a new node, even with the new retain policies.

## Design Details

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should specify that ownerRefs on PVCs are only set for those that are created for volumes provisioned as part of the the statefulset, ie in VolumeClaimTemplates with a StorageClass specified, and no dataSource, etc.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you tell? Someone could manually create a dynamically provisioned PVC with the same name that the StatefulSet controller expects.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. Does that mean we should add an annotation when a PVC is created by the SS controller for a pod?

We don't want to add an owner reference I think as that would make it harder to not change behavior for a retain policy.

It seems to me that an annotation would work well here but maybe there is an idiomatic alternative.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have added text indicating static naming convention used by the statefulset will be reused here to identity the PVCs in question.

### Volume delete policy for the StatefulSet created PVCs

When a statefulset spec has a `VolumeClaimTemplate`, PVCs are dynamically created
using a static naming scheme. A new field named `PersistentVolumeClaimDeletePolicy` of the
type `StatefulSetPersistentVolumeClaimDeletePolicy` will be added to the StatefulSet. This
field will represent the user indication on whether the associated PVCs can be automatically
deleted or not. The default policy would be `Retain`.

If `PersistentVolumeClaimDeletePolicy` is set to `RemoveOnScaledown`, Pod is set as the owner of the PVCs created
from the `VolumeClaimTemplates` just before the scale down is performed by the statefulset controller.
When a Pod is deleted, the PVC owned by the Pod is also deleted. When `RemoveOnScaledown`
policy is set and the Statefulset gets deleted the PVCs also will get deleted
(similar to `RemoveonStatefulSetDeletion` policy).

Current statefulset controller implementation ensures that the manually deleted pods are restored
before the scale down logic is run. This combined with the fact that the owner references are set
only before the scale down will ensure that manual deletions do not automatically delete the PVCs
in question.

During scale-up, if a PVC has an OwnerRef that does not match the Pod, it
potentially indicates that the PVC is referred by the deleted Pod and is in the process of
getting deleted. Controller will exit the current reconcile loop and attempt to reconcile in the
next iteration. This avoids a race with PVC deletion.

When `PersistentVolumeClaimDeletePolicy` is set to `RemoveOnStatefulSetDeletion` the owner reference in
PVC points to the StatefulSet. When a scale up or down occurs, the PVC would remain the same.
PVCs previously in use before scale down will be used again when the scale up occurs. The PVC deletion
should happen only after the Pod gets deleted. Since the Pod ownership has `blockOwnerDeletion` set to
`true` pods will get deleted before the StatefulSet is deleted. The `blockOwnerDeletion` for PVCs will
be set to `false` which ensures that PVC deletion happens only after the StatefulSet is deleted. This
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very elegant, nice!

chain of ownership ensures that Pod deletion occurs before the PVCs are deleted.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What would happen if I delete a RemoveOnStatefulSetDeletion StatefulSet with orphan policy? Will the PVCs be gone while the Pods stay?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the orphan policy do you mean kubectl delete --cascade=false?

In that case I think nothing should be deleted other than the statefulset resource, the semantics for cascade seem clear. That would mean removing any ownership IIUC.

At any rate, even if the PVCs were deleted, the protection controller would prevent them from being finalized until the pods were deleted. But it might be unexpected to have the PVCs disappear when the pods are eventually deleted.


`Retain` `PersistentVolumeClaimDeletePolicy` will ensure the current behaviour - no PVC deletion is performed as part
of StatefulSet controller.

In alpha release we intend to keep the `PersistentVolumeClaimDeletePolicy` immutable after creation.
Based on user feedback we will consider making this field mutable in future releases.

## Cluster role change for statefulset controller
Inorder to update the PVC ownerreference, the `buildControllerRoles` will be updated with
`patch` on PVC resource.

### Test Plan

1. Unit tests

1. e2e tests
- RemoveOnScaleDown
1. Create 2 pod statefulset, scale to 1 pod, confirm PVC deleted
1. Create 2 pod statefulset, add data to PVs, scale to 1 pod, scale back to 2, confirm PV empty.
1. Create 2 pod statefulset, delete stateful set, confirm PVCs deleted.
1. Create 2 pod statefulset, add data to PVs, manually delete one pod, confirm pod comes back and PV still has data (PVC not deleted).
1. As above, but manually delete all pods in stateful set.
1. Create 2 pod statefulset, add data to PVs, manually delete one pod, immediately scale down to one pod, confirm PVC is deleted.
1. Create 2 pod statefulset, add data to PVs, manually delete one pod, immediately scale down to one pod, scale back to two pods, confirm PV is empty.
1. Create 2 pod statefulset, add data to PVs, perform rolling confirm PVC don't get deleted and PV contents remain intact and useful in the updated pods.
- RemoveOnStatefulSetDeletion
1. Create 2 pod statefulset, scale to 1 pod, confirm PVC still exists,
1. Create 2 pod statefulset, add data to PVs, scale to 1 pod, scale back to 2, confirm PV has data (PVC not deleted).
1. Create 2 pod statefulset, delete stateful set, confirm PVCs deleted
1. Create 2 pod statefulset, add data to PVs, manually delete one pod, confirm pod comes back and PV has data (PVC not deleted).
1. As above, but manually delete all pods in stateful set.
1. Create 2 pod statefulset, add data to PVs, manually delete one pod, immediately scale down to one pod, confirm PVC exists.
1. Create 2 pod statefulset, add data to PVs, manually delete one pod, immediately scale down to one pod, scale back to two pods, confirm PV has data.
- Retain:
1. same tests as above, but PVCs not removed in any case and confirm data intact on the PV.
- Pod restart tests:
1. Create statefulset, perform rolling update
1. Upgrade/Downgrade tests
1. Create statefulset in previous version and upgrade to the version
supporting this feature. The PVCs should remain intact.
2. Downgrade to earlier version and check the PVCs with Retain
remain intact and the others with set policies before upgrade
gets removed based on if the references were already set.
1. Feature disablement/enable test for alpha feature flag `statefulset-autodelete-pvcs`.


### Graduation Criteria
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This, and remaining sections should be filled out.


#### Alpha release
- Complete adding the items in the 'Changes required' section.
- Add unit, functional, upgrade and downgrade tests to automated k8s test.

### Upgrade / Downgrade Strategy

There is a new field getting added to the StatefulSet. The upgrade will not
change the previously expected behaviour of existing Statefulset.

If the statefulset had been set with the RemoveOnStatefulSetDeletion
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe say

On a downgrade, the PersistentVolumeClaimReclaimPolicy field will be removed from any StatefulSets. If a scaledown or delete is in process when the downgrade happens, any existing OwnerRefs on PVCs will not be removed so that in most cases when the scaledown completes, unused PVCs will be deleted. However, there may be edge cases causing only some of the unused PVCs to be deleted. As unused PVCs remaining after a scaledown is the expected behavior of the downgraded clusters no further effort will be made to remove them.

and RemoveOnScaleDown and the version of the kube-controller downgraded,
even though the `PersistentVolumeClaimDeletePolicy` field will go away, the references
would still be acted upon by the garbage collector and cleaned up
based on the settings before downgrade.

### Version Skew Strategy
There is only kubecontroller manager changes involved, hence not applicable for
version skew involving other components.

## Production Readiness Review Questionnaire

### Feature Enablement and Rollback

* **How can this feature be enabled / disabled in a live cluster?**
- [x] Feature gate (also fill in values in `kep.yaml`)
- Feature gate name: statefulset-autodelete-pvcs
- Components depending on the feature gate: kube-controller-manager

* **Does enabling the feature change any default behavior?**
The default behaviour is only changed when user explicitly specifies the `PersistentVolumeClaimDeletePolicy`.
Hence no change in any user visible behaviour change by default.

* **Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?**
Yes, but with side effects for users who already started using the feature by means of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will be sufficient to say yes here, and refer to the downgrade section above on what happens on rollback.

I think that disabling the feature would also mean removing the new policy field?

I'm not sure what you mean by annotating the PVCs here.

specifying non-retain `PersistentVolumeClaimDeletePolicy`. We will an annotation to the
PVC indicating that the references have been set from previous enablement. Hence a reconcile
loop which goes through the required PVCs and removes the references will be added.
The side effect is that if there was pod deletion before the references were removed after the
feature flag was diabled, the PVCs could get deleted.

* **What happens if we reenable the feature if it was previously rolled back?**
The reconcile loop which removes references on disablement will not come into action. Since the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have no effect, as the policy fields were removed during the rollback, so when reenabled all StatefulSets in the cluster will be using the retain policy.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The way feature disabling/rollback works is that the field remains stored in etcd, however controllers will not process the field (because it's protected by a feature gate). Then if you re-enable the feature again, then you can start processing the field again.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

I think that means that the RemoveOnDeletion needs to be reconciled, that is make sure associated PVCs have ownerrefs to the statefulset.

I don't see this as too much of a problem if things aren't exactly consistent. For example, one might set RemoveOnScaledown on a 10 replica statefulset, disable the feature, scale down to 6 pods, then re-enable the feature; PVCs 7-10 are going to not have been deleted and we can't be expected to delete them. So if the disabling or re-enabling happens during a scaledown and some PVCs are missed or deleted, the behavior is almost the same as before when we were okay with the orphaned PVCs 7-10.

The main thing I think is to make sure "future" scaledown/deletions work correctly, and as long as we get the statefulset ownerref for RemoveOnDeletion that will be covered.

StatefulSet field would persist through the disablment we will have to ensure that the required
references get set in the next set of reconcile loops.

* **Are there any tests for feature enablement/disablement?**
Feature enablement disablement tests will be added.

## Implementation History

## Drawbacks
The Statefulset field update is required.

## Alternatives
Users can delete the PVC manually. This is the motivation of the KEP.
43 changes: 43 additions & 0 deletions keps/sig-storage/1847-autoremove-statefulset-pvcs/kep.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
title: Auto remove PVCs created by StatefulSet
kep-number: 1847
authors:
- "@kk-src"
- "@dsu-igeek"
- "@mattcary"
owning-sig: sig-apps
participating-sigs:
- sig-storage
status: implementable
creation-date: 2020-06-04
reviewers:
- "@kow3ns"
- "@xing-yang"
- "@msau42"
- "@janetkuo"
approvers:
- "@msau42"
- "@janetkuo"

#The target maturity stage in the current dev cycle for this KEP.
stage: alpha

# The most recent milestone for which work toward delivery of this KEP has been
# done. This can be the current (upcoming) milestone, if it is being actively
# worked on.
latest-milestone: "v1.20"

# The milestone at which this feature was, or is targeted to be, at each stage.
milestone:
alpha: "v1.20"
beta: "v1.21"
stable: "v1.22"

# The following PRR answers are required at alpha release
# List the feature gate name and the components for which it must be enabled
#feature-gates:
# - default is existing behaviour. Only if retention flags are enabled does
# the feature come into action, hence not adding additional feature gate.

# The following PRR answers are required at beta release
# metrics:
# Currently no metrics is planned for alpha release.