Skip to content

Conversation

huww98
Copy link
Contributor

@huww98 huww98 commented Jul 20, 2025

What type of PR is this?

/kind feature

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespaces from that line:

/kind api-change
/kind bug
/kind cleanup
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:

Support rollback to VAC A if modifying from A to B failed with a final error.
This works just like we modifying it again to C on final error.

The significant changes in the sync logic:

  • Always retry if pvc.Status.ModifyVolumeStatus is not nil, which means the
    last transation does not finish successfully.
  • Keep reconciling to the previous target if spec is rolled back to nil, until
    it succeeds or we get an infeasible error. Then we just leave it at its
    current state and stop reconciling for it, since user may not care about it
    now.

See kubernetes/enhancements#5482 for details

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

The first 2 commits are already included in my other PRs. Please review the last commit. I will rebase when the previous PRs are merged.

Does this PR introduce a user-facing change?:

Support rolling back when modify failed

@k8s-ci-robot k8s-ci-robot added release-note Denotes a PR that will be considered when it comes time to generate release notes. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. kind/feature Categorizes issue or PR as related to a new feature. labels Jul 20, 2025
@k8s-ci-robot k8s-ci-robot requested review from humblec and jingxu97 July 20, 2025 09:23
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jul 20, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @huww98. Thanks for your PR.

I'm waiting for a kubernetes-csi member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@huww98
Copy link
Contributor Author

huww98 commented Aug 17, 2025

Implementation of kubernetes/enhancements#5482

@huww98 huww98 force-pushed the rollback branch 2 times, most recently from 56ff79c to 19cef74 Compare August 24, 2025 12:47
@sunnylovestiramisu
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 2, 2025

// Check if we should change our target
_, inUncertainState := ctrl.uncertainPVCs.Load(pvcKey)
if (status != nil && status.Status == v1.PersistentVolumeClaimModifyVolumeInProgress && inUncertainState) || pvcSpecVacName == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If status.Status == v1.PersistentVolumeClaimModifyVolumeInProgress alone should be a sufficient condition to not modify volume anymore?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that condition exists to ensure that for transient failing modifications we keep trying to reconcile to whatever value was recorded in target and do not start with new value of vac.

@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Sep 16, 2025
t.Fatalf("expected error to be %v, got %v", finalErr, err)
}
// should clear uncertain state
assertUncertain(false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also add cases to verify what the conditions should be in each step?

Copy link
Contributor Author

@huww98 huww98 Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now The ModifyVolumeError condition will overwrite ModifyingVolume, we will always get only ModifyVolumeError after a failed modification. So I think it maybe not very interesting to check it here.

Added stronger assertions to TestMarkControllerModifyVolumeStatus.

@gnufied
Copy link
Contributor

gnufied commented Sep 23, 2025

@huww98 let me know when this PR is ready for another round of review. thanks.

Copy link
Member

@torredil torredil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validated that the change introduced in this PR does not break existing modify volume workflows by building a custom resizer image including this change and deploying latest version of EBS CSI Driver. I ran the external storage test suite and manually tested a few edge cases.

lgtm besides the outstanding feedback from @gnufied and @sunnylovestiramisu that needs to be addressed.

We should keep retry the previously specified target.
@huww98 huww98 force-pushed the rollback branch 2 times, most recently from ef281b6 to eff5e89 Compare September 24, 2025 17:12
Support rollback to VAC A if modifying from A to B failed with a final error.
This works just like we modifying it again to C on final error.

The significant changes in the sync logic:
- Always retry if pvc.Status.ModifyVolumeStatus is not nil, which means the
  last transation does not finish successfully.
- Keep reconciling to the previous target if spec is rolled back to nil, until
  it succeeds or we get an infeasible error. Then we just leave it at its
  current state and stop reconciling for it, since user may not care about it
  now.
pvcModifier should return independent object from the start of the call chain
to avoid interference between test cases.
Start modify @ t1
Type               Status  LastProbeTime  LastTransitionTime  Reason    Message
----               ------  -------------  ------------------  ------    -------
ModifyingVolume    True    t1             t1                            Modifying volume to <VAC name> is in progress.

final error @ t2:
- Use gPRC code for Reason
- update LastTransitionTime only if Reason changes
Type               Status  LastProbeTime  LastTransitionTime  Reason    Message
----               ------  -------------  ------------------  ------    -------
ModifyingVolume    True    t2             t1                            Modifying volume to <VAC name> failed. Waiting for retry.
ModifyVolumeError  True    t2             t2                  Internal  Final error.

Retry @ t3:
- update ModifyingVolume LastProbeTime
Type               Status  LastProbeTime  LastTransitionTime  Reason    Message
----               ------  -------------  ------------------  ------    -------
ModifyingVolume    True    t3             t1                            Modifying volume to <VAC name> is in progress.
ModifyVolumeError  True    t2             t2                  Internal  Final error.

non-final error @ t4:
Type               Status  LastProbeTime  LastTransitionTime  Reason            Message
----               ------  -------------  ------------------  ------            -------
ModifyingVolume    True    t4             t1                                    Modifying volume to <VAC name> is still in progress.
ModifyVolumeError  True    t4             t4                  DeadlineExceeded  Progress: 10%.
@gnufied
Copy link
Contributor

gnufied commented Sep 29, 2025

@huww98 you didn't address:

When I apply a invalid VAC (which doesn't exist) and then revert back to empty/nil VAC the status.modifyStatus is not cleared. IMO it should be cleared:

@huww98
Copy link
Contributor Author

huww98 commented Sep 29, 2025

@gnufied I think we have reached agreement about this: the status.modifyStatus should not be cleared

Yes. I think we can clear the conditions. But status.modifyStatus should not be cleared.

That is fine. I was mainly thinking about conditions.

@gnufied
Copy link
Contributor

gnufied commented Sep 29, 2025

@gnufied I think we have reached agreement about this: the status.modifyStatus should not be cleared

Okay, lets move past this issue for now. I was mainly talking about case of VAC being pending.

But I think I found a bug in handling of non-existing VACs which can cause users to abuse quota trivially. Example flow:

  1. A PVC starts with a nil VAC.
  2. User applies VAC B to the PVC, which fails with final error.
  3. User applies VAC C to the PVC, which doesn't exist, now since this we are recording VAC C in target's status, any record of B being ever applied is erased.

Make it harder to abuse VAC quota by change spec to an non-existing VAC when the volume is partially modified. Basically reverting 6c8c10f.
@huww98
Copy link
Contributor Author

huww98 commented Sep 29, 2025

It is a bit weird if we left TargetVolumeAttributesClassName empty while setting Status to "Pending". We want them to be set or cleared at the same time since we put them in a struct, right? But as you wish, I reverted to the previous behavior, it is more correct in terms of quota.

I'm thinking that can we just deprecate the Pending status? Maybe we should left modifyVolumeStatus untouched if VAC not exists, and just communicate this situation with condition and event. Maybe we can have condition with type: ModifyingVolume, status: "False" for this.

@gnufied
Copy link
Contributor

gnufied commented Oct 1, 2025

/lgtm
/approve

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Oct 1, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gnufied, huww98

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 1, 2025
@k8s-ci-robot k8s-ci-robot merged commit 099f112 into kubernetes-csi:master Oct 1, 2025
6 checks passed
@huww98
Copy link
Contributor Author

huww98 commented Oct 7, 2025

/test pull-kubernetes-csi-external-resizer-1-33-on-kubernetes-master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. release-note Denotes a PR that will be considered when it comes time to generate release notes. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants