Skip to content

🐛 fix(crd-upgrade-safety): Safely handle changes to description fields #2023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

anik120
Copy link
Contributor

@anik120 anik120 commented Jun 11, 2025

Description

Motivation:

When attempting to upgrade argocd-operator from v0.5.0 to v0.7.0, the upgrade process fails during the preflight CRD safety validation. The validation correctly detects that the argocds.argoproj.io CRD has been modified between the two versions.

The specific error reported is:

CustomResourceDefinition argocds.argoproj.io failed upgrade safety validation. "ChangeValidator" validation failed: version "v1alpha1", field "^.status.applicationController" has unknown change, refusing to determine that change is safe

However, changes between the CRD versions in this instance are limited to non-functional updates in the description fields of various properties (e.g., status.applicationController).ChangeValidator lacks a specific rule to classify a description-only update as safe, which blocks legitimate and otherwise safe operator upgrades.

Solution:

This PR enhances the CRD upgrade safety validation logic to correctly handle changes to description fields by introducing a new ChangeValidation check for Description, and registering the check by adding it to the default list of ChangeValidations used by ChangeValidator.

Result:

Non-functional updates to documentation fields are now deemed safe(which resolves the upgrade failure for argocd-operator from v0.5.0 to v0.7.0)

Reviewer Checklist

  • API Go Documentation
  • Tests: Unit Tests (and E2E Tests, if appropriate)
  • Comprehensive Commit Messages
  • Links to related GitHub Issue(s)

@anik120 anik120 requested a review from a team as a code owner June 11, 2025 19:26
Copy link

netlify bot commented Jun 11, 2025

Deploy Preview for olmv1 ready!

Name Link
🔨 Latest commit eebe373
🔍 Latest deploy log https://app.netlify.com/projects/olmv1/deploys/684af48eec38f6000856defb
😎 Deploy Preview https://deploy-preview-2023--olmv1.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@openshift-ci openshift-ci bot requested review from kevinrizza and perdasilva June 11, 2025 19:26
Copy link

codecov bot commented Jun 11, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 69.31%. Comparing base (b152c7b) to head (eebe373).
Report is 6 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##             main    #2023       +/-   ##
===========================================
+ Coverage   43.06%   69.31%   +26.25%     
===========================================
  Files          64       79       +15     
  Lines        5418     7059     +1641     
===========================================
+ Hits         2333     4893     +2560     
+ Misses       2741     1884      -857     
+ Partials      344      282       -62     
Flag Coverage Δ
e2e 42.90% <12.50%> (-0.16%) ⬇️
unit 60.23% <100.00%> (?)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saw this PR show up in passing and I am working on a validation very similar to this for kubernetes-sigs/crdify and thought it would be worth taking a look.

@@ -242,3 +242,13 @@ func Type(diff FieldDiff) (bool, error) {

return isHandled(diff, reset), err
}

// A change in a description is always considered safe and non-breaking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that this isn't always true.

While it is true that a basic description change is usually non-breaking, a change to the semantics of a field is a breaking change.

Unfortunately a semantics change isn't something that is easily detectable in a case like this, so this is probably something that OLM wants to allow for the most part (but maybe should warn about somehow?).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This.
I came to take a peek and was immediately wary of a blanket statement that a description-only change could never be breaking.
Our position in the past has been "where there is no clear/easy evaluation, fail the safety check".
I think we could argue that this is too strict and we should -- for a few nebulous items -- adopt the position of "notify and proceed".
I could see this growing to participate in an eventual approval flow, but for now is there a third condition (not just pass/fail) where we could add a notification/caution?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There has been some discussion recently about introducing some sort of revision API (not unlike our semi-internal Helm release secrets now), where there is a fairly obvious opportunity to compute a diff and potentially introduce an approval mechanism where this kind of change would show up in the diff.

If we're getting super-fancy, I could imagine some sort of opinionated diff UI that categorizes changes and could, for example, group the CRD field description diffs into a spot that would be easier to assess them for semantic changes.

My opinion for the here and now is:

  1. Users will almost definitely considered our current behavior (block upgrades for description changes) as a bug.
  2. Therefore, we should merge this change.

In my opinion, OLM should not be trying -- at runtime -- to have opinions about semantic changes. That is where I think we should draw a clear line and say that we expect/assume that upgrade graphs and version numbers from operator authors to be correct (and if they are not, that is a clear bug against the operator, not OLM).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 agree with :

Users will almost definitely considered our current behavior (block upgrades for description changes) as a bug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, OLM does technically have a workaround that could act as an approval method by disabling the checks to proceed with an upgrade. That being said, it certainly isn't an ideal workflow.

I tend to agree that OLM should not be too strict here - at the end of the day these checks are best effort. Hopefully as the ecosystem of tooling for extension authors progresses there will be more robust pipelines put in place on the author side to catch these potential regressions.

Copy link
Contributor Author

@anik120 anik120 Jun 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fyi this is the description change:

argocd-operator v0.5.0:

ApplicationController is a simple, high-level summary of where the Argo CD application controller component is in its lifecycle. There are five possible ApplicationController values:

  • Pending: The Argo CD application controller component has been accepted by the Kubernetes system, but one or more of the required resources have not been created.
  • Running: All of the required Pods for the Argo CD application controller component are in a Ready state.
  • Failed: At least one of the Argo CD application controller component Pods had a failure.
  • Unknown: For some reason the state of the Argo CD application controller component could not be obtained.

ApplicationController is a simple, high-level summary of where the Argo CD application controller component is in its lifecycle. There are four possible ApplicationController values:

  • Pending: The Argo CD application controller component has been accepted by the Kubernetes system, but one or more of the required resources have not been created.
  • Running: All of the required Pods for the Argo CD application controller component are in a Ready state.
  • Failed: At least one of the Argo CD application controller component Pods had a failure.
  • Unknown: The state of the Argo CD application controller component could not be obtained.

(note the bold portions)
ie in this case itself, there was a correction since there is only 4 possible values before and after.

Copy link
Contributor

@perdasilva perdasilva Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question in my mind is less whether a description change can or not be breaking, but rather, how much responsibility should OLM take over gating author intention...I don't have an answer here. But it's an interesting question. Should we be controlling the release process of the APIs of others? If an upgrade of a package destroys the cluster, is it OLMs fault? The more we gate, the more we take that responsibility, in a way...maybe..? Should this be a catalog concern? Or maybe a policy/approval concern outside of OLM (i.e. some kind of approver type)? sorry - just thinking out loud...

handled: false,
},
{
name: "different field changed with description, no error, not handled",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if both ID and Description change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

handled will be false. Added a test case for it 👍🏽

Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @anik120

Great work 🥇. I think we just need to address the comment: https://github.com/operator-framework/operator-controller/pull/2023/files#r2140938754

And also another nit.

// A change in a description is always considered safe and non-breaking.
func Description(diff FieldDiff) (bool, error) {

See that the func name is Description
So, the comment (which is how we doc golang) should have // Description .... documentation. Could we please change that and ensure that we have a better explanation showing why and when it is required.

Then, IHMO it will be all great to 🪰

thank you for your contribution

@@ -242,3 +242,13 @@ func Type(diff FieldDiff) (bool, error) {

return isHandled(diff, reset), err
}

// A change in a description is always considered safe and non-breaking.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// A change in a description is always considered safe and non-breaking.
// Description differences should be allowed. Otherwise, we will block upgrades due to changes in the description, which will be considered as a bug from the end user's perspective.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@anik120 ^ I would suggest something
But we beed start the comment with Description
We are not using the lint check that catch it, but that is a better practice to doc funcs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point @camilamacedo86. I've changed the sentence around a little bit, PTAL, thank you!

@anik120 anik120 force-pushed the upgrade-safety-fix branch from f51b02f to c1f7279 Compare June 12, 2025 15:37
Motivation:

When attempting to upgrade argocd-operator from v0.5.0 to v0.7.0, the upgrade process fails during the preflight CRD safety validation. The validation correctly detects that the `argocds.argoproj.io` CRD has been modified between the two versions.

The specific error reported is:

```
CustomResourceDefinition argocds.argoproj.io failed upgrade safety validation. "ChangeValidator" validation failed: version "v1alpha1", field "^.status.applicationController" has unknown change, refusing to determine that change is safe
```

However, changes between the CRD versions in this instance are limited to non-functional updates in the description fields of various properties (e.g., status.applicationController).`ChangeValidator` lacks a specific rule to classify a description-only update as safe, which blocks legitimate and otherwise safe operator upgrades.

Solution:

This PR enhances the CRD upgrade safety validation logic to correctly handle changes to description fields by introducing a new `ChangeValidation` check for `Description`, and registering the check by adding it to the default list of `ChangeValidations` used by `ChangeValidator`.

Result:

Non-functional updates to documentation fields are now deemed safe(which resolves the upgrade failure for argocd-operator from v0.5.0 to v0.7.0)
@anik120 anik120 force-pushed the upgrade-safety-fix branch from c1f7279 to eebe373 Compare June 12, 2025 15:38
Copy link
Contributor

@camilamacedo86 camilamacedo86 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution 🎉
It seems fine for me so I am /approve this one

Please, we should also get a second approver/reviewer ok as well

@perdasilva @tmshort @grokspawn WDYT?

@bentito @tylerslaton @rashmigottipati WDYT too?

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 12, 2025
Copy link

openshift-ci bot commented Jun 12, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: camilamacedo86

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 12, 2025
@camilamacedo86
Copy link
Contributor

camilamacedo86 commented Jun 12, 2025

/lgtm cancel

Should not add LGTM, only approve so that we give a chance for a second reviewer

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Jun 12, 2025
@perdasilva
Copy link
Contributor

/lgtm

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jun 13, 2025
@openshift-merge-bot openshift-merge-bot bot merged commit 6bf1742 into operator-framework:main Jun 13, 2025
22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. lgtm Indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants