Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719
base: master
Are you sure you want to change the base?
OTA-541: enhancements/update/do-not-block-on-degraded: New enhancement proposal #1719
Changes from 2 commits
9498fb9
111c8fe
d4a4682
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we observed another kind of upgrade blocker here. Applying the
infrastructures.config.openshift.io
manifest failed as the CRD had introduced some validations and that needed the apiserver to be upgraded to support it. Unfortunately, the upgrade didn't progress and we had to manually step in to update the kube-apiserver to let the upgrade proceed. Is there a way to enhance these cases to at least let the apiserver upgrade before blocking?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to talk folks into the narrow
Degraded
handling pivot this enhancement currently covers since 2021. I accept that there may be other changes that we could make to help updates go more smoothly, but I'd personally rather limit the scope of this enhancement to theDegraded
handling.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it mean, if no operator is unavailable, then the upgrade should always complete?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ClusterOperators aren't the only CVO-manifested resources, and if something else breaks like we fail to reconcile a RoleBinding or whatever, that will block further update progress. And for ClusterOperators, we'll still block on
status.versions
not being as far along as the manifest claimed, in addition to blocking ifAvailable
isn'tTrue
. Personally,status.versions
seems like the main thing that's relevant, e.g. a component coming after the Kube API server knows it can use 4.18 APIs if the Kube API server has declared 4.18versions
. As an example of what the 4.18 Kube API server asks the CVO to wait on:A recent example of this being useful is openshift/machine-config-operator#4637, which got the CVO to block until the MCO had rolled out a single-arch -> multi-arch transition, without the MCO needing to touch its
Degraded
orAvailable
conditions to slow the CVO down.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so could I say, if failing=true for an upgrade, the reason should not be
ClusterOperatorDegraded
only.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, we'll still propagate
ClusterOperator(s)Degraded
through toFailing
, it just will no longer block the update's progress. So if the only issueFailing
is talking about isClusterOperator(s)Degraded
, we expect the update to be moving towards completion, and not stalling.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
openshift/cluster-version-operator#482 is in flight with this change, if folks want to test pre-merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The enhancement and the tracking card OTA-541 are not targeted at a release. However, changes in the
dev-guide/cluster-version-operator/user/reconciliation.md
file suggest that the enhancement is targeted at the 4.19 release, and thus theTest Plan
section should be addressed.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not strongly opinionated on what the test plan looks like. We don't do a lot of intentional-sad-path update testing today in CI, and I'm fuzzy on what QE does in that space that could be expanded into this new space (or maybe they already test pushing a ClusterOperator component to
Degraded=True
mid update to see how the cluster handles that?).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, that's also what I want to explore during test. I also had some other immature checkpoints in my mind when I read this enhancement doc at the first time, but I still need some inputs from @wking to help me tidy up them. For example #1719 (comment).
I asked this because there's already some cv.conditions check in CI, I'm thinking about if we could update the logic to help catching issues once the feature implemented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty new to the code base for cluster-authentication-operator but scanning through the code there is nothing that stands out in this operator that is concerning with this change.
Ack from @liouk or @ibihim would also be nice to have as an additional sanity check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checking internal org docs, the Auth team seems like they might be responsible for the
service-ca
ClusterOperator, in addition to this line'sauthentication
ClusterOperator. In case those maintainers want to comment with something like:or whatever, assuming they are ok making that assertion for the operators they maintain. Also fine if they want to say "I'm a maintainer for
$CLUSTER_OPERATORS
, and I'm not ok with this enhancement as it stands, because..." or whatever, I'm just trying to give folks a way to satisfy David's requested sign-off if they do happen to be on board.