-
Notifications
You must be signed in to change notification settings - Fork 552
[WIP] MCO-1669: add BootImageSkewEnforcement API #2357
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@djoshy: This pull request references MCO-1669 which is a valid jira issue. Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
Hello @djoshy! Some important instructions when contributing to openshift/api: |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: djoshy The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -56,8 +56,66 @@ type MachineConfigurationSpec struct { | |||
// +openshift:enable:FeatureGate=NodeDisruptionPolicy | |||
// +optional | |||
NodeDisruptionPolicy NodeDisruptionPolicyConfig `json:"nodeDisruptionPolicy"` | |||
// bootImageSkewEnforcement allows an admin to set the behavior of the boot image skew enforcement mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why does an admin care about configuring this? What does configuring this allow them to achieve?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When boot images are out of skew, the cluster will very likely fail to scale. Our hope is that, when the MCO's skew enforcement mechanism detects that the boot image is out of skew, it will alert the cluster by disabling upgrades. So then the admin has a few options to restore cluster upgrades:
- If they do plan to not scale their cluster, they could disable this mechanism completely by setting the API knob to
None
. - If they do plan to scale, it could play out in two ways:
- For the platforms that support automatic boot image updates, the MCO will default this API knob to automatic and set the
ClusterBootImage
field to the current boot image of the cluster. Going forward, the boot image controller will then update theClusterBootImage
field when a boot image update takes place. Ideally, this process is completely invisible to admin and need no manual action from their end. - For the platforms that do not support automatic boot image updates, the MCO will default this API knob to manual, and set the
ClusterBootImage
field to the current boot image of the cluster. The admin is expected to manually perform updates(via docs that we provide) and update theClusterBootImage
field.
- For the platforms that support automatic boot image updates, the MCO will default this API knob to automatic and set the
There's more details in the EP, but this is the gist.
This discussion does make me think, does it make sense to have a status version of the BootImageSkewEnforcement
field? Perhaps that makes more sense to represent the default modes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If end-users are going to be the ones configuring this, lets provide them some of this additional context.
What do you think of something along the lines of this for the description of the field?:
bootImageSkewEnforcement is an optional field that can be used to configure how version skew is enforced on the cluster.
When version skew is being enforced, cluster upgrades will be disabled until the version skew becomes acceptable for the release payload.
When omitted, ....
We should try to provide a clear explanation to end-users through the GoDoc why they may care about configuring this.
For the defaulting behavior, does this only happen if a user doesn't explicitly set this field?
How does one know if they are on a platform that supports manual vs automatic boot image updates?
If I as a user were to set the wrong one, what happens?
Also worth noting that because this field is not a pointer and doesn't have omitempty
, omitting this field would result in the following serialization:
bootImageSkewEnforcement:
mode: ""
which according to the valid values of the mode
field would be invalid. You either need to allow the ""
value for the mode
OR make this a pointer with omitempty
to distinguish between "not set" and the zero value here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion does make me think, does it make sense to have a status version of the BootImageSkewEnforcement field? Perhaps that makes more sense to represent the default modes?
I'm still wrapping my head around the "default" modes, but if I'm following this correctly, it does seem reasonable to me to have the cluster boot image state stored in a status condition - especially when set to Automatic
and the MCO is keeping track of this.
In my experience, it can be a bit confusing when both a user and controller are modifying the spec field of a resource and I would generally advise against that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the defaulting behavior, does this only happen if a user doesn't explicitly set this field?
Yes, the defaulting behavior will take place when the user hasn't specified anything. The user will always be able to override it.
How does one know if they are on a platform that supports manual vs automatic boot image updates?
We plan to have this info available via documentation, but I want to note that we don't plan on GAing skew enforcement API until we have implemented automatic boot image updates on a majority of the platforms that can support it. Based on our work so far, there are going to be a sizable amount of platforms(or variants within platforms) that will require manual intervention for boot image updates.
If I as a user were to set the wrong one, what happens?
Manual
and None
can still be valid even when the platform supports automatic boot image updates because the admin can still choose to be manually in control of their boot images. If the admin incorrectly sets Automatic
for an unsupported scenario, the MCO could propagate an error.
I'm still wrapping my head around the "default" modes, but if I'm following this correctly, it does seem reasonable to me to have the cluster boot image state stored in a status condition - especially when set to Automatic and the MCO is keeping track of this.
The ManagedBootImages
field in this object operates in this manner wrt defaulting behavior, and it has a mirrored field in Status. Let me try sketching out what that would look like.
which according to the valid values of the mode field would be invalid. You either need to allow the "" value for the mode OR make this a pointer with omitempty to distinguish between "not set" and the zero value here.
I think I'll go with pointer & omitempty route here, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, let me know if the pointer+omitempty field looks right. I've also attemped to fix the union discriminator naming.
Thanks for the questions & review(sorry it took a while!), this should be ready for another look. Happy to hop on a call if that is easier. Update: Did another push to fix up some tests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving another handful of comments.
I'm also happy to hop on a call if you think it would be beneficial.
@@ -56,8 +56,66 @@ type MachineConfigurationSpec struct { | |||
// +openshift:enable:FeatureGate=NodeDisruptionPolicy | |||
// +optional | |||
NodeDisruptionPolicy NodeDisruptionPolicyConfig `json:"nodeDisruptionPolicy"` | |||
// bootImageSkewEnforcement allows an admin to set the behavior of the boot image skew enforcement mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If end-users are going to be the ones configuring this, lets provide them some of this additional context.
What do you think of something along the lines of this for the description of the field?:
bootImageSkewEnforcement is an optional field that can be used to configure how version skew is enforced on the cluster.
When version skew is being enforced, cluster upgrades will be disabled until the version skew becomes acceptable for the release payload.
When omitted, ....
We should try to provide a clear explanation to end-users through the GoDoc why they may care about configuring this.
For the defaulting behavior, does this only happen if a user doesn't explicitly set this field?
How does one know if they are on a platform that supports manual vs automatic boot image updates?
If I as a user were to set the wrong one, what happens?
Also worth noting that because this field is not a pointer and doesn't have omitempty
, omitting this field would result in the following serialization:
bootImageSkewEnforcement:
mode: ""
which according to the valid values of the mode
field would be invalid. You either need to allow the ""
value for the mode
OR make this a pointer with omitempty
to distinguish between "not set" and the zero value here.
// clusterBootImage describes the current boot image of the cluster. This will be used to enforce the skew limit. | ||
// This value will be compared against the cluster's skew limit to determine skew compliance. | ||
// Required when mode is set to "Automatic" or "Manual" and forbidden otherwise. | ||
// +optional | ||
ClusterBootImage *ClusterBootImage `json:"clusterBootImage,omitempty"` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually, discriminated union members will follow the name of their mode
counterpart. i.e:
mode: Automatic
automatic:
...
or
mode: Manual
manual:
...
Another thing I'm curious about now that I've got a bit more context - why do you want to require the clusterBootImage
when set to Automatic
?
Presumably, if the MCO is able to determine the cluster boot image by itself should it just do it and perform the skew handling automatically?
If a user were to explicitly set Automatic
, I imagine they are wanting to have MCO handle all of that and that they likely don't have the cluster boot image information on hand. Whereas if they set Manual
they are explicitly stating they want to manually manage that information and I would expect them to have it on hand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question. The determination of boot image version isn't straightforward and varies wildly per platform. There currently isn't a single source of truth for the admin or the controllers to use in the cluster. So I thought this would be a good way to represent that information in the API. Perhaps for the Automatic case; clusterBootImage
makes more sense as a Status only field, but for Manual we could have it in Spec
and Status
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated with this Spec/Status shape, but Automatic
only specifies a version in the status version. So I'm envisioning something like the following examples.
On a cluster that defaults into Automatic mode(no admin opinion):
spec:
..
status:
bootImageSkewEnforcementStatus:
mode: Automatic
automatic:
ocpVersion: "4.18.2"
rhcosVersion: "9.6.20250523-1"
On a cluster that defaults into manual mode(no admin opinion):
spec:
..
status:
bootImageSkewEnforcementStatus:
mode: Manual
manual:
ocpVersion: "4.18.2"
On a cluster that an admin explicitly sets to Manual, and performs updates:
spec:
bootImageSkewEnforcement:
mode: Manual
manual:
ocpVersion: "4.18.2"
rhcosVersion: "9.6.20250523-1"
status:
bootImageSkewEnforcementStatus:
mode: Manual
manual:
ocpVersion: "4.18.2"
rhcosVersion: "9.6.20250523-1"
On a cluster that an admin disables this feature:
spec:
bootImageSkewEnforcement:
mode: None
status:
bootImageSkewEnforcementStatus:
mode: None
On a cluster that an admin explicitly sets to Automatic:
spec:
bootImageSkewEnforcement:
mode: Automatic
status:
bootImageSkewEnforcementStatus:
mode: Automatic
automatic:
ocpVersion: "4.18.2"
rhcosVersion: "9.6.20250523-1"
For this last case, I'm not entirely convinced if it needs to be supported. The user having the power to go to "Manual" and "None" via an explicit value makes sense.
Hmm, I guess a workflow to consider for this would be a user going from Manual/None to Automatic mode; would deleting the spec.bootImageSkewEnforcement
be good UX that case? If the MCO is able to automatically determine that the cluster is able to perform skew management in a hands off fashion, it would default the status to Automatic(if spec is empty). Or would it be better to have an explicit Automatic
setting in the spec?
@@ -56,8 +56,66 @@ type MachineConfigurationSpec struct { | |||
// +openshift:enable:FeatureGate=NodeDisruptionPolicy | |||
// +optional | |||
NodeDisruptionPolicy NodeDisruptionPolicyConfig `json:"nodeDisruptionPolicy"` | |||
// bootImageSkewEnforcement allows an admin to set the behavior of the boot image skew enforcement mechanism. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This discussion does make me think, does it make sense to have a status version of the BootImageSkewEnforcement field? Perhaps that makes more sense to represent the default modes?
I'm still wrapping my head around the "default" modes, but if I'm following this correctly, it does seem reasonable to me to have the cluster boot image state stored in a status condition - especially when set to Automatic
and the MCO is keeping track of this.
In my experience, it can be a bit confusing when both a user and controller are modifying the spec field of a resource and I would generally advise against that.
@djoshy: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
WIP boot image enforcement API, based on discussions from openshift/enhancements#1761