Skip to content

[WIP] MCO-1669: add BootImageSkewEnforcement API #2357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

djoshy
Copy link
Contributor

@djoshy djoshy commented Jun 5, 2025

WIP boot image enforcement API, based on discussions from openshift/enhancements#1761

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 5, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 5, 2025

@djoshy: This pull request references MCO-1669 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

WIP boot image enforcement API, based on discussions from openshift/enhancements#1761

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Jun 5, 2025

Hello @djoshy! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jun 5, 2025
@openshift-ci openshift-ci bot requested review from deads2k and everettraven June 5, 2025 16:51
Copy link
Contributor

openshift-ci bot commented Jun 5, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: djoshy
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@@ -56,8 +56,66 @@ type MachineConfigurationSpec struct {
// +openshift:enable:FeatureGate=NodeDisruptionPolicy
// +optional
NodeDisruptionPolicy NodeDisruptionPolicyConfig `json:"nodeDisruptionPolicy"`
// bootImageSkewEnforcement allows an admin to set the behavior of the boot image skew enforcement mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why does an admin care about configuring this? What does configuring this allow them to achieve?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When boot images are out of skew, the cluster will very likely fail to scale. Our hope is that, when the MCO's skew enforcement mechanism detects that the boot image is out of skew, it will alert the cluster by disabling upgrades. So then the admin has a few options to restore cluster upgrades:

  • If they do plan to not scale their cluster, they could disable this mechanism completely by setting the API knob to None.
  • If they do plan to scale, it could play out in two ways:
    • For the platforms that support automatic boot image updates, the MCO will default this API knob to automatic and set the ClusterBootImage field to the current boot image of the cluster. Going forward, the boot image controller will then update theClusterBootImage field when a boot image update takes place. Ideally, this process is completely invisible to admin and need no manual action from their end.
    • For the platforms that do not support automatic boot image updates, the MCO will default this API knob to manual, and set the ClusterBootImage field to the current boot image of the cluster. The admin is expected to manually perform updates(via docs that we provide) and update the ClusterBootImage field.

There's more details in the EP, but this is the gist.

This discussion does make me think, does it make sense to have a status version of the BootImageSkewEnforcement field? Perhaps that makes more sense to represent the default modes?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If end-users are going to be the ones configuring this, lets provide them some of this additional context.

What do you think of something along the lines of this for the description of the field?:

bootImageSkewEnforcement is an optional field that can be used to configure how version skew is enforced on the cluster.
When version skew is being enforced, cluster upgrades will be disabled until the version skew becomes acceptable for the release payload.
When omitted, ....

We should try to provide a clear explanation to end-users through the GoDoc why they may care about configuring this.

For the defaulting behavior, does this only happen if a user doesn't explicitly set this field?
How does one know if they are on a platform that supports manual vs automatic boot image updates?
If I as a user were to set the wrong one, what happens?

Also worth noting that because this field is not a pointer and doesn't have omitempty, omitting this field would result in the following serialization:

bootImageSkewEnforcement:
  mode: ""

which according to the valid values of the mode field would be invalid. You either need to allow the "" value for the mode OR make this a pointer with omitempty to distinguish between "not set" and the zero value here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion does make me think, does it make sense to have a status version of the BootImageSkewEnforcement field? Perhaps that makes more sense to represent the default modes?

I'm still wrapping my head around the "default" modes, but if I'm following this correctly, it does seem reasonable to me to have the cluster boot image state stored in a status condition - especially when set to Automatic and the MCO is keeping track of this.

In my experience, it can be a bit confusing when both a user and controller are modifying the spec field of a resource and I would generally advise against that.

Copy link
Contributor Author

@djoshy djoshy Jul 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the defaulting behavior, does this only happen if a user doesn't explicitly set this field?

Yes, the defaulting behavior will take place when the user hasn't specified anything. The user will always be able to override it.

How does one know if they are on a platform that supports manual vs automatic boot image updates?

We plan to have this info available via documentation, but I want to note that we don't plan on GAing skew enforcement API until we have implemented automatic boot image updates on a majority of the platforms that can support it. Based on our work so far, there are going to be a sizable amount of platforms(or variants within platforms) that will require manual intervention for boot image updates.

If I as a user were to set the wrong one, what happens?

Manual and None can still be valid even when the platform supports automatic boot image updates because the admin can still choose to be manually in control of their boot images. If the admin incorrectly sets Automatic for an unsupported scenario, the MCO could propagate an error.

I'm still wrapping my head around the "default" modes, but if I'm following this correctly, it does seem reasonable to me to have the cluster boot image state stored in a status condition - especially when set to Automatic and the MCO is keeping track of this.

The ManagedBootImages field in this object operates in this manner wrt defaulting behavior, and it has a mirrored field in Status. Let me try sketching out what that would look like.

which according to the valid values of the mode field would be invalid. You either need to allow the "" value for the mode OR make this a pointer with omitempty to distinguish between "not set" and the zero value here.

I think I'll go with pointer & omitempty route here, thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, let me know if the pointer+omitempty field looks right. I've also attemped to fix the union discriminator naming.

@djoshy djoshy force-pushed the skew-enforcement branch from eee6809 to 54938cf Compare July 14, 2025 20:21
@djoshy
Copy link
Contributor Author

djoshy commented Jul 14, 2025

Thanks for the questions & review(sorry it took a while!), this should be ready for another look. Happy to hop on a call if that is easier.

Update: Did another push to fix up some tests.

@djoshy djoshy force-pushed the skew-enforcement branch from 54938cf to 0344f1e Compare July 15, 2025 14:58
Copy link
Contributor

@everettraven everettraven left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving another handful of comments.

I'm also happy to hop on a call if you think it would be beneficial.

@@ -56,8 +56,66 @@ type MachineConfigurationSpec struct {
// +openshift:enable:FeatureGate=NodeDisruptionPolicy
// +optional
NodeDisruptionPolicy NodeDisruptionPolicyConfig `json:"nodeDisruptionPolicy"`
// bootImageSkewEnforcement allows an admin to set the behavior of the boot image skew enforcement mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If end-users are going to be the ones configuring this, lets provide them some of this additional context.

What do you think of something along the lines of this for the description of the field?:

bootImageSkewEnforcement is an optional field that can be used to configure how version skew is enforced on the cluster.
When version skew is being enforced, cluster upgrades will be disabled until the version skew becomes acceptable for the release payload.
When omitted, ....

We should try to provide a clear explanation to end-users through the GoDoc why they may care about configuring this.

For the defaulting behavior, does this only happen if a user doesn't explicitly set this field?
How does one know if they are on a platform that supports manual vs automatic boot image updates?
If I as a user were to set the wrong one, what happens?

Also worth noting that because this field is not a pointer and doesn't have omitempty, omitting this field would result in the following serialization:

bootImageSkewEnforcement:
  mode: ""

which according to the valid values of the mode field would be invalid. You either need to allow the "" value for the mode OR make this a pointer with omitempty to distinguish between "not set" and the zero value here.

Comment on lines 81 to 85
// clusterBootImage describes the current boot image of the cluster. This will be used to enforce the skew limit.
// This value will be compared against the cluster's skew limit to determine skew compliance.
// Required when mode is set to "Automatic" or "Manual" and forbidden otherwise.
// +optional
ClusterBootImage *ClusterBootImage `json:"clusterBootImage,omitempty"`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually, discriminated union members will follow the name of their mode counterpart. i.e:

mode: Automatic
automatic:
  ...

or

mode: Manual
manual:
  ...

Another thing I'm curious about now that I've got a bit more context - why do you want to require the clusterBootImage when set to Automatic?

Presumably, if the MCO is able to determine the cluster boot image by itself should it just do it and perform the skew handling automatically?

If a user were to explicitly set Automatic, I imagine they are wanting to have MCO handle all of that and that they likely don't have the cluster boot image information on hand. Whereas if they set Manual they are explicitly stating they want to manually manage that information and I would expect them to have it on hand.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question. The determination of boot image version isn't straightforward and varies wildly per platform. There currently isn't a single source of truth for the admin or the controllers to use in the cluster. So I thought this would be a good way to represent that information in the API. Perhaps for the Automatic case; clusterBootImage makes more sense as a Status only field, but for Manual we could have it in Spec and Status?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated with this Spec/Status shape, but Automatic only specifies a version in the status version. So I'm envisioning something like the following examples.

On a cluster that defaults into Automatic mode(no admin opinion):

        spec:
        ..
        status:
          bootImageSkewEnforcementStatus:
            mode: Automatic
            automatic:
              ocpVersion: "4.18.2"
              rhcosVersion: "9.6.20250523-1"

On a cluster that defaults into manual mode(no admin opinion):

        spec:
        ..
        status:
          bootImageSkewEnforcementStatus:
            mode: Manual
            manual:
              ocpVersion: "4.18.2"

On a cluster that an admin explicitly sets to Manual, and performs updates:

        spec:
          bootImageSkewEnforcement:
            mode: Manual
            manual:
              ocpVersion: "4.18.2"
              rhcosVersion: "9.6.20250523-1"
        status:
          bootImageSkewEnforcementStatus:
            mode: Manual
            manual:
              ocpVersion: "4.18.2"
              rhcosVersion: "9.6.20250523-1"

On a cluster that an admin disables this feature:

        spec:
          bootImageSkewEnforcement:
            mode: None
        status:
          bootImageSkewEnforcementStatus:
            mode: None

On a cluster that an admin explicitly sets to Automatic:

        spec:
          bootImageSkewEnforcement:
            mode: Automatic
        status:
          bootImageSkewEnforcementStatus:
            mode: Automatic
            automatic:
              ocpVersion: "4.18.2"
              rhcosVersion: "9.6.20250523-1"

For this last case, I'm not entirely convinced if it needs to be supported. The user having the power to go to "Manual" and "None" via an explicit value makes sense.

Hmm, I guess a workflow to consider for this would be a user going from Manual/None to Automatic mode; would deleting the spec.bootImageSkewEnforcement be good UX that case? If the MCO is able to automatically determine that the cluster is able to perform skew management in a hands off fashion, it would default the status to Automatic(if spec is empty). Or would it be better to have an explicit Automatic setting in the spec?

@@ -56,8 +56,66 @@ type MachineConfigurationSpec struct {
// +openshift:enable:FeatureGate=NodeDisruptionPolicy
// +optional
NodeDisruptionPolicy NodeDisruptionPolicyConfig `json:"nodeDisruptionPolicy"`
// bootImageSkewEnforcement allows an admin to set the behavior of the boot image skew enforcement mechanism.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This discussion does make me think, does it make sense to have a status version of the BootImageSkewEnforcement field? Perhaps that makes more sense to represent the default modes?

I'm still wrapping my head around the "default" modes, but if I'm following this correctly, it does seem reasonable to me to have the cluster boot image state stored in a status condition - especially when set to Automatic and the MCO is keeping track of this.

In my experience, it can be a bit confusing when both a user and controller are modifying the spec field of a resource and I would generally advise against that.

@djoshy djoshy force-pushed the skew-enforcement branch from 0344f1e to 5816e99 Compare July 17, 2025 18:46
Copy link
Contributor

openshift-ci bot commented Jul 17, 2025

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/minor-e2e-upgrade-minor 5816e99 link true /test minor-e2e-upgrade-minor
ci/prow/lint 5816e99 link true /test lint
ci/prow/e2e-aws-serial-2of2 5816e99 link true /test e2e-aws-serial-2of2

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants