Skip to content

MCO-1805: MCO-1806: Add ManagedBootImagesCPMS feature gate & CPMS type to ManagedBootImages API #2396

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

djoshy
Copy link
Contributor

@djoshy djoshy commented Jul 8, 2025

This PR adds:

  • A new feature gate for MCO-1007; whose goal is to support boot image updates to ControlPlaneMachineSets
  • A new MachineManagerMachineSetsResourceType enum for CPMS so they can be opted in for updates via the MachineConfiguration API object.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jul 8, 2025
@openshift-ci-robot
Copy link

openshift-ci-robot commented Jul 8, 2025

@djoshy: This pull request references MCO-1805 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.20.0" version, but no target version was set.

In response to this:

This PR adds:

  • A new feature gate for MCO-1007; whose goal is to support boot image updates to ControlPlaneMachineSets
  • A new MachineManagerMachineSetsResourceType enum for CPMS so they can be opted in for updates via the MachineConfiguration API object.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

Copy link
Contributor

openshift-ci bot commented Jul 8, 2025

Hello @djoshy! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/M Denotes a PR that changes 30-99 lines, ignoring generated files. label Jul 8, 2025
@openshift-ci openshift-ci bot requested review from everettraven and JoelSpeed July 8, 2025 16:11
Copy link
Contributor

openshift-ci bot commented Jul 8, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: djoshy
Once this PR has been reviewed and has the lgtm label, please assign deads2k for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

type MachineManagerMachineSetsResourceType string

const (
// MachineSets represent the MachineSet resource type, which manage a group of machines and belong to the Openshift machine API group.
MachineSets MachineManagerMachineSetsResourceType = "machinesets"
// ControlPlaneMachineSets represent the ControlPlaneMachineSets resource type, which manage a group of control-plane machines and belong to the Openshift machine API group.
ControlPlaneMachineSets MachineManagerMachineSetsResourceType = "controlplanemachinesets"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Question: Is there a way to only enable this value of enum on the feature gate?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes there is. You would change the usage of +kubebuilder:validation:Enum to the following:

  • +openshift:validation:FeatureGateAwareEnum:featureGate="",enum="machinesets"
  • +openshift:validation:FeatureGateAwareEnum:featureGate="ManagedBootImagesCPMS",enum="machinesets";"controlplanemachinesets"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I've updated the PR. PTAL when you get a chance (:

@djoshy djoshy force-pushed the add-cpms-boot-image-updates branch from 18ab992 to 41adfe1 Compare July 8, 2025 21:21
@openshift-ci openshift-ci bot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/M Denotes a PR that changes 30-99 lines, ignoring generated files. labels Jul 8, 2025
Copy link
Contributor

openshift-ci bot commented Jul 9, 2025

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-serial-techpreview-2of2 41adfe1 link true /test e2e-aws-serial-techpreview-2of2
ci/prow/e2e-aws-ovn-techpreview 41adfe1 link true /test e2e-aws-ovn-techpreview
ci/prow/integration 41adfe1 link true /test integration

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

reportProblemsToJiraComponent("MachineConfigOperator").
contactPerson("djoshy").
productScope(ocpSpecific).
enhancementPR("https://github.com/openshift/enhancements/pull/1761").
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The linked EP explicitly calls out not targeting CPMS. Has there been design discussion of the impacts of enabling boot image updates on CPMS?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, yes I can update that, will open a PR for it so the reference here can be corrected. We were asked by Service Delivery folks to bump the priority for this and we had initially this slated for TechPreview in 4.21. Some recent developments pushed Azure to 4.21, so I decided to pull this into 4.20. Since CPMS do not use marketplace AMIs/images, this should be hopefully just re-using a lot of the existing implementaton for GCP/AWS management.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sure from your side it's easy to get the CPMS updated, but there's a big difference between CPMS and MachineSets that needs to be discussed, primarily, that when you update the CPMS, it could trigger a complete control plane replacement, that is potentially not desirable depending on when it happens, or even, at all in some cases. I think this is worth bringing to an architecture call, and perhaps even bringing some SD opinionated folks along

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context, I agree with your concerns. I will be happy to bring it to the next arch call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone into more detail on the issue you linked, hoping to trigger some discussion with SD, lets see if they respond

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack, thanks!

kind: MachineConfiguration
spec:
managedBootImages:
machineManagers: []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we allow this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think our original idea was that it would improve discovery: #1672 (comment)

Currently, it is used to explicitly disable updates in 4.18, so an auto opt-in does not take place on upgrade to 4.19.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you are making a distinction now between omitted and the empty list? The API wasn't designed with this in mind and I'm not sure how you'd actually be doing that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API wasn't designed with this in mind and I'm not sure how you'd actually be doing that?

Yeah, it's not pretty 😓 and it is only meant as stop-gap < 4.18 since we have an explicit None option in 4.19+. I check if the spec list exists, if omitted, the list object would be nil and the MCO considers that to be no opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you are trying to achieve doesn't work. Go decoding/encoding won't tell the difference between a persisted [] and the field being omitted completely. Take a look at the output of https://go.dev/play/p/xEYwvCwxqB3.

If you wanted to be able to tell the difference between those two states, you'd need the list to be a pointer (*[]T).

As soon as a structured client writes to the object after the use has persisted [], it will be stripped away again.

Copy link
Contributor

@JoelSpeed JoelSpeed Jul 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh wait, we don't have omitempty... that changes it slightly, but damn that is sketchy and fragile 👀 This is not a behaviour I would be comfortable relying on. Kubernetes doesn't have a concept of pointers, it has present, or not present. Lists have size generally, and we should not assume an empty list round trips.

Sketchy playground

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Understood, given that the empty list method is now in use by 4.18 ROSA/Managed clusters, what would you suggest is the path forward here? Our 4.18 docs recommend the empty list for disabling prior to an upgrade, and 4.19 docs recommend the None option. Should we do some sort of migration?

operatorLogLevel: Normal
managedBootImages:
machineManagers:
- resource: controlplanemachinesets
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CPMS is a singleton within the cluster, perhaps we want to validate a specific selection (All?) to be required when this value is controlplanemachinesets?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, interesting, I did not know that! Yes, I can update the validation here. It will also simplify the reconciliation loop in the MCO controller.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants