Skip to content

[release-4.19] MCO-1720: Promote MachineConfigNode feature gate to default #2376

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: release-4.19
Choose a base branch
from

Conversation

isabella-janssen
Copy link
Member

This promotes the MachineConfigNode feature gate to Default in the 4.19 brach.

@openshift-ci-robot
Copy link

openshift-ci-robot commented Jun 20, 2025

@isabella-janssen: This pull request references MCO-1720 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the story to target the "4.19.0" version, but no target version was set.

In response to this:

This promotes the MachineConfigNode feature gate to Default in the 4.19 brach.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot added the jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. label Jun 20, 2025
Copy link
Contributor

openshift-ci bot commented Jun 20, 2025

Hello @isabella-janssen! Some important instructions when contributing to openshift/api:
API design plays an important part in the user experience of OpenShift and as such API PRs are subject to a high level of scrutiny to ensure they follow our best practices. If you haven't already done so, please review the OpenShift API Conventions and ensure that your proposed changes are compliant. Following these conventions will help expedite the api review process for your PR.

@openshift-ci openshift-ci bot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jun 20, 2025
@openshift-ci openshift-ci bot requested review from JoelSpeed and sinnykumari June 20, 2025 12:43
Copy link
Contributor

openshift-ci bot commented Jun 20, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: isabella-janssen
Once this PR has been reviewed and has the lgtm label, please assign joelspeed for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@isabella-janssen
Copy link
Member Author

/retest-required

@JoelSpeed
Copy link
Contributor

@isabella-janssen We don't generally backport feature promotions post GA unless there is a very very good reason to do so, what's the motivation here?

@oourfali
Copy link

@JoelSpeed Hi! Due to it finalizing in the last minute, the aggreement was to push it via 4.19 z-stream.
Motivation-wise we have several aspects:

  1. Appliance - we have a customer interested in getting a support exception for the appliance, which leverages the pinned image set functionality. in order to allow the cluster to get upgraded, we'll need this API GA'ed
  2. OVE - It is consuming the appliance as well.
  3. There are additional use-cases being discussed at the moment

Overall, the api does exist for a while, and is a mature one. I hope the above clarifies things, but when we saw that we missed 4.19 the agreed plan was to backport.

@isabella-janssen
Copy link
Member Author

/retest-required

@JoelSpeed
Copy link
Contributor

Do we have the associated SBAR for this feature backport that we can link here?

@isabella-janssen
Copy link
Member Author

/hold

Holding while we create an SBAR for this feature & PinnedImage sets.

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jun 24, 2025
@isabella-janssen
Copy link
Member Author

/retest-required

3 similar comments
@isabella-janssen
Copy link
Member Author

/retest-required

@isabella-janssen
Copy link
Member Author

/retest-required

@isabella-janssen
Copy link
Member Author

/retest-required

@isabella-janssen
Copy link
Member Author

/test verify-feature-promotion

@isabella-janssen
Copy link
Member Author

/retest-required

2 similar comments
@isabella-janssen
Copy link
Member Author

/retest-required

@isabella-janssen
Copy link
Member Author

/retest-required

@isabella-janssen
Copy link
Member Author

/test verify-feature-promotion

@isabella-janssen
Copy link
Member Author

/retest-required

@JoelSpeed
Copy link
Contributor

As far as I can tell, some tests have been renamed recetly and those haven't timed out yet.

The only test I can see which doesn't pass enough is [Suite:openshift/machine-config-operator/disruptive][Suite:openshift/conformance/serial][sig-mco][OCPFeatureGate:PinnedImages][OCPFeatureGate:MachineConfigNodes][Serial] Invalid PIS leads to degraded MCN in a standard Pool [apigroup:machineconfiguration.openshift.io] on ha metal amd64 ipv4.

If we can get a few more runs there to see if it comes above 95% I think we are good to promote this

@isabella-janssen
Copy link
Member Author

/retest-required

@isabella-janssen
Copy link
Member Author

As far as I can tell, some tests have been renamed recetly and those haven't timed out yet.

That's correct, they were updated June 26th (openshift/origin#29918).

The only test I can see which doesn't pass enough is [Suite:openshift/machine-config-operator/disruptive][Suite:openshift/conformance/serial][sig-mco][OCPFeatureGate:PinnedImages][OCPFeatureGate:MachineConfigNodes][Serial] Invalid PIS leads to degraded MCN in a standard Pool [apigroup:machineconfiguration.openshift.io] on ha metal amd64 ipv4.

I have some manually triggered runs for that running now, hopefully we can get the passing rate up soon.

Related SBAR can be found here.

Thanks for the continued patience and help throughout this effort @JoelSpeed!

@JoelSpeed
Copy link
Contributor

/test verify-feature-promotion

1 similar comment
@isabella-janssen
Copy link
Member Author

/test verify-feature-promotion

@isabella-janssen
Copy link
Member Author

/test verify-feature-promotion

1 similar comment
@isabella-janssen
Copy link
Member Author

/test verify-feature-promotion

@isabella-janssen
Copy link
Member Author

/unhold

@JoelSpeed It looks like the [Suite:openshift/machine-config-operator/disruptive][Suite:openshift/conformance/serial][sig-mco][OCPFeatureGate:PinnedImages][OCPFeatureGate:MachineConfigNodes][Serial] Invalid PIS leads to degraded MCN in a standard Pool [apigroup:machineconfiguration.openshift.io] test has a good pass rate now.

Two tests, [Suite:openshift/machine-config-operator/disruptive][sig-mco][OCPFeatureGate:MachineConfigNodes] [Serial][Slow]Should properly report MCN conditions on node degrade [apigroup:machineconfiguration.openshift.io] and [Suite:openshift/machine-config-operator/disruptive][sig-mco][OCPFeatureGate:MachineConfigNodes] [Serial][Slow]Should properly create and remove MCN on node creation and deletion [apigroup:machineconfiguration.openshift.io] are new and are not being used to prove component readiness for the MCN feature gate. cc @yuqi-zhang

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jul 9, 2025
@JoelSpeed
Copy link
Contributor

are not being used to prove component readiness for the MCN feature gate

How did we come to this conclusion? I would have expected all tests to contribute, whether they are new or not? Or are these tests specifically marked up in a way that they always pass? We have a method for marking tests as flakes so that they do not impact the initial signal, are you doing that?

@isabella-janssen
Copy link
Member Author

isabella-janssen commented Jul 9, 2025

How did we come to this conclusion? I would have expected all tests to contribute, whether they are new or not?

When originally working to GA MachineConfigNodes & PinnedImageSets in 4.19, we decided to use 11 tests to prove component readiness for the features, 6 specific to MCN & 5 for PIS, and the tests being used for MCN's component readiness were documented in Jira. We used those 11 tests to prove component readiness in lifting the feature gates in 4.20 and planned to use the same tests to support the 4.19 backport. Those tests now seem to have reached the required number of runs and pass rates.

The Should properly report MCN conditions on node degrade and Should properly create and remove MCN on node creation and deletion tests started running recently as part of separate MCO work (MCO-1652) and will contribute to component readiness going forward, but at this time. This decision was discussed with @yuqi-zhang & @craychee.

Or are these tests specifically marked up in a way that they always pass?

No, the tests pass / fail based on whether they perform as expected or not.

We have a method for marking tests as flakes so that they do not impact the initial signal, are you doing that?

We are not currently doing this and I am not aware of any plans to do this.

@yuqi-zhang
Copy link
Contributor

yuqi-zhang commented Jul 9, 2025

How did we come to this conclusion? I would have expected all tests to contribute, whether they are new or not? Or are these tests specifically marked up in a way that they always pass? We have a method for marking tests as flakes so that they do not impact the initial signal, are you doing that?

Perhaps a better way to frame it is that these tests were introduced after the fact as additional checks for the features, since no test suite exercised them until recently, so we would like to not consider them as part of the requirement yet (the original 5-tests-per-FG is still being met, with these tests we would have 8 for MCN FG)

Put another way, if we were to for some reason add a new test for a feature every week as we expand on testing, does that mean that we'd never satisfy the requirements as there would always be a test not reaching enough # of runs to meet criteria?

As for the test pass rate, they also both have 100% pass rate on 4.20, just around 10 runs each (as opposed to a few hundred for all the other tests), and in 4.19 has only 4 and 7 runs, and thus isn't matching graduation criteria by themselves, so we will use them to track component readiness, but not use them to lift the FG if possible.

@isabella-janssen
Copy link
Member Author

@JoelSpeed do you have any followup questions or concerns based on the justification from myself or @yuqi-zhang?

@JoelSpeed
Copy link
Contributor

/test verify-feature-promotion

Copy link
Contributor

openshift-ci bot commented Jul 14, 2025

@isabella-janssen: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/verify-feature-promotion 75435dd link true /test verify-feature-promotion

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@JoelSpeed
Copy link
Contributor

Put another way, if we were to for some reason add a new test for a feature every week as we expand on testing, does that mean that we'd never satisfy the requirements as there would always be a test not reaching enough # of runs to meet criteria?

I'll be honest, this is possibly the first time this has come up. Generally by the time feature promotion happens, the feature is complete, and the testing is complete (or mostly complete) at this time. Given that these features were promoted through the SBAR process, this is a bit of a wrinkle that we haven't crossed before.

Looking at the results as of today, the new tests do have some failures, promotion the feature with tests failing will contribute to red component readiness, which will trigger TRT to act. Have we looked at the failures on the newer tests to see what those are? (Since we are currently at 80%, I think it's at least looking into why those runs failed before we assess an override)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants