Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OCPBUGS-51009: OSUpdateStarted event should only be emitted on actual OS updates #4864

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

djoshy
Copy link
Contributor

@djoshy djoshy commented Feb 18, 2025

- What I did

  • Stopped emitting OSUpdateStarted events for a specific scenario: there are some extensions currently in use, but no extension installs/uninstalls are taking place. In such cases, rpm-ostree update is not run and no OS updates are happening. I suspect this a special case that we accounted for that is no longer in use.
  • Made the OSUpdateStarted event's message a bit more verbose for easier debugging.

- How to verify it

  • Existing units/e2es should pass, this should not break extension functionality.
  • Not sure this needs pre merge QE testing.

@openshift-ci-robot openshift-ci-robot added jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Feb 18, 2025
@openshift-ci-robot
Copy link
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-51009, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

- What I did

  • Stopped emitting OSUpdateStarted events for a specific scenario: there are some extensions currently in use, but no extension installs/uninstalls are taking place. In such cases, rpm-ostree update is not run and no OS updates are happening. I suspect this a special case that we accounted for that is no longer in use.
  • Made the OSUpdateStarted event's message a bit more verbose for easier debugging.

- How to verify it

  • Existing units/e2es should pass, this should not break extension functionality.
  • Not sure this needs pre merge QE testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Feb 18, 2025
Copy link
Contributor

@yuqi-zhang yuqi-zhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

I think the changes make sense. These events really shouldn't be canonical for OS upgrades, but I guess for now they work for catching CI flakes like this, so it somewhat helps

// We have at least one customer that removes the pull secret from the cluster to "shrinkwrap" it for distribution and we want
// to make sure we don't break that use case, but realtime kernel update and extensions update always ran
// if they were in use, so we also need to preserve that behavior.
// https://issues.redhat.com/browse/OCPBUGS-4049
if mcDiff.osUpdate || mcDiff.extensions || mcDiff.kernelType || mcDiff.kargs ||
canonicalizeKernelType(newConfig.Spec.KernelType) == ctrlcommon.KernelTypeRealtime ||
canonicalizeKernelType(newConfig.Spec.KernelType) == ctrlcommon.KernelType64kPages ||
len(newConfig.Spec.Extensions) > 0 {
canonicalizeKernelType(newConfig.Spec.KernelType) == ctrlcommon.KernelType64kPages {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, the best I can interpret the intent was that on upgrades, the extensions packages should also be automatically updating with the OS version, that said, I think this logic would still be incorrect, since it should only emit an event if mcDiff.osUpdate && len(newConfig.Spec.Extensions) which then is covered by the mcDiff.osUpdate above so maybe that doesn't make sense

// osChangesString() can return empty in cases where the above diffs are false,
// but the node uses a non standard kernel, so let's make it a bit more
// informative in such cases
reason = fmt.Sprintf("Updating to a target config with %s kernel", canonicalizeKernelType(newConfig.Spec.KernelType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking of having this also refer to the rendered config that the update is happening for, but I guess that's relatively easy to match, so no need here.

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Feb 19, 2025
Copy link
Contributor

openshift-ci bot commented Feb 19, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: djoshy, yuqi-zhang

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@djoshy
Copy link
Contributor Author

djoshy commented Feb 19, 2025

/label acknowledge-critical-fixes-only

/hold

Holding for QE

@openshift-ci openshift-ci bot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. labels Feb 19, 2025
@sergiordlr
Copy link

Verified using IPI on AWS

  1. Install all extensions
  2. Check the events
$ oc get events --sort-by metadata.creationTimestamp |grep OSUpdateStarted
11m         Normal    OSUpdateStarted                           node/ip-10-0-20-249.us-east-2.compute.internal                       Installing extensions
5m5s        Normal    OSUpdateStarted                           node/ip-10-0-35-192.us-east-2.compute.internal                       Installing extensions

  1. Configure the MachineConfiguration resource to execute no action given a test file
oc edit machineconfiguration
...
  spec:
    logLevel: Normal
    managementState: Managed
    nodeDisruptionPolicy:
      files:
      - actions:
        - type: None
        path: /etc/test-file
    operatorLogLevel: Normal

  1. Create a MC to deploy the test file

kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: test-machine-config
spec:
  config:
    ignition:
      version: 3.1.0
    storage:
      files:
      - contents:
          source: data:text/plain;charset=utf-8;base64,dGVzdA==
        filesystem: root
        mode: 420
        path: /etc/test-file
  1. Check that no extra events were triggered
$ oc get events --sort-by metadata.creationTimestamp |grep OSUpdateStarted
16m         Normal    OSUpdateStarted                           node/ip-10-0-20-249.us-east-2.compute.internal                       Installing extensions
10m         Normal    OSUpdateStarted                           node/ip-10-0-35-192.us-east-2.compute.internal                       Installing extensions

  1. Remove the extensions

  2. Check that new events were triggered

$ oc get events --sort-by metadata.creationTimestamp |grep OSUpdateStarted
27m         Normal    OSUpdateStarted                           node/ip-10-0-20-249.us-east-2.compute.internal                       Installing extensions
21m         Normal    OSUpdateStarted                           node/ip-10-0-35-192.us-east-2.compute.internal                       Installing extensions
7m43s       Normal    OSUpdateStarted                           node/ip-10-0-20-249.us-east-2.compute.internal                       Installing extensions
2m50s       Normal    OSUpdateStarted                           node/ip-10-0-35-192.us-east-2.compute.internal                       Installing extensions
  1. Deploy realtime kernel
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfig
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: worker-rt-kernel
spec:
  kernelType: realtime
  1. Check events
$ oc get events --sort-by metadata.creationTimestamp |grep OSUpdateStarted
47m         Normal    OSUpdateStarted                           node/ip-10-0-20-249.us-east-2.compute.internal                       Installing extensions
40m         Normal    OSUpdateStarted                           node/ip-10-0-35-192.us-east-2.compute.internal                       Installing extensions
27m         Normal    OSUpdateStarted                           node/ip-10-0-20-249.us-east-2.compute.internal                       Installing extensions
22m         Normal    OSUpdateStarted                           node/ip-10-0-35-192.us-east-2.compute.internal                       Installing extensions
15m         Normal    OSUpdateStarted                           node/ip-10-0-20-249.us-east-2.compute.internal                       Changing kernel type
8m19s       Normal    OSUpdateStarted                           node/ip-10-0-35-192.us-east-2.compute.internal                       Changing kernel type

/label qe-approved

@openshift-ci openshift-ci bot added the qe-approved Signifies that QE has signed off on this PR label Feb 20, 2025
@openshift-ci-robot
Copy link
Contributor

@djoshy: This pull request references Jira Issue OCPBUGS-51009, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.19.0) matches configured target version for branch (4.19.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sergiordlr

In response to this:

- What I did

  • Stopped emitting OSUpdateStarted events for a specific scenario: there are some extensions currently in use, but no extension installs/uninstalls are taking place. In such cases, rpm-ostree update is not run and no OS updates are happening. I suspect this a special case that we accounted for that is no longer in use.
  • Made the OSUpdateStarted event's message a bit more verbose for easier debugging.

- How to verify it

  • Existing units/e2es should pass, this should not break extension functionality.
  • Not sure this needs pre merge QE testing.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@djoshy
Copy link
Contributor Author

djoshy commented Feb 20, 2025

/unhold

@openshift-ci openshift-ci bot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 20, 2025
@djoshy
Copy link
Contributor Author

djoshy commented Feb 20, 2025

/retest-required

@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d34ac99 and 2 for PR HEAD c75a003 in total

1 similar comment
@openshift-ci-robot
Copy link
Contributor

/retest-required

Remaining retests: 0 against base HEAD d34ac99 and 2 for PR HEAD c75a003 in total

Copy link
Contributor

openshift-ci bot commented Feb 21, 2025

@djoshy: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-op-techpreview c75a003 link false /test e2e-gcp-op-techpreview
ci/prow/e2e-aws-ovn-upgrade-out-of-change c75a003 link false /test e2e-aws-ovn-upgrade-out-of-change
ci/prow/e2e-azure-ovn-upgrade-out-of-change c75a003 link false /test e2e-azure-ovn-upgrade-out-of-change
ci/prow/e2e-gcp-op-ocl c75a003 link false /test e2e-gcp-op-ocl
ci/prow/e2e-gcp-op c75a003 link true /test e2e-gcp-op

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
acknowledge-critical-fixes-only Indicates if the issuer of the label is OK with the policy. approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/severity-moderate Referenced Jira bug's severity is moderate for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. lgtm Indicates that a PR is ready to be merged. qe-approved Signifies that QE has signed off on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants