Skip to content

Latest commit

 

History

History
722 lines (622 loc) · 34.5 KB

File metadata and controls

722 lines (622 loc) · 34.5 KB

Day-2 Cluster Configuration Changes

Overview

This guide provides instructions for Day‑2 configuration change use cases, to be applied after cluster provisioning. For guidance on provisioning a cluster, refer to Cluster Provisioning and Configuration.

Use Cases

Day 2 configuration changes are supported for both hardware configuration updates and policy parameter changes. The system supports retry scenarios even after previous configuration attempts have timed out or failed.

Hardware Configuration Timeouts and Retry

When a configuration operation times out or fails, the system supports retry through spec changes.

Retry Mechanism

  • Configuration timeouts/failures: Can be retried by updating the ProvisioningRequest spec
  • Provisioning timeouts/failures: Cannot be retried; the ProvisioningRequest must be deleted and recreated
  • Retry mechanism: Uses ConfigTransactionId (set to ProvisioningRequest generation) to track configuration changes. When the ProvisioningRequest spec changes, the generation increments, creating a new ConfigTransactionId. The system compares this with ObservedConfigTransactionId to detect spec changes and trigger new configuration attempts.
  • Terminal state override: The system allows clearing terminal states (timeout/failed) when the ProvisioningRequest is in pending state due to spec changes, except for hardware provisioning timeouts/failures which require deleting and recreating the ProvisioningRequest.

Troubleshooting Configuration Timeouts

To troubleshoot:

  1. Check configuration status:

    oc get provisioningrequest <UUID> -o yaml

    Look for HardwareConfigured condition with reason: TimedOut

  2. Check hardware manager logs:

    oc logs -n oran-o2ims -l app=hardwaremanager-server -f
  3. Retry configuration:

    • Update the ProvisioningRequest spec to trigger a new configuration attempt
    • The system will clear the terminal state and start a new configuration
    • Before retrying, check BareMetalHost (BMH) state:
      • If BMH is in servicing state, wait for it to complete first before retrying
      • If BMH is in servicing error state, retry might not work, especially for consistent power management errors
      • Use oc get bmh -n <namespace> to check BMH status

Updates to the clusterInstanceParameters field under ProvisioningRequest spec.templateParameters

A ProvisioningRequest can be edited to update:

  • the cluster labels and annotations
apiVersion: clcm.openshift.io/v1alpha1
kind: ProvisioningRequest
metadata:
  finalizers:
    - provisioningrequest.clcm.openshift.io/finalizer
  name: 123e4567-e89b-12d3-a456-426614174000
spec:
  description: Provisioning request for basic SNO with sample ACM policies
  name: Dev-main-SNO-Provisioning-sno-ran-du-1
  templateName: sno-ran-du
  templateVersion: v1
  templateParameters:
    clusterInstanceParameters:
      clusterName: sno-ran-du-1
      extraAnnotations:
        ManagedCluster:
          test: test <<< added as a Day 2 change
      extraLabels:
        ManagedCluster:
          sno-ran-du-policy: v1
          test: test <<< added as a Day 2 change
  • the node labels and annotations
apiVersion: clcm.openshift.io/v1alpha1
kind: ProvisioningRequest
metadata:
  finalizers:
    - provisioningrequest.clcm.openshift.io/finalizer
  name: 123e4567-e89b-12d3-a456-426614174000
spec:
  description: Provisioning request for basic SNO with sample ACM policies
  name: Dev-main-SNO-Provisioning-sno-ran-du-1
  templateName: sno-ran-du
  templateVersion: v1
  templateParameters:
    clusterInstanceParameters:
      clusterName: sno-ran-du-1
      nodes:
      - bmcCredentialsName:
          name: sno-ran-du-1-bmc-secret
        extraAnnotations:
          BareMetalHost:
            test: test <<< added as a Day 2 change
        extraLabels:
          BareMetalHost:
            test: test <<< added as a Day 2 change

The status.conditions records the success/failure of the update.

The cluster configuration goes to the ManagedCluster CR and the nodes configuration to the corresponding BMHs, as expected.

Note: ManagedCluster and node extra labels&annotations are the only fields that can be edited post installation. All the other fields are immutable and are rejected by the O-Cloud Manager. These changes would be rejected anyway by webhooks put in place by other operators for cluster installation resources (ex: ClusterDeployment)

Updates to the policyTemplateParameters field under ProvisioningRequest spec.templateParameters

These types of changes can be made under the ProvisioningRequest spec.templateParameters by updating the policyTemplateParameters entry, if it's present.

apiVersion: clcm.openshift.io/v1alpha1
kind: ProvisioningRequest
metadata:
  finalizers:
    - provisioningrequest.clcm.openshift.io/finalizer
  name: 123e4567-e89b-12d3-a456-426614174000
spec:
  description: Provisioning request for basic SNO with sample ACM policies
  name: Dev-main-SNO-Provisioning-sno-ran-du-1
  templateName: sno-ran-du
  templateVersion: v1
  templateParameters:
    nodeClusterName: sno-ran-du-1
    oCloudSiteId: local-west-12345
    policyTemplateParameters:
      sriov-network-vlan-1: "111"
      sriov-network-pfNames-1: '["ens2f0"]'

Note: Only policy configuration values exposed in the policyTemplateParameters property within the spec.templateParameterSchema field of the associated ClusterTemplate can be updated through the ProvisioningRequest.

Once the update is made, the <cluster-name>-pg ConfigMap in the ztp-<cluster-template-namespace> namespace gets updated with the new value. This ConfigMap is used by the ACM policies in their hub templates.

$  oc get clustertemplate -A
NAMESPACE                 NAME                     AGE
sno-ran-du-v4-Y-Z         sno-ran-du.v4-Y-Z-1      3d23h

$  oc get cm -n ztp-sno-ran-du-v4-Y-Z <cluster name>-pg -oyaml
apiVersion: v1
data:
  cpu-isolated: 0-1,64-65
  cpu-reserved: 2-10
  hugepages-count: "32"
  hugepages-default: 1G
  hugepages-size: 1G
  install-plan-approval: Automatic
  sriov-network-vlan-1: "111"
  sriov-network-pfNames-1: '["ens2f0"]'
kind: ConfigMap
metadata:
  name: sno-ran-du-1-pg
  namespace: ztp-sno-ran-du-v4-Y-Z

Once a policy matched with a ManagedCluster deployed through a ProvisioningRequest becomes NonCompliant, it's reflected in the ProvisioningRequest status.extensions.policies and the time when it becomes NonCompliant is also recorded. The ConfigurationApplied condition reflects that the configuration is being applied.

status:
  extensions:
    clusterDetails:
      clusterProvisionStartedAt: "2024-10-07T17:59:23Z"
      name: sno-ran-du-1
      nonCompliantAt: "2024-10-07T21:53:29Z"  <<< non compliance timestamp recorded here
      ztpStatus: ZTP Done
    policies:
    - compliant: Compliant
      policyName: v1-perf-configuration-policy
      policyNamespace: ztp-sno-ran-du-v4-Y-Z
      remediationAction: enforce
    - compliant: NonCompliant <<< Policy is NonCompliant
      policyName: v1-sriov-configuration-policy
      policyNamespace: ztp-sno-ran-du-v4-Y-Z
      remediationAction: enforce
    - compliant: Compliant
      policyName: v1-subscriptions-policy
      policyNamespace: ztp-sno-ran-du-v4-Y-Z
      remediationAction: enforce
  conditions:
  - lastTransitionTime: "2024-10-07T21:53:29Z"
    message: The configuration is still being applied
    reason: InProgress
    status: "False"
    type: ConfigurationApplied

Notes:

  • The format of the nonCompliantAt timestamps might move to another structure in the status, but it will still be recorded.
  • Some changes happen so fast that the Policy doesn't even switch to NonCompliant, so the O-Cloud Manager cannot record the event. In this case, the O-Cloud Manager still holds a correct recording since all the policies are/remain Compliant.
  • Once an enforce NonCompliant Policy becomes Compliant again, the status.extensions.policies is updated, the status.extensions.clusterDetails.nonCompliantAt value removed and the ConfigurationApplied condition updated to show that the configuration is up to date:
  • When refactored, the start and end times of the configuration being NonCompliant will be recorded.

Once all the policies become Compliant, the status is updated as follows:

status:
  extensions:
    clusterDetails:
      clusterProvisionStartedAt: "2024-10-07T17:59:23Z"
      name: sno-ran-du-1
      ztpStatus: ZTP Done
      >>> no nonCompliantAt <<<
    policies:
    - compliant: Compliant
      policyName: v1-perf-configuration-policy
      policyNamespace: ztp-sno-ran-du-v4-Y-Z
      remediationAction: enforce
    - compliant: Compliant
      policyName: v1-sriov-configuration-policy
      policyNamespace: ztp-sno-ran-du-v4-Y-Z
      remediationAction: enforce
    - compliant: Compliant
      policyName: v1-subscriptions-policy
      policyNamespace: ztp-sno-ran-du-v4-Y-Z
      remediationAction: enforce
  conditions:
  - lastTransitionTime: "2024-10-07T22:15:32Z"
    message: The configuration is up to date
    reason: Completed
    status: "True"
    type: ConfigurationApplied

Updates to the ClusterInstance defaults ConfigMap

We assume a ManagedCluster has been installed through a ProvisioningRequest referencing the sno-ran-du.v4-Y-Z-1 ClusterTemplate CR.

In this example we are adding a new annotation to the ManagedCluster through the clusterinstance-defaults-v1 ConfigMap holding default values for the corresponding ClusterInstance. The following steps need to be taken:

  1. Upversion the cluster template:
    • Create a new version of the clusterinstance-defaults-v1 ConfigMap - clusterinstance-defaults-v2:
      • Update the name to clusterinstance-defaults-v2 (the namespace stays sno-ran-du-v4-Y-Z).
      • Update data.clusterinstance-defaults.extraAnnotations with the desired new annotation.
    • Create a new version of the sno-ran-du.v4-Y-Z-1 ClusterTemplate CR - sno-ran-du.v4-Y-Z-2
      • Update the metadata.name from sno-ran-du.v4-Y-Z-1 to sno-ran-du.v4-Y-Z-2
      • Update spec.version from v4-Y-Z-1 to v4-Y-Z-2
      • Update spec.templateDefaults.clusterInstanceDefaults to clusterinstance-defaults-v2
  2. ArgoCD sync to the hub cluster:
    • Add the newly created files to their corresponding kustomization.yaml.
    • All the resources from above are created on the hub cluster.
  3. The SMO selects the new ClusterTemplate CR for the ProvisioningRequest:
    • spec.templateName remains sno-ran-du, spec.templateVersion is updated from v4-Y-Z-1 to v4-Y-Z-2
  4. The O-Cloud Manager detects the change:
    • It updates the ClusterInstance with the new annotation.
  5. The siteconfig operator detects the change to the ClusterInstance CR:
    • The new annotation is added to the ManagedCluster.
    • Any issues are reported in the ProvisioningRequest, under status.conditions.
    • Note: Some installation manifests cannot be updated after provisioning as the underlying operators have webhooks to prevent such updates.

Updates to an existing ACM PolicyGenerator manifest

For updating a manifest in an existing ACM PolicyGenerator, the following steps need to be taken (we'll take sno-ran-du-pg-v4-Y-Z-v1 as an example):

  1. Upversion the cluster template content:

    • Create a new version of the ACM PG - sno-ran-du-pg-v4-Y-Z-v2:
      • The name is updated to sno-ran-du-pg-v4-Y-Z-v2 (the ztp-sno-ran-du-v4-Y-Z namespace is kept).
      • policyDefaults.placement.labelSelector.sno-ran-du-policy is updated from v1 to v2 such that the policy binding is updated.
      • The annotation clustertemplates.clcm.openshift.io/templates under policyAnnotations is updated to sno-ran-du.v4-Y-Z-3, which is the name of new ClusterTemplate that will be created in the following step.
      • All policy names are updated from v1 to v2 (example: v1-subscriptions-policy -> v2-subscriptions-policy).
      • The desired manifest section is updated. The current sno-ran-du-pg-v4-Y-Z-v2 sample adds a sysctl section to the TunedPerformancePatch section under the v2-tuned-configuration-policy policy.
    • Create a new version of the clusterinstance-defaults-v2 ConfigMap - clusterinstance-defaults-v3:
      • Update the name to clusterinstance-defaults-v3 (the namespace stays sno-ran-du-v4-Y-Z).
      • Update the sno-ran-du-policy ManagedCluster extraLabel from v1 to v2.
    • Create a new version of the sno-ran-du.v4-Y-Z-2 ClusterTemplate CR - sno-ran-du.v4-Y-Z-3
      • Update the metadata.name from sno-ran-du.v4-Y-Z-2 to sno-ran-du.v4-Y-Z-3.
      • Update spec.version from v4-Y-Z-2 to v4-Y-Z-3.
      • Update spec.templateDefaults.clusterInstanceDefaults to clusterinstance-defaults-v3.
  2. ArgoCD sync to the hub cluster:

    • Add the newly created files to their corresponding kustomization.yaml.
    • All the resources created from above are created on the hub cluster, including the v2 policies and the new ClusterTemplate is validated.
    • The new policies are not yet applied to the cluster because the ManagedCluster still has the old sno-ran-du-policy: "v1" label.
  3. The SMO selects the new ClusterTemplate CR for the ProvisioningRequest:

    • spec.templateName remains sno-ran-du, spec.templateVersion is updated from v4-Y-Z-2 to v4-Y-Z-3
  4. The O-Cloud Manager detects the change:

    • It updates the ClusterInstance with the new sno-ran-du-policy: "v2" ManagedCluster label.
    • The siteconfig operator applies the new label to the ManagedCluster.
  5. The ACM Policy Propagator detects the new binding:

    • The old policies created through the sno-ran-du-pg-v4-Y-Z-v1 Policy Generator are no longer matched to the ManagedCluster.
    • The new policies created through the sno-ran-du-pg-v4-Y-Z-v2 Policy Generator are matched to the ManagedCluster.
    • The ConfigurationApplied condition is updated in the ProvisioningRequest to show that the configuration has changed and is being applied (the policies depend on each other, so some are in a Pending state until ACM confirms their compliance):
    status:
      extensions:
        ...
        policies:
        - compliant: Pending
          policyName: v2-perf-configuration-policy
          policyNamespace: ztp-sno-ran-du-v4-Y-Z
          remediationAction: enforce
        - compliant: Pending
          policyName: v2-sriov-configuration-policy
          policyNamespace: ztp-sno-ran-du-v4-Y-Z
          remediationAction: enforce
        - compliant: Compliant
          policyName: v2-subscriptions-policy
          policyNamespace: ztp-sno-ran-du-v4-Y-Z
          remediationAction: enforce
        - compliant: Pending
          policyName: v2-tuned-configuration-policy
          policyNamespace: ztp-sno-ran-du-v4-Y-Z
          remediationAction: enforce
      conditions:
        ...
        - lastTransitionTime: "2024-10-11T19:48:06Z"
          message: The configuration is still being applied
          reason: InProgress
          status: "False"
          type: ConfigurationApplied
    • The affected CRs are updated on the ManagedCluster, not deleted and recreated.
  6. The O-Cloud Manager updates the ProvisioningRequest once all the policies are Compliant

    status:
      extensions:
        ...
        policies:
        - compliant: Compliant
          policyName: v2-tuned-configuration-policy
          policyNamespace: ztp-sno-ran-du-v4-Y-Z
          remediationAction: enforce
        - compliant: Compliant
          policyName: v2-perf-configuration-policy
          policyNamespace: ztp-sno-ran-du-v4-Y-Z
          remediationAction: enforce
        - compliant: Compliant
          policyName: v2-sriov-configuration-policy
          policyNamespace: ztp-sno-ran-du-v4-Y-Z
          remediationAction: enforce
        - compliant: Compliant
          policyName: v2-subscriptions-policy
          policyNamespace: ztp-sno-ran-du-v4-Y-Z
          remediationAction: enforce
      conditions:
      ...
      - lastTransitionTime: "2024-10-11T19:48:36Z"
        message: The configuration is up to date
        reason: Completed
        status: "True"
        type: ConfigurationApplied

Adding a new manifest to an existing ACM PolicyGenerator

This usecase is identical to the previous one, with the following distinctions:

  • If the new manifest does not have a corresponding source-cr file, the CSP should add a new yaml file to the custom-crs directory.

Directory structure example:

policytemplates
|
└──version_4.Y.Z
|  | sno-ran-du
|  | source-crs
|  | custom-crs
|  | kustomization.yaml
|
└─── kustomization.yaml
  • Depending on the dependencies, the new policy can be added to an existing policy as a new manifest or as a new policy.

Adding manifests to an existing policy - adding the LCA operator:

policies:
- name: v1-subscriptions-policy
  manifests:
    - path: source-crs/DefaultCatsrc.yaml
      patches:
      - metadata:
          name: redhat-operators
        spec:
          displayName: redhat-operators
          image: registry.redhat.io/redhat/redhat-operator-index:v4.16
    # Everything below would be added for installing the LCA operator:
    - path: source-crs/LcaSubscriptionNS.yaml
    - path: source-crs/LcaSubscriptionOperGroup.yaml
    - path: source-crs/LcaSubscription.yaml
      patches:
      - spec:
          source: redhat-operators
          installPlanApproval:
            '{{hub $configMap:=(lookup "v1" "ConfigMap" "" (printf "%s-pg" .ManagedClusterName)) hub}}{{hub dig "data" "install-plan-approval" "Manual" $configMap hub}}'
    - path: source-crs/LcaSubscriptionOperGroup.yaml

Adding manifests to a new policy - adding the LCA operator:

policies:
# Everything below would be added for installing the LCA operator:
- name: v1-lca-operator-policy
  manifests:
    - path: source-crs/LcaSubscriptionNS.yaml
    - path: source-crs/LcaSubscriptionOperGroup.yaml
    - path: source-crs/LcaSubscription.yaml
      patches:
      - spec:
          source: redhat-operators
          installPlanApproval:
            '{{hub $configMap:=(lookup "v1" "ConfigMap" "" (printf "%s-pg" .ManagedClusterName)) hub}}{{hub dig "data" "install-plan-approval" "Manual" $configMap hub}}'
    - path: source-crs/LcaSubscriptionOperGroup.yaml

Updating the ClusterTemplate schemas

We assume a ManagedCluster has been installed through a ProvisioningRequest referencing the sno-ran-du.v4-Y-Z-3 ClusterTemplate CR.

In this example we are updating the policy template schema - spec.templateParameterSchema.policyTemplateParameters. This update means that the ACM PG requires extra configuration values. We assume we are starting from the sno-ran-du-pg-v4-Y-Z-v2 ACM PG, but want to add configuration for one more SRIOV network, so 2 extra manifests (SriovNetwork and SriovNetworkNodePolicy) are needed.

The following steps need to be taken:

  1. Upversion the cluster template content:
    • A new ACM PG is created - sno-ran-du-pg-v4-Y-Z-v3:

      • metadata.name is updated from sno-ran-du-pg-v4-Y-Z-v2 to sno-ran-du-pg-v4-Y-Z-v3 (the ztp-sno-ran-du-v4-Y-Z namespace is kept).
      • policyDefaults.placement.labelSelector.sno-ran-du-policy is updated from v2 to v3 such that the policy binding is updated.
      • The annotation clustertemplates.clcm.openshift.io/templates under policyAnnotations is updated to sno-ran-du.v4-Y-Z-4, which is the name of new ClusterTemplate that will be created in the following step.
      • All policy names are updated from v2 to v3 (example: v2-subscriptions-policy -> v3-subscriptions-policy).
      • The following manifests are added under the v3-subscriptions-policy:
      - path: source-crs/SriovNetwork.yaml
        patches:
        - metadata:
            name: sriov-nw-du-mh
          spec:
            resourceName: du_mh
            vlan: '{{hub fromConfigMap "" (printf "%s-pg" .ManagedClusterName) "sriov-network-vlan-2" | toInt hub}}'
      - path: source-crs/SriovNetworkNodePolicy-SetSelector.yaml
        patches:
        - metadata:
            name: sriov-nnp-du-mh
          spec:
            deviceType: vfio-pci
            isRdma: false
            nicSelector:
              pfNames: '{{hub fromConfigMap "" (printf "%s-pg" .ManagedClusterName) "sriov-network-pfNames-2" | toLiteral hub}}'
            nodeSelector:
              node-role.kubernetes.io/master: ""
            numVfs: 8
            priority: 10
            resourceName: du_mh
    • A new version of the policytemplate-defaults-v1 ConfigMap is created - policytemplate-defaults-v2:

      • metadata.name is updated from policytemplate-defaults-v1 to policytemplate-defaults-v2.
      • update the defaults to reflect the new schema and thus the needed configuration values, in our case: sriov-network-vlan-2 and sriov-network-pfNames-2.
    • Create a new version of the clusterinstance-defaults-v3 ConfigMap - clusterinstance-defaults-v4:

      • Update the name to clusterinstance-defaults-v4 (the namespace stays sno-ran-du-v4-Y-Z).
      • Update the sno-ran-du-policy ManagedCluster extraLabel from v2 to v3.
    • Create a new version of the sno-ran-du.v4-Y-Z-3 ClusterTemplate CR - sno-ran-du.v4-Y-Z-4

      • Update the metadata.name from sno-ran-du.v4-Y-Z-3 to sno-ran-du.v4-Y-Z-4.
      • Update spec.version from v4-Y-Z-3 to v4-Y-Z-4.
      • Update spec.templateDefaults.clusterInstanceDefaults to clusterinstance-defaults-v4.
      • Update spec.templateDefaults.policyTemplateDefaults to policytemplate-defaults-v2.
      • Update spec.templateParameterSchema.properties.policyTemplateParameters to include the newly desired configuration options:
      ...
      sriov-network-vlan-2:
        type: string
      sriov-network-pfNames-2:
        type: string
      ...

The remaining steps are similar to those from the Updates to an existing ACM PolicyGenerator manifest section, starting with step 2.

The only distinction is that for the current usecase, the <cluster-name>-pg ConfigMap in the ztp-<cluster-template-namespace> will be updated by the O-Cloud Manager to include the new values (sriov-network-vlan-2 and sriov-network-pfNames-2):

$  oc get cm -n ztp-sno-ran-du-v4-Y-Z <cluster name>-pg -oyaml
apiVersion: v1
data:
  cpu-isolated: 0-1,64-65
  cpu-reserved: 2-10
  hugepages-count: "32"
  hugepages-default: 1G
  hugepages-size: 1G
  install-plan-approval: Automatic
  sriov-network-pfNames-1: '["ens2f0"]'
  sriov-network-pfNames-2: '["ens2f1"]'
  sriov-network-vlan-1: "111"
  sriov-network-vlan-2: "222"
kind: ConfigMap
metadata:
  name: sno-ran-du-1-pg
  namespace: ztp-sno-ran-du-v4-Y-Z

Note: The steps are similar for updating the spec.templateParameterSchema.properties.clusterInstanceParameters. Any change to the clusterInstanceParameters must match the ClusterInstance CR of the siteconfig operator.

Switching to a new hardware profile

For a detailed explanation of how firmware updates are processed, including the CR relationship, status conditions, timeouts, and failure handling, see the Firmware Update Workflow guide.

We assume a ManagedCluster has been installed through a ProvisioningRequest referencing the sno-ran-du.v4-Y-Z-4 ClusterTemplate CR with inline hwMgmtDefaults.

In this example we are updating BIOS settings, BIOS firmware, and BMC firmware by changing the HardwareProfile referenced via templateParameters.hwMgmtParameters in the ProvisioningRequest. The hwMgmtDefaults and ClusterTemplate remain unchanged — only the ProvisioningRequest and the HardwareProfile CR need to be updated.

The following steps are required:

  1. Create a new HardwareProfile CR (if one does not already exist for the target firmware versions):

    • Create a new version of dell-xr8620t-bios-2.3.5-bmc-7.10.70.10 HardwareProfiledell-xr8620t-bios-2.6.3-bmc-7.20.30.50
      • Update the name from dell-xr8620t-bios-2.3.5-bmc-7.10.70.10 to dell-xr8620t-bios-2.6.3-bmc-7.20.30.50.
      • Update the spec.bios, spec.biosFirmware and spec.bmcFirmware with desired settings/versions.
      • Add or update spec.nicFirmware if NIC firmware updates are also required.
    • Update the kustomization files to include the new HardwareProfile. ArgoCD will automatically sync it to the hub cluster.
    • No changes to hwMgmtDefaults or ClusterTemplate CRs are needed.
  2. Update the ProvisioningRequest to reference the new HardwareProfile:

    • Set the hwProfile for the controller node group in spec.templateParameters.hwMgmtParameters.nodeGroupData to the new profile name (dell-xr8620t-bios-2.6.3-bmc-7.20.30.50).
    spec:
      templateParameters:
        hwMgmtParameters:
          nodeGroupData:
            - name: controller
              hwProfile: dell-xr8620t-bios-2.6.3-bmc-7.20.30.50
  3. The O-Cloud manager detects the change:

    • Updates the hardware profile in the NodeAllocationRequest CR for that cluster to the new profile.
    spec:
      ...
        nodeGroup:
        - nodeGroupData:
            hwProfile: dell-xr8620t-bios-2.6.3-bmc-7.20.30.50
            name: master
            resourceSelector:
              server-colour: blue
              server-type: XR8620t
            role: master
      ...
    • Updates the status of the ProvisioningRequest:
        - lastTransitionTime: "2025-10-01T21:36:01Z"
          message: Hardware configuring is in progress
          reason: InProgress
          status: "False"
          type: HardwareConfigured
        ...
        provisioningStatus:
          provisionedResources:
            oCloudNodeClusterId: 95f4a2cf-04dc-42d5-9d1e-f6cbc693d8ea
          provisioningDetails: Hardware configuring is in progress
          provisioningPhase: progressing
  4. The O-Cloud hardware manager detects the updated NodeAllocationRequest CR:

    • It lists the AllocatedNode CRs that reference the NodeAllocationRequest and updates spec.hwProfile in each AllocatedNode CR to the new profile.
    • It computes BIOS/firmware changes from the new HardwareProfile and requests the updates by updating the Metal3 resources—HostFirmwareSettings and HostFirmwareComponents—with the changes.
    • The NodeAllocationRequest and AllocatedNode CRs status conditions are also updated to reflect the configuration change.

    NodeAllocationRequest CR status:

    status:
      conditions:
      - lastTransitionTime: "2025-09-17T21:47:53Z"
        message: Created
        reason: Completed
        status: "True"
        type: Provisioned
      - lastTransitionTime: "2025-10-01T21:36:01Z"
        message: 'Configuration update in progress (AllocatedNode sno1-dell-xr8620t-pool-dell-xr8620t-node1)'
        reason: InProgress
        status: "False"
        type: Configured

    AllocatedNode CR status:

    conditions:
    - lastTransitionTime: "2025-09-17T21:47:53Z"
      message: Provisioned
      reason: Completed
      status: "True"
      type: Provisioned
    - lastTransitionTime: "2025-10-01T21:36:01Z"
      message: Update requested
      reason: ConfigurationUpdateRequested
      status: "False"
      type: Configured
  5. The hardware manager waits for the Metal3 Bare Metal Operator (BMO) to detect and validate the changes on the HostFirmwareSettings and HostFirmwareComponents CRs, then triggers a host reboot via the reboot.metal3.io annotation on the BMH. BMO applies the firmware and BIOS updates during the reboot cycle. For multi-node clusters, master nodes are updated serially first, then worker nodes can be updated in parallel based on the MCP maxUnavailable setting. See Day-2 Workflow for details.

  6. The hardware manager validates the result by checking HostFirmwareSettings/HostFirmwareComponents status and confirming the Kubernetes node has rejoined the cluster and reached Ready state.

    • Success scenario:

      • It updates the status of the AllocatedNode CR to reflect the result of the operation.
      status:
        conditions:
        - lastTransitionTime: "2025-09-17T21:47:53Z"
          message: Provisioned
          reason: Completed
          status: "True"
          type: Provisioned
        - lastTransitionTime: "2025-10-01T22:05:01Z"
          message: Configuration has been applied successfully
          reason: ConfigurationApplied
          status: "True"
          type: Configured
        hwProfile: dell-xr8620t-bios-2.6.3-bmc-7.20.30.50
      • The currentVersion values for bios and bmc in HostFirmwareComponents status should match the versions declared in the new HardwareProfile.
      status:
        components:
        - component: bios
          currentVersion: 2.6.3
          initialVersion: 2.3.5
          lastVersionFlashed: 2.6.3
          updatedAt: "2025-10-01T22:01:50Z"
        - component: bmc
          currentVersion: 7.20.30.50
          initialVersion: 7.10.70.10
          lastVersionFlashed: 7.20.30.50
          updatedAt: "2025-10-01T22:01:50Z"
        ...
      • The settings field in the HostFirmwareSettings status shows the applied BIOS attributes as defined in the new HardwareProfile.
      status:
        settings:
          AcPwrRcvryUserDelay: "120"
      • Once all nodes have been updated, it will update the status of the NodeAllocationRequest CR to reflect the result of the operation.
      status:
        conditions:
        - lastTransitionTime: "2025-09-17T21:47:53Z"
          message: Created
          reason: Completed
          status: "True"
          type: Provisioned
        - lastTransitionTime: "2025-10-01T22:05:01Z"
          message: Configuration has been applied successfully
          reason: ConfigurationApplied
          status: "True"
          type: Configured
    • Failure scenario:

      • The operation is aborted.
      • The status of the AllocatedNode CR is updated with the failure reason.
      • The O-Cloud manager does not initiate a rollback of any nodes already updated. This is left to the user to remediate.
  7. The O-Cloud manager will update the ProvisioningRequest status to reflect the result of the operation, based on the status update of the NodeAllocationRequest CR:

    - lastTransitionTime: "2025-10-01T22:05:01Z"
      message: Configuration has been applied successfully
      reason: Completed
      status: "True"
      type: HardwareConfigured
    ...
    provisioningStatus:
      provisionedResources:
        oCloudNodeClusterId: 95f4a2cf-04dc-42d5-9d1e-f6cbc693d8ea
      provisioningDetails: Provisioning request has completed successfully
      provisioningPhase: fulfilled