Skip to content
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
22 changes: 22 additions & 0 deletions operator-nexus/concepts-cluster-upgrade-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,6 +77,28 @@ az networkcloud baremetalmachine list -g $mrg --subscription $sub --query "sort_
--output table

```

## Nexus tenant workload health check during cluster runtime upgrade

During a runtime upgrade, the inventory readiness check is triggered to conduct workload health checks. The inventory readiness check feature is appliable for only rack by rack upgrade strategy. The platform feature "UpgradeInventoryChecks" controls the platform runtime upgrade outcome when the health check fails. When the feature is enabled, the upgrade pauses if there is an inventory readiness check failure after the compute rack upgrade. The upgrade can be continued using CCUVA. When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage. By default the feature is disabled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we generalize the CCUVA term? May be say, "The upgrade can be continued when the customer executes the upgrade API."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do not use term CUVA, as customers do not know this.
State runtime upgrade or similar

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the text to remove CUVA reference and used runtime upgrade.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we aren't really doing workload health checks. We are looking at the infrastructure of the tenant resources. I want to avoid indicating we check how their workloads are performing

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specifically right now we're checking for Nexus VM / Nexus AKS health. would it make sense to explicitly reference those for clarity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe something like "...triggered to conduct workload infrastructure availability" or "to conduct availability of the VM and NAKS health"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have mentioned tenant workload (Nexus Kubernetes Cluster and Virtual Machine). This will clarify we are referring to Nexus Kubernetes Cluster and Virtual Machine as tenant workload during health checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"When the feature is disabled the inventory readiness failures are logged and upgrade continues to next stage."

Is this information exposed to customers? How can it be viewed?

Copy link

@seaneagan seaneagan Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think logging is an internal mechanism that we probably don't want to inform customer about. I think in target-state we probably want to expose the information in the same form as we would if the feature were enabled (e.g. ARM properties, but TBD), but just don't block upgrades on it, and have it be informational only.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the user experience look like for customers? I'm trying to understand how the customer would know the workloads aren't healthy after the runtime. I don't think we need to specify why but something to indicate the workload inventory check failed.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The behavior is different for compute node-pool and other KCP and mgmt-plane servers. Its good to document this explicitly.
I will remove the log section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the feature turned on/off?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there separate documentation produced for AFEC feature flags that we can link to?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would recommend to specify the functionality is feature flag enabled.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have mentioned functionality is feature flag enabled.
Need to provide a link to this.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

inventory readiness check is triggered to conduct workload health checks

would be good to align on one term. i'm guessing "tenant workload health checks" may resonate the best based based on the rest of the nexus documentation? i think "inventory readiness check" is the name of the implementation which customer shouldn't need to know about.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change to tenant workload health checks


The Inventory Readiness Check feature performs workload health check after control-plane, management-plane and compute servers are upgraded during platform runtime upgrade. It operates in snapshot and comparison modes and provides a mechanism to verify workload health state after different stages of platform runtime upgrade. the feature supports Nexus Kubernetes Cluster and Virtual Machine workloads.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There aren't workloads on the control plane. We need to be clear about what is being checked on these servers.

Copy link

@seaneagan seaneagan Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the checks at each phase (KCP, NMP, Compute Rack 1, ....) aren't actually checking for workloads running on those node poolds, they are executing global checks as e.g. upgrading kubernetes could cause issues with workloads running on computes even though those compute machines haven't been upgraded yet. how much of this details do we want to include in the docs?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we specify these are workload inventory checks, I would only specify the compute node scope. It may beneficial to reference the management node if you are checking CSNs.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are health check after different stages of upgrade. I have mentioned that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What actions need to be performed by customer if the checks fail?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have provided the link for this.


### Workflow of workload health check

1. **Snapshot Initiation** - Snapshot is collected for all registered workloads (Nexus Kubernetes Cluster and Virtual Machine) before starting upgrade of servers.
2. **Upgrade Stage Transitions** - After upgrade of each stage like control-plane, management-plane and compute servers are completed, comparison of inventory for workloads are initiated.
3. **Comparison Process** - Comparison of current workloads with snapshot taken during start of upgrade. Report comparison status.
4. **Health Check Handling** - On success proceed to next upgrade stage. For failure, based on inventory readiness check feature is enable or disable its handled as below.

| Upgrade Stage | UpgradeInventoryChecks Enable | UpgradeInventoryChecks Disable |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we clearly state the heading of the stage to say that this is the failure case? Something on the lines "Failure at Upgrade state". Trying to see if can put something that says what happens if there is a failure in this stage.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The above statement says the table is for upgrade failure.

|--------------------------|-------------------------------------|--------------------------------|
| Initial Snapshot | Upgrade failure | Upgrade continue to next stage |
| Control Plane Upgrade | Upgrade failure | Upgrade continue to next stage |
| Management Plane Upgrade | Upgrade failure | Upgrade continue to next stage |
| Compute server Upgrade | Upgrade paused, continue with CCUVA | Upgrade continue to next stage |


## BareMetalMachine (BMM) keyset operations during cluster runtime upgrade

When a server is upgraded to utilize a new OS, the BMM keysets have to be re-established with the new software. This process starts once the runtime upgrade completes for the instance. Servers yet to undergo a runtime upgrade can still be accessed via the BMM keyset. If access to a machine is needed during the upgrade, the console user is available.
Expand Down