Restructure of update health monitor inventory representation

## Background

When I initially started working on the health monitor for update, we landed on a design for the inventory API where each sled agent had a designated health-monitor object which would contain different health checks. Each health check is done asynchronously every minute as a background job:

```console
$ curl -H "api-version: 16.0.0" http://[::1]:41364/inventory | jq
<...>
 "health_monitor": {
 "smf_services_in_maintenance": {
 "ok": {
 "services": [
 {
 "fmri": "svc:/site/fake-service2:default",
 "zone": "global"
 },
 {
 "fmri": "svc:/site/fake-service:default",
 "zone": "global"
 }
 ],
 "errors": [],
 "time_of_status": "2026-01-19T06:03:13.713176608Z"
 }
 },
 "unhealthy_zpools": {
 "ok": {
 "zpools": [
 {
 "zpool": "fakepool1",
 "state": "degraded"
 },
 {
 "zpool": "fakepool2",
 "state": "degraded"
 }
 ],
 "errors": [],
 "time_of_status": "2026-01-19T06:03:13.690492960Z"
 }
 }
 }
<More health checks>
}
```

This approach made sense from a usability standpoint for a few reasons:

- All health check data is contained in one place and with a clear purpose, which makes it easy for consumers to ingest.
- Empty health check lists mean all components checked are healthy based on the parameters we have, so it's very easy to deduct whether there is a problem or not.

Unfortunately, there are a few issues with this approach:

- For some components that will be checked for health, we are already collecting data, like zpools in the reconciler. This is a problem. For example, the health monitor could end up reporting on different zpools than the reconciler. The reconciler reports data on demand, whereas the health monitor polls every minute.
- Not all health checks will be run on a "per sled" basis, like stale sagas. Having a single health monitor object per sled would be confusing and impractical.
- Some health checks report on the state of some component that is reported by itself, like services in maintenance or zpool health, but others will be determined by self imposed thresholds like stale sagas, dataset usage etc. Having all of the health checks lumped in a "health-monitor" would not make these nuances obvious.

## Proposal

Instead of having a single health-monitor object per sled, add additional health status/information to every component or create a new entry for components that aren't listed yet. This approach will also allow new components that have no data recorded in inventory yet, to contain more information than just health.

Taking SMF services as an example, with the current approach, we only list services in maintenance. Only listing services in maintenance could prove to be insufficient information to gauge a system's health (e.g. https://github.com/oxidecomputer/omicron/issues/9855#issuecomment-3887994459), we would have more information to go on by. 

In addition to health information/status, each item should have a timestamp of when the sample was taken. This will help with accuracy and avoid relying on assumptions.

It will be the job of the consumer to decide whether a status meets the criteria of a failed health check or not.

I have also assessed the current manual health check [script](https://github.com/oxidecomputer/customer-support/blob/main/runbooks/rack-update/template/check-health.sh), and have a list of what health status information we'll want to include. It'd be great to have input from others on this:

| Health check | Scope | Necessary? |
| --- | --- | --- |
| Uptime | per sled | no |
| Kernel low memory scan | per sled | maybe? Would need to have a threshold to compare the result against. May need to run the command on a loop for a bit as well. |
| SMF service health | per sled/zone | yes - priority |
| Core file or kernel core file | per sled | no, There could be leftover core files that don’t mean anything |
| NVME firmware version | per sled | possibly necessary, but not urgent |
| expected physical disk count | per sled | yes |
| zpool health | per sled | yes- priority |
| unmounted dataset | per sled | yes. Will need to make sure the datasets are ones that matter (e.g. not support bundles) |
| expected crucible zone count | per rack | yes |
| dataset usage over 80% of quota or root dataset with <150 GiB avail | per sled | yes |
| crucible free space when total capacity minus total used < 5TiB | per rack | yes |
| dimm expected memory (1 or 2 TB) | per rack | yes |
| missing components for MGS driven updates | per rack | yes |
| stale sagas - not on the script but we want to add this check | per rack | yes - priority |

### Open questions 

To avoid triggering a bunch of expensive calls when an inventory collection is created, @jgallagher suggested we use the intermittent polling background tasks in the existing health monitor. This could also have the unintended consequence of data being older than the inventory collection itself. Is this acceptable? How would this impact the reconciler?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Restructure of update health monitor inventory representation #9876

Background

Proposal

Open questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Health check	Scope	Necessary?
Uptime	per sled	no
Kernel low memory scan	per sled	maybe? Would need to have a threshold to compare the result against. May need to run the command on a loop for a bit as well.
SMF service health	per sled/zone	yes - priority
Core file or kernel core file	per sled	no, There could be leftover core files that don’t mean anything
NVME firmware version	per sled	possibly necessary, but not urgent
expected physical disk count	per sled	yes
zpool health	per sled	yes- priority
unmounted dataset	per sled	yes. Will need to make sure the datasets are ones that matter (e.g. not support bundles)
expected crucible zone count	per rack	yes
dataset usage over 80% of quota or root dataset with <150 GiB avail	per sled	yes
crucible free space when total capacity minus total used < 5TiB	per rack	yes
dimm expected memory (1 or 2 TB)	per rack	yes
missing components for MGS driven updates	per rack	yes
stale sagas - not on the script but we want to add this check	per rack	yes - priority

Restructure of update health monitor inventory representation #9876

Description

Background

Proposal

Open questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions