Skip to content

Restructure of update health monitor inventory representation #9876

@karencfv

Description

@karencfv

Background

When I initially started working on the health monitor for update, we landed on a design for the inventory API where each sled agent had a designated health-monitor object which would contain different health checks. Each health check is done asynchronously every minute as a background job:

$ curl -H "api-version: 16.0.0"  http://[::1]:41364/inventory | jq
<...>
  "health_monitor": {
    "smf_services_in_maintenance": {
      "ok": {
        "services": [
          {
            "fmri": "svc:/site/fake-service2:default",
            "zone": "global"
          },
          {
            "fmri": "svc:/site/fake-service:default",
            "zone": "global"
          }
        ],
        "errors": [],
        "time_of_status": "2026-01-19T06:03:13.713176608Z"
      }
    },
    "unhealthy_zpools": {
      "ok": {
        "zpools": [
          {
            "zpool": "fakepool1",
            "state": "degraded"
          },
          {
            "zpool": "fakepool2",
            "state": "degraded"
          }
        ],
        "errors": [],
        "time_of_status": "2026-01-19T06:03:13.690492960Z"
      }
    }
  }
<More health checks>
}

This approach made sense from a usability standpoint for a few reasons:

  • All health check data is contained in one place and with a clear purpose, which makes it easy for consumers to ingest.
  • Empty health check lists mean all components checked are healthy based on the parameters we have, so it's very easy to deduct whether there is a problem or not.

Unfortunately, there are a few issues with this approach:

  • For some components that will be checked for health, we are already collecting data, like zpools in the reconciler. This is a problem. For example, the health monitor could end up reporting on different zpools than the reconciler. The reconciler reports data on demand, whereas the health monitor polls every minute.
  • Not all health checks will be run on a "per sled" basis, like stale sagas. Having a single health monitor object per sled would be confusing and impractical.
  • Some health checks report on the state of some component that is reported by itself, like services in maintenance or zpool health, but others will be determined by self imposed thresholds like stale sagas, dataset usage etc. Having all of the health checks lumped in a "health-monitor" would not make these nuances obvious.

Proposal

Instead of having a single health-monitor object per sled, add additional health status/information to every component or create a new entry for components that aren't listed yet. This approach will also allow new components that have no data recorded in inventory yet, to contain more information than just health.

Taking SMF services as an example, with the current approach, we only list services in maintenance. Only listing services in maintenance could prove to be insufficient information to gauge a system's health (e.g. #9855 (comment)), we would have more information to go on by.

In addition to health information/status, each item should have a timestamp of when the sample was taken. This will help with accuracy and avoid relying on assumptions.

It will be the job of the consumer to decide whether a status meets the criteria of a failed health check or not.

I have also assessed the current manual health check script, and have a list of what health status information we'll want to include. It'd be great to have input from others on this:

Health check Scope Necessary?
Uptime per sled no
Kernel low memory scan per sled maybe? Would need to have a threshold to compare the
result against.

May need to run the command on a loop for a bit as well.
SMF service health per sled/zone yes - priority
Core file or kernel core file per sled no, There could be leftover core files that don’t mean anything
NVME firmware version per sled possibly necessary, but not urgent
expected physical disk count per sled yes
zpool health per sled yes- priority
unmounted dataset per sled yes. Will need to make sure the datasets are ones that matter
(e.g. not support bundles)
expected crucible zone count per rack yes
dataset usage over 80% of quota
or root dataset with <150 GiB avail
per sled yes
crucible free space when total
capacity minus total used < 5TiB
per rack yes
dimm expected memory (1 or 2 TB) per rack yes
missing components for MGS driven
updates
per rack yes
stale sagas - not on the script
but we want to add this check
per rack yes - priority

Open questions

To avoid triggering a bunch of expensive calls when an inventory collection is created, @jgallagher suggested we use the intermittent polling background tasks in the existing health monitor. This could also have the unintended consequence of data being older than the inventory collection itself. Is this acceptable? How would this impact the reconciler?

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions