-
Notifications
You must be signed in to change notification settings - Fork 68
Description
Background
When I initially started working on the health monitor for update, we landed on a design for the inventory API where each sled agent had a designated health-monitor object which would contain different health checks. Each health check is done asynchronously every minute as a background job:
$ curl -H "api-version: 16.0.0" http://[::1]:41364/inventory | jq
<...>
"health_monitor": {
"smf_services_in_maintenance": {
"ok": {
"services": [
{
"fmri": "svc:/site/fake-service2:default",
"zone": "global"
},
{
"fmri": "svc:/site/fake-service:default",
"zone": "global"
}
],
"errors": [],
"time_of_status": "2026-01-19T06:03:13.713176608Z"
}
},
"unhealthy_zpools": {
"ok": {
"zpools": [
{
"zpool": "fakepool1",
"state": "degraded"
},
{
"zpool": "fakepool2",
"state": "degraded"
}
],
"errors": [],
"time_of_status": "2026-01-19T06:03:13.690492960Z"
}
}
}
<More health checks>
}This approach made sense from a usability standpoint for a few reasons:
- All health check data is contained in one place and with a clear purpose, which makes it easy for consumers to ingest.
- Empty health check lists mean all components checked are healthy based on the parameters we have, so it's very easy to deduct whether there is a problem or not.
Unfortunately, there are a few issues with this approach:
- For some components that will be checked for health, we are already collecting data, like zpools in the reconciler. This is a problem. For example, the health monitor could end up reporting on different zpools than the reconciler. The reconciler reports data on demand, whereas the health monitor polls every minute.
- Not all health checks will be run on a "per sled" basis, like stale sagas. Having a single health monitor object per sled would be confusing and impractical.
- Some health checks report on the state of some component that is reported by itself, like services in maintenance or zpool health, but others will be determined by self imposed thresholds like stale sagas, dataset usage etc. Having all of the health checks lumped in a "health-monitor" would not make these nuances obvious.
Proposal
Instead of having a single health-monitor object per sled, add additional health status/information to every component or create a new entry for components that aren't listed yet. This approach will also allow new components that have no data recorded in inventory yet, to contain more information than just health.
Taking SMF services as an example, with the current approach, we only list services in maintenance. Only listing services in maintenance could prove to be insufficient information to gauge a system's health (e.g. #9855 (comment)), we would have more information to go on by.
In addition to health information/status, each item should have a timestamp of when the sample was taken. This will help with accuracy and avoid relying on assumptions.
It will be the job of the consumer to decide whether a status meets the criteria of a failed health check or not.
I have also assessed the current manual health check script, and have a list of what health status information we'll want to include. It'd be great to have input from others on this:
| Health check | Scope | Necessary? |
|---|---|---|
| Uptime | per sled | no |
| Kernel low memory scan | per sled | maybe? Would need to have a threshold to compare the result against. May need to run the command on a loop for a bit as well. |
| SMF service health | per sled/zone | yes - priority |
| Core file or kernel core file | per sled | no, There could be leftover core files that don’t mean anything |
| NVME firmware version | per sled | possibly necessary, but not urgent |
| expected physical disk count | per sled | yes |
| zpool health | per sled | yes- priority |
| unmounted dataset | per sled | yes. Will need to make sure the datasets are ones that matter (e.g. not support bundles) |
| expected crucible zone count | per rack | yes |
| dataset usage over 80% of quota or root dataset with <150 GiB avail |
per sled | yes |
| crucible free space when total capacity minus total used < 5TiB |
per rack | yes |
| dimm expected memory (1 or 2 TB) | per rack | yes |
| missing components for MGS driven updates |
per rack | yes |
| stale sagas - not on the script but we want to add this check |
per rack | yes - priority |
Open questions
To avoid triggering a bunch of expensive calls when an inventory collection is created, @jgallagher suggested we use the intermittent polling background tasks in the existing health monitor. This could also have the unintended consequence of data being older than the inventory collection itself. Is this acceptable? How would this impact the reconciler?