You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The current configuration for the systemd unit files are monitoring the active state like this:
name: systemd-services-monitoring
rules:
alert: service-down-pacemaker
expr: node_systemd_unit_state{name="pacemaker.service",
state="active"} == 0
labels:
severity: page
annotations:
summary: Pacemaker service not running
This would lead into false positive report due to maintenance work or other task when the systemd units are stop by an admin.
I would suggest to change the monitoring rule from active to failed:
name: systemd-services-monitoring
rules:
alert: service-failed-pacemaker
expr: node_systemd_unit_state{name="pacemaker.service",
state="failed"} == 1
labels:
severity: page
annotations:
summary: Pacemaker service could not start or is crashed.
This would create less calls in regards to the situation a systemd unit is stop due to maintenance.
If we would go this way we could think about to shorten the list and using a configuration like this:
@yeoldegrove sorry I didn't see your request.
We may have to consider a combination of service is enabled and not started as well this would reflect the original idea better than my suggestion. I'll try to figure out this rule and can create a PR. But I can't say when it will happen.
The current configuration for the systemd unit files are monitoring the active state like this:
rules:
expr: node_systemd_unit_state{name="pacemaker.service",
state="active"} == 0
labels:
severity: page
annotations:
summary: Pacemaker service not running
This would lead into false positive report due to maintenance work or other task when the systemd units are stop by an admin.
I would suggest to change the monitoring rule from active to failed:
rules:
expr: node_systemd_unit_state{name="pacemaker.service",
state="failed"} == 1
labels:
severity: page
annotations:
summary: Pacemaker service could not start or is crashed.
This would create less calls in regards to the situation a systemd unit is stop due to maintenance.
If we would go this way we could think about to shorten the list and using a configuration like this:
expr: node_systemd_unit_state{state="failed"} == 1
for: 1m
labels:
severity: page
annotations:
description: |-
systemd service crashed
VALUE = {{ $value }}
LABELS = {{ $labels }}
summary: Host systemd service crashed (instance {{ $labels.instance }})
The text was updated successfully, but these errors were encountered: