Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml) #791

pirat013 · 2021-11-16T14:07:20Z

The current configuration for the systemd unit files are monitoring the active state like this:

name: systemd-services-monitoring
rules:
- alert: service-down-pacemaker
  expr: node_systemd_unit_state{name="pacemaker.service",
  state="active"} == 0
  labels:
  severity: page
  annotations:
  summary: Pacemaker service not running

This would lead into false positive report due to maintenance work or other task when the systemd units are stop by an admin.
I would suggest to change the monitoring rule from active to failed:

name: systemd-services-monitoring
rules:
- alert: service-failed-pacemaker
  expr: node_systemd_unit_state{name="pacemaker.service",
  state="failed"} == 1
  labels:
  severity: page
  annotations:
  summary: Pacemaker service could not start or is crashed.

This would create less calls in regards to the situation a systemd unit is stop due to maintenance.
If we would go this way we could think about to shorten the list and using a configuration like this:

alert: HostSystemdServiceCrashed
expr: node_systemd_unit_state{state="failed"} == 1
for: 1m
labels:
severity: page
annotations:
description: |-
systemd service crashed
VALUE = {{ $value }}
LABELS = {{ $labels }}
summary: Host systemd service crashed (instance {{ $labels.instance }})

yeoldegrove · 2022-02-09T13:53:49Z

@pirat013 I would not see anything that speaks against this change.
Are you willing to submit a PR?

pirat013 · 2022-04-04T10:22:27Z

@yeoldegrove sorry I didn't see your request.
We may have to consider a combination of service is enabled and not started as well this would reflect the original idea better than my suggestion. I'll try to figure out this rule and can create a PR. But I can't say when it will happen.

yeoldegrove added the needinfo label Feb 15, 2022

yeoldegrove assigned pirat013 May 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml) #791

Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml) #791

pirat013 commented Nov 16, 2021 •

edited

Loading

yeoldegrove commented Feb 9, 2022

pirat013 commented Apr 4, 2022

Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml) #791

Monitoring rules for systemd units (monitoring_srv/prometheus/rules.yml) #791

Comments

pirat013 commented Nov 16, 2021 • edited Loading

yeoldegrove commented Feb 9, 2022

pirat013 commented Apr 4, 2022

pirat013 commented Nov 16, 2021 •

edited

Loading