Skip to content

Commit 7e3ccc1

Browse files
anandethe2hill
authored andcommitted
feature: Grafana alert for catching Neutron IPAM Errors while attempt… (#1054)
* feature: Grafana alert for catching Neutron IPAM Errors while attempting to create new VM This is an alert when Neutron logs an error while attempting to assign an already assigned IP address to a VM. Log entries on the nova log side look like below: ``` 2025-03-20 03:19:13.009 9 ERROR oslo_db.api pymysql.err.IntegrityError: (1062, "Duplicate entry '<ip-address>-uuid-...' for key 'PRIMARY'") ``` * feature: Grafana alert for catching Neutron IPAM Errors while attempting to create new VM * fixed indentation and whitespaces issues in grafana-helm-overrides.yaml rules section This is an alert when Neutron logs an error while attempting to assign an already assigned IP address to a VM. Log entries on the nova log side look like below: ``` 2025-03-20 03:19:13.009 9 ERROR oslo_db.api pymysql.err.IntegrityError: (1062, "Duplicate entry '<ip-address>-uuid-...' for key 'PRIMARY'") ``` * feature: Grafana alert for catching Neutron IPAM Errors while attempting to create new VM * Fix: yaml indentation and trailing whitespaces This is an alert when Neutron logs an error while attempting to assign an already assigned IP address to a VM. Log entries on the nova log side look like below: ``` 2025-03-20 03:19:13.009 9 ERROR oslo_db.api pymysql.err.IntegrityError: (1062, "Duplicate entry '<ip-address>-uuid-...' for key 'PRIMARY'") ```
1 parent 74c7887 commit 7e3ccc1

File tree

1 file changed

+126
-76
lines changed

1 file changed

+126
-76
lines changed

base-helm-configs/grafana/grafana-helm-overrides.yaml

Lines changed: 126 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -38,93 +38,143 @@ grafana.ini:
3838
type: mysql
3939
host: mariadb-cluster.grafana.svc:3306
4040
user: $__file{/etc/secrets/grafana-db/username}
41-
password: $__file{/etc/secrets/grafana-db/password}
41+
password: $__file{/etc/secrets/grafana-db/password}
4242
name: grafana
4343

4444
datasources:
4545
datasources.yaml:
4646
apiversion: 1
4747
datasources:
48-
- name: Prometheus
49-
type: prometheus
50-
access: proxy
51-
url: http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local:9090
52-
isdefault: true
53-
- name: Loki
54-
type: loki
55-
access: proxy
56-
url: http://loki-gateway.{{ $.Release.Namespace }}.svc.cluster.local:80
57-
editable: false
48+
- name: Prometheus
49+
type: prometheus
50+
access: proxy
51+
url: http://kube-prometheus-stack-prometheus.prometheus.svc.cluster.local:9090
52+
isdefault: true
53+
- name: Loki
54+
type: loki
55+
access: proxy
56+
url: http://loki-gateway.{{ $.Release.Namespace }}.svc.cluster.local:80
57+
editable: false
5858

5959
alerting:
6060
rules.yaml:
6161
groups:
62-
- orgId: 1
63-
name: loki 1 min eval
64-
folder: rules
65-
interval: 1m
66-
rules:
67-
- uid: ba943125-33ca-4e4e-85f8-13359a8e4d65
68-
title: OVN claim storm
69-
condition: B
70-
data:
71-
- refId: A
62+
- orgId: 1
63+
name: loki 1 min eval
64+
folder: rules
65+
interval: 1m
66+
rules:
67+
- uid: ba943125-33ca-4e4e-85f8-13359a8e4d65
68+
title: OVN claim storm
69+
condition: B
70+
data:
71+
- refId: A
72+
queryType: instant
73+
relativeTimeRange:
74+
from: 60
75+
to: 0
76+
datasourceUid: P8E80F9AEF21F6940
77+
model:
78+
editorMode: builder
79+
expr: rate({app="ovs"} |= `binding|INFO|cr-lrp` [1m])
80+
intervalMs: 60000
81+
maxDataPoints: 43200
82+
queryType: instant
83+
refId: A
84+
- refId: B
85+
relativeTimeRange:
86+
from: 60
87+
to: 0
88+
datasourceUid: __expr__
89+
model:
90+
conditions:
91+
- evaluator:
92+
params:
93+
- 1
94+
- 0
95+
type: gt
96+
operator:
97+
type: and
98+
query:
99+
params: []
100+
reducer:
101+
params: []
102+
type: avg
103+
type: query
104+
datasource:
105+
name: Expression
106+
type: __expr__
107+
uid: __expr__
108+
expression: A
109+
intervalMs: 1000
110+
maxDataPoints: 43200
111+
refId: B
112+
type: threshold
113+
noDataState: OK
114+
execErrState: Error
115+
for: 0s
116+
notifications:
117+
- uid: prom-alertmanager-notification
118+
annotations:
119+
description: >-
120+
Checks app=ovs (ovs-ovn) pod logs for lines with string
121+
'binding|INFO|cr-lrp'
122+
summary: >-
123+
This alerts on rapid port claims for cr-lrp ports on OVN
124+
gateway nodes, which overloads the OVN south database and
125+
interferes with the function of the affected ports.
126+
labels: {}
127+
isPaused: false
128+
# Generated UUID using 'uuidgen'
129+
- uid: c14dd8fd-54ec-4e15-9813-e02cc3269899
130+
title: Neutron IPAM Duplicate Entry Error
131+
condition: C
132+
data:
133+
- refId: A
134+
queryType: instant
135+
relativeTimeRange:
136+
from: 60
137+
to: 0
138+
# Using same loki datasource as rule#ba943125-33ca-4e4e-85f8-13359a8e4d65
139+
datasourceUid: P8E80F9AEF21F6940
140+
model:
141+
expr: rate({app="fluentbit"} |= `Duplicate entry|ERROR` [1m])
72142
queryType: instant
73-
relativeTimeRange:
74-
from: 60
75-
to: 0
76-
datasourceUid: P8E80F9AEF21F6940
77-
model:
78-
editorMode: builder
79-
expr: rate({app="ovs"} |= `binding|INFO|cr-lrp` [1m])
80-
intervalMs: 60000
81-
maxDataPoints: 43200
82-
queryType: instant
83-
refId: A
84-
- refId: B
85-
relativeTimeRange:
86-
from: 60
87-
to: 0
88-
datasourceUid: __expr__
89-
model:
90-
conditions:
91-
- evaluator:
92-
params:
93-
- 1
94-
- 0
95-
type: gt
96-
operator:
97-
type: and
98-
query:
99-
params: []
100-
reducer:
101-
params: []
102-
type: avg
103-
type: query
104-
datasource:
105-
name: Expression
106-
type: __expr__
107-
uid: __expr__
108-
expression: A
109-
intervalMs: 1000
110-
maxDataPoints: 43200
111-
refId: B
112-
type: threshold
113-
noDataState: OK
114-
execErrState: Error
115-
for: 0s
116-
notifications:
117-
- uid: prom-alertmanager-notification
118-
annotations:
119-
description: >-
120-
Checks app=ovs (ovs-ovn) pod logs for lines with string
121-
'binding|INFO|cr-lrp'
122-
summary: >-
123-
This alerts on rapid port claims for cr-lrp ports on OVN
124-
gateway nodes, which overloads the OVN south database and
125-
interferes with the function of the affected ports.
126-
labels: {}
127-
isPaused: false
143+
refId: A
144+
- refId: B
145+
relativeTimeRange:
146+
# Past 60 seconds (can be adjusted further)
147+
from: 60
148+
# 0 denotes till current time
149+
to: 0
150+
datasourceUid: __expr__
151+
model:
152+
conditions:
153+
- evaluator:
154+
params:
155+
- 1
156+
- 0
157+
type: gt
158+
operator:
159+
type: and
160+
reducer:
161+
type: avg
162+
type: query
163+
datasource:
164+
name: Expression
165+
type: __expr__
166+
uid: __expr__
167+
expression: A
168+
refId: B
169+
type: threshold
170+
noDataState: OK
171+
execErrState: Error
172+
notifications:
173+
- uid: prom-alertmanager-notification
174+
annotations:
175+
summary: >
176+
Checks for log lines containing 'Duplicate entry|ERROR' in nova logs.
177+
isPaused: false
128178
contactpoints.yaml:
129179
secret:
130180
apiVersion: 1

0 commit comments

Comments
 (0)