Skip to content

Commit ca58d6a

Browse files
pracuccitacole02
andauthored
Mixin: remove support for the experimental read-write deployment mode (#11975)
#### What this PR does The read-write deployment mode has been never promoted to stable (still experimental) and we decided to remove it completely in Mimir 3.0. In this PR I propose to remove the support from dashboards and alerts. #### Which issue(s) this PR fixes or relates to Part of #11887 #### Checklist - [ ] Tests updated. - [ ] Documentation added. - [x] `CHANGELOG.md` updated - the order of entries should be `[CHANGE]`, `[FEATURE]`, `[ENHANCEMENT]`, `[BUGFIX]`. If changelog entry is not needed, please add the `changelog-not-needed` label to the PR. - [ ] [`about-versioning.md`](https://github.com/grafana/mimir/blob/main/docs/sources/mimir/configure/about-versioning.md) updated with experimental features. --------- Signed-off-by: Marco Pracucci <[email protected]> Co-authored-by: Taylor C <[email protected]>
1 parent 79bb448 commit ca58d6a

File tree

60 files changed

+2024
-2067
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+2024
-2067
lines changed

CHANGELOG.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -137,6 +137,9 @@
137137

138138
### Mixin
139139

140+
* [CHANGE] Alerts: Update the query for `MimirBucketIndexNotUpdated` to use `max_over_time` to prevent alert firing when pods rotate. #11311, #11426
141+
* [CHANGE] Alerts: Make alerting threshold for `DistributorGcUsesTooMuchCpu` configurable. #11508.
142+
* [CHANGE] Remove support for the experimental read-write deployment mode. #11975
140143
* [ENHANCEMENT] Dashboards: Include absolute number of notifications attempted to alertmanager in 'Mimir / Ruler'. #10918
141144
* [ENHANCEMENT] Alerts: Make `MimirRolloutStuck` a critical alert if it has been firing for 6h. #10890
142145
* [ENHANCEMENT] Dashboards: Add panels to the `Mimir / Tenants` and `Mimir / Top Tenants` dashboards showing the rate of gateway requests. #10978
@@ -146,8 +149,6 @@
146149
* [ENHANCEMENT] Dashboards: Add "per-query memory consumption" and "fallback to Prometheus' query engine" panels to the Queries dashboard. #11626
147150
* [ENHANCEMENT] Alerts: Add `MimirGoThreadsTooHigh` alert. #11836 #11845
148151
* [ENHANCEMENT] Dashboards: Add autoscaling row for ruler query-frontends to `Mimir / Remote ruler reads` dashboard. #11838
149-
* [CHANGE] Alerts: Update query for `MimirBucketIndexNotUpdated`. Use `max_over_time` to prevent alert firing when pods rotate. #11311, #11426
150-
* [CHANGE] Alerts: Make alerting threshold for `DistributorGcUsesTooMuchCpu` configurable. #11508.
151152
* [BUGFIX] Dashboards: fix "Mimir / Tenants" legends for non-Kubernetes deployments. #10891
152153
* [BUGFIX] Dashboards: fix Query-scheduler RPS panel legend in "Mimir / Reads". #11515
153154
* [BUGFIX] Recording rules: fix `cluster_namespace_deployment:actual_replicas:count` recording rule when there's a mix on single-zone and multi-zone deployments. #11287

operations/helm/tests/metamonitoring-values-generated/mimir-distributed/templates/metamonitoring/grafana-dashboards.yaml

Lines changed: 496 additions & 496 deletions
Large diffs are not rendered by default.

operations/helm/tests/metamonitoring-values-generated/mimir-distributed/templates/metamonitoring/mixin-alerts.yaml

Lines changed: 13 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -136,7 +136,7 @@ spec:
136136
expr: |
137137
(
138138
sum by(cluster, namespace, pod) (
139-
increase(kube_pod_container_status_restarts_total{container=~"(ingester|mimir-write)"}[30m])
139+
increase(kube_pod_container_status_restarts_total{container=~"(ingester)"}[30m])
140140
)
141141
>= 2
142142
)
@@ -167,7 +167,7 @@ spec:
167167
message: '{{ $labels.job }}/{{ $labels.pod }} has a number of mmap-ed areas close to the limit.'
168168
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirmemorymapareastoohigh
169169
expr: |
170-
process_memory_map_areas{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} / process_memory_map_areas_limit{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} > 0.8
170+
process_memory_map_areas{job=~".*/(ingester.*|cortex|mimir|store-gateway.*|cortex|mimir)"} / process_memory_map_areas_limit{job=~".*/(ingester.*|cortex|mimir|store-gateway.*|cortex|mimir)"} > 0.8
171171
for: 5m
172172
labels:
173173
severity: critical
@@ -269,8 +269,8 @@ spec:
269269
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirringmembersmismatch
270270
expr: |
271271
(
272-
avg by(cluster, namespace) (sum by(cluster, namespace, pod) (cortex_ring_members{name="ingester",job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"}))
273-
!= sum by(cluster, namespace) (up{job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"})
272+
avg by(cluster, namespace) (sum by(cluster, namespace, pod) (cortex_ring_members{name="ingester",job=~".*/(ingester.*|cortex|mimir)",job!~".*/(ingester.*-partition)"}))
273+
!= sum by(cluster, namespace) (up{job=~".*/(ingester.*|cortex|mimir)",job!~".*/(ingester.*-partition)"})
274274
)
275275
and
276276
(
@@ -490,9 +490,9 @@ spec:
490490
(
491491
# We use RSS instead of working set memory because of the ingester's extensive usage of mmap.
492492
# See: https://github.com/grafana/mimir/issues/2466
493-
container_memory_rss{container=~"(ingester|mimir-write|mimir-backend)"}
493+
container_memory_rss{container=~"(ingester)"}
494494
/
495-
( container_spec_memory_limit_bytes{container=~"(ingester|mimir-write|mimir-backend)"} > 0 )
495+
( container_spec_memory_limit_bytes{container=~"(ingester)"} > 0 )
496496
)
497497
# Match only Mimir namespaces.
498498
* on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
@@ -509,9 +509,9 @@ spec:
509509
(
510510
# We use RSS instead of working set memory because of the ingester's extensive usage of mmap.
511511
# See: https://github.com/grafana/mimir/issues/2466
512-
container_memory_rss{container=~"(ingester|mimir-write|mimir-backend)"}
512+
container_memory_rss{container=~"(ingester)"}
513513
/
514-
( container_spec_memory_limit_bytes{container=~"(ingester|mimir-write|mimir-backend)"} > 0 )
514+
( container_spec_memory_limit_bytes{container=~"(ingester)"} > 0 )
515515
)
516516
# Match only Mimir namespaces.
517517
* on(cluster, namespace) group_left max by(cluster, namespace) (cortex_build_info)
@@ -615,7 +615,7 @@ spec:
615615
expr: |
616616
max by (cluster, namespace) (memberlist_client_cluster_members_count)
617617
>
618-
(sum by (cluster, namespace) (up{job=~".*/(admin-api|alertmanager|compactor.*|distributor.*|ingester.*|query-frontend.*|querier.*|ruler|ruler-querier.*|store-gateway.*|cortex|mimir|mimir-write.*|mimir-read.*|mimir-backend.*)"}) + 10)
618+
(sum by (cluster, namespace) (up{job=~".*/(admin-api|alertmanager|compactor.*|distributor.*|ingester.*|query-frontend.*|querier.*|ruler|ruler-querier.*|store-gateway.*|cortex|mimir)"}) + 10)
619619
for: 20m
620620
labels:
621621
severity: warning
@@ -626,7 +626,7 @@ spec:
626626
expr: |
627627
min by (cluster, namespace) (memberlist_client_cluster_members_count)
628628
<
629-
(sum by (cluster, namespace) (up{job=~".+/(admin-api|alertmanager|compactor.*|distributor.*|ingester.*|query-frontend.*|querier.*|ruler|ruler-querier.*|store-gateway.*|cortex|mimir|mimir-write.*|mimir-read.*|mimir-backend.*)"}) * 0.5)
629+
(sum by (cluster, namespace) (up{job=~".+/(admin-api|alertmanager|compactor.*|distributor.*|ingester.*|query-frontend.*|querier.*|ruler|ruler-querier.*|store-gateway.*|cortex|mimir)"}) * 0.5)
630630
for: 20m
631631
labels:
632632
severity: warning
@@ -985,7 +985,7 @@ spec:
985985
message: Mimir store-gateway in {{ $labels.cluster }}/{{ $labels.namespace }} is querying level 1 blocks, indicating the compactor may not be keeping up with compaction.
986986
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirhighvolumelevel1blocksqueried
987987
expr: |
988-
sum by(cluster, namespace) (rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",level="1",out_of_order="false",job=~".*/(store-gateway.*|cortex|mimir|mimir-backend.*)"}[5m])) > 0
988+
sum by(cluster, namespace) (rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",level="1",out_of_order="false",job=~".*/(store-gateway.*|cortex|mimir)"}[5m])) > 0
989989
for: 6h
990990
labels:
991991
severity: warning
@@ -1055,8 +1055,7 @@ spec:
10551055
and
10561056
(max by(cluster, namespace, pod) (thanos_objstore_bucket_last_successful_upload_time{component="compactor"}) > 0)
10571057
and
1058-
# Only if some compactions have started. We don't want to fire this alert if the compactor has nothing to do
1059-
# (e.g. there are more replicas than required because running as part of mimir-backend).
1058+
# Only if some compactions have started. We don't want to fire this alert if the compactor has nothing to do.
10601059
(sum by(cluster, namespace, pod) (rate(cortex_compactor_group_compaction_runs_started_total[24h])) > 0)
10611060
for: 15m
10621061
labels:
@@ -1069,8 +1068,7 @@ spec:
10691068
expr: |
10701069
(max by(cluster, namespace, pod) (thanos_objstore_bucket_last_successful_upload_time{component="compactor"}) == 0)
10711070
and
1072-
# Only if some compactions have started. We don't want to fire this alert if the compactor has nothing to do
1073-
# (e.g. there are more replicas than required because running as part of mimir-backend).
1071+
# Only if some compactions have started. We don't want to fire this alert if the compactor has nothing to do.
10741072
(sum by(cluster, namespace, pod) (rate(cortex_compactor_group_compaction_runs_started_total[24h])) > 0)
10751073
for: 24h
10761074
labels:

operations/mimir-mixin-compiled-baremetal/alerts.yaml

Lines changed: 11 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -124,7 +124,7 @@ groups:
124124
expr: |
125125
(
126126
sum by(cluster, namespace, instance) (
127-
increase(kube_pod_container_status_restarts_total{container=~"(ingester|mimir-write)"}[30m])
127+
increase(kube_pod_container_status_restarts_total{container=~"(ingester)"}[30m])
128128
)
129129
>= 2
130130
)
@@ -155,7 +155,7 @@ groups:
155155
message: '{{ $labels.job }}/{{ $labels.instance }} has a number of mmap-ed areas close to the limit.'
156156
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirmemorymapareastoohigh
157157
expr: |
158-
process_memory_map_areas{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} / process_memory_map_areas_limit{job=~".*/(ingester.*|cortex|mimir|mimir-write.*|store-gateway.*|cortex|mimir|mimir-backend.*)"} > 0.8
158+
process_memory_map_areas{job=~".*/(ingester.*|cortex|mimir|store-gateway.*|cortex|mimir)"} / process_memory_map_areas_limit{job=~".*/(ingester.*|cortex|mimir|store-gateway.*|cortex|mimir)"} > 0.8
159159
for: 5m
160160
labels:
161161
severity: critical
@@ -257,8 +257,8 @@ groups:
257257
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirringmembersmismatch
258258
expr: |
259259
(
260-
avg by(cluster, namespace) (sum by(cluster, namespace, instance) (cortex_ring_members{name="ingester",job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"}))
261-
!= sum by(cluster, namespace) (up{job=~".*/(ingester.*|cortex|mimir|mimir-write.*)",job!~".*/(ingester.*-partition)"})
260+
avg by(cluster, namespace) (sum by(cluster, namespace, instance) (cortex_ring_members{name="ingester",job=~".*/(ingester.*|cortex|mimir)",job!~".*/(ingester.*-partition)"}))
261+
!= sum by(cluster, namespace) (up{job=~".*/(ingester.*|cortex|mimir)",job!~".*/(ingester.*-partition)"})
262262
)
263263
and
264264
(
@@ -476,7 +476,7 @@ groups:
476476
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirallocatingtoomuchmemory
477477
expr: |
478478
(
479-
process_resident_memory_bytes{job=~".*/(ingester|mimir-write|mimir-backend)"}
479+
process_resident_memory_bytes{job=~".*/(ingester)"}
480480
/
481481
on(instance) node_memory_MemTotal_bytes{}
482482
) > 0.65
@@ -490,7 +490,7 @@ groups:
490490
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirallocatingtoomuchmemory
491491
expr: |
492492
(
493-
process_resident_memory_bytes{job=~".*/(ingester|mimir-write|mimir-backend)"}
493+
process_resident_memory_bytes{job=~".*/(ingester)"}
494494
/
495495
on(instance) node_memory_MemTotal_bytes{}
496496
) > 0.8
@@ -593,7 +593,7 @@ groups:
593593
expr: |
594594
max by (cluster, namespace) (memberlist_client_cluster_members_count)
595595
>
596-
(sum by (cluster, namespace) (up{job=~".*/(admin-api|alertmanager|compactor.*|distributor.*|ingester.*|query-frontend.*|querier.*|ruler|ruler-querier.*|store-gateway.*|cortex|mimir|mimir-write.*|mimir-read.*|mimir-backend.*)"}) + 10)
596+
(sum by (cluster, namespace) (up{job=~".*/(admin-api|alertmanager|compactor.*|distributor.*|ingester.*|query-frontend.*|querier.*|ruler|ruler-querier.*|store-gateway.*|cortex|mimir)"}) + 10)
597597
for: 20m
598598
labels:
599599
severity: warning
@@ -604,7 +604,7 @@ groups:
604604
expr: |
605605
min by (cluster, namespace) (memberlist_client_cluster_members_count)
606606
<
607-
(sum by (cluster, namespace) (up{job=~".+/(admin-api|alertmanager|compactor.*|distributor.*|ingester.*|query-frontend.*|querier.*|ruler|ruler-querier.*|store-gateway.*|cortex|mimir|mimir-write.*|mimir-read.*|mimir-backend.*)"}) * 0.5)
607+
(sum by (cluster, namespace) (up{job=~".+/(admin-api|alertmanager|compactor.*|distributor.*|ingester.*|query-frontend.*|querier.*|ruler|ruler-querier.*|store-gateway.*|cortex|mimir)"}) * 0.5)
608608
for: 20m
609609
labels:
610610
severity: warning
@@ -959,7 +959,7 @@ groups:
959959
message: Mimir store-gateway in {{ $labels.cluster }}/{{ $labels.namespace }} is querying level 1 blocks, indicating the compactor may not be keeping up with compaction.
960960
runbook_url: https://grafana.com/docs/mimir/latest/operators-guide/mimir-runbooks/#mimirhighvolumelevel1blocksqueried
961961
expr: |
962-
sum by(cluster, namespace) (rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",level="1",out_of_order="false",job=~".*/(store-gateway.*|cortex|mimir|mimir-backend.*)"}[5m])) > 0
962+
sum by(cluster, namespace) (rate(cortex_bucket_store_series_blocks_queried_sum{component="store-gateway",level="1",out_of_order="false",job=~".*/(store-gateway.*|cortex|mimir)"}[5m])) > 0
963963
for: 6h
964964
labels:
965965
severity: warning
@@ -1029,8 +1029,7 @@ groups:
10291029
and
10301030
(max by(cluster, namespace, instance) (thanos_objstore_bucket_last_successful_upload_time{component="compactor"}) > 0)
10311031
and
1032-
# Only if some compactions have started. We don't want to fire this alert if the compactor has nothing to do
1033-
# (e.g. there are more replicas than required because running as part of mimir-backend).
1032+
# Only if some compactions have started. We don't want to fire this alert if the compactor has nothing to do.
10341033
(sum by(cluster, namespace, instance) (rate(cortex_compactor_group_compaction_runs_started_total[24h])) > 0)
10351034
for: 15m
10361035
labels:
@@ -1043,8 +1042,7 @@ groups:
10431042
expr: |
10441043
(max by(cluster, namespace, instance) (thanos_objstore_bucket_last_successful_upload_time{component="compactor"}) == 0)
10451044
and
1046-
# Only if some compactions have started. We don't want to fire this alert if the compactor has nothing to do
1047-
# (e.g. there are more replicas than required because running as part of mimir-backend).
1045+
# Only if some compactions have started. We don't want to fire this alert if the compactor has nothing to do.
10481046
(sum by(cluster, namespace, instance) (rate(cortex_compactor_group_compaction_runs_started_total[24h])) > 0)
10491047
for: 24h
10501048
labels:

0 commit comments

Comments
 (0)