Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add cnpg alerts #558

Merged
merged 3 commits into from
Jan 13, 2025
Merged

add cnpg alerts #558

merged 3 commits into from
Jan 13, 2025

Conversation

klinch0
Copy link
Contributor

@klinch0 klinch0 commented Jan 3, 2025

Summary by CodeRabbit

  • New Features
    • Added comprehensive monitoring and alerting rules for PostgreSQL instances.
    • Introduced alerts for:
      • Long-running transactions
      • Backend waiting times
      • Transaction ID age
      • Replication lag
      • Archiving failures
      • Deadlock conflicts
      • Replication status
    • New resource: PrometheusRule named cnpg-default-alerts.

@klinch0 klinch0 requested a review from kvaps as a code owner January 3, 2025 10:26
@dosubot dosubot bot added the size:M This PR changes 30-99 lines, ignoring generated files. label Jan 3, 2025
Copy link
Contributor

coderabbitai bot commented Jan 3, 2025

Walkthrough

A new PrometheusRule configuration has been introduced in the PostgreSQL operator templates, defining a set of alerting rules for monitoring PostgreSQL instances managed by the CNPG operator. This configuration includes eight alerts that address various performance and operational metrics, such as long-running transactions, replication lag, transaction ID age, deadlock conflicts, and archiving failures. Each alert contains detailed annotations, triggering conditions, and severity labels to enhance monitoring capabilities.

Changes

File Change Summary
packages/system/postgres-operator/templates/prometheusrule.yaml Added new PrometheusRule configuration with 8 monitoring alerts for PostgreSQL instances

Poem

🐰 Postgres alerts hop and bound,
Monitoring metrics now profound!
Transactions, replicas in sight,
Catching issues day and night 🚨
A rabbit's watch, precise and keen,
Keeping databases smoothly clean! 🔍


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 836cbc0 and 5aaf3c0.

📒 Files selected for processing (1)
  • packages/system/postgres-operator/templates/prometheusrule.yaml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • packages/system/postgres-operator/templates/prometheusrule.yaml

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@dosubot dosubot bot added the enhancement New feature or request label Jan 3, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Nitpick comments (3)
packages/system/postgres-operator/templates/prometheusrule.yaml (3)

9-17: Consider enhancing the LongRunningTransaction alert

A few suggestions to improve this alert:

  1. The 300s threshold might need adjustment based on your workload patterns.
  2. Consider adding instance and cluster labels to improve alert routing and filtering.
  3. The description could be more actionable by including the query details if available.
     - alert: LongRunningTransaction
       annotations:
-        description: Pod {{ $labels.pod }} is taking more than 5 minutes (300 seconds) for a query.
+        description: Pod {{ $labels.pod }} in cluster {{ $labels.cluster }} has a transaction running for more than 5 minutes. Query details: {{ $labels.query }}
         summary: A query is taking longer than 5 minutes.
       expr: |-
-        cnpg_backends_max_tx_duration_seconds > 300
+        cnpg_backends_max_tx_duration_seconds{job="cnpg"} > 300
       for: 1m
       labels:
         severity: warning
+        cluster: '{{ $labels.cluster }}'

36-44: Enhance PGReplication alert with more context

Consider adding more context to help with troubleshooting:

  1. Include the actual lag value in the description
  2. Add primary server information
  3. Consider different severity levels based on lag duration
     - alert: PGReplication
       annotations:
-        description: Standby is lagging behind by over 300 seconds (5 minutes)
+        description: Standby {{ $labels.pod }} is lagging behind primary {{ $labels.primary }} by {{ $value }} seconds
         summary: The standby is lagging behind the primary
       expr: |-
         cnpg_pg_replication_lag > 300
       for: 1m
       labels:
         severity: warning
+        primary: '{{ $labels.primary }}'

63-71: Enhance ReplicaFailingReplication alert with more context

The alert logic is correct, but the description could be more helpful during incident response by including more context about the replication state.

     - alert: ReplicaFailingReplication
       annotations:
-        description: Replica {{ $labels.pod }} is failing to replicate
+        description: |
+          Replica {{ $labels.pod }} is failing to replicate.
+          Pod is in recovery ({{ $labels.in_recovery }}) but WAL receiver is down ({{ $labels.wal_receiver_status }}).
+          Check the replica's logs for potential issues with WAL receiver.
         summary: Checks if the replica is failing to replicate
       expr: |-
         cnpg_pg_replication_in_recovery > cnpg_pg_replication_is_wal_receiver_up
       for: 1m
       labels:
         severity: warning
+        replica_host: '{{ $labels.pod }}'
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 26388c7 and 983d5ef.

📒 Files selected for processing (1)
  • packages/system/postgres-operator/templates/prometheusrule.yaml (1 hunks)
🔇 Additional comments (1)
packages/system/postgres-operator/templates/prometheusrule.yaml (1)

1-71: Consider overall monitoring strategy improvements

  1. Validate metric availability:

  2. Consider implementing different severity levels:

    • Critical: Data loss risks (XID age)
    • Warning: Performance issues (long transactions)
    • Info: Early warnings (replication lag starting)
  3. Add runbooks or troubleshooting guides in the alert descriptions.

✅ Verification successful

Let me search for any existing documentation or metric definitions to validate these metrics further.


CloudNativePG metrics are correctly configured and aligned with operator's monitoring

Based on the verification results, I can confirm that:

  1. All metrics used in the PrometheusRule are standard CloudNativePG metrics, defined in the operator's default monitoring queries configuration.
  2. The metrics are properly aligned with the operator's monitoring capabilities:
    • cnpg_backends_max_tx_duration_seconds comes from the backends query
    • cnpg_backends_waiting_total comes from the backends_waiting query
    • cnpg_pg_database_xid_age comes from the pg_database query
    • cnpg_pg_replication_lag comes from the pg_replication query
    • cnpg_pg_stat_archiver_* metrics come from the pg_stat_archiver query
    • cnpg_pg_stat_database_deadlocks comes from the pg_stat_database query
    • cnpg_pg_replication_in_recovery and cnpg_pg_replication_is_wal_receiver_up come from the pg_replication query

The alert thresholds and severity levels are appropriate for their respective scenarios, with warning levels set for conditions that require attention but aren't immediately critical.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check if these metrics exist in your Prometheus
metrics=(
  "cnpg_backends_max_tx_duration_seconds"
  "cnpg_backends_waiting_total"
  "cnpg_pg_database_xid_age"
  "cnpg_pg_replication_lag"
  "cnpg_pg_stat_archiver_last_failed_time"
  "cnpg_pg_stat_database_deadlocks"
  "cnpg_pg_replication_in_recovery"
  "cnpg_pg_replication_is_wal_receiver_up"
)

for metric in "${metrics[@]}"; do
  echo "Checking $metric..."
  curl -s "http://prometheus:9090/api/v1/query?query=$metric" 2>/dev/null || \
  rg "$metric" -g '*.yaml' -g '*.yml'
done

Length of output: 3480


Script:

#!/bin/bash
# Look for CloudNativePG documentation or metric definitions
fd . -e md -e yaml -e yml | xargs rg -l "cnpg_" | xargs rg "metrics|monitoring|prometheus"

# Also check if there's any operator CRD that might define these metrics
fd . -e yaml -e yml | xargs rg -l "cloudnative-pg.io" | xargs cat

Length of output: 28088

Comment on lines 27 to 35
- alert: PGDatabaseXidAge
annotations:
description: Over 150,000,000 transactions from frozen xid on pod {{ $labels.pod }}
summary: Number of transactions from the frozen XID to the current one
expr: |-
cnpg_pg_database_xid_age > 300000000
for: 1m
labels:
severity: warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix threshold inconsistency in PGDatabaseXidAge alert

The description mentions 150,000,000 transactions but the expression uses 300,000,000. This inconsistency could lead to confusion during incident response.

     - alert: PGDatabaseXidAge
       annotations:
-        description: Over 150,000,000 transactions from frozen xid on pod {{ $labels.pod  }}
+        description: Over 300,000,000 transactions from frozen xid on pod {{ $labels.pod  }}
         summary: Number of transactions from the frozen XID to the current one
       expr: |-
         cnpg_pg_database_xid_age > 300000000

Also consider adding a second alert with a lower threshold as an early warning:

    - alert: PGDatabaseXidAgeWarning
      annotations:
        description: Over 150,000,000 transactions from frozen xid on pod {{ $labels.pod }}
        summary: Transaction ID age is approaching critical levels
      expr: |-
        cnpg_pg_database_xid_age > 150000000
      for: 1m
      labels:
        severity: info

Comment on lines 18 to 29
- alert: BackendsWaiting
annotations:
description: Pod {{ $labels.pod }} has been waiting for longer than 5 minutes
summary: If a backend is waiting for longer than 5 minutes
expr: |-
cnpg_backends_waiting_total > 300
for: 1m
labels:
severity: warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix critical issues in BackendsWaiting alert

There are several issues with this alert:

  1. The metric cnpg_backends_waiting_total appears to be a counter (based on the _total suffix), but it's being compared directly to a threshold. Counters should typically be evaluated using rate().
  2. The expression doesn't match the description - it's comparing a total count to 300, not a duration.
     - alert: BackendsWaiting
       annotations:
         description: Pod {{ $labels.pod  }} has been waiting for longer than 5 minutes
         summary: If a backend is waiting for longer than 5 minutes
       expr: |-
-        cnpg_backends_waiting_total > 300
+        rate(cnpg_backends_waiting_total[5m]) > 0
       for: 1m
       labels:
         severity: warning
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- alert: BackendsWaiting
annotations:
description: Pod {{ $labels.pod }} has been waiting for longer than 5 minutes
summary: If a backend is waiting for longer than 5 minutes
expr: |-
cnpg_backends_waiting_total > 300
for: 1m
labels:
severity: warning
- alert: BackendsWaiting
annotations:
description: Pod {{ $labels.pod }} has been waiting for longer than 5 minutes
summary: If a backend is waiting for longer than 5 minutes
expr: |-
rate(cnpg_backends_waiting_total[5m]) > 0
for: 1m
labels:
severity: warning

Comment on lines 54 to 73
- alert: DatabaseDeadlockConflicts
annotations:
description: There are over 10 deadlock conflicts in {{ $labels.pod }}
summary: Checks the number of database conflicts
expr: |-
cnpg_pg_stat_database_deadlocks > 10
for: 1m
labels:
severity: warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix counter usage in DatabaseDeadlockConflicts alert

The metric appears to be a counter but is being compared directly to a threshold. This should be using rate() to measure the increase over time.

     - alert: DatabaseDeadlockConflicts
       annotations:
-        description: There are over 10 deadlock conflicts in {{ $labels.pod }}
+        description: High rate of deadlock conflicts (>10/5m) detected in {{ $labels.pod }}
         summary: Checks the number of database conflicts
       expr: |-
-        cnpg_pg_stat_database_deadlocks > 10
+        rate(cnpg_pg_stat_database_deadlocks[5m]) * 300 > 10
       for: 1m
       labels:
         severity: warning
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- alert: DatabaseDeadlockConflicts
annotations:
description: There are over 10 deadlock conflicts in {{ $labels.pod }}
summary: Checks the number of database conflicts
expr: |-
cnpg_pg_stat_database_deadlocks > 10
for: 1m
labels:
severity: warning
- alert: DatabaseDeadlockConflicts
annotations:
description: High rate of deadlock conflicts (>10/5m) detected in {{ $labels.pod }}
summary: Checks the number of database conflicts
expr: |-
rate(cnpg_pg_stat_database_deadlocks[5m]) * 300 > 10
for: 1m
labels:
severity: warning

Comment on lines 45 to 62
- alert: LastFailedArchiveTime
annotations:
description: Archiving failed for {{ $labels.pod }}
summary: Checks the last time archiving failed. Will be < 0 when it has not failed.
expr: |-
(cnpg_pg_stat_archiver_last_failed_time - cnpg_pg_stat_archiver_last_archived_time) > 1
for: 1m
labels:
severity: warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Revise LastFailedArchiveTime alert logic

The current implementation has several issues:

  1. The summary mentions "Will be < 0 when it has not failed" but the expression checks for "> 1"
  2. Comparing timestamps directly might lead to false positives
  3. The metric names suggest these are Unix timestamps, which should be handled differently
     - alert: LastFailedArchiveTime
       annotations:
         description: Archiving failed for {{ $labels.pod }}
-        summary: Checks the last time archiving failed. Will be < 0 when it has not failed.
+        summary: Archiving has failed recently
       expr: |-
-        (cnpg_pg_stat_archiver_last_failed_time - cnpg_pg_stat_archiver_last_archived_time) > 1
+        (time() - cnpg_pg_stat_archiver_last_failed_time) < 300
+        and
+        cnpg_pg_stat_archiver_last_failed_time > cnpg_pg_stat_archiver_last_archived_time
       for: 1m
       labels:
         severity: warning
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- alert: LastFailedArchiveTime
annotations:
description: Archiving failed for {{ $labels.pod }}
summary: Checks the last time archiving failed. Will be < 0 when it has not failed.
expr: |-
(cnpg_pg_stat_archiver_last_failed_time - cnpg_pg_stat_archiver_last_archived_time) > 1
for: 1m
labels:
severity: warning
- alert: LastFailedArchiveTime
annotations:
description: Archiving failed for {{ $labels.pod }}
summary: Archiving has failed recently
expr: |-
(time() - cnpg_pg_stat_archiver_last_failed_time) < 300
and
cnpg_pg_stat_archiver_last_failed_time > cnpg_pg_stat_archiver_last_archived_time
for: 1m
labels:
severity: warning

Comment on lines +1 to +5
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: cnpg-default-alerts
spec:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Resource name should use cloudnative-pg.fullname template

The PrometheusRule name should follow the Helm chart's naming convention by using the cloudnative-pg.fullname template helper, which is available in the chart's _helpers.tpl. This ensures consistent naming across all resources and allows for proper name overrides through Helm values.

Change the name from:

  name: cnpg-default-alerts

to:

  name: {{ include "cloudnative-pg.fullname" . }}-alerts

This follows the pattern seen in other charts and matches the naming template defined in the chart's helpers.

🔗 Analysis chain

Verify monitoring resource naming convention

Let's verify if the resource name follows the established pattern for monitoring resources in the cluster.


Resource name should use Helm templating for consistency

The resource name cnpg-default-alerts is hardcoded while other PrometheusRules in the codebase consistently use Helm templating functions for naming. Let's verify if there's a template function available for this chart.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check naming patterns of existing PrometheusRules
kubectl get prometheusrules -A -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || \
rg -g '*.yaml' -g '*.yml' 'kind:\s*PrometheusRule' -A 2

Length of output: 2480


Script:

#!/bin/bash
# Check if there's a template helper function defined
rg -g '*.yaml' -g '*.tpl' '_helpers' packages/system/postgres-operator/templates/

Length of output: 81


Script:

#!/bin/bash
# Look for _helpers.tpl in the chart directory and its parent directories
fd -g '_helpers.tpl' packages/system/postgres-operator

# Also check Chart.yaml to get the chart name
fd -g 'Chart.yaml' packages/system/postgres-operator -x cat {}

Length of output: 963


Script:

#!/bin/bash
# Check the content of _helpers.tpl to see available naming templates
cat packages/system/postgres-operator/charts/cloudnative-pg/templates/_helpers.tpl

# Also check if there are any other PrometheusRules in the same chart for naming pattern
rg -g '*.yaml' 'kind:\s*PrometheusRule' packages/system/postgres-operator/charts/cloudnative-pg/templates/

Length of output: 2045

@klinch0 klinch0 marked this pull request as draft January 6, 2025 12:16
@klinch0 klinch0 force-pushed the feature/add-cnpg-alerts branch from 983d5ef to 2190502 Compare January 7, 2025 09:31
@klinch0 klinch0 marked this pull request as ready for review January 7, 2025 09:32
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
packages/system/postgres-operator/templates/prometheusrule.yaml (1)

6-8: Consider restructuring alert groups and adding multi-level alerts

Consider the following improvements to the alert structure:

  1. Split alerts into logical groups (e.g., replication, performance, maintenance)
  2. Add warning levels for critical alerts (e.g., replication lag > 150s as warning, >300s as critical)
  3. Define alert inhibition rules to prevent alert storms
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 983d5ef and 2190502.

📒 Files selected for processing (1)
  • packages/system/postgres-operator/templates/prometheusrule.yaml (1 hunks)
🔇 Additional comments (6)
packages/system/postgres-operator/templates/prometheusrule.yaml (6)

4-4: Resource name should use cloudnative-pg.fullname template

The resource name should follow the Helm chart's naming convention.


20-29: Fix critical issues in BackendsWaiting alert

The metric appears to be a counter but is being compared directly to a threshold.


31-40: Fix threshold inconsistency in PGDatabaseXidAge alert

Description mentions different threshold than expression.


53-62: Revise LastFailedArchiveTime alert logic

The timestamp comparison logic needs revision.


64-73: LGTM! Good use of increase() function

The implementation correctly uses increase() for the counter metric over a 10-minute window.


75-84: LGTM! Well-structured replica monitoring

The alert correctly combines recovery status and WAL receiver status to detect replication issues. The critical severity is appropriate for this type of failure.

Comment on lines 9 to 18
- alert: LongRunningTransaction
annotations:
description: >-
Pod {{ $labels.pod }} has a transaction running longer than 5 minutes (300 seconds).
This could indicate a potential lock issue or unoptimized query execution.
summary: Long-running transaction detected.
expr: rate(cnpg_backends_max_tx_duration_seconds[5m]) > 300
for: 1m
labels:
severity: warning
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix metric usage in LongRunningTransaction alert

The current expression uses rate() on what appears to be a gauge metric measuring duration. This is incorrect as rate() should only be used with counter metrics.

Apply this fix:

-      expr: rate(cnpg_backends_max_tx_duration_seconds[5m]) > 300
+      expr: max(cnpg_backends_max_tx_duration_seconds) > 300
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- alert: LongRunningTransaction
annotations:
description: >-
Pod {{ $labels.pod }} has a transaction running longer than 5 minutes (300 seconds).
This could indicate a potential lock issue or unoptimized query execution.
summary: Long-running transaction detected.
expr: rate(cnpg_backends_max_tx_duration_seconds[5m]) > 300
for: 1m
labels:
severity: warning
- alert: LongRunningTransaction
annotations:
description: >-
Pod {{ $labels.pod }} has a transaction running longer than 5 minutes (300 seconds).
This could indicate a potential lock issue or unoptimized query execution.
summary: Long-running transaction detected.
expr: max(cnpg_backends_max_tx_duration_seconds) > 300
for: 1m
labels:
severity: warning

Comment on lines 42 to 51
- alert: PGReplication
annotations:
description: >-
Standby on pod {{ $labels.pod }} is lagging behind the primary by more than 5 minutes (300 seconds).
This can lead to outdated data on replicas.
summary: Replication lag detected.
expr: rate(cnpg_pg_replication_lag[5m]) > 300
for: 1m
labels:
severity: critical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix metric usage in PGReplication alert

The current expression uses rate() on what appears to be a gauge metric measuring replication lag. This is incorrect as rate() should only be used with counter metrics.

Apply this fix:

-      expr: rate(cnpg_pg_replication_lag[5m]) > 300
+      expr: max(cnpg_pg_replication_lag) > 300
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- alert: PGReplication
annotations:
description: >-
Standby on pod {{ $labels.pod }} is lagging behind the primary by more than 5 minutes (300 seconds).
This can lead to outdated data on replicas.
summary: Replication lag detected.
expr: rate(cnpg_pg_replication_lag[5m]) > 300
for: 1m
labels:
severity: critical
- alert: PGReplication
annotations:
description: >-
Standby on pod {{ $labels.pod }} is lagging behind the primary by more than 5 minutes (300 seconds).
This can lead to outdated data on replicas.
summary: Replication lag detected.
expr: max(cnpg_pg_replication_lag) > 300
for: 1m
labels:
severity: critical

Copy link
Member

@kvaps kvaps left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from what Coderabbit said, the rest seems okay to me

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Jan 9, 2025
@klinch0 klinch0 marked this pull request as draft January 9, 2025 13:02
@klinch0 klinch0 marked this pull request as ready for review January 10, 2025 10:01
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels Jan 10, 2025
@kvaps kvaps merged commit 4547efa into aenix-io:main Jan 13, 2025
1 check passed
@kvaps kvaps mentioned this pull request Jan 15, 2025
kvaps added a commit that referenced this pull request Jan 15, 2025
fix regression introduced in
#558

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

- **Refactor**
- Updated label formatting in PostgreSQL operator default alerts
configuration
- Enhanced alert template generation to dynamically include multiple
alert configurations from separate files

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Andrei Kvapil <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants