-
Notifications
You must be signed in to change notification settings - Fork 69
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add cnpg alerts #558
add cnpg alerts #558
Conversation
WalkthroughA new PrometheusRule configuration has been introduced in the PostgreSQL operator templates, defining a set of alerting rules for monitoring PostgreSQL instances managed by the CNPG operator. This configuration includes eight alerts that address various performance and operational metrics, such as long-running transactions, replication lag, transaction ID age, deadlock conflicts, and archiving failures. Each alert contains detailed annotations, triggering conditions, and severity labels to enhance monitoring capabilities. Changes
Poem
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Nitpick comments (3)
packages/system/postgres-operator/templates/prometheusrule.yaml (3)
9-17
: Consider enhancing the LongRunningTransaction alertA few suggestions to improve this alert:
- The 300s threshold might need adjustment based on your workload patterns.
- Consider adding
instance
andcluster
labels to improve alert routing and filtering.- The description could be more actionable by including the query details if available.
- alert: LongRunningTransaction annotations: - description: Pod {{ $labels.pod }} is taking more than 5 minutes (300 seconds) for a query. + description: Pod {{ $labels.pod }} in cluster {{ $labels.cluster }} has a transaction running for more than 5 minutes. Query details: {{ $labels.query }} summary: A query is taking longer than 5 minutes. expr: |- - cnpg_backends_max_tx_duration_seconds > 300 + cnpg_backends_max_tx_duration_seconds{job="cnpg"} > 300 for: 1m labels: severity: warning + cluster: '{{ $labels.cluster }}'
36-44
: Enhance PGReplication alert with more contextConsider adding more context to help with troubleshooting:
- Include the actual lag value in the description
- Add primary server information
- Consider different severity levels based on lag duration
- alert: PGReplication annotations: - description: Standby is lagging behind by over 300 seconds (5 minutes) + description: Standby {{ $labels.pod }} is lagging behind primary {{ $labels.primary }} by {{ $value }} seconds summary: The standby is lagging behind the primary expr: |- cnpg_pg_replication_lag > 300 for: 1m labels: severity: warning + primary: '{{ $labels.primary }}'
63-71
: Enhance ReplicaFailingReplication alert with more contextThe alert logic is correct, but the description could be more helpful during incident response by including more context about the replication state.
- alert: ReplicaFailingReplication annotations: - description: Replica {{ $labels.pod }} is failing to replicate + description: | + Replica {{ $labels.pod }} is failing to replicate. + Pod is in recovery ({{ $labels.in_recovery }}) but WAL receiver is down ({{ $labels.wal_receiver_status }}). + Check the replica's logs for potential issues with WAL receiver. summary: Checks if the replica is failing to replicate expr: |- cnpg_pg_replication_in_recovery > cnpg_pg_replication_is_wal_receiver_up for: 1m labels: severity: warning + replica_host: '{{ $labels.pod }}'
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
packages/system/postgres-operator/templates/prometheusrule.yaml
(1 hunks)
🔇 Additional comments (1)
packages/system/postgres-operator/templates/prometheusrule.yaml (1)
1-71
: Consider overall monitoring strategy improvements
Validate metric availability:
Consider implementing different severity levels:
- Critical: Data loss risks (XID age)
- Warning: Performance issues (long transactions)
- Info: Early warnings (replication lag starting)
Add runbooks or troubleshooting guides in the alert descriptions.
✅ Verification successful
Let me search for any existing documentation or metric definitions to validate these metrics further.
CloudNativePG metrics are correctly configured and aligned with operator's monitoring
Based on the verification results, I can confirm that:
- All metrics used in the PrometheusRule are standard CloudNativePG metrics, defined in the operator's default monitoring queries configuration.
- The metrics are properly aligned with the operator's monitoring capabilities:
cnpg_backends_max_tx_duration_seconds
comes from thebackends
querycnpg_backends_waiting_total
comes from thebackends_waiting
querycnpg_pg_database_xid_age
comes from thepg_database
querycnpg_pg_replication_lag
comes from thepg_replication
querycnpg_pg_stat_archiver_*
metrics come from thepg_stat_archiver
querycnpg_pg_stat_database_deadlocks
comes from thepg_stat_database
querycnpg_pg_replication_in_recovery
andcnpg_pg_replication_is_wal_receiver_up
come from thepg_replication
queryThe alert thresholds and severity levels are appropriate for their respective scenarios, with warning levels set for conditions that require attention but aren't immediately critical.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check if these metrics exist in your Prometheus metrics=( "cnpg_backends_max_tx_duration_seconds" "cnpg_backends_waiting_total" "cnpg_pg_database_xid_age" "cnpg_pg_replication_lag" "cnpg_pg_stat_archiver_last_failed_time" "cnpg_pg_stat_database_deadlocks" "cnpg_pg_replication_in_recovery" "cnpg_pg_replication_is_wal_receiver_up" ) for metric in "${metrics[@]}"; do echo "Checking $metric..." curl -s "http://prometheus:9090/api/v1/query?query=$metric" 2>/dev/null || \ rg "$metric" -g '*.yaml' -g '*.yml' doneLength of output: 3480
Script:
#!/bin/bash # Look for CloudNativePG documentation or metric definitions fd . -e md -e yaml -e yml | xargs rg -l "cnpg_" | xargs rg "metrics|monitoring|prometheus" # Also check if there's any operator CRD that might define these metrics fd . -e yaml -e yml | xargs rg -l "cloudnative-pg.io" | xargs catLength of output: 28088
- alert: PGDatabaseXidAge | ||
annotations: | ||
description: Over 150,000,000 transactions from frozen xid on pod {{ $labels.pod }} | ||
summary: Number of transactions from the frozen XID to the current one | ||
expr: |- | ||
cnpg_pg_database_xid_age > 300000000 | ||
for: 1m | ||
labels: | ||
severity: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix threshold inconsistency in PGDatabaseXidAge alert
The description mentions 150,000,000 transactions but the expression uses 300,000,000. This inconsistency could lead to confusion during incident response.
- alert: PGDatabaseXidAge
annotations:
- description: Over 150,000,000 transactions from frozen xid on pod {{ $labels.pod }}
+ description: Over 300,000,000 transactions from frozen xid on pod {{ $labels.pod }}
summary: Number of transactions from the frozen XID to the current one
expr: |-
cnpg_pg_database_xid_age > 300000000
Also consider adding a second alert with a lower threshold as an early warning:
- alert: PGDatabaseXidAgeWarning
annotations:
description: Over 150,000,000 transactions from frozen xid on pod {{ $labels.pod }}
summary: Transaction ID age is approaching critical levels
expr: |-
cnpg_pg_database_xid_age > 150000000
for: 1m
labels:
severity: info
- alert: BackendsWaiting | ||
annotations: | ||
description: Pod {{ $labels.pod }} has been waiting for longer than 5 minutes | ||
summary: If a backend is waiting for longer than 5 minutes | ||
expr: |- | ||
cnpg_backends_waiting_total > 300 | ||
for: 1m | ||
labels: | ||
severity: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix critical issues in BackendsWaiting alert
There are several issues with this alert:
- The metric
cnpg_backends_waiting_total
appears to be a counter (based on the_total
suffix), but it's being compared directly to a threshold. Counters should typically be evaluated using rate(). - The expression doesn't match the description - it's comparing a total count to 300, not a duration.
- alert: BackendsWaiting
annotations:
description: Pod {{ $labels.pod }} has been waiting for longer than 5 minutes
summary: If a backend is waiting for longer than 5 minutes
expr: |-
- cnpg_backends_waiting_total > 300
+ rate(cnpg_backends_waiting_total[5m]) > 0
for: 1m
labels:
severity: warning
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- alert: BackendsWaiting | |
annotations: | |
description: Pod {{ $labels.pod }} has been waiting for longer than 5 minutes | |
summary: If a backend is waiting for longer than 5 minutes | |
expr: |- | |
cnpg_backends_waiting_total > 300 | |
for: 1m | |
labels: | |
severity: warning | |
- alert: BackendsWaiting | |
annotations: | |
description: Pod {{ $labels.pod }} has been waiting for longer than 5 minutes | |
summary: If a backend is waiting for longer than 5 minutes | |
expr: |- | |
rate(cnpg_backends_waiting_total[5m]) > 0 | |
for: 1m | |
labels: | |
severity: warning |
- alert: DatabaseDeadlockConflicts | ||
annotations: | ||
description: There are over 10 deadlock conflicts in {{ $labels.pod }} | ||
summary: Checks the number of database conflicts | ||
expr: |- | ||
cnpg_pg_stat_database_deadlocks > 10 | ||
for: 1m | ||
labels: | ||
severity: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix counter usage in DatabaseDeadlockConflicts alert
The metric appears to be a counter but is being compared directly to a threshold. This should be using rate() to measure the increase over time.
- alert: DatabaseDeadlockConflicts
annotations:
- description: There are over 10 deadlock conflicts in {{ $labels.pod }}
+ description: High rate of deadlock conflicts (>10/5m) detected in {{ $labels.pod }}
summary: Checks the number of database conflicts
expr: |-
- cnpg_pg_stat_database_deadlocks > 10
+ rate(cnpg_pg_stat_database_deadlocks[5m]) * 300 > 10
for: 1m
labels:
severity: warning
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- alert: DatabaseDeadlockConflicts | |
annotations: | |
description: There are over 10 deadlock conflicts in {{ $labels.pod }} | |
summary: Checks the number of database conflicts | |
expr: |- | |
cnpg_pg_stat_database_deadlocks > 10 | |
for: 1m | |
labels: | |
severity: warning | |
- alert: DatabaseDeadlockConflicts | |
annotations: | |
description: High rate of deadlock conflicts (>10/5m) detected in {{ $labels.pod }} | |
summary: Checks the number of database conflicts | |
expr: |- | |
rate(cnpg_pg_stat_database_deadlocks[5m]) * 300 > 10 | |
for: 1m | |
labels: | |
severity: warning |
- alert: LastFailedArchiveTime | ||
annotations: | ||
description: Archiving failed for {{ $labels.pod }} | ||
summary: Checks the last time archiving failed. Will be < 0 when it has not failed. | ||
expr: |- | ||
(cnpg_pg_stat_archiver_last_failed_time - cnpg_pg_stat_archiver_last_archived_time) > 1 | ||
for: 1m | ||
labels: | ||
severity: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Revise LastFailedArchiveTime alert logic
The current implementation has several issues:
- The summary mentions "Will be < 0 when it has not failed" but the expression checks for "> 1"
- Comparing timestamps directly might lead to false positives
- The metric names suggest these are Unix timestamps, which should be handled differently
- alert: LastFailedArchiveTime
annotations:
description: Archiving failed for {{ $labels.pod }}
- summary: Checks the last time archiving failed. Will be < 0 when it has not failed.
+ summary: Archiving has failed recently
expr: |-
- (cnpg_pg_stat_archiver_last_failed_time - cnpg_pg_stat_archiver_last_archived_time) > 1
+ (time() - cnpg_pg_stat_archiver_last_failed_time) < 300
+ and
+ cnpg_pg_stat_archiver_last_failed_time > cnpg_pg_stat_archiver_last_archived_time
for: 1m
labels:
severity: warning
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- alert: LastFailedArchiveTime | |
annotations: | |
description: Archiving failed for {{ $labels.pod }} | |
summary: Checks the last time archiving failed. Will be < 0 when it has not failed. | |
expr: |- | |
(cnpg_pg_stat_archiver_last_failed_time - cnpg_pg_stat_archiver_last_archived_time) > 1 | |
for: 1m | |
labels: | |
severity: warning | |
- alert: LastFailedArchiveTime | |
annotations: | |
description: Archiving failed for {{ $labels.pod }} | |
summary: Archiving has failed recently | |
expr: |- | |
(time() - cnpg_pg_stat_archiver_last_failed_time) < 300 | |
and | |
cnpg_pg_stat_archiver_last_failed_time > cnpg_pg_stat_archiver_last_archived_time | |
for: 1m | |
labels: | |
severity: warning |
apiVersion: monitoring.coreos.com/v1 | ||
kind: PrometheusRule | ||
metadata: | ||
name: cnpg-default-alerts | ||
spec: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Codebase verification
Resource name should use cloudnative-pg.fullname
template
The PrometheusRule name should follow the Helm chart's naming convention by using the cloudnative-pg.fullname
template helper, which is available in the chart's _helpers.tpl
. This ensures consistent naming across all resources and allows for proper name overrides through Helm values.
Change the name from:
name: cnpg-default-alerts
to:
name: {{ include "cloudnative-pg.fullname" . }}-alerts
This follows the pattern seen in other charts and matches the naming template defined in the chart's helpers.
🔗 Analysis chain
Verify monitoring resource naming convention
Let's verify if the resource name follows the established pattern for monitoring resources in the cluster.
Resource name should use Helm templating for consistency
The resource name cnpg-default-alerts
is hardcoded while other PrometheusRules in the codebase consistently use Helm templating functions for naming. Let's verify if there's a template function available for this chart.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Check naming patterns of existing PrometheusRules
kubectl get prometheusrules -A -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}{end}' 2>/dev/null || \
rg -g '*.yaml' -g '*.yml' 'kind:\s*PrometheusRule' -A 2
Length of output: 2480
Script:
#!/bin/bash
# Check if there's a template helper function defined
rg -g '*.yaml' -g '*.tpl' '_helpers' packages/system/postgres-operator/templates/
Length of output: 81
Script:
#!/bin/bash
# Look for _helpers.tpl in the chart directory and its parent directories
fd -g '_helpers.tpl' packages/system/postgres-operator
# Also check Chart.yaml to get the chart name
fd -g 'Chart.yaml' packages/system/postgres-operator -x cat {}
Length of output: 963
Script:
#!/bin/bash
# Check the content of _helpers.tpl to see available naming templates
cat packages/system/postgres-operator/charts/cloudnative-pg/templates/_helpers.tpl
# Also check if there are any other PrometheusRules in the same chart for naming pattern
rg -g '*.yaml' 'kind:\s*PrometheusRule' packages/system/postgres-operator/charts/cloudnative-pg/templates/
Length of output: 2045
983d5ef
to
2190502
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
packages/system/postgres-operator/templates/prometheusrule.yaml (1)
6-8
: Consider restructuring alert groups and adding multi-level alertsConsider the following improvements to the alert structure:
- Split alerts into logical groups (e.g., replication, performance, maintenance)
- Add warning levels for critical alerts (e.g., replication lag > 150s as warning, >300s as critical)
- Define alert inhibition rules to prevent alert storms
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
packages/system/postgres-operator/templates/prometheusrule.yaml
(1 hunks)
🔇 Additional comments (6)
packages/system/postgres-operator/templates/prometheusrule.yaml (6)
4-4
: Resource name should usecloudnative-pg.fullname
templateThe resource name should follow the Helm chart's naming convention.
20-29
: Fix critical issues in BackendsWaiting alertThe metric appears to be a counter but is being compared directly to a threshold.
31-40
: Fix threshold inconsistency in PGDatabaseXidAge alertDescription mentions different threshold than expression.
53-62
: Revise LastFailedArchiveTime alert logicThe timestamp comparison logic needs revision.
64-73
: LGTM! Good use of increase() functionThe implementation correctly uses
increase()
for the counter metric over a 10-minute window.
75-84
: LGTM! Well-structured replica monitoringThe alert correctly combines recovery status and WAL receiver status to detect replication issues. The critical severity is appropriate for this type of failure.
- alert: LongRunningTransaction | ||
annotations: | ||
description: >- | ||
Pod {{ $labels.pod }} has a transaction running longer than 5 minutes (300 seconds). | ||
This could indicate a potential lock issue or unoptimized query execution. | ||
summary: Long-running transaction detected. | ||
expr: rate(cnpg_backends_max_tx_duration_seconds[5m]) > 300 | ||
for: 1m | ||
labels: | ||
severity: warning |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix metric usage in LongRunningTransaction alert
The current expression uses rate()
on what appears to be a gauge metric measuring duration. This is incorrect as rate()
should only be used with counter metrics.
Apply this fix:
- expr: rate(cnpg_backends_max_tx_duration_seconds[5m]) > 300
+ expr: max(cnpg_backends_max_tx_duration_seconds) > 300
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- alert: LongRunningTransaction | |
annotations: | |
description: >- | |
Pod {{ $labels.pod }} has a transaction running longer than 5 minutes (300 seconds). | |
This could indicate a potential lock issue or unoptimized query execution. | |
summary: Long-running transaction detected. | |
expr: rate(cnpg_backends_max_tx_duration_seconds[5m]) > 300 | |
for: 1m | |
labels: | |
severity: warning | |
- alert: LongRunningTransaction | |
annotations: | |
description: >- | |
Pod {{ $labels.pod }} has a transaction running longer than 5 minutes (300 seconds). | |
This could indicate a potential lock issue or unoptimized query execution. | |
summary: Long-running transaction detected. | |
expr: max(cnpg_backends_max_tx_duration_seconds) > 300 | |
for: 1m | |
labels: | |
severity: warning |
- alert: PGReplication | ||
annotations: | ||
description: >- | ||
Standby on pod {{ $labels.pod }} is lagging behind the primary by more than 5 minutes (300 seconds). | ||
This can lead to outdated data on replicas. | ||
summary: Replication lag detected. | ||
expr: rate(cnpg_pg_replication_lag[5m]) > 300 | ||
for: 1m | ||
labels: | ||
severity: critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix metric usage in PGReplication alert
The current expression uses rate()
on what appears to be a gauge metric measuring replication lag. This is incorrect as rate()
should only be used with counter metrics.
Apply this fix:
- expr: rate(cnpg_pg_replication_lag[5m]) > 300
+ expr: max(cnpg_pg_replication_lag) > 300
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
- alert: PGReplication | |
annotations: | |
description: >- | |
Standby on pod {{ $labels.pod }} is lagging behind the primary by more than 5 minutes (300 seconds). | |
This can lead to outdated data on replicas. | |
summary: Replication lag detected. | |
expr: rate(cnpg_pg_replication_lag[5m]) > 300 | |
for: 1m | |
labels: | |
severity: critical | |
- alert: PGReplication | |
annotations: | |
description: >- | |
Standby on pod {{ $labels.pod }} is lagging behind the primary by more than 5 minutes (300 seconds). | |
This can lead to outdated data on replicas. | |
summary: Replication lag detected. | |
expr: max(cnpg_pg_replication_lag) > 300 | |
for: 1m | |
labels: | |
severity: critical |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apart from what Coderabbit said, the rest seems okay to me
fix regression introduced in #558 <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit - **Refactor** - Updated label formatting in PostgreSQL operator default alerts configuration - Enhanced alert template generation to dynamically include multiple alert configurations from separate files <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Andrei Kvapil <[email protected]>
Summary by CodeRabbit
PrometheusRule
namedcnpg-default-alerts
.