Add VMNonRecoverableOSPanic and ClusterMultipleVMPanics alerts#4371
Add VMNonRecoverableOSPanic and ClusterMultipleVMPanics alerts#4371avlitman wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
Coverage Report for CI Build 28588800930Coverage increased (+0.04%) to 81.468%Details
Uncovered ChangesNo uncovered changes found. Coverage RegressionsNo coverage regressions found. Coverage Stats
💛 - Coveralls |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Plus Run ID: 📒 Files selected for processing (3)
👮 Files not reviewed due to content moderation or server errors (3)
📝 WalkthroughWarning Review ran into problems🔥 ProblemsLinked repositories: Your configuration references 13 linked repositories, but your current plan allows 0. Analyzed ``, skipped Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
pkg/monitoring/observability/rules/alerts/cluster_alerts.go (1)
129-132: 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low valueMinor: literal label strings vs. shared constants.
"severity": "warning"/"operator_health_impact": "none"are duplicated string literals, consistent with the existing style in this file. As per path instructions ("prefer shared constants for alert/annotation/label strings ... instead of duplicating string literals"), consider introducing shared constants file-wide — but since this matches the pre-existing convention throughoutclusterAlerts(), this is optional and best tackled as a broader follow-up rather than in this PR alone.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@pkg/monitoring/observability/rules/alerts/cluster_alerts.go` around lines 129 - 132, This is a file-wide consistency follow-up rather than a PR-blocking bug: the label literals in clusterAlerts() such as the Labels map entries for severity and operator_health_impact are duplicated across the alert definitions. If you choose to address it, extract shared package-level constants for these repeated alert label values and reuse them throughout clusterAlerts() instead of hardcoding the same strings in each rule.Source: Path instructions
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitoring/observability/rules/alerts/cluster_alerts.go`:
- Around line 121-133: The VMNonRecoverableOSPanic alert definition in
cluster_alerts.go needs a valid runbook target before merge. Verify that the
runbook page for VMNonRecoverableOSPanic exists at the expected
kubevirt.io/monitoring/runbooks location, and if it does not, update the alert’s
runbook reference to point to an open PR or a valid existing docs page. Check
the alert entry itself and any helper that derives the runbook URL so the
generated link is not broken.
---
Nitpick comments:
In `@pkg/monitoring/observability/rules/alerts/cluster_alerts.go`:
- Around line 129-132: This is a file-wide consistency follow-up rather than a
PR-blocking bug: the label literals in clusterAlerts() such as the Labels map
entries for severity and operator_health_impact are duplicated across the alert
definitions. If you choose to address it, extract shared package-level constants
for these repeated alert label values and reuse them throughout clusterAlerts()
instead of hardcoding the same strings in each rule.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: c36f528e-4bff-47af-8c99-33ff63394f03
📒 Files selected for processing (2)
hack/prom-rule-ci/observability-prom-rules-tests.yamlpkg/monitoring/observability/rules/alerts/cluster_alerts.go
|
hco-e2e-consecutive-operator-sdk-upgrades-azure lane succeeded. |
|
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-consecutive-operator-sdk-upgrades-aws, ci/prow/hco-e2e-operator-sdk-aws, ci/prow/hco-e2e-operator-sdk-sno-aws, ci/prow/hco-e2e-upgrade-operator-sdk-sno-aws, ci/prow/hco-e2e-upgrade-prev-operator-sdk-sno-aws DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitoring/observability/rules/alerts/alerts.go`:
- Around line 31-33: Add unit coverage for the label fallback in Register()
within alerts.go by testing the alert-processing path directly, not only
PrometheusRule reconciliation. Create a test that verifies an existing
kubernetes_operator_component label on VMNonRecoverableOSPanic is preserved, and
another that verifies a missing label is defaulted to cnv-observability when
Register() applies the fallback logic.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 0199b167-d384-4885-a441-adc53b760486
📒 Files selected for processing (3)
hack/prom-rule-ci/observability-prom-rules-tests.yamlpkg/monitoring/observability/rules/alerts/alerts.gopkg/monitoring/observability/rules/alerts/cluster_alerts.go
🚧 Files skipped from review as they are similar to previous changes (2)
- pkg/monitoring/observability/rules/alerts/cluster_alerts.go
- hack/prom-rule-ci/observability-prom-rules-tests.yaml
| if _, ok := alert.Labels["kubernetes_operator_component"]; !ok { | ||
| alert.Labels["kubernetes_operator_component"] = "cnv-observability" | ||
| } |
There was a problem hiding this comment.
🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check for unit tests covering the Register function's component-label logic
fd -e go alerts_test.go
rg -n 'kubernetes_operator_component' --type=go pkg/monitoring/observability/rules/alertsRepository: kubevirt/hyperconverged-cluster-operator
Length of output: 584
🏁 Script executed:
#!/bin/bash
set -euo pipefail
ast-grep outline pkg/monitoring/observability/rules/alerts/alerts.go --view expanded
printf '\n--- alerts_test.go ---\n'
wc -l controllers/observability/alerts_test.go
sed -n '1,260p' controllers/observability/alerts_test.go
printf '\n--- cluster_alerts.go excerpt ---\n'
sed -n '110,150p' pkg/monitoring/observability/rules/alerts/cluster_alerts.goRepository: kubevirt/hyperconverged-cluster-operator
Length of output: 5116
Add unit coverage for the component-label fallback here. controllers/observability/alerts_test.go only exercises PrometheusRule reconciliation, not this branch in pkg/monitoring/observability/rules/alerts/alerts.go. Add a unit test for Register() that covers both cases: preserve an existing kubernetes_operator_component value (for VMNonRecoverableOSPanic) and default missing labels to cnv-observability.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitoring/observability/rules/alerts/alerts.go` around lines 31 - 33,
Add unit coverage for the label fallback in Register() within alerts.go by
testing the alert-processing path directly, not only PrometheusRule
reconciliation. Create a test that verifies an existing
kubernetes_operator_component label on VMNonRecoverableOSPanic is preserved, and
another that verifies a missing label is defaulted to cnv-observability when
Register() applies the fallback logic.
Source: Path instructions
There was a problem hiding this comment.
The existing testutil.ValidateAlertPartOfAndComponentLabels linter already validates that all alerts have kubernetes_operator_component set. And the promtool tests already verify the exact label values in the rendered output (kubernetes_operator_component: "kubevirt" for our alerts, "cnv-observability" for others).
There was a problem hiding this comment.
Oops, something went wrong! Please try again later. 🐰 💔
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
hack/prom-rule-ci/observability-prom-rules-tests.yaml (1)
487-531: 📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick winAdd a boundary test at exactly 10 affected VMs.
Current cases cover 0 panicking VMs implicitly and 11 VMs (fires). Since the alert threshold is strictly
> 10, a test asserting no alert with exactly 10 affected VMs would tighten coverage of this boundary.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@hack/prom-rule-ci/observability-prom-rules-tests.yaml` around lines 487 - 531, Add a boundary test for ClusterMultipleVMPanics in observability-prom-rules-tests.yaml using the existing alert_rule_test block: keep the current 11-VM firing case, and add a new eval_time case with exactly 10 kubevirt_vmi_guest_os_panic_total series so the alert does not fire. Reference the ClusterMultipleVMPanics rule name and its existing vm-01 through vm-11 input_series pattern, and ensure the new assertion expects no alerts at the strict > 10 threshold.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@pkg/monitoring/observability/rules/alerts/cluster_alerts.go`:
- Around line 129-133: The new Labels map literals in cluster_alerts.go are not
gofmt-aligned, causing formatting checks to fail. Update the Labels blocks in
the affected alert definitions (including the ones near the cluster alert rule
setup) so they match standard gofmt spacing and alignment, then run gofmt on the
file to normalize the formatting.
---
Nitpick comments:
In `@hack/prom-rule-ci/observability-prom-rules-tests.yaml`:
- Around line 487-531: Add a boundary test for ClusterMultipleVMPanics in
observability-prom-rules-tests.yaml using the existing alert_rule_test block:
keep the current 11-VM firing case, and add a new eval_time case with exactly 10
kubevirt_vmi_guest_os_panic_total series so the alert does not fire. Reference
the ClusterMultipleVMPanics rule name and its existing vm-01 through vm-11
input_series pattern, and ensure the new assertion expects no alerts at the
strict > 10 threshold.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 54504c24-9be9-4aad-8583-59ca1be8181b
📒 Files selected for processing (3)
hack/prom-rule-ci/observability-prom-rules-tests.yamlpkg/monitoring/observability/rules/alerts/alerts.gopkg/monitoring/observability/rules/alerts/cluster_alerts.go
🚧 Files skipped from review as they are similar to previous changes (1)
- pkg/monitoring/observability/rules/alerts/alerts.go
| Labels: map[string]string{ | ||
| "severity": "warning", | ||
| "operator_health_impact": "none", | ||
| "kubernetes_operator_component": "kubevirt", | ||
| }, |
There was a problem hiding this comment.
📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
gofmt -l pkg/monitoring/observability/rules/alerts/cluster_alerts.goRepository: kubevirt/hyperconverged-cluster-operator
Length of output: 238
🏁 Script executed:
sed -n '120,155p' pkg/monitoring/observability/rules/alerts/cluster_alerts.goRepository: kubevirt/hyperconverged-cluster-operator
Length of output: 1955
Format the new Labels maps with gofmt. The spacing in pkg/monitoring/observability/rules/alerts/cluster_alerts.go:129-133 and :143-147 is not gofmt-aligned, so this file will fail formatting checks until it’s run through gofmt.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@pkg/monitoring/observability/rules/alerts/cluster_alerts.go` around lines 129
- 133, The new Labels map literals in cluster_alerts.go are not gofmt-aligned,
causing formatting checks to fail. Update the Labels blocks in the affected
alert definitions (including the ones near the cluster alert rule setup) so they
match standard gofmt spacing and alignment, then run gofmt on the file to
normalize the formatting.
Signed-off-by: avlitman <alitman@redhat.com>
|
|
hco-e2e-operator-sdk-gcp lane succeeded. |
|
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-operator-sdk-aws, ci/prow/hco-e2e-operator-sdk-azure DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
@avlitman: The following tests failed, say
Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
|
hco-e2e-upgrade-prev-operator-sdk-azure lane succeeded. |
|
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-upgrade-operator-sdk-aws, ci/prow/hco-e2e-upgrade-operator-sdk-sno-azure, ci/prow/hco-e2e-upgrade-prev-operator-sdk-aws DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
hco-e2e-kv-smoke-gcp lane succeeded. |
|
@hco-bot: Overrode contexts on behalf of hco-bot: ci/prow/hco-e2e-kv-smoke-azure DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |



Add a VMNonRecoverableOSPanic warning alert that fires when any non-recoverable guest OS panic is detected (> 0 and <= 5), based on the existing kubevirt_vmi_guest_os_panic_total metric. Also add a ClusterMultipleVMPanics warning alert for admins that fires when more than 10 VMs across the cluster have experienced panics.
This complements the KubeVirt-side VMNonRecoverableOSPanic alert (kubevirt#18004) which fires at critical severity when a VM exceeds 5 panics (crash-looping). Both fire independently from separate PrometheusRule CRs and are visible as distinct alerts in Alertmanager (differentiated by
severityandkubernetes_operator_componentlabels).Jira Ticket:
Release note: