Skip to content

feat(module): add alert KubeNodeAwaitingVirtualMachinesEvictionBeforeShutdown #1268

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
55 changes: 55 additions & 0 deletions monitoring/prometheus-rules/node.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
- alert: KubeNodeAwaitingVirtualMachinesEvictionBeforeShutdown
expr: |
(
kube_node_status_condition{condition="GracefulShutdownPostpone", status="true"} == 1
and on(node)
sum by (node) (d8_virtualization_virtualmachine_status_phase{phase="Running"}) > 0
)
labels:
severity_level: "6"
tier: cluster
for: 5m
annotations:
plk_protocol_extent_version: "1"
plk_markup_format: "markdown"
plk_create_group_if_not_exists__node_maintenance: "NodeMaintenance,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes"
plk_grouped_by__node_maintenance: "NodeMaintenance,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes"
summary: Node is awaiting workload evacuation before safe shutdown.
description: |
The node `{{ $labels.node }}` has activated graceful shutdown protection and **cannot be safely powered off** until workloads (e.g., VirtualMachines) are eviction.

### What Is Happening?
A shutdown request was issued, but the system intercepted it to prevent data loss or VM downtime.
The `GracefulShutdownPostpone` condition is now active — this means:
- The node is **intentionally blocking abrupt power-off**.
- You must **manually evict VirtualMachines** before proceeding.

This is expected behavior for nodes running VMs and ensures safe maintenance.

### Required Action
To proceed with node shutdown:
1. **List VMs running on the node and check if they are migratable**:
```bash
d8 k get virtualmachine -A -o jsonpath='{range .items[?(@.status.nodeName=="'{{ $labels.node }}'")]}{.metadata.namespace}/{.metadata.name}{"\t"}Migratable={.status.conditions[?(@.type=="Migratable")].status}{"\n"}{end}''
```
This command shows a list like:
```bash
default/vm-name Migratable=True
prod/vm-beta Migratable=False
```
2. **For each VM**:
**If Migratable=True**, **migrate the VM to another node**:
```bash
d8 v evict <vm-name> -n <namespace>
```
> This migrates the VM to another node without guest OS downtime.

**If Migratable=False**, **restart the VM**:
```bash
d8 v restart <vm-name> -n <namespace>
```
> This restarts the VM.
Some VMs cannot run on other nodes because they have specific storage or network requirements.
In such cases, these VMs must be stopped.

3. Once all VMs are migrated, restarted or stopped, the node will automatically continue shutting down.
Loading