diff --git a/monitoring/prometheus-rules/node.yaml b/monitoring/prometheus-rules/node.yaml new file mode 100644 index 0000000000..2860a4f309 --- /dev/null +++ b/monitoring/prometheus-rules/node.yaml @@ -0,0 +1,55 @@ +- alert: KubeNodeAwaitingVirtualMachinesEvictionBeforeShutdown + expr: | + ( + kube_node_status_condition{condition="GracefulShutdownPostpone", status="true"} == 1 + and on(node) + sum by (node) (d8_virtualization_virtualmachine_status_phase{phase="Running"}) > 0 + ) + labels: + severity_level: "6" + tier: cluster + for: 5m + annotations: + plk_protocol_extent_version: "1" + plk_markup_format: "markdown" + plk_create_group_if_not_exists__node_maintenance: "NodeMaintenance,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + plk_grouped_by__node_maintenance: "NodeMaintenance,tier=~tier,prometheus=deckhouse,kubernetes=~kubernetes" + summary: Node is awaiting workload evacuation before safe shutdown. + description: | + The node `{{ $labels.node }}` has activated graceful shutdown protection and **cannot be safely powered off** until workloads (e.g., VirtualMachines) are eviction. + + ### What Is Happening? + A shutdown request was issued, but the system intercepted it to prevent data loss or VM downtime. + The `GracefulShutdownPostpone` condition is now active — this means: + - The node is **intentionally blocking abrupt power-off**. + - You must **manually evict VirtualMachines** before proceeding. + + This is expected behavior for nodes running VMs and ensures safe maintenance. + + ### Required Action + To proceed with node shutdown: + 1. **List VMs running on the node and check if they are migratable**: + ```bash + d8 k get virtualmachine -A -o jsonpath='{range .items[?(@.status.nodeName=="'{{ $labels.node }}'")]}{.metadata.namespace}/{.metadata.name}{"\t"}Migratable={.status.conditions[?(@.type=="Migratable")].status}{"\n"}{end}'' + ``` + This command shows a list like: + ```bash + default/vm-name Migratable=True + prod/vm-beta Migratable=False + ``` + 2. **For each VM**: + **If Migratable=True**, **migrate the VM to another node**: + ```bash + d8 v evict -n + ``` + > This migrates the VM to another node without guest OS downtime. + + **If Migratable=False**, **restart the VM**: + ```bash + d8 v restart -n + ``` + > This restarts the VM. + Some VMs cannot run on other nodes because they have specific storage or network requirements. + In such cases, these VMs must be stopped. + + 3. Once all VMs are migrated, restarted or stopped, the node will automatically continue shutting down.