KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090
KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090jmsperu wants to merge 1 commit intoapache:mainfrom
Conversation
The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and
kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via
'echo b > /proc/sysrq-trigger' when a heartbeat write to primary storage
times out. This bypasses all OS-level shutdown protections, drops every
running VM on the host instantly, and triggers HA cascades onto
surviving hosts.
For NFS shared storage the binary "heartbeat-write-failed = host-is-dead"
heuristic is reasonable. For LINSTOR/DRBD or other replicated local
storage, the same disk serves application I/O, replication I/O and
heartbeat I/O simultaneously - so a transient I/O contention spike can
time out the heartbeat write without the host actually being unhealthy.
The result is false-positive sysrq fencing.
Adds a new agent.properties option:
kvm.heartbeat.fence.action = reboot | graceful-reboot
| restart-agent | log-only
Default value is "reboot" so existing deployments keep their current
behavior. Operators on replicated storage backends can choose a less
destructive action:
- graceful-reboot: 'systemctl reboot' instead of sysrq, allowing VMs
a chance to shut down cleanly
- restart-agent: restart cloudstack-agent only, preserving running VMs
- log-only: log + alert, no automatic action
The existing 'reboot.host.and.alert.management.on.heartbeat.timeout'
boolean continues to function as a complete Java-side bypass.
Refs: apache#13089
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #13090 +/- ##
============================================
+ Coverage 17.92% 18.07% +0.14%
- Complexity 16154 16697 +543
============================================
Files 5939 6036 +97
Lines 533181 542411 +9230
Branches 65237 66424 +1187
============================================
+ Hits 95585 98036 +2451
- Misses 426856 433361 +6505
- Partials 10740 11014 +274
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
DaanHoogland
left a comment
There was a problem hiding this comment.
clgtm, thanks for this feature @jmsperu . I would suggest renaming “reboot” to “fence” or “hard-reboot” (but no -1 on that).
|
@jmsperu , would you consider this for older versions as well? |
There was a problem hiding this comment.
Pull request overview
Adds a configurable fencing action for KVM storage-heartbeat timeouts to avoid overly-destructive false-positive host fencing on replicated/local primary storage backends (e.g., LINSTOR/DRBD).
Changes:
- Introduces new agent property
kvm.heartbeat.fence.action(defaultreboot) to select fencing behavior. - Updates
kvmheartbeat.shandkvmspheartbeat.shto read the property from/etc/cloudstack/agent/agent.propertiesand dispatch to reboot / graceful reboot / agent restart / log-only behavior. - Documents the property in
agent.propertiesand adds aAgentPropertiesentry for discoverability.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| scripts/vm/hypervisor/kvm/kvmheartbeat.sh | Reads kvm.heartbeat.fence.action and selects fence behavior for heartbeat write failures. |
| scripts/vm/hypervisor/kvm/kvmspheartbeat.sh | Same fencing configurability for StorPool heartbeat script. |
| agent/src/main/java/com/cloud/agent/properties/AgentProperties.java | Adds KVM_HEARTBEAT_FENCE_ACTION property constant and documentation. |
| agent/conf/agent.properties | Documents kvm.heartbeat.fence.action for operators. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # write fails persistently. Supersedes the legacy binary | ||
| # 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value. | ||
| # |
| # reboot - immediate sysrq-trigger reboot (default; original behavior) | ||
| # graceful-reboot - 'systemctl reboot' instead of sysrq; allows VMs to stop cleanly | ||
| # restart-agent - restart cloudstack-agent only; running VMs are preserved | ||
| # log-only - log + alert; take no automatic action (admin must investigate) |
| * <li>{@code reboot} (default) — immediate sysrq-trigger reboot; original behavior</li> | ||
| * <li>{@code graceful-reboot} — {@code systemctl reboot} instead of sysrq, lets VMs stop cleanly</li> | ||
| * <li>{@code restart-agent} — restart cloudstack-agent only; running VMs preserved</li> | ||
| * <li>{@code log-only} — log + alert, no automatic action</li> |
Description
Fixes #13089
The KVM agent's storage heartbeat scripts (
kvmheartbeat.shandkvmspheartbeat.sh) hard-code an immediate kernel-level reboot viaecho b > /proc/sysrq-triggerwhen a heartbeat write to primary storage times out.This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes false-positive host fencing on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault.
We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer.
Change
Adds a new agent property
kvm.heartbeat.fence.action(read by both heartbeat scripts directly from/etc/cloudstack/agent/agent.properties):reboot(default)echo b > /proc/sysrq-triggergraceful-rebootsystemctl reboot— allows running VMs to stop cleanlyrestart-agentcloudstack-agentonly; running VMs preservedlog-onlyDefault is
rebootso existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action.The existing
reboot.host.and.alert.management.on.heartbeat.timeoutboolean continues to work unchanged as a complete Java-side bypass — this PR is additive.Files changed
scripts/vm/hypervisor/kvm/kvmheartbeat.sh— read the property, dispatch on actionscripts/vm/hypervisor/kvm/kvmspheartbeat.sh— sameagent/conf/agent.properties— document the new propertyagent/src/main/java/com/cloud/agent/properties/AgentProperties.java— add Java-side property entry for tooling/discoverabilityBackward compatibility
reboot, identical to current behaviortail -n 1so duplicate entries take the last valuerebootreboot.host.and.alert.management.on.heartbeat.timeout) continues to work as beforeTesting
reboot(default) — verified produces same output as before viabash -xtrace; sysrq path unchangedlog-only— verified the script exits 0 with logger entry, no reboot/agent-restart attemptedrestart-agent— verifiedsystemctl restart cloudstack-agentis invokedgraceful-reboot— verifiedsystemctl rebootis invoked instead of sysrqIn production we have been running with the fence path neutered (equivalent to
log-only) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred.Related