KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only) by jmsperu · Pull Request #13090 · apache/cloudstack

jmsperu · 2026-05-01T00:09:48Z

Description

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via echo b > /proc/sysrq-trigger when a heartbeat write to primary storage times out.

This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes false-positive host fencing on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault.

We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer.

Change

Adds a new agent property kvm.heartbeat.fence.action (read by both heartbeat scripts directly from /etc/cloudstack/agent/agent.properties):

Value	Behavior
`reboot` (default)	Original behavior: `echo b > /proc/sysrq-trigger`
`graceful-reboot`	`systemctl reboot` — allows running VMs to stop cleanly
`restart-agent`	Restart `cloudstack-agent` only; running VMs preserved
`log-only`	Log + alert, no automatic action (admin investigates)

Default is reboot so existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action.

The existing reboot.host.and.alert.management.on.heartbeat.timeout boolean continues to work unchanged as a complete Java-side bypass — this PR is additive.

Files changed

scripts/vm/hypervisor/kvm/kvmheartbeat.sh — read the property, dispatch on action
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh — same
agent/conf/agent.properties — document the new property
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java — add Java-side property entry for tooling/discoverability

Backward compatibility

Default action is reboot, identical to current behavior
Property is read with tail -n 1 so duplicate entries take the last value
If the property file is unreadable or the value is unrecognized, falls back to reboot
No Java-side runtime change — the existing boolean (reboot.host.and.alert.management.on.heartbeat.timeout) continues to work as before

Testing

reboot (default) — verified produces same output as before via bash -x trace; sysrq path unchanged
log-only — verified the script exits 0 with logger entry, no reboot/agent-restart attempted
restart-agent — verified systemctl restart cloudstack-agent is invoked
graceful-reboot — verified systemctl reboot is invoked instead of sysrq

In production we have been running with the fence path neutered (equivalent to log-only) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred.

Issue: KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage #13089
Affected versions: 4.22.0.0 (likely earlier; the script section is unchanged for many releases)
Triggered by: LINSTOR/DRBD primary storage with active resyncs, but applies to any replicated local storage

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via 'echo b > /proc/sysrq-trigger' when a heartbeat write to primary storage times out. This bypasses all OS-level shutdown protections, drops every running VM on the host instantly, and triggers HA cascades onto surviving hosts. For NFS shared storage the binary "heartbeat-write-failed = host-is-dead" heuristic is reasonable. For LINSTOR/DRBD or other replicated local storage, the same disk serves application I/O, replication I/O and heartbeat I/O simultaneously - so a transient I/O contention spike can time out the heartbeat write without the host actually being unhealthy. The result is false-positive sysrq fencing. Adds a new agent.properties option: kvm.heartbeat.fence.action = reboot | graceful-reboot | restart-agent | log-only Default value is "reboot" so existing deployments keep their current behavior. Operators on replicated storage backends can choose a less destructive action: - graceful-reboot: 'systemctl reboot' instead of sysrq, allowing VMs a chance to shut down cleanly - restart-agent: restart cloudstack-agent only, preserving running VMs - log-only: log + alert, no automatic action The existing 'reboot.host.and.alert.management.on.heartbeat.timeout' boolean continues to function as a complete Java-side bypass. Refs: apache#13089

codecov · 2026-05-01T02:47:31Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 18.07%. Comparing base (30e6c22) to head (d603b26).
⚠️ Report is 184 commits behind head on main.

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #13090      +/-   ##
============================================
+ Coverage     17.92%   18.07%   +0.14%     
- Complexity    16154    16697     +543     
============================================
  Files          5939     6036      +97     
  Lines        533181   542411    +9230     
  Branches      65237    66424    +1187     
============================================
+ Hits          95585    98036    +2451     
- Misses       426856   433361    +6505     
- Partials      10740    11014     +274

Flag	Coverage Δ
uitests	`3.52% <ø> (-0.15%)`	⬇️
unittests	`19.23% <100.00%> (+0.19%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

DaanHoogland

clgtm, thanks for this feature @jmsperu . I would suggest renaming “reboot” to “fence” or “hard-reboot” (but no -1 on that).

DaanHoogland · 2026-05-01T09:46:36Z

@jmsperu , would you consider this for older versions as well?

Copilot

Pull request overview

Adds a configurable fencing action for KVM storage-heartbeat timeouts to avoid overly-destructive false-positive host fencing on replicated/local primary storage backends (e.g., LINSTOR/DRBD).

Changes:

Introduces new agent property kvm.heartbeat.fence.action (default reboot) to select fencing behavior.
Updates kvmheartbeat.sh and kvmspheartbeat.sh to read the property from /etc/cloudstack/agent/agent.properties and dispatch to reboot / graceful reboot / agent restart / log-only behavior.
Documents the property in agent.properties and adds a AgentProperties entry for discoverability.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File	Description
scripts/vm/hypervisor/kvm/kvmheartbeat.sh	Reads `kvm.heartbeat.fence.action` and selects fence behavior for heartbeat write failures.
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh	Same fencing configurability for StorPool heartbeat script.
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java	Adds `KVM_HEARTBEAT_FENCE_ACTION` property constant and documentation.
agent/conf/agent.properties	Documents `kvm.heartbeat.fence.action` for operators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+# write fails persistently. Supersedes the legacy binary
+# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
+#


+#   reboot          - immediate sysrq-trigger reboot (default; original behavior)
+#   graceful-reboot - 'systemctl reboot' instead of sysrq; allows VMs to stop cleanly
+#   restart-agent   - restart cloudstack-agent only; running VMs are preserved
+#   log-only        - log + alert; take no automatic action (admin must investigate)


+     *   <li>{@code reboot} (default) — immediate sysrq-trigger reboot; original behavior</li>
+     *   <li>{@code graceful-reboot} — {@code systemctl reboot} instead of sysrq, lets VMs stop cleanly</li>
+     *   <li>{@code restart-agent} — restart cloudstack-agent only; running VMs preserved</li>
+     *   <li>{@code log-only} — log + alert, no automatic action</li>


boring-cyborg Bot added component:agent component:kvm labels May 1, 2026

DaanHoogland approved these changes May 1, 2026

View reviewed changes

DaanHoogland added the status:needs-testing label May 1, 2026

sureshanaparti requested a review from Copilot May 1, 2026 09:50

sureshanaparti added this to the 4.23.0 milestone May 1, 2026

Copilot started reviewing on behalf of sureshanaparti May 1, 2026 09:51 View session

Copilot AI reviewed May 1, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:feature/configurable-heartbeat-fence

jmsperu commented May 1, 2026

Uh oh!

codecov Bot commented May 1, 2026 •

edited

Loading

Uh oh!

DaanHoogland left a comment

Uh oh!

DaanHoogland commented May 1, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jmsperu commented May 1, 2026

Description

Change

Files changed

Backward compatibility

Testing

Related

Uh oh!

codecov Bot commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

DaanHoogland left a comment

Choose a reason for hiding this comment

Uh oh!

DaanHoogland commented May 1, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov Bot commented May 1, 2026 •

edited

Loading