Skip to content

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090

Open
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:feature/configurable-heartbeat-fence
Open

KVM: make storage heartbeat fence action configurable (graceful-reboot / restart-agent / log-only)#13090
jmsperu wants to merge 1 commit intoapache:mainfrom
jmsperu:feature/configurable-heartbeat-fence

Conversation

@jmsperu
Copy link
Copy Markdown
Contributor

@jmsperu jmsperu commented May 1, 2026

Description

Fixes #13089

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via echo b > /proc/sysrq-trigger when a heartbeat write to primary storage times out.

This works fine for NFS-backed primary storage where transient I/O latency is rare, but causes false-positive host fencing on LINSTOR/DRBD (and any replicated local storage), because the same disk simultaneously serves application I/O, replication I/O and heartbeat I/O. A normal DRBD resync I/O burst can transiently delay the heartbeat write enough to trip the fence — and the host is force-rebooted with no real fault.

We hit this in production on 4.22.0.0 multiple times during a single incident; each false-positive sysrq drops every running VM on the host and cascades onto the surviving peer.

Change

Adds a new agent property kvm.heartbeat.fence.action (read by both heartbeat scripts directly from /etc/cloudstack/agent/agent.properties):

Value Behavior
reboot (default) Original behavior: echo b > /proc/sysrq-trigger
graceful-reboot systemctl reboot — allows running VMs to stop cleanly
restart-agent Restart cloudstack-agent only; running VMs preserved
log-only Log + alert, no automatic action (admin investigates)

Default is reboot so existing deployments keep current behavior. Operators on replicated-storage backends can pick a less destructive action.

The existing reboot.host.and.alert.management.on.heartbeat.timeout boolean continues to work unchanged as a complete Java-side bypass — this PR is additive.

Files changed

  • scripts/vm/hypervisor/kvm/kvmheartbeat.sh — read the property, dispatch on action
  • scripts/vm/hypervisor/kvm/kvmspheartbeat.sh — same
  • agent/conf/agent.properties — document the new property
  • agent/src/main/java/com/cloud/agent/properties/AgentProperties.java — add Java-side property entry for tooling/discoverability

Backward compatibility

  • Default action is reboot, identical to current behavior
  • Property is read with tail -n 1 so duplicate entries take the last value
  • If the property file is unreadable or the value is unrecognized, falls back to reboot
  • No Java-side runtime change — the existing boolean (reboot.host.and.alert.management.on.heartbeat.timeout) continues to work as before

Testing

  • reboot (default) — verified produces same output as before via bash -x trace; sysrq path unchanged
  • log-only — verified the script exits 0 with logger entry, no reboot/agent-restart attempted
  • restart-agent — verified systemctl restart cloudstack-agent is invoked
  • graceful-reboot — verified systemctl reboot is invoked instead of sysrq

In production we have been running with the fence path neutered (equivalent to log-only) for several hours since the incident, with no impact on cluster health — the host stays up while DRBD resyncs background-complete normally, and the previous false-positive cascade has not recurred.

Related

The KVM agent's storage heartbeat scripts (kvmheartbeat.sh and
kvmspheartbeat.sh) hard-code an immediate kernel-level reboot via
'echo b > /proc/sysrq-trigger' when a heartbeat write to primary storage
times out. This bypasses all OS-level shutdown protections, drops every
running VM on the host instantly, and triggers HA cascades onto
surviving hosts.

For NFS shared storage the binary "heartbeat-write-failed = host-is-dead"
heuristic is reasonable. For LINSTOR/DRBD or other replicated local
storage, the same disk serves application I/O, replication I/O and
heartbeat I/O simultaneously - so a transient I/O contention spike can
time out the heartbeat write without the host actually being unhealthy.
The result is false-positive sysrq fencing.

Adds a new agent.properties option:

    kvm.heartbeat.fence.action = reboot | graceful-reboot
                               | restart-agent | log-only

Default value is "reboot" so existing deployments keep their current
behavior. Operators on replicated storage backends can choose a less
destructive action:

  - graceful-reboot: 'systemctl reboot' instead of sysrq, allowing VMs
    a chance to shut down cleanly
  - restart-agent: restart cloudstack-agent only, preserving running VMs
  - log-only: log + alert, no automatic action

The existing 'reboot.host.and.alert.management.on.heartbeat.timeout'
boolean continues to function as a complete Java-side bypass.

Refs: apache#13089
@codecov
Copy link
Copy Markdown

codecov Bot commented May 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 18.07%. Comparing base (30e6c22) to head (d603b26).
⚠️ Report is 184 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #13090      +/-   ##
============================================
+ Coverage     17.92%   18.07%   +0.14%     
- Complexity    16154    16697     +543     
============================================
  Files          5939     6036      +97     
  Lines        533181   542411    +9230     
  Branches      65237    66424    +1187     
============================================
+ Hits          95585    98036    +2451     
- Misses       426856   433361    +6505     
- Partials      10740    11014     +274     
Flag Coverage Δ
uitests 3.52% <ø> (-0.15%) ⬇️
unittests 19.23% <100.00%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@DaanHoogland DaanHoogland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clgtm, thanks for this feature @jmsperu . I would suggest renaming “reboot” to “fence” or “hard-reboot” (but no -1 on that).

@DaanHoogland
Copy link
Copy Markdown
Contributor

@jmsperu , would you consider this for older versions as well?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a configurable fencing action for KVM storage-heartbeat timeouts to avoid overly-destructive false-positive host fencing on replicated/local primary storage backends (e.g., LINSTOR/DRBD).

Changes:

  • Introduces new agent property kvm.heartbeat.fence.action (default reboot) to select fencing behavior.
  • Updates kvmheartbeat.sh and kvmspheartbeat.sh to read the property from /etc/cloudstack/agent/agent.properties and dispatch to reboot / graceful reboot / agent restart / log-only behavior.
  • Documents the property in agent.properties and adds a AgentProperties entry for discoverability.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.

File Description
scripts/vm/hypervisor/kvm/kvmheartbeat.sh Reads kvm.heartbeat.fence.action and selects fence behavior for heartbeat write failures.
scripts/vm/hypervisor/kvm/kvmspheartbeat.sh Same fencing configurability for StorPool heartbeat script.
agent/src/main/java/com/cloud/agent/properties/AgentProperties.java Adds KVM_HEARTBEAT_FENCE_ACTION property constant and documentation.
agent/conf/agent.properties Documents kvm.heartbeat.fence.action for operators.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +314 to +316
# write fails persistently. Supersedes the legacy binary
# 'reboot.host.and.alert.management.on.heartbeat.timeout' when set to a non-default value.
#
# reboot - immediate sysrq-trigger reboot (default; original behavior)
# graceful-reboot - 'systemctl reboot' instead of sysrq; allows VMs to stop cleanly
# restart-agent - restart cloudstack-agent only; running VMs are preserved
# log-only - log + alert; take no automatic action (admin must investigate)
* <li>{@code reboot} (default) — immediate sysrq-trigger reboot; original behavior</li>
* <li>{@code graceful-reboot} — {@code systemctl reboot} instead of sysrq, lets VMs stop cleanly</li>
* <li>{@code restart-agent} — restart cloudstack-agent only; running VMs preserved</li>
* <li>{@code log-only} — log + alert, no automatic action</li>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

KVM: kvmheartbeat.sh / kvmspheartbeat.sh hardcoded sysrq reboot causes false-positive host fencing on LINSTOR/DRBD primary storage

4 participants