Skip to content

Conversation

@mshitrit
Copy link
Member

@mshitrit mshitrit commented Aug 14, 2025

Why we need this PR

The storm recovery mechanism is intended to prevent additional remediations when already in a remediation "storm" mode.
Once the number of healthy nodes drops below minHealthy, fencing should be delayed until the number of unhealthy nodes reaches this stormRecoveryThreshold. This would prevent premature fencing during transient issues and potential "storm" scenarios.

Changes made

  • Added a new optional field to the spec StormRecoveryThreshold
  • Update the remediation creation logic according to it's value
  • Added webhook validation for the StormRecoveryThreshold
  • Added Test code for the new functionality
  • Added documentation

Which issue(s) this PR fixes

RHWA-16

Test plan

Summary by CodeRabbit

  • New Features

    • Added Storm Recovery enhancements with a configurable stormTerminationDelay to delay exit from storm mode and pause new remediations during recovery.
    • Introduced new status timestamps: stormRecoveryStartTime and stormTerminationStartTime for better visibility into storm lifecycle.
    • Updated CRD and operator UI descriptors to surface the new spec and status fields with validation and display metadata.
  • Documentation

    • Added a comprehensive Storm Recovery guide, configuration steps, and updated status field descriptions, including usage of stormTerminationDelay.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai
Copy link

coderabbitai bot commented Aug 14, 2025

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

Adds storm-recovery support to NodeHealthCheck: new spec delay field, new status timestamps, constants and condition utils, controller state machine to activate/delay/terminate storm mode affecting remediation creation, updated CRD/CSV manifests, regenerated deepcopy code, tests for storm scenarios, and documentation.

Changes

Cohort / File(s) Summary of changes
API types & deepcopy
api/v1alpha1/nodehealthcheck_types.go, api/v1alpha1/zz_generated.deepcopy.go
Introduces StormActive/HealthyNodeThresholdChange constants; adds spec stormTerminationDelay; adds status stormRecoveryStartTime and stormTerminationStartTime; updates deepcopy implementations and type aliases accordingly.
CRD schema (bundle + config)
bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml, config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml
Extends OpenAPI schemas: new spec duration field with regex validation; new spec/status date-time fields for storm timestamps.
Operator CSV descriptors (bundle + base)
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml, config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml
Adds spec descriptor for stormTerminationDelay and status descriptors for storm timestamp fields with descriptions.
Controller logic
controllers/nodehealthcheck_controller.go
Integrates storm-recovery state machine: evaluates activation, delay, termination; sets conditions/timestamps; skips remediation during storm; computes requeues; logs and emits events. Refactors minHealthy error handling.
Tests
controllers/nodehealthcheck_controller_test.go
Adds storm-recovery tests, helpers to count remediations, factory for NHC with delay; multi-phase assertions; note: duplicate factory definition present.
Condition utilities
controllers/utils/conditions.go
Adds IsConditionSet; refactors IsConditionTrue to use it.
Documentation
docs/configuration.md
Documents storm recovery feature, new spec/status fields, behavior, configuration, and states.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Reconciler Loop
  participant C as NHC Controller
  participant K as API Server
  participant CR as NodeHealthCheck CR
  participant N as Cluster Nodes

  U->>C: Reconcile(NHC)
  C->>K: Get CR, Nodes, Remediations
  C->>C: getMinHealthy()
  C->>C: evaluateStormRecovery(nhc, minHealthy)
  alt Storm activates
    C->>CR: Set Condition StormActive=true<br/>Set stormRecoveryStartTime
    C->>U: Result{RequeueAfter=t}
    note over C,CR: Remediation creation skipped while active
  else Storm active, health regained
    C->>CR: Set stormTerminationStartTime (delay start)
    C->>U: Result{RequeueAfter=remaining delay}
  else Delay elapsed
    C->>CR: Set Condition StormActive=false<br/>Clear timestamps
    C->>U: Continue normal remediation
  end
  C->>K: Update CR Status/Events
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

Suggested labels

ok-to-test

Suggested reviewers

  • slintes
  • clobrano
  • beekhof

Pre-merge checks (2 passed, 1 warning)

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 16.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title concisely summarizes the primary change by indicating the addition of a storm recovery mechanism and is directly related to the main functionality introduced in the pull request.
Description Check ✅ Passed The pull request description clearly describes the introduction of a storm recovery mechanism, outlines why it is needed, and summarizes the key changes made to implement this feature.

Poem

A thunder rolls across the kube-white plain,
I twitch my ears and set a cautious rein—
“Hold remediations, wait the storm’s delay;”
Two ticking clocks to mark a safer day.
Then sun returns, I hop and say: “Hooray!” 🐇⏳🌤️

✨ Finishing Touches
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Aug 14, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mshitrit

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@mshitrit
Copy link
Member Author

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Aug 14, 2025

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (16)
api/v1alpha1/zz_generated.deepcopy.go (1)

23-28: Note: Generated file looks consistent with controller-gen output

The import alias switch to v1 and the added deepcopy code align with typical controller-gen patterns. No manual edits needed; keep regenerating via make generate when API types change.

api/v1alpha1/nodehealthcheck_webhook_test.go (1)

433-474: Optional: factor node list mocking into a helper for readability

Not required, but a small local helper to populate NodeList would reduce repetition.

docs/configuration.md (5)

62-62: Tighten wording and fix grammar in stormRecoveryThreshold description

Clarify trigger phrasing and article usage.

-| _stormRecoveryThreshold_ | no                                                  | n/a                                                                                             | The threshold for proceeding with remediation during storm scenarios. When storm mode is triggered by number of remediation reaching minHealthy/maxUnhealthy maximum, NHC waits for the unhealthy count to drop to this threshold before resuming normal remediation. See Storm Recovery section below.                                                         |
+| _stormRecoveryThreshold_ | no                                                  | n/a                                                                                             | The threshold for resuming remediation during storm scenarios. When storm mode is triggered by the number of remediations reaching the minHealthy/maxUnhealthy limit, NHC waits for the unhealthy count to drop to this threshold before resuming normal remediation. See the Storm Recovery section below.                                                  |

262-268: Fix typo: “remeditions” → “remediations”

Minor typo in the Normal minHealthy Behavior pseudo-code.

-  → Wait until current remeditions are below maximum before creating more (as defined by minHealthy/maxUnhealthy)
+  → Wait until current remediations are below maximum before creating more (as defined by minHealthy/maxUnhealthy)

243-257: Add language identifiers to fenced code blocks (markdownlint MD040)

Specify a language (use text for pseudo-code/scenarios).

-```
+```text
 Scenario: 20 nodes, minHealthy=11, stormRecoveryThreshold=5

 Normal operation: 15 healthy, 5 unhealthy → 5 remediations created
 Storm hits: 11 healthy, 9 unhealthy → 9 remediations created (minHealthy limit reached)
 Additional failures: 9 healthy, 11 unhealthy → 9 remediations continue, 2 nodes wait due to minHealthy constraint
-
 Key insight: Those 2 additional failed nodes get NO remediation
 - It's intentional restraint
 - Assumption: At this stage creating more remediations might destabilize the system further
 - Strategy: Wait for system to stabilize (unhealthy count ≤ 5) before creating remediations for the 2 additional failed nodes.
 - Risk: If more nodes keep failing, storm mode continues indefinitely even if current remediations finish successfully
 - Reality: In such cases, the cluster may be beyond automated help anyway
Apply the same “text” language spec to other pseudo-code blocks in this section (Normal minHealthy Behavior, Storm Recovery Enhancement, States, Accepted Risks, Risk vs. Benefit).

---

`424-435`: **Fix emphasis spacing in table (markdownlint MD037)**

Avoid spaces inside emphasis markers.


```diff
-| _inFlightRemediations_   | ** DEPRECATED ** A list of "timestamp - node name" pairs of ongoing remediations. Replaced by unhealthyNodes.                                                                                                                                              |
+| _inFlightRemediations_   | **DEPRECATED** A list of "timestamp - node name" pairs of ongoing remediations. Replaced by unhealthyNodes.                                                                                                                                                |

260-278: Optional clarity: prefer consistent use of “≤” vs. “reaches”

In multiple places you say “reaches stormRecoveryThreshold”; elsewhere you use “≤ threshold”. Consider consistently using “≤ stormRecoveryThreshold” to remove ambiguity.

bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (2)

5-43: Add stormRecoveryThreshold to alm-examples for discoverability

Including the new field in the sample helps users discover and understand configuration from the UI.

           "spec": {
             "minHealthy": "51%",
+            "stormRecoveryThreshold": 3,
             "remediationTemplate": {

167-167: Optional: add x-descriptors for storm recovery status fields

Not required, but adding descriptors (e.g., badges) can improve UI presentation in OLM consoles.

controllers/nodehealthcheck_controller.go (2)

336-345: Good switch to shared GetMinHealthy; tighten logging for observedNodes

  • Using remediationv1alpha1.GetMinHealthy centralizes logic and reduces duplication. Good change.
  • Minor: the error log’s "observedNodes" currently logs a pointer (nhc.Status.ObservedNodes). Prefer the concrete number of selected nodes for clarity.
-	log.Error(err, "failed to calculate min healthy allowed nodes",
-		"minHealthy", nhc.Spec.MinHealthy, "maxUnhealthy", nhc.Spec.MaxUnhealthy, "observedNodes", nhc.Status.ObservedNodes)
+	log.Error(err, "failed to calculate min healthy allowed nodes",
+		"minHealthy", nhc.Spec.MinHealthy, "maxUnhealthy", nhc.Spec.MaxUnhealthy, "observedNodes", len(selectedNodes))

961-977: Storm recovery status transitions and eventing look good

  • Correctly sets boolean and start time on activation, clears on deactivation.
  • Emits Warning on start and Normal on end, which matches the operational significance.

Consider defining typed constants for the new event reasons ("StormRecoveryStarted", "StormRecoveryEnded") alongside existing ones for consistency.

api/v1alpha1/nodehealthcheck_webhook.go (1)

267-309: StormRecoveryThreshold validation is solid; consider relaxing the “no nodes match” hard-fail

  • Positives: Validates selector, fetches actual node count, computes minHealthy via shared helper, enforces stormThreshold < totalNodes - minHealthy with a precise error message. All good.

  • Potential UX improvement: Returning an error when the selector matches 0 nodes can block creating NHCs early in cluster bring-up or when labels are not yet applied. Since the controller re-evaluates dynamically, you could relax this to allow creation and defer reachability enforcement to runtime.

If you agree, replace the hard error for totalNodes == 0 with a no-op (or admission warning if desired):

-	if totalNodes == 0 {
-		return fmt.Errorf("no nodes match the selector, cannot validate storm recovery threshold")
-	}
+	if totalNodes == 0 {
+		// Defer strict threshold feasibility validation until nodes are present.
+		return nil
+	}

Alternatively, return an admission warning (requires plumbing warnings up from validate()).

controllers/nodehealthcheck_controller_test.go (4)

1973-1979: Add an initial assertion for StartTime

At initial state, also assert StormRecoveryStartTime is nil to fully capture the expected lifecycle (nil -> set -> nil).

Apply this diff:

-					Eventually(func(g Gomega) {
+					Eventually(func(g Gomega) {
 						g.Expect(k8sClient.Get(context.Background(), client.ObjectKeyFromObject(underTest), underTest)).To(Succeed())
 						g.Expect(*underTest.Status.HealthyNodes).To(Equal(5))
 						g.Expect(len(underTest.Status.UnhealthyNodes)).To(Equal(2))
 						g.Expect(underTest.Status.StormRecoveryActive).To(BeNil())
+						g.Expect(underTest.Status.StormRecoveryStartTime).To(BeNil())
 					}, "10s", "1s").Should(Succeed())

1969-2030: Reduce flakiness: drive waits off unhealthyConditionDuration instead of fixed literals

These waits depend on condition-duration expiry and reconcile cycles. Consider deriving timeouts from unhealthyConditionDuration to avoid timing brittleness across environments.

For example:

  • Replace "15s" with unhealthyConditionDuration + a small buffer (e.g., 2s).
  • Use shorter polling intervals (e.g., 300–500ms) for faster feedback without excessive load.

No code change required if stability is acceptable; offered for robustness.


2733-2741: Minor style: use ptr.To for threshold initialization

Prefer k8s ptr helpers for clarity and consistency with other tests.

Apply this diff:

 func newNodeHealthCheckWithStormRecovery() *v1alpha1.NodeHealthCheck {
 	nhc := newNodeHealthCheck()
 	// 7-node cluster: minHealthy=4, stormRecoveryThreshold=1
 	minHealthy := intstr.FromInt(4)
-	stormThreshold := 1
 	nhc.Spec.MinHealthy = &minHealthy
-	nhc.Spec.StormRecoveryThreshold = &stormThreshold
+	nhc.Spec.StormRecoveryThreshold = ptr.To(1)
 	return nhc
 }

2735-2738: Clarify comment or de-couple from cluster size

The function prepares an NHC with storm-recovery settings; the “7-node cluster” comment is specific to the test scenario using setupObjects(2,5,…) rather than inherent to this helper. Consider rephrasing to avoid implying a fixed cluster size.

Suggested wording:

  • “Configure minHealthy=4 and StormRecoveryThreshold=1 for storm-recovery scenarios (used with a 7-node cluster in tests).”
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2d87424 and 61eb63f.

📒 Files selected for processing (11)
  • api/v1alpha1/nodehealthcheck_types.go (2 hunks)
  • api/v1alpha1/nodehealthcheck_webhook.go (4 hunks)
  • api/v1alpha1/nodehealthcheck_webhook_test.go (2 hunks)
  • api/v1alpha1/zz_generated.deepcopy.go (3 hunks)
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (2 hunks)
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (2 hunks)
  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (2 hunks)
  • config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (2 hunks)
  • controllers/nodehealthcheck_controller.go (3 hunks)
  • controllers/nodehealthcheck_controller_test.go (3 hunks)
  • docs/configuration.md (2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-28T07:42:13.767Z
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).

Applied to files:

  • docs/configuration.md
🧬 Code Graph Analysis (3)
api/v1alpha1/nodehealthcheck_webhook_test.go (1)
vendor/golang.org/x/net/context/context.go (2)
  • Context (54-54)
  • Background (68-70)
controllers/nodehealthcheck_controller.go (4)
api/v1alpha1/nodehealthcheck_webhook.go (1)
  • GetMinHealthy (311-334)
vendor/github.com/medik8s/common/pkg/events/events.go (2)
  • WarningEvent (37-39)
  • NormalEvent (26-28)
controllers/utils/events.go (1)
  • EventReasonRemediationSkipped (6-6)
api/v1alpha1/nodehealthcheck_types.go (1)
  • NodeHealthCheck (368-374)
controllers/nodehealthcheck_controller_test.go (4)
vendor/sigs.k8s.io/controller-runtime/pkg/client/interfaces.go (2)
  • ObjectKeyFromObject (35-37)
  • ObjectKey (32-32)
vendor/k8s.io/api/core/v1/types.go (2)
  • ConditionFalse (2837-2837)
  • ConditionTrue (2836-2836)
api/v1alpha1/nodehealthcheck_webhook.go (1)
  • GetMinHealthy (311-334)
api/v1alpha1/nodehealthcheck_types.go (1)
  • NodeHealthCheck (368-374)
🪛 LanguageTool
docs/configuration.md

[grammar] ~62-~62: There might be a mistake here.
Context: ...scenarios. When storm mode is triggered by number of remediation reaching minHealt...

(QB_NEW_EN)


[grammar] ~62-~62: There might be a mistake here.
Context: ...old before resuming normal remediation. See Storm Recovery section below. ...

(QB_NEW_EN)


[grammar] ~236-~236: There might be a mistake here.
Context: ...iation restraint**. The Philosophy: - Recognition: When many nodes fail, cre...

(QB_NEW_EN)


[grammar] ~260-~260: There might be a mistake here.
Context: ...provides a controlled exit strategy when maximum number of remediation is reache...

(QB_NEW_EN)


[grammar] ~260-~260: There might be a mistake here.
Context: ...um number of remediation is reached due to minHealthy/maxUnhealthy constraint. **...

(QB_NEW_EN)


[grammar] ~262-~262: Ensure spelling is correct
Context: ...lthy/maxUnhealthy constraint. Normal minHealthy Behavior: ``` If (Maximum number of r...

(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)


[grammar] ~281-~281: There might be a mistake here.
Context: ... Storm Recovery is Appropriate When: - Your infrastructure can be overwhelmed b...

(QB_NEW_EN)


[grammar] ~287-~287: There might be a mistake here.
Context: ...torm Recovery is NOT Appropriate When**: - You need aggressive remediation regardle...

(QB_NEW_EN)


[grammar] ~356-~356: There might be a mistake here.
Context: ... Behavior What Storm Recovery Does: - ✅ Preserves existing remediations (l...

(QB_NEW_EN)


[grammar] ~357-~357: There might be a mistake here.
Context: ...sting remediations** (let them complete) - ❌ Blocks new remediations (even for ...

(QB_NEW_EN)


[grammar] ~358-~358: There might be a mistake here.
Context: ...diations** (even for newly failed nodes) - ⏳ Waits for system stabilization (un...

(QB_NEW_EN)


[grammar] ~359-~359: There might be a mistake here.
Context: ...tabilization** (unhealthy count to drop) - 🔄 Resumes when safe (unhealthy ≤ th...

(QB_NEW_EN)


[grammar] ~360-~360: There might be a mistake here.
Context: ...umes when safe** (unhealthy ≤ threshold) What Storm Recovery Does NOT Do: - ❌ Do...

(QB_NEW_EN)


[grammar] ~362-~362: There might be a mistake here.
Context: ...d) What Storm Recovery Does NOT Do: - ❌ Does NOT guarantee all nodes will be r...

(QB_NEW_EN)


[grammar] ~430-~430: There might be a mistake here.
Context: .../workflow.md) for further information. | | phase | A short hum...

(QB_NEW_EN)

🪛 markdownlint-cli2 (0.17.2)
docs/configuration.md

243-243: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


263-263: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


271-271: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


295-295: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


311-311: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


312-312: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


317-317: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


318-318: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


324-324: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


330-330: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


331-331: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


338-338: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


369-369: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


377-377: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


428-428: Spaces inside emphasis markers

(MD037, no-space-in-emphasis)


428-428: Spaces inside emphasis markers

(MD037, no-space-in-emphasis)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-and-unit-test
  • GitHub Check: e2e-k8s
🔇 Additional comments (24)
api/v1alpha1/zz_generated.deepcopy.go (2)

125-129: DeepCopy for StormRecoveryThreshold is correct

Pointer allocation and value copy are implemented properly for the new spec field.


200-208: DeepCopy for StormRecoveryActive and StormRecoveryStartTime is correct

  • StormRecoveryActive: proper pointer allocation and value copy.
  • StormRecoveryStartTime: uses DeepCopy(), which is required for metav1.Time.
api/v1alpha1/nodehealthcheck_types.go (2)

110-121: Spec: StormRecoveryThreshold field and validation look good

  • Optional pointer, Minimum=0, and clear doc comment establish intended behavior.
  • The “must be less than (totalNodes - minHealthy)” constraint will necessarily live in the webhook, which is fine.

No changes requested.


281-297: Status: StormRecoveryActive/StormRecoveryStartTime additions are sound

  • Optional fields, appropriate types, and CSV descriptors annotations are consistent.
  • Format/date-time annotations for StartTime match metav1.Time expectations.

No changes requested.

api/v1alpha1/nodehealthcheck_webhook_test.go (2)

5-5: Import fmt is appropriate for dynamic node names

This is a reasonable addition for generating mock node lists.


505-507: Verify webhook error messages to avoid test flakiness

If the webhook’s error text is not exactly as asserted, these tests will fail. Consider the suggested less-brittle assertions above. If you prefer exact messages, verify the error string in nodehealthcheck_webhook.go matches these tests.

Would you like me to scan the webhook code and tests across the repo to ensure error strings line up?

bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (3)

140-149: Spec descriptor for stormRecoveryThreshold looks good

Descriptor text matches API comments and clearly explains the constraint and behavior.


201-212: Status descriptors for storm recovery fields are appropriate

Descriptors explain activation semantics and StartTime purpose. Paths align with other status fields.


1-667: Sanity-check complete — CRD and CSV are in sync for storm recovery fields

  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml — contains stormRecoveryThreshold with "minimum: 0" and integer type; also exposes stormRecoveryActive and stormRecoveryStartTime in status.
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml — bundled CRD includes stormRecoveryThreshold and the status fields.
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml — CSV lists spec and status descriptors/paths for stormRecoveryThreshold, stormRecoveryActive, and stormRecoveryStartTime.

No changes required — the CRD OpenAPI schema and CSV are consistent.

docs/configuration.md (1)

228-418: Fix markdownlint issues by adding languages to all fenced blocks in Storm Recovery section

A number of fenced code blocks lack languages (MD040). I proposed one patch above; similar changes should be applied to all pseudo-code blocks in this section for a clean docs build.

I can generate a patch touching all code fences in this section if you want me to apply it across the board.

bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (2)

283-293: Spec: stormRecoveryThreshold schema looks correct

  • Integer with minimum: 0 and clear behavior description is good.
  • Constraint note “must be less than (totalNodes - minHealthy)” is helpful, and enforced in the webhook.

452-465: Status: stormRecoveryActive/startTime additions are well-structured

  • Correctly placed under status with appropriate types (boolean and date-time).
  • Clear UX-facing descriptions.
config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (2)

281-291: Spec: stormRecoveryThreshold schema LGTM

Mirrors the bundle CRD changes; constraints and documentation are consistent.


450-462: Status: stormRecoveryActive/startTime fields LGTM

Accurately reflect runtime state and align with controller/state logic.

controllers/nodehealthcheck_controller.go (2)

347-359: Skip path during storm recovery is reasonable and communicates well

  • Defers remediation while storm recovery is active and emits a clear warning event.
  • Message uses current unhealthy count, configured threshold, and in-progress remediation count. Good operator UX.

951-959: Helper for remediation counting is clear and efficient

Straightforward int counter over status; fits the purposes of logging and optional UX.

config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (2)

100-108: CSV: Spec descriptor for stormRecoveryThreshold is well-described

Accurately communicates behavior and constraint to users; path is correct.


161-172: CSV: Status descriptors for stormRecoveryActive/startTime are apt

Good display names and clear descriptions; should render well in UIs.

api/v1alpha1/nodehealthcheck_webhook.go (4)

30-31: Import alias switch to utilerrors is correct

Switching to utilerrors and updating call sites avoids name collisions and is standard.


108-116: Validation aggregation extended with storm recovery and topology checks

Comprehensive aggregation with utilerrors.NewAggregate is appropriate. Nice to keep validation logic centralized.


156-162: Escalating remediations sub-validations aggregated correctly

Good consistency with the main validator’s pattern.


311-334: GetMinHealthy helper is correct and useful across packages

  • Validates mutual exclusivity and negative values.
  • Correctly handles both absolute and percentage intstr semantics.
controllers/nodehealthcheck_controller_test.go (2)

2457-2457: Good move to use the exported API for minHealthy resolution

Switching to v1alpha1.GetMinHealthy centralizes validation and keeps tests aligned with webhook logic. Looks good.


1981-1996: Conclusion — controller uses <= so the test is correct; update PR text or implementation

I checked the code: evaluateStormRecovery returns healthyCount <= minHealthy && inProgressRemediations > stormThreshold (controllers/nodehealthcheck_controller.go:917), so storm recovery activates when healthyCount is less than or equal to minHealthy. The test that expects activation at healthy == minHealthy (controllers/nodehealthcheck_controller_test.go:1981-1996) matches the implementation. The PR summary wording ("below minHealthy") is inaccurate — either update the PR description to say "less than or equal to minHealthy" or change the controller/operator if you intended strictly <.

Points to review:

  • controllers/nodehealthcheck_controller.go:915-917 — activation logic and surrounding comment (comment refers to "remediation count" and is misleading; consider clarifying).
  • controllers/nodehealthcheck_controller_test.go:1981-1996 — test expecting activation at healthy == minHealthy (consistent with implementation).

Recommended actions (pick one):

  • Update PR summary and the controller comment to reflect <= minHealthy, or
  • If intended behavior is strictly <, change the operator in controllers/nodehealthcheck_controller.go and adjust tests accordingly.

Likely an incorrect or invalid review comment.

@mshitrit
Copy link
Member Author

/test 4.19-openshift-e2e

@mshitrit mshitrit force-pushed the storm-recovery branch 2 times, most recently from 021946b to d46900a Compare September 8, 2025 15:20
@mshitrit
Copy link
Member Author

mshitrit commented Sep 9, 2025

/test 4.19-openshift-e2e

@mshitrit
Copy link
Member Author

mshitrit commented Sep 9, 2025

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Sep 9, 2025

✅ Actions performed

Full review triggered.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
controllers/nodehealthcheck_controller_test.go (1)

1996-2005: Add assertion that no remediation is started for the newly unhealthy node while storm is active.

This tightens the core guarantee of storm recovery and avoids relying only on aggregate counts. Mirrors prior review feedback.

Apply:

 					// Verify Storm recovery is activated
 					// 7 total, 3 healthy, 4 unhealthy, 3 remediations (additional remediation not created because of min healthy constraint)
 					Eventually(func(g Gomega) {
 						g.Expect(k8sClient.Get(context.Background(), client.ObjectKeyFromObject(underTest), underTest)).To(Succeed())
 						g.Expect(*underTest.Status.HealthyNodes).To(Equal(3))
 						g.Expect(len(underTest.Status.UnhealthyNodes)).To(Equal(4))
 						g.Expect(getRemediationsCount(underTest)).To(Equal(3))
 						g.Expect(utils.IsConditionTrue(underTest.Status.Conditions, v1alpha1.ConditionTypeStormActive, v1alpha1.ConditionReasonStormThresholdChange)).To(BeTrue())
 						g.Expect(underTest.Status.StormRecoveryStartTime).ToNot(BeNil())
 					}, "5s", "100ms").Should(Succeed())
+
+					// Ensure the newly unhealthy node is NOT remediated while storm recovery is active
+					Consistently(func(g Gomega) {
+						cr := findRemediationCRForNHC("healthy-worker-node-1", underTest)
+						g.Expect(cr).To(BeNil())
+					}, "5s", "500ms").Should(Succeed())
🧹 Nitpick comments (21)
docs/configuration.md (2)

62-62: Tighten spec text and align with validation units

Clarify the gating trigger and list valid units to match the API validation.

-| _stormTerminationDelay_  | no                                                  | n/a                                                                                             | Optional duration. When storm recovery regains the minHealthy/maxUnhealthy constraint, NHC waits this additional time before exiting storm mode. While waiting, no new remediations are created. Example: 30s, 2m, 1h.                                                                                                                                          |
+| _stormTerminationDelay_  | no                                                  | n/a                                                                                             | Optional duration. When the minHealthy/maxUnhealthy constraint is satisfied, NHC stays in storm recovery for this additional delay before exiting. During the delay, no new remediations are created. Examples: 30s, 2m, 1h. Valid units: ns, us/µs, ms, s, m, h.                                                     |

328-339: Resolve markdownlint nit in table and surface new condition in “Conditions” docs

  • Remove spaces inside emphasis (MD037).
  • Also recommend noting StormActive in the “Known .status.conditions.type” sentence for accuracy.
-| _inFlightRemediations_      | ** DEPRECATED ** A list of "timestamp - node name" pairs of ongoing remediations. Replaced by unhealthyNodes.                                                                                                                                              |
+| _inFlightRemediations_      | **DEPRECATED** A list of "timestamp - node name" pairs of ongoing remediations. Replaced by unhealthyNodes.                                                                                                                                                |

Outside this hunk (near the “Conditions” field description), change:

  • Known .status.conditions.type are: "Disabled"
    to:
  • Known .status.conditions.type include: "Disabled", "StormActive"
controllers/utils/conditions.go (2)

80-85: Avoid duplicate lookups; simplify condition check

Reduce one meta.FindStatusCondition call and keep semantics identical.

-	if !IsConditionSet(conditions, conditionType, reason) {
-		return false
-	}
-	condition := meta.FindStatusCondition(conditions, conditionType)
-	return condition.Status == metav1.ConditionTrue
+	condition := meta.FindStatusCondition(conditions, conditionType)
+	if condition == nil || condition.Reason != reason {
+		return false
+	}
+	return condition.Status == metav1.ConditionTrue

Optional nit outside this hunk: adjust the comment to “returns”.

-// IsConditionTrue return true when the conditions contain a condition of given type and reason with status true
+// IsConditionTrue returns true when the conditions contain a condition of the given type and reason with status true

87-97: Consider a helper for type-only checks

IsConditionSet requires a reason match. If callers ever need a type-only presence check, add a small helper to avoid passing sentinel values.

Happy to open a follow-up PR with:

func HasConditionType(conditions []metav1.Condition, conditionType string) bool {
	return meta.FindStatusCondition(conditions, conditionType) != nil
}
api/v1alpha1/nodehealthcheck_types.go (2)

38-41: Fix StormActive comment typo

Comment incorrectly says “when NHC will get disabled”.

-	// ConditionTypeStormActive is the condition type used when NHC will get disabled
+	// ConditionTypeStormActive indicates storm-recovery mode is active
 	ConditionTypeStormActive = "StormActive"
-	// ConditionReasonStormThresholdChange is the condition reason for a storm change from active to inactive and vice versa
+	// ConditionReasonStormThresholdChange is the reason used when storm mode toggles (active <-> inactive) due to healthy-node threshold changes
 	ConditionReasonStormThresholdChange = "HealthyNodeThresholdChange"

291-309: Align status-field comments with condition name and behavior

  • Change “StormRecoveryActive” to “StormActive” in comments:
-   // StormRecoveryStartTime records when storm recovery mode was activated.
-   // This field is set when StormRecoveryActive becomes true and helps track
+   // StormRecoveryStartTime records when storm recovery mode was activated.
+   // This field is set when the StormActive condition becomes true and helps track
    // how long the system has been in storm recovery mode.
@@
-   // StormTerminationStartTime records when storm recovery mode regained the minHealthy/maxUnhealthy constraint
-   // and the storm is about to end (after NodeHealthCheckSpec.StormTerminationDelay has passed).
+   // StormTerminationStartTime records when the minHealthy/maxUnhealthy constraint was satisfied
+   // and the storm is about to end after NodeHealthCheckSpec.StormTerminationDelay elapses.
  • Update the Status.Conditions documentation to include the new “StormActive” type.
    CRDs already include stormTerminationDelay, stormRecoveryStartTime and stormTerminationStartTime, so no additional regeneration is needed.
bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (2)

276-289: StormTerminationDelay schema: looks good; add minor validation note (optional).

Pattern and description align with existing duration fields. If you want to enforce non-negative values explicitly at the schema level (beyond the regex), consider an x-kubernetes-validations rule for clarity, though not strictly necessary.


434-446: Status timestamps added correctly.

stormRecoveryStartTime and stormTerminationStartTime are placed under status with date-time format. LGTM. Consider updating the conditions description elsewhere in the CRD to mention StormActive for completeness.

controllers/nodehealthcheck_controller_test.go (6)

1966-1969: Fix comment mismatch with setupObjects arguments.

The comment says “2 unhealthy, 5 healthy” but setupObjects(3, 4, true) creates 3 unhealthy and 4 healthy.

Apply:

-					setupObjects(3, 4, true) // 2 unhealthy, 5 healthy = 7 total
+					setupObjects(3, 4, true) // 3 unhealthy, 4 healthy = 7 total

2016-2023: Typo in log message.

“Elasped” → “Elapsed”.

Apply:

-					By(fmt.Sprintf("Elasped time for node recovery is: %v seconds", elapsed))
+					By(fmt.Sprintf("Elapsed time for node recovery is: %v seconds", elapsed))

2016-2031: Assert storm termination start timestamp after regaining minHealthy (optional).

When HealthyNodes returns to minHealthy, assert StormTerminationStartTime is set to avoid regressions.

Apply:

 					//wait for node to recover
 					Eventually(func(g Gomega) {
 						g.Expect(k8sClient.Get(context.Background(), client.ObjectKeyFromObject(underTest), underTest)).To(Succeed())
 						g.Expect(*underTest.Status.HealthyNodes).To(Equal(4))
+						// Optional: termination window should start ticking now
+						g.Expect(underTest.Status.StormTerminationStartTime).ToNot(BeNil())
 					}, "5s", "100ms").Should(Succeed())

2032-2041: Reduce flake risk by widening timing margins (optional).

Sleeping 1.5s then waiting 1.5s against a 2s delay leaves little slack for reconcile/event lags. Consider bumping both to 2s.

Apply:

-					time.Sleep(time.Millisecond * 1500)
+					time.Sleep(2 * time.Second)
 ...
-					}, "1500ms", "100ms").Should(Succeed())
+					}, "2s", "100ms").Should(Succeed())

2660-2681: Remediation count may include CRs from other owners; filter by NHC owner to avoid test flakiness.

Current count sums all CRs of the kind, regardless of owner/template. Restrict to CRs owned by this NHC.

Apply:

 func getRemediationsCountForTemplate(nhc *v1alpha1.NodeHealthCheck, templateRef v1.ObjectReference) int {
 	baseCr := newBaseCR(nhc, templateRef)
 	crList := &unstructured.UnstructuredList{Object: baseCr.Object}
-	if err := k8sClient.List(ctx, crList); err == nil {
-		return len(crList.Items)
-	}
-	return 0
+	if err := k8sClient.List(ctx, crList); err != nil {
+		return 0
+	}
+	count := 0
+	for i := range crList.Items {
+		item := crList.Items[i]
+		for _, o := range item.GetOwnerReferences() {
+			if o.UID == nhc.UID {
+				count++
+				break
+			}
+		}
+	}
+	return count
 }

2768-2775: Outdated comment mentions stormRecoveryThreshold.

There’s no spec field named StormRecoveryThreshold. Update the comment to reflect only MinHealthy and StormTerminationDelay.

Apply:

-func newNodeHealthCheckWithStormRecovery() *v1alpha1.NodeHealthCheck {
-	nhc := newNodeHealthCheck()
-	// 7-node cluster: minHealthy=4, stormRecoveryThreshold=1
+func newNodeHealthCheckWithStormRecovery() *v1alpha1.NodeHealthCheck {
+	nhc := newNodeHealthCheck()
+	// 7-node cluster: minHealthy=4, StormTerminationDelay=2s
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1)

202-211: Status descriptors added appropriately

Consider adding x-descriptors for date-time if you want special rendering in some UIs, otherwise this is fine.

controllers/nodehealthcheck_controller.go (4)

935-944: Rename for clarity and use monotonic test clock

  • Function name reads like “exist” instead of “exit”. Rename improves readability.
  • Prefer currentTime() over time.Now() for testability.

Apply:

-func (r *NodeHealthCheckReconciler) shouldExistStormRecovery(nhc *remediationv1alpha1.NodeHealthCheck, minHealthy int) bool {
+func (r *NodeHealthCheckReconciler) shouldExitStormRecovery(nhc *remediationv1alpha1.NodeHealthCheck, minHealthy int) bool {
 	isActive := utils.IsConditionTrue(nhc.Status.Conditions, remediationv1alpha1.ConditionTypeStormActive, remediationv1alpha1.ConditionReasonStormThresholdChange)
-	isDelayElapsed := false
+	isDelayElapsed := false
 	if nhc.Status.StormTerminationStartTime != nil {
-		isDelayElapsed = time.Now().After(nhc.Status.StormTerminationStartTime.Time.Add(nhc.Spec.StormTerminationDelay.Duration))
+		isDelayElapsed = currentTime().After(nhc.Status.StormTerminationStartTime.Time.Add(nhc.Spec.StormTerminationDelay.Duration))
 	}
 	shouldExist := isActive && isMinHealthyConstraintSatisfied(nhc, minHealthy) && isDelayElapsed
 	return shouldExist
 }

And update the call site (see separate diff in evaluateStormRecovery).


986-1001: Use currentTime() and avoid dual pointer helpers

  • Use currentTime() for consistency/testability.
  • Use k8s.io/utils/pointer for duration pointers and drop the extra ptr import.

Apply:

-func calculateRequeue(active bool, nhc *remediationv1alpha1.NodeHealthCheck) *time.Duration {
+func calculateRequeue(active bool, nhc *remediationv1alpha1.NodeHealthCheck) *time.Duration {
 	if !active || nhc.Status.StormTerminationStartTime == nil {
 		return nil
 	}
-	elapsedTime := time.Now().Sub(nhc.Status.StormTerminationStartTime.Time)
+	elapsedTime := currentTime().Sub(nhc.Status.StormTerminationStartTime.Time)
 	// time left before storm should finish
 	timeLeft := nhc.Spec.StormTerminationDelay.Duration - elapsedTime
 	//Add some requeue buffer
 	var requeueAfter *time.Duration
 	if timeLeft < 0 {
-		requeueAfter = ptr.To(time.Second)
+		requeueAfter = pointer.Duration(1 * time.Second)
 	} else {
-		requeueAfter = ptr.To(timeLeft + time.Second)
+		requeueAfter = pointer.Duration(timeLeft + 1*time.Second)
 	}
 	return requeueAfter
 }

And drop the extra import:

-	"k8s.io/utils/ptr"

45-46: Avoid two pointer helper imports

After changing calculateRequeue to use pointer.Duration, remove k8s.io/utils/ptr to keep imports lean.

-	"k8s.io/utils/ptr"

1003-1011: Unused helper

getRemediationCount is no longer used; consider removing to reduce noise.

config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (2)

100-109: Clarify disable semantics for Storm Termination Delay.

If omitting the field disables the delay (nil Duration), state it explicitly to improve UX.

Proposed diff:

-      - description: "StormTerminationDelay introduces a configurable delay after
+      - description: "StormTerminationDelay introduces a configurable delay after
           storm recovery exit criteria are satisfied (for example, when the number
           of healthy nodes rises above the configured minHealthy constraint). While
           this delay is in effect, NHC remains in storm recovery mode and does not
           create new remediations. Once the delay elapses, storm recovery mode exits
-          and normal remediation resumes. \n Expects a string of decimal numbers each
+          and normal remediation resumes. Omitting this field disables the delay.
+          \n Expects a string of decimal numbers each
           with optional fraction and a unit suffix, e.g. \"300ms\", \"1.5h\" or \"2h45m\".
           Valid time units are \"ns\", \"us\" (or \"µs\"), \"ms\", \"s\", \"m\", \"h\"."
         displayName: Storm Termination Delay
         path: stormTerminationDelay

162-171: Avoid Go-type references in user-facing text; specify timestamp format.

Replace “NodeHealthCheckSpec.StormTerminationDelay” with “spec.stormTerminationDelay”, and note these are RFC 3339 timestamps (mirrors other descriptors).

Proposed diff:

-      - description: StormRecoveryStartTime records when storm recovery mode was activated.
-          This field is set when StormRecoveryActive becomes true and helps track
-          how long the system has been in storm recovery mode.
+      - description: StormRecoveryStartTime records when storm recovery mode was activated.
+          This field is set when StormRecoveryActive becomes true and helps track
+          how long the system has been in storm recovery mode. RFC 3339 timestamp.
         displayName: Storm Recovery Start Time
         path: stormRecoveryStartTime
-      - description: StormTerminationStartTime records when storm recovery mode regained
-          the minHealthy/maxUnhealthy constraint and the storm is about to end (after
-          NodeHealthCheckSpec.StormTerminationDelay has passed).
+      - description: StormTerminationStartTime records when storm recovery mode regained
+          the minHealthy/maxUnhealthy constraint and the storm is about to end (after
+          spec.stormTerminationDelay has passed). RFC 3339 timestamp.
         displayName: Storm Termination Start Time
         path: stormTerminationStartTime
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 4a610d2 and d46900a.

📒 Files selected for processing (10)
  • api/v1alpha1/nodehealthcheck_types.go (3 hunks)
  • api/v1alpha1/zz_generated.deepcopy.go (4 hunks)
  • bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (2 hunks)
  • bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (2 hunks)
  • config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (2 hunks)
  • config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (2 hunks)
  • controllers/nodehealthcheck_controller.go (4 hunks)
  • controllers/nodehealthcheck_controller_test.go (3 hunks)
  • controllers/utils/conditions.go (1 hunks)
  • docs/configuration.md (2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-28T07:42:13.767Z
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).

Applied to files:

  • docs/configuration.md
📚 Learning: 2025-05-28T07:55:11.390Z
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.

Applied to files:

  • controllers/nodehealthcheck_controller_test.go
🪛 markdownlint-cli2 (0.17.2)
docs/configuration.md

243-243: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


257-257: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


265-265: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


296-296: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


297-297: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


302-302: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


303-303: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


308-308: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


309-309: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


315-315: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


316-316: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


332-332: Spaces inside emphasis markers

(MD037, no-space-in-emphasis)


332-332: Spaces inside emphasis markers

(MD037, no-space-in-emphasis)

🔇 Additional comments (9)
api/v1alpha1/nodehealthcheck_types.go (1)

115-131: Spec: StormTerminationDelay text looks good

Validation pattern and JSON tag align with docs; no issues spotted.

api/v1alpha1/zz_generated.deepcopy.go (3)

24-26: Import aliasing change is correct.

Switching meta/v1 to alias v1 and core types to corev1 keeps ObjectReference and Duration/Time types disambiguated. LGTM.


125-135: Deep-copy updates for spec fields are consistent.

  • StormTerminationDelay pointer deep-copied.
  • RemediationTemplate switched to corev1.ObjectReference.
  • HealthyDelay uses v1.Duration.
    No issues.

Also applies to: 147-148


188-196: Approve DeepCopy changes
DeepCopy methods for StormTerminationDelay (*metav1.Duration), StormRecoveryStartTime and StormTerminationStartTime (*metav1.Time), and RemediationTemplate fields align with their definitions in nodehealthcheck_types.go.

config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (2)

274-287: StormTerminationDelay field and validation look good

Pattern, units, and description align with existing duration fields. No concerns.


432-444: Status timestamps added correctly

date-time formatting and placement under status are appropriate.

bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1)

140-149: Descriptor for Storm Termination Delay is clear and consistent

Matches CRD semantics and uses the standard duration wording.

controllers/nodehealthcheck_controller.go (2)

350-361: Good integration: gating new remediations during storm recovery

Hooking storm evaluation before remediation creation is correct and minimally invasive.


927-934: Start condition based solely on minHealthy is correct

Using only the minHealthy constraint to activate storm mode matches the intended behavior and avoids coupling to remediation counts.

Comment on lines +228 to +321
## Storm Recovery

Storm recovery is an **optional advanced feature** that provides system stabilization through remediation restraint during mass failure scenarios.

### Concept & Philosophy

**Core Principle**: Storm recovery is about **system stabilization through remediation restraint**.

**The Philosophy**:
- **Recognition**: When many nodes fail, creating more remediations may be counterproductive
- **Assumption**: The system needs time to stabilize before additional remediation attempts
- **Strategy**: Temporary restraint to prevent making a bad situation worse
- **Acceptance**: Some scenarios may be beyond automated recovery (same as minHealthy)

**Real-World Example**:
```
Scenario: 20 nodes, minHealthy=11, stormTerminationDelay=30s

Normal operation: 15 healthy, 5 unhealthy → 5 remediations created
5 additional nodes go down and storm hits: 10 healthy (<11) → storm recovery activates, new remediations are blocked (existing continue)
2 node regain health 12 healthy (≥11) → start 30s delay timer (3 remediations left, 8 unhealthy nodes)
Exit: after 30s elapse → storm recovery deactivates, resume creating remediations for the 5 additional nodes
```
### How Storm Recovery Works
**The Key Insight**: Storm recovery provides a **controlled exit strategy** when maximum number of remediation is reached due to minHealthy/maxUnhealthy constraint.
**Normal minHealthy Behavior**:
```
If (Maximum number of remediation is reached):
→ Block creation of new remediations
→ Existing remediations continue
→ Wait until current remediations are below maximum before creating more (as defined by minHealthy/maxUnhealthy)
```
**Storm Recovery Enhancement**:
```
If (minHealthy/maxUnhealthy blocks creation of new remediations AND stormTerminationDelay is configured):
→ Enter storm recovery mode
→ Block creation of new remediations while storm is active
→ When constraint is satisfied (healthyNodes ≥ minHealthy or maxUnhealthy satisfied):
- Start stormTerminationDelay timer
- Exit storm recovery after delay
- Resume normal remediation creation
```
### When to Use Storm Recovery
**Storm Recovery is Appropriate When**:
- Your infrastructure can be overwhelmed by too many concurrent remediations
- You want controlled recovery during mass failure events
- You prefer "wait and see" over "aggressive remediation" during storms
- You accept that some failure scenarios are beyond automated recovery
**Storm Recovery is NOT Appropriate When**:
- You need aggressive remediation regardless of system load
- Your infrastructure can handle unlimited concurrent remediations
- You can't accept any remediation delays during mass failures
### Configuring Storm Recovery
- Enable by setting `spec.stormTerminationDelay` on the `NodeHealthCheck` CR.
- Choose a delay that matches your environment's stabilization time. While the delay is counting down, new remediations remain blocked.
- If `stormTerminationDelay` is not set, storm recovery mode is disabled and only the standard `minHealthy`/`maxUnhealthy` gating applies.
### Storm Recovery States
**State 1: Normal Operation**
```
healthyNodes ≥ minHealthy
→ Create remediations for unhealthy nodes (as allowed by minHealthy/maxUnhealthy)
```
**State 2: Storm Recovery Active**
```
healthyNodes < minHealthy
→ Block new remediations
```
**State 3: Regaining Health (Delay)**
```
healthyNodes ≥ minHealthy
→ Start stormTerminationDelay timer
→ Continue blocking new remediations until delay elapses
```
**State 4: Storm Recovery Exit**
```
after stormTerminationDelay
→ Exit storm recovery mode
→ Resume creating remediations
```
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Fix markdownlint issues: add languages to fences and convert bold “State” lines to headings

Addresses MD040 (missing language) and MD036 (emphasis-as-heading).

Run markdownlint after applying.

@@
-**Real-World Example**:
-```
+**Real-World Example**:
+```text
@@
-**Normal minHealthy Behavior**:
-```
+**Normal minHealthy Behavior**:
+```text
@@
-**Storm Recovery Enhancement**:
-```
+**Storm Recovery Enhancement**:
+```text
@@
-**State 1: Normal Operation**
-```
+#### State 1: Normal Operation
+```text
@@
-**State 2: Storm Recovery Active**
-```
+#### State 2: Storm Recovery Active
+```text
@@
-**State 3: Regaining Health (Delay)**
-```
+#### State 3: Regaining Health (Delay)
+```text
@@
-**State 4: Storm Recovery Exit**
-```
+#### State 4: Storm Recovery Exit
+```text

To verify:


🏁 Script executed:

#!/bin/bash
# Validate markdown formatting on this file
npx --yes markdownlint-cli2 docs/configuration.md || true

Length of output: 8714


Add fenced code block languages and convert emphasis-as-heading to headings in the Storm Recovery section
File: docs/configuration.md (≈lines 230–320)

  • Prefix each triple backtick in the Storm Recovery examples and states with a language (e.g. use ```text).
  • Replace the emphasized “State 1: Normal Operation”, “State 2: Storm Recovery Active”, “State 3: Regaining Health (Delay)”, and “State 4: Storm Recovery Exit” lines with actual headings (e.g. #### State 1: Normal Operation).
    Re-run markdownlint (npx --yes markdownlint-cli2 docs/configuration.md) to ensure MD036 and MD040 are cleared.
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)

243-243: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


257-257: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


265-265: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


296-296: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


297-297: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


302-302: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


303-303: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


308-308: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


309-309: Fenced code blocks should have a language specified

(MD040, fenced-code-language)


315-315: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)


316-316: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
In docs/configuration.md around lines 228–321, the Storm Recovery examples use
triple-backtick fences without languages and the section "State X" labels are
formatted as emphasized text; update every code fence in this block to include a
language (e.g., change ``` to ```text) and convert the emphasized state lines
(e.g., "*State 1: Normal Operation*") into proper Markdown headings (e.g., ####
State 1: Normal Operation); after making these changes run markdownlint (npx
--yes markdownlint-cli2 docs/configuration.md) and fix any remaining MD036/MD040
warnings if reported.

… of new remediation as long as the "storm" is active.

Signed-off-by: Michael Shitrit <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants