-
Notifications
You must be signed in to change notification settings - Fork 28
[WIP] Add a storm recovery mechanism #378
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Skipping CI for Draft Pull Request. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the WalkthroughAdds storm-recovery support to NodeHealthCheck: new spec delay field, new status timestamps, constants and condition utils, controller state machine to activate/delay/terminate storm mode affecting remediation creation, updated CRD/CSV manifests, regenerated deepcopy code, tests for storm scenarios, and documentation. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as Reconciler Loop
participant C as NHC Controller
participant K as API Server
participant CR as NodeHealthCheck CR
participant N as Cluster Nodes
U->>C: Reconcile(NHC)
C->>K: Get CR, Nodes, Remediations
C->>C: getMinHealthy()
C->>C: evaluateStormRecovery(nhc, minHealthy)
alt Storm activates
C->>CR: Set Condition StormActive=true<br/>Set stormRecoveryStartTime
C->>U: Result{RequeueAfter=t}
note over C,CR: Remediation creation skipped while active
else Storm active, health regained
C->>CR: Set stormTerminationStartTime (delay start)
C->>U: Result{RequeueAfter=remaining delay}
else Delay elapsed
C->>CR: Set Condition StormActive=false<br/>Clear timestamps
C->>U: Continue normal remediation
end
C->>K: Update CR Status/Events
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Pre-merge checks (2 passed, 1 warning)❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Poem
✨ Finishing Touches🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mshitrit The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (16)
api/v1alpha1/zz_generated.deepcopy.go (1)
23-28: Note: Generated file looks consistent with controller-gen outputThe import alias switch to v1 and the added deepcopy code align with typical controller-gen patterns. No manual edits needed; keep regenerating via make generate when API types change.
api/v1alpha1/nodehealthcheck_webhook_test.go (1)
433-474: Optional: factor node list mocking into a helper for readabilityNot required, but a small local helper to populate NodeList would reduce repetition.
docs/configuration.md (5)
62-62: Tighten wording and fix grammar in stormRecoveryThreshold descriptionClarify trigger phrasing and article usage.
-| _stormRecoveryThreshold_ | no | n/a | The threshold for proceeding with remediation during storm scenarios. When storm mode is triggered by number of remediation reaching minHealthy/maxUnhealthy maximum, NHC waits for the unhealthy count to drop to this threshold before resuming normal remediation. See Storm Recovery section below. | +| _stormRecoveryThreshold_ | no | n/a | The threshold for resuming remediation during storm scenarios. When storm mode is triggered by the number of remediations reaching the minHealthy/maxUnhealthy limit, NHC waits for the unhealthy count to drop to this threshold before resuming normal remediation. See the Storm Recovery section below. |
262-268: Fix typo: “remeditions” → “remediations”Minor typo in the Normal minHealthy Behavior pseudo-code.
- → Wait until current remeditions are below maximum before creating more (as defined by minHealthy/maxUnhealthy) + → Wait until current remediations are below maximum before creating more (as defined by minHealthy/maxUnhealthy)
243-257: Add language identifiers to fenced code blocks (markdownlint MD040)Specify a language (use text for pseudo-code/scenarios).
-``` +```text Scenario: 20 nodes, minHealthy=11, stormRecoveryThreshold=5 Normal operation: 15 healthy, 5 unhealthy → 5 remediations created Storm hits: 11 healthy, 9 unhealthy → 9 remediations created (minHealthy limit reached) Additional failures: 9 healthy, 11 unhealthy → 9 remediations continue, 2 nodes wait due to minHealthy constraint - Key insight: Those 2 additional failed nodes get NO remediation - It's intentional restraint - Assumption: At this stage creating more remediations might destabilize the system further - Strategy: Wait for system to stabilize (unhealthy count ≤ 5) before creating remediations for the 2 additional failed nodes. - Risk: If more nodes keep failing, storm mode continues indefinitely even if current remediations finish successfully - Reality: In such cases, the cluster may be beyond automated help anywayApply the same “text” language spec to other pseudo-code blocks in this section (Normal minHealthy Behavior, Storm Recovery Enhancement, States, Accepted Risks, Risk vs. Benefit). --- `424-435`: **Fix emphasis spacing in table (markdownlint MD037)** Avoid spaces inside emphasis markers. ```diff -| _inFlightRemediations_ | ** DEPRECATED ** A list of "timestamp - node name" pairs of ongoing remediations. Replaced by unhealthyNodes. | +| _inFlightRemediations_ | **DEPRECATED** A list of "timestamp - node name" pairs of ongoing remediations. Replaced by unhealthyNodes. |
260-278: Optional clarity: prefer consistent use of “≤” vs. “reaches”In multiple places you say “reaches stormRecoveryThreshold”; elsewhere you use “≤ threshold”. Consider consistently using “≤ stormRecoveryThreshold” to remove ambiguity.
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (2)
5-43: Add stormRecoveryThreshold to alm-examples for discoverabilityIncluding the new field in the sample helps users discover and understand configuration from the UI.
"spec": { "minHealthy": "51%", + "stormRecoveryThreshold": 3, "remediationTemplate": {
167-167: Optional: add x-descriptors for storm recovery status fieldsNot required, but adding descriptors (e.g., badges) can improve UI presentation in OLM consoles.
controllers/nodehealthcheck_controller.go (2)
336-345: Good switch to shared GetMinHealthy; tighten logging for observedNodes
- Using remediationv1alpha1.GetMinHealthy centralizes logic and reduces duplication. Good change.
- Minor: the error log’s "observedNodes" currently logs a pointer (nhc.Status.ObservedNodes). Prefer the concrete number of selected nodes for clarity.
- log.Error(err, "failed to calculate min healthy allowed nodes", - "minHealthy", nhc.Spec.MinHealthy, "maxUnhealthy", nhc.Spec.MaxUnhealthy, "observedNodes", nhc.Status.ObservedNodes) + log.Error(err, "failed to calculate min healthy allowed nodes", + "minHealthy", nhc.Spec.MinHealthy, "maxUnhealthy", nhc.Spec.MaxUnhealthy, "observedNodes", len(selectedNodes))
961-977: Storm recovery status transitions and eventing look good
- Correctly sets boolean and start time on activation, clears on deactivation.
- Emits Warning on start and Normal on end, which matches the operational significance.
Consider defining typed constants for the new event reasons ("StormRecoveryStarted", "StormRecoveryEnded") alongside existing ones for consistency.
api/v1alpha1/nodehealthcheck_webhook.go (1)
267-309: StormRecoveryThreshold validation is solid; consider relaxing the “no nodes match” hard-fail
Positives: Validates selector, fetches actual node count, computes minHealthy via shared helper, enforces stormThreshold < totalNodes - minHealthy with a precise error message. All good.
Potential UX improvement: Returning an error when the selector matches 0 nodes can block creating NHCs early in cluster bring-up or when labels are not yet applied. Since the controller re-evaluates dynamically, you could relax this to allow creation and defer reachability enforcement to runtime.
If you agree, replace the hard error for totalNodes == 0 with a no-op (or admission warning if desired):
- if totalNodes == 0 { - return fmt.Errorf("no nodes match the selector, cannot validate storm recovery threshold") - } + if totalNodes == 0 { + // Defer strict threshold feasibility validation until nodes are present. + return nil + }Alternatively, return an admission warning (requires plumbing warnings up from validate()).
controllers/nodehealthcheck_controller_test.go (4)
1973-1979: Add an initial assertion for StartTimeAt initial state, also assert StormRecoveryStartTime is nil to fully capture the expected lifecycle (nil -> set -> nil).
Apply this diff:
- Eventually(func(g Gomega) { + Eventually(func(g Gomega) { g.Expect(k8sClient.Get(context.Background(), client.ObjectKeyFromObject(underTest), underTest)).To(Succeed()) g.Expect(*underTest.Status.HealthyNodes).To(Equal(5)) g.Expect(len(underTest.Status.UnhealthyNodes)).To(Equal(2)) g.Expect(underTest.Status.StormRecoveryActive).To(BeNil()) + g.Expect(underTest.Status.StormRecoveryStartTime).To(BeNil()) }, "10s", "1s").Should(Succeed())
1969-2030: Reduce flakiness: drive waits off unhealthyConditionDuration instead of fixed literalsThese waits depend on condition-duration expiry and reconcile cycles. Consider deriving timeouts from unhealthyConditionDuration to avoid timing brittleness across environments.
For example:
- Replace "15s" with unhealthyConditionDuration + a small buffer (e.g., 2s).
- Use shorter polling intervals (e.g., 300–500ms) for faster feedback without excessive load.
No code change required if stability is acceptable; offered for robustness.
2733-2741: Minor style: use ptr.To for threshold initializationPrefer k8s ptr helpers for clarity and consistency with other tests.
Apply this diff:
func newNodeHealthCheckWithStormRecovery() *v1alpha1.NodeHealthCheck { nhc := newNodeHealthCheck() // 7-node cluster: minHealthy=4, stormRecoveryThreshold=1 minHealthy := intstr.FromInt(4) - stormThreshold := 1 nhc.Spec.MinHealthy = &minHealthy - nhc.Spec.StormRecoveryThreshold = &stormThreshold + nhc.Spec.StormRecoveryThreshold = ptr.To(1) return nhc }
2735-2738: Clarify comment or de-couple from cluster sizeThe function prepares an NHC with storm-recovery settings; the “7-node cluster” comment is specific to the test scenario using setupObjects(2,5,…) rather than inherent to this helper. Consider rephrasing to avoid implying a fixed cluster size.
Suggested wording:
- “Configure minHealthy=4 and StormRecoveryThreshold=1 for storm-recovery scenarios (used with a 7-node cluster in tests).”
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (11)
api/v1alpha1/nodehealthcheck_types.go(2 hunks)api/v1alpha1/nodehealthcheck_webhook.go(4 hunks)api/v1alpha1/nodehealthcheck_webhook_test.go(2 hunks)api/v1alpha1/zz_generated.deepcopy.go(3 hunks)bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml(2 hunks)bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml(2 hunks)config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml(2 hunks)config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml(2 hunks)controllers/nodehealthcheck_controller.go(3 hunks)controllers/nodehealthcheck_controller_test.go(3 hunks)docs/configuration.md(2 hunks)
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-05-28T07:42:13.767Z
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).
Applied to files:
docs/configuration.md
🧬 Code Graph Analysis (3)
api/v1alpha1/nodehealthcheck_webhook_test.go (1)
vendor/golang.org/x/net/context/context.go (2)
Context(54-54)Background(68-70)
controllers/nodehealthcheck_controller.go (4)
api/v1alpha1/nodehealthcheck_webhook.go (1)
GetMinHealthy(311-334)vendor/github.com/medik8s/common/pkg/events/events.go (2)
WarningEvent(37-39)NormalEvent(26-28)controllers/utils/events.go (1)
EventReasonRemediationSkipped(6-6)api/v1alpha1/nodehealthcheck_types.go (1)
NodeHealthCheck(368-374)
controllers/nodehealthcheck_controller_test.go (4)
vendor/sigs.k8s.io/controller-runtime/pkg/client/interfaces.go (2)
ObjectKeyFromObject(35-37)ObjectKey(32-32)vendor/k8s.io/api/core/v1/types.go (2)
ConditionFalse(2837-2837)ConditionTrue(2836-2836)api/v1alpha1/nodehealthcheck_webhook.go (1)
GetMinHealthy(311-334)api/v1alpha1/nodehealthcheck_types.go (1)
NodeHealthCheck(368-374)
🪛 LanguageTool
docs/configuration.md
[grammar] ~62-~62: There might be a mistake here.
Context: ...scenarios. When storm mode is triggered by number of remediation reaching minHealt...
(QB_NEW_EN)
[grammar] ~62-~62: There might be a mistake here.
Context: ...old before resuming normal remediation. See Storm Recovery section below. ...
(QB_NEW_EN)
[grammar] ~236-~236: There might be a mistake here.
Context: ...iation restraint**. The Philosophy: - Recognition: When many nodes fail, cre...
(QB_NEW_EN)
[grammar] ~260-~260: There might be a mistake here.
Context: ...provides a controlled exit strategy when maximum number of remediation is reache...
(QB_NEW_EN)
[grammar] ~260-~260: There might be a mistake here.
Context: ...um number of remediation is reached due to minHealthy/maxUnhealthy constraint. **...
(QB_NEW_EN)
[grammar] ~262-~262: Ensure spelling is correct
Context: ...lthy/maxUnhealthy constraint. Normal minHealthy Behavior: ``` If (Maximum number of r...
(QB_NEW_EN_ORTHOGRAPHY_ERROR_IDS_1)
[grammar] ~281-~281: There might be a mistake here.
Context: ... Storm Recovery is Appropriate When: - Your infrastructure can be overwhelmed b...
(QB_NEW_EN)
[grammar] ~287-~287: There might be a mistake here.
Context: ...torm Recovery is NOT Appropriate When**: - You need aggressive remediation regardle...
(QB_NEW_EN)
[grammar] ~356-~356: There might be a mistake here.
Context: ... Behavior What Storm Recovery Does: - ✅ Preserves existing remediations (l...
(QB_NEW_EN)
[grammar] ~357-~357: There might be a mistake here.
Context: ...sting remediations** (let them complete) - ❌ Blocks new remediations (even for ...
(QB_NEW_EN)
[grammar] ~358-~358: There might be a mistake here.
Context: ...diations** (even for newly failed nodes) - ⏳ Waits for system stabilization (un...
(QB_NEW_EN)
[grammar] ~359-~359: There might be a mistake here.
Context: ...tabilization** (unhealthy count to drop) - 🔄 Resumes when safe (unhealthy ≤ th...
(QB_NEW_EN)
[grammar] ~360-~360: There might be a mistake here.
Context: ...umes when safe** (unhealthy ≤ threshold) What Storm Recovery Does NOT Do: - ❌ Do...
(QB_NEW_EN)
[grammar] ~362-~362: There might be a mistake here.
Context: ...d) What Storm Recovery Does NOT Do: - ❌ Does NOT guarantee all nodes will be r...
(QB_NEW_EN)
[grammar] ~430-~430: There might be a mistake here.
Context: .../workflow.md) for further information. | | phase | A short hum...
(QB_NEW_EN)
🪛 markdownlint-cli2 (0.17.2)
docs/configuration.md
243-243: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
263-263: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
271-271: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
295-295: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
311-311: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
312-312: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
317-317: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
318-318: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
324-324: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
330-330: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
331-331: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
338-338: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
369-369: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
377-377: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
428-428: Spaces inside emphasis markers
(MD037, no-space-in-emphasis)
428-428: Spaces inside emphasis markers
(MD037, no-space-in-emphasis)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-and-unit-test
- GitHub Check: e2e-k8s
🔇 Additional comments (24)
api/v1alpha1/zz_generated.deepcopy.go (2)
125-129: DeepCopy for StormRecoveryThreshold is correctPointer allocation and value copy are implemented properly for the new spec field.
200-208: DeepCopy for StormRecoveryActive and StormRecoveryStartTime is correct
- StormRecoveryActive: proper pointer allocation and value copy.
- StormRecoveryStartTime: uses DeepCopy(), which is required for metav1.Time.
api/v1alpha1/nodehealthcheck_types.go (2)
110-121: Spec: StormRecoveryThreshold field and validation look good
- Optional pointer, Minimum=0, and clear doc comment establish intended behavior.
- The “must be less than (totalNodes - minHealthy)” constraint will necessarily live in the webhook, which is fine.
No changes requested.
281-297: Status: StormRecoveryActive/StormRecoveryStartTime additions are sound
- Optional fields, appropriate types, and CSV descriptors annotations are consistent.
- Format/date-time annotations for StartTime match metav1.Time expectations.
No changes requested.
api/v1alpha1/nodehealthcheck_webhook_test.go (2)
5-5: Import fmt is appropriate for dynamic node namesThis is a reasonable addition for generating mock node lists.
505-507: Verify webhook error messages to avoid test flakinessIf the webhook’s error text is not exactly as asserted, these tests will fail. Consider the suggested less-brittle assertions above. If you prefer exact messages, verify the error string in nodehealthcheck_webhook.go matches these tests.
Would you like me to scan the webhook code and tests across the repo to ensure error strings line up?
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (3)
140-149: Spec descriptor for stormRecoveryThreshold looks goodDescriptor text matches API comments and clearly explains the constraint and behavior.
201-212: Status descriptors for storm recovery fields are appropriateDescriptors explain activation semantics and StartTime purpose. Paths align with other status fields.
1-667: Sanity-check complete — CRD and CSV are in sync for storm recovery fields
- config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml — contains stormRecoveryThreshold with "minimum: 0" and integer type; also exposes stormRecoveryActive and stormRecoveryStartTime in status.
- bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml — bundled CRD includes stormRecoveryThreshold and the status fields.
- bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml — CSV lists spec and status descriptors/paths for stormRecoveryThreshold, stormRecoveryActive, and stormRecoveryStartTime.
No changes required — the CRD OpenAPI schema and CSV are consistent.
docs/configuration.md (1)
228-418: Fix markdownlint issues by adding languages to all fenced blocks in Storm Recovery sectionA number of fenced code blocks lack languages (MD040). I proposed one patch above; similar changes should be applied to all pseudo-code blocks in this section for a clean docs build.
I can generate a patch touching all code fences in this section if you want me to apply it across the board.
bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (2)
283-293: Spec: stormRecoveryThreshold schema looks correct
- Integer with minimum: 0 and clear behavior description is good.
- Constraint note “must be less than (totalNodes - minHealthy)” is helpful, and enforced in the webhook.
452-465: Status: stormRecoveryActive/startTime additions are well-structured
- Correctly placed under status with appropriate types (boolean and date-time).
- Clear UX-facing descriptions.
config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (2)
281-291: Spec: stormRecoveryThreshold schema LGTMMirrors the bundle CRD changes; constraints and documentation are consistent.
450-462: Status: stormRecoveryActive/startTime fields LGTMAccurately reflect runtime state and align with controller/state logic.
controllers/nodehealthcheck_controller.go (2)
347-359: Skip path during storm recovery is reasonable and communicates well
- Defers remediation while storm recovery is active and emits a clear warning event.
- Message uses current unhealthy count, configured threshold, and in-progress remediation count. Good operator UX.
951-959: Helper for remediation counting is clear and efficientStraightforward int counter over status; fits the purposes of logging and optional UX.
config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (2)
100-108: CSV: Spec descriptor for stormRecoveryThreshold is well-describedAccurately communicates behavior and constraint to users; path is correct.
161-172: CSV: Status descriptors for stormRecoveryActive/startTime are aptGood display names and clear descriptions; should render well in UIs.
api/v1alpha1/nodehealthcheck_webhook.go (4)
30-31: Import alias switch to utilerrors is correctSwitching to utilerrors and updating call sites avoids name collisions and is standard.
108-116: Validation aggregation extended with storm recovery and topology checksComprehensive aggregation with utilerrors.NewAggregate is appropriate. Nice to keep validation logic centralized.
156-162: Escalating remediations sub-validations aggregated correctlyGood consistency with the main validator’s pattern.
311-334: GetMinHealthy helper is correct and useful across packages
- Validates mutual exclusivity and negative values.
- Correctly handles both absolute and percentage intstr semantics.
controllers/nodehealthcheck_controller_test.go (2)
2457-2457: Good move to use the exported API for minHealthy resolutionSwitching to v1alpha1.GetMinHealthy centralizes validation and keeps tests aligned with webhook logic. Looks good.
1981-1996: Conclusion — controller uses<=so the test is correct; update PR text or implementationI checked the code: evaluateStormRecovery returns
healthyCount <= minHealthy && inProgressRemediations > stormThreshold(controllers/nodehealthcheck_controller.go:917), so storm recovery activates when healthyCount is less than or equal to minHealthy. The test that expects activation at healthy == minHealthy (controllers/nodehealthcheck_controller_test.go:1981-1996) matches the implementation. The PR summary wording ("below minHealthy") is inaccurate — either update the PR description to say "less than or equal to minHealthy" or change the controller/operator if you intended strictly<.Points to review:
- controllers/nodehealthcheck_controller.go:915-917 — activation logic and surrounding comment (comment refers to "remediation count" and is misleading; consider clarifying).
- controllers/nodehealthcheck_controller_test.go:1981-1996 — test expecting activation at healthy == minHealthy (consistent with implementation).
Recommended actions (pick one):
- Update PR summary and the controller comment to reflect
<= minHealthy, or- If intended behavior is strictly
<, change the operator in controllers/nodehealthcheck_controller.go and adjust tests accordingly.Likely an incorrect or invalid review comment.
61eb63f to
410ab75
Compare
|
/test 4.19-openshift-e2e |
410ab75 to
33cb7e1
Compare
021946b to
d46900a
Compare
|
/test 4.19-openshift-e2e |
|
@coderabbitai full review |
✅ Actions performedFull review triggered. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
controllers/nodehealthcheck_controller_test.go (1)
1996-2005: Add assertion that no remediation is started for the newly unhealthy node while storm is active.This tightens the core guarantee of storm recovery and avoids relying only on aggregate counts. Mirrors prior review feedback.
Apply:
// Verify Storm recovery is activated // 7 total, 3 healthy, 4 unhealthy, 3 remediations (additional remediation not created because of min healthy constraint) Eventually(func(g Gomega) { g.Expect(k8sClient.Get(context.Background(), client.ObjectKeyFromObject(underTest), underTest)).To(Succeed()) g.Expect(*underTest.Status.HealthyNodes).To(Equal(3)) g.Expect(len(underTest.Status.UnhealthyNodes)).To(Equal(4)) g.Expect(getRemediationsCount(underTest)).To(Equal(3)) g.Expect(utils.IsConditionTrue(underTest.Status.Conditions, v1alpha1.ConditionTypeStormActive, v1alpha1.ConditionReasonStormThresholdChange)).To(BeTrue()) g.Expect(underTest.Status.StormRecoveryStartTime).ToNot(BeNil()) }, "5s", "100ms").Should(Succeed()) + + // Ensure the newly unhealthy node is NOT remediated while storm recovery is active + Consistently(func(g Gomega) { + cr := findRemediationCRForNHC("healthy-worker-node-1", underTest) + g.Expect(cr).To(BeNil()) + }, "5s", "500ms").Should(Succeed())
🧹 Nitpick comments (21)
docs/configuration.md (2)
62-62: Tighten spec text and align with validation unitsClarify the gating trigger and list valid units to match the API validation.
-| _stormTerminationDelay_ | no | n/a | Optional duration. When storm recovery regains the minHealthy/maxUnhealthy constraint, NHC waits this additional time before exiting storm mode. While waiting, no new remediations are created. Example: 30s, 2m, 1h. | +| _stormTerminationDelay_ | no | n/a | Optional duration. When the minHealthy/maxUnhealthy constraint is satisfied, NHC stays in storm recovery for this additional delay before exiting. During the delay, no new remediations are created. Examples: 30s, 2m, 1h. Valid units: ns, us/µs, ms, s, m, h. |
328-339: Resolve markdownlint nit in table and surface new condition in “Conditions” docs
- Remove spaces inside emphasis (MD037).
- Also recommend noting StormActive in the “Known .status.conditions.type” sentence for accuracy.
-| _inFlightRemediations_ | ** DEPRECATED ** A list of "timestamp - node name" pairs of ongoing remediations. Replaced by unhealthyNodes. | +| _inFlightRemediations_ | **DEPRECATED** A list of "timestamp - node name" pairs of ongoing remediations. Replaced by unhealthyNodes. |Outside this hunk (near the “Conditions” field description), change:
- Known .status.conditions.type are: "Disabled"
to:- Known .status.conditions.type include: "Disabled", "StormActive"
controllers/utils/conditions.go (2)
80-85: Avoid duplicate lookups; simplify condition checkReduce one meta.FindStatusCondition call and keep semantics identical.
- if !IsConditionSet(conditions, conditionType, reason) { - return false - } - condition := meta.FindStatusCondition(conditions, conditionType) - return condition.Status == metav1.ConditionTrue + condition := meta.FindStatusCondition(conditions, conditionType) + if condition == nil || condition.Reason != reason { + return false + } + return condition.Status == metav1.ConditionTrueOptional nit outside this hunk: adjust the comment to “returns”.
-// IsConditionTrue return true when the conditions contain a condition of given type and reason with status true +// IsConditionTrue returns true when the conditions contain a condition of the given type and reason with status true
87-97: Consider a helper for type-only checksIsConditionSet requires a reason match. If callers ever need a type-only presence check, add a small helper to avoid passing sentinel values.
Happy to open a follow-up PR with:
func HasConditionType(conditions []metav1.Condition, conditionType string) bool { return meta.FindStatusCondition(conditions, conditionType) != nil }api/v1alpha1/nodehealthcheck_types.go (2)
38-41: Fix StormActive comment typoComment incorrectly says “when NHC will get disabled”.
- // ConditionTypeStormActive is the condition type used when NHC will get disabled + // ConditionTypeStormActive indicates storm-recovery mode is active ConditionTypeStormActive = "StormActive" - // ConditionReasonStormThresholdChange is the condition reason for a storm change from active to inactive and vice versa + // ConditionReasonStormThresholdChange is the reason used when storm mode toggles (active <-> inactive) due to healthy-node threshold changes ConditionReasonStormThresholdChange = "HealthyNodeThresholdChange"
291-309: Align status-field comments with condition name and behavior
- Change “StormRecoveryActive” to “StormActive” in comments:
- // StormRecoveryStartTime records when storm recovery mode was activated. - // This field is set when StormRecoveryActive becomes true and helps track + // StormRecoveryStartTime records when storm recovery mode was activated. + // This field is set when the StormActive condition becomes true and helps track // how long the system has been in storm recovery mode. @@ - // StormTerminationStartTime records when storm recovery mode regained the minHealthy/maxUnhealthy constraint - // and the storm is about to end (after NodeHealthCheckSpec.StormTerminationDelay has passed). + // StormTerminationStartTime records when the minHealthy/maxUnhealthy constraint was satisfied + // and the storm is about to end after NodeHealthCheckSpec.StormTerminationDelay elapses.
- Update the Status.Conditions documentation to include the new “StormActive” type.
CRDs already includestormTerminationDelay,stormRecoveryStartTimeandstormTerminationStartTime, so no additional regeneration is needed.bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml (2)
276-289: StormTerminationDelay schema: looks good; add minor validation note (optional).Pattern and description align with existing duration fields. If you want to enforce non-negative values explicitly at the schema level (beyond the regex), consider an x-kubernetes-validations rule for clarity, though not strictly necessary.
434-446: Status timestamps added correctly.stormRecoveryStartTime and stormTerminationStartTime are placed under status with date-time format. LGTM. Consider updating the conditions description elsewhere in the CRD to mention StormActive for completeness.
controllers/nodehealthcheck_controller_test.go (6)
1966-1969: Fix comment mismatch with setupObjects arguments.The comment says “2 unhealthy, 5 healthy” but setupObjects(3, 4, true) creates 3 unhealthy and 4 healthy.
Apply:
- setupObjects(3, 4, true) // 2 unhealthy, 5 healthy = 7 total + setupObjects(3, 4, true) // 3 unhealthy, 4 healthy = 7 total
2016-2023: Typo in log message.“Elasped” → “Elapsed”.
Apply:
- By(fmt.Sprintf("Elasped time for node recovery is: %v seconds", elapsed)) + By(fmt.Sprintf("Elapsed time for node recovery is: %v seconds", elapsed))
2016-2031: Assert storm termination start timestamp after regaining minHealthy (optional).When HealthyNodes returns to minHealthy, assert StormTerminationStartTime is set to avoid regressions.
Apply:
//wait for node to recover Eventually(func(g Gomega) { g.Expect(k8sClient.Get(context.Background(), client.ObjectKeyFromObject(underTest), underTest)).To(Succeed()) g.Expect(*underTest.Status.HealthyNodes).To(Equal(4)) + // Optional: termination window should start ticking now + g.Expect(underTest.Status.StormTerminationStartTime).ToNot(BeNil()) }, "5s", "100ms").Should(Succeed())
2032-2041: Reduce flake risk by widening timing margins (optional).Sleeping 1.5s then waiting 1.5s against a 2s delay leaves little slack for reconcile/event lags. Consider bumping both to 2s.
Apply:
- time.Sleep(time.Millisecond * 1500) + time.Sleep(2 * time.Second) ... - }, "1500ms", "100ms").Should(Succeed()) + }, "2s", "100ms").Should(Succeed())
2660-2681: Remediation count may include CRs from other owners; filter by NHC owner to avoid test flakiness.Current count sums all CRs of the kind, regardless of owner/template. Restrict to CRs owned by this NHC.
Apply:
func getRemediationsCountForTemplate(nhc *v1alpha1.NodeHealthCheck, templateRef v1.ObjectReference) int { baseCr := newBaseCR(nhc, templateRef) crList := &unstructured.UnstructuredList{Object: baseCr.Object} - if err := k8sClient.List(ctx, crList); err == nil { - return len(crList.Items) - } - return 0 + if err := k8sClient.List(ctx, crList); err != nil { + return 0 + } + count := 0 + for i := range crList.Items { + item := crList.Items[i] + for _, o := range item.GetOwnerReferences() { + if o.UID == nhc.UID { + count++ + break + } + } + } + return count }
2768-2775: Outdated comment mentions stormRecoveryThreshold.There’s no spec field named StormRecoveryThreshold. Update the comment to reflect only MinHealthy and StormTerminationDelay.
Apply:
-func newNodeHealthCheckWithStormRecovery() *v1alpha1.NodeHealthCheck { - nhc := newNodeHealthCheck() - // 7-node cluster: minHealthy=4, stormRecoveryThreshold=1 +func newNodeHealthCheckWithStormRecovery() *v1alpha1.NodeHealthCheck { + nhc := newNodeHealthCheck() + // 7-node cluster: minHealthy=4, StormTerminationDelay=2sbundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1)
202-211: Status descriptors added appropriatelyConsider adding x-descriptors for date-time if you want special rendering in some UIs, otherwise this is fine.
controllers/nodehealthcheck_controller.go (4)
935-944: Rename for clarity and use monotonic test clock
- Function name reads like “exist” instead of “exit”. Rename improves readability.
- Prefer currentTime() over time.Now() for testability.
Apply:
-func (r *NodeHealthCheckReconciler) shouldExistStormRecovery(nhc *remediationv1alpha1.NodeHealthCheck, minHealthy int) bool { +func (r *NodeHealthCheckReconciler) shouldExitStormRecovery(nhc *remediationv1alpha1.NodeHealthCheck, minHealthy int) bool { isActive := utils.IsConditionTrue(nhc.Status.Conditions, remediationv1alpha1.ConditionTypeStormActive, remediationv1alpha1.ConditionReasonStormThresholdChange) - isDelayElapsed := false + isDelayElapsed := false if nhc.Status.StormTerminationStartTime != nil { - isDelayElapsed = time.Now().After(nhc.Status.StormTerminationStartTime.Time.Add(nhc.Spec.StormTerminationDelay.Duration)) + isDelayElapsed = currentTime().After(nhc.Status.StormTerminationStartTime.Time.Add(nhc.Spec.StormTerminationDelay.Duration)) } shouldExist := isActive && isMinHealthyConstraintSatisfied(nhc, minHealthy) && isDelayElapsed return shouldExist }And update the call site (see separate diff in evaluateStormRecovery).
986-1001: Use currentTime() and avoid dual pointer helpers
- Use currentTime() for consistency/testability.
- Use k8s.io/utils/pointer for duration pointers and drop the extra ptr import.
Apply:
-func calculateRequeue(active bool, nhc *remediationv1alpha1.NodeHealthCheck) *time.Duration { +func calculateRequeue(active bool, nhc *remediationv1alpha1.NodeHealthCheck) *time.Duration { if !active || nhc.Status.StormTerminationStartTime == nil { return nil } - elapsedTime := time.Now().Sub(nhc.Status.StormTerminationStartTime.Time) + elapsedTime := currentTime().Sub(nhc.Status.StormTerminationStartTime.Time) // time left before storm should finish timeLeft := nhc.Spec.StormTerminationDelay.Duration - elapsedTime //Add some requeue buffer var requeueAfter *time.Duration if timeLeft < 0 { - requeueAfter = ptr.To(time.Second) + requeueAfter = pointer.Duration(1 * time.Second) } else { - requeueAfter = ptr.To(timeLeft + time.Second) + requeueAfter = pointer.Duration(timeLeft + 1*time.Second) } return requeueAfter }And drop the extra import:
- "k8s.io/utils/ptr"
45-46: Avoid two pointer helper importsAfter changing calculateRequeue to use pointer.Duration, remove k8s.io/utils/ptr to keep imports lean.
- "k8s.io/utils/ptr"
1003-1011: Unused helpergetRemediationCount is no longer used; consider removing to reduce noise.
config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml (2)
100-109: Clarify disable semantics for Storm Termination Delay.If omitting the field disables the delay (nil Duration), state it explicitly to improve UX.
Proposed diff:
- - description: "StormTerminationDelay introduces a configurable delay after + - description: "StormTerminationDelay introduces a configurable delay after storm recovery exit criteria are satisfied (for example, when the number of healthy nodes rises above the configured minHealthy constraint). While this delay is in effect, NHC remains in storm recovery mode and does not create new remediations. Once the delay elapses, storm recovery mode exits - and normal remediation resumes. \n Expects a string of decimal numbers each + and normal remediation resumes. Omitting this field disables the delay. + \n Expects a string of decimal numbers each with optional fraction and a unit suffix, e.g. \"300ms\", \"1.5h\" or \"2h45m\". Valid time units are \"ns\", \"us\" (or \"µs\"), \"ms\", \"s\", \"m\", \"h\"." displayName: Storm Termination Delay path: stormTerminationDelay
162-171: Avoid Go-type references in user-facing text; specify timestamp format.Replace “NodeHealthCheckSpec.StormTerminationDelay” with “spec.stormTerminationDelay”, and note these are RFC 3339 timestamps (mirrors other descriptors).
Proposed diff:
- - description: StormRecoveryStartTime records when storm recovery mode was activated. - This field is set when StormRecoveryActive becomes true and helps track - how long the system has been in storm recovery mode. + - description: StormRecoveryStartTime records when storm recovery mode was activated. + This field is set when StormRecoveryActive becomes true and helps track + how long the system has been in storm recovery mode. RFC 3339 timestamp. displayName: Storm Recovery Start Time path: stormRecoveryStartTime - - description: StormTerminationStartTime records when storm recovery mode regained - the minHealthy/maxUnhealthy constraint and the storm is about to end (after - NodeHealthCheckSpec.StormTerminationDelay has passed). + - description: StormTerminationStartTime records when storm recovery mode regained + the minHealthy/maxUnhealthy constraint and the storm is about to end (after + spec.stormTerminationDelay has passed). RFC 3339 timestamp. displayName: Storm Termination Start Time path: stormTerminationStartTime
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
api/v1alpha1/nodehealthcheck_types.go(3 hunks)api/v1alpha1/zz_generated.deepcopy.go(4 hunks)bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml(2 hunks)bundle/manifests/remediation.medik8s.io_nodehealthchecks.yaml(2 hunks)config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml(2 hunks)config/manifests/base/bases/node-healthcheck-operator.clusterserviceversion.yaml(2 hunks)controllers/nodehealthcheck_controller.go(4 hunks)controllers/nodehealthcheck_controller_test.go(3 hunks)controllers/utils/conditions.go(1 hunks)docs/configuration.md(2 hunks)
🧰 Additional context used
🧠 Learnings (2)
📚 Learning: 2025-05-28T07:42:13.767Z
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/status.go:74-81
Timestamp: 2025-05-28T07:42:13.767Z
Learning: In the Node Healthcheck Operator's status management for controllers/resources/status.go, when a node's healthy delay period expires, the entire unhealthy node entry is removed from the status via UpdateStatusNodeHealthy rather than just resetting the HealthyDelayed flag to false. The state transition flow is: unhealthy -> healthy with delay (HealthyDelayed=true) -> completely healthy (removed from unhealthy nodes list).
Applied to files:
docs/configuration.md
📚 Learning: 2025-05-28T07:55:11.390Z
Learnt from: mshitrit
PR: medik8s/node-healthcheck-operator#365
File: controllers/resources/manager.go:319-364
Timestamp: 2025-05-28T07:55:11.390Z
Learning: In the node-healthcheck-operator HandleHealthyNode method, when calcCrDeletionDelay fails with an error, the intended behavior is to log the error and proceed with CR deletion (treating it as "no delay configured") rather than aborting reconciliation. This prevents the system from getting stuck when delay calculations fail due to issues like malformed annotations.
Applied to files:
controllers/nodehealthcheck_controller_test.go
🪛 markdownlint-cli2 (0.17.2)
docs/configuration.md
243-243: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
257-257: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
265-265: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
296-296: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
297-297: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
302-302: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
303-303: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
308-308: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
309-309: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
315-315: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
316-316: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
332-332: Spaces inside emphasis markers
(MD037, no-space-in-emphasis)
332-332: Spaces inside emphasis markers
(MD037, no-space-in-emphasis)
🔇 Additional comments (9)
api/v1alpha1/nodehealthcheck_types.go (1)
115-131: Spec: StormTerminationDelay text looks goodValidation pattern and JSON tag align with docs; no issues spotted.
api/v1alpha1/zz_generated.deepcopy.go (3)
24-26: Import aliasing change is correct.Switching meta/v1 to alias v1 and core types to corev1 keeps ObjectReference and Duration/Time types disambiguated. LGTM.
125-135: Deep-copy updates for spec fields are consistent.
- StormTerminationDelay pointer deep-copied.
- RemediationTemplate switched to corev1.ObjectReference.
- HealthyDelay uses v1.Duration.
No issues.Also applies to: 147-148
188-196: Approve DeepCopy changes
DeepCopy methods for StormTerminationDelay (*metav1.Duration), StormRecoveryStartTime and StormTerminationStartTime (*metav1.Time), and RemediationTemplate fields align with their definitions in nodehealthcheck_types.go.config/crd/bases/remediation.medik8s.io_nodehealthchecks.yaml (2)
274-287: StormTerminationDelay field and validation look goodPattern, units, and description align with existing duration fields. No concerns.
432-444: Status timestamps added correctlydate-time formatting and placement under status are appropriate.
bundle/manifests/node-healthcheck-operator.clusterserviceversion.yaml (1)
140-149: Descriptor for Storm Termination Delay is clear and consistentMatches CRD semantics and uses the standard duration wording.
controllers/nodehealthcheck_controller.go (2)
350-361: Good integration: gating new remediations during storm recoveryHooking storm evaluation before remediation creation is correct and minimally invasive.
927-934: Start condition based solely on minHealthy is correctUsing only the minHealthy constraint to activate storm mode matches the intended behavior and avoids coupling to remediation counts.
| ## Storm Recovery | ||
|
|
||
| Storm recovery is an **optional advanced feature** that provides system stabilization through remediation restraint during mass failure scenarios. | ||
|
|
||
| ### Concept & Philosophy | ||
|
|
||
| **Core Principle**: Storm recovery is about **system stabilization through remediation restraint**. | ||
|
|
||
| **The Philosophy**: | ||
| - **Recognition**: When many nodes fail, creating more remediations may be counterproductive | ||
| - **Assumption**: The system needs time to stabilize before additional remediation attempts | ||
| - **Strategy**: Temporary restraint to prevent making a bad situation worse | ||
| - **Acceptance**: Some scenarios may be beyond automated recovery (same as minHealthy) | ||
|
|
||
| **Real-World Example**: | ||
| ``` | ||
| Scenario: 20 nodes, minHealthy=11, stormTerminationDelay=30s | ||
|
|
||
| Normal operation: 15 healthy, 5 unhealthy → 5 remediations created | ||
| 5 additional nodes go down and storm hits: 10 healthy (<11) → storm recovery activates, new remediations are blocked (existing continue) | ||
| 2 node regain health 12 healthy (≥11) → start 30s delay timer (3 remediations left, 8 unhealthy nodes) | ||
| Exit: after 30s elapse → storm recovery deactivates, resume creating remediations for the 5 additional nodes | ||
| ``` | ||
| ### How Storm Recovery Works | ||
| **The Key Insight**: Storm recovery provides a **controlled exit strategy** when maximum number of remediation is reached due to minHealthy/maxUnhealthy constraint. | ||
| **Normal minHealthy Behavior**: | ||
| ``` | ||
| If (Maximum number of remediation is reached): | ||
| → Block creation of new remediations | ||
| → Existing remediations continue | ||
| → Wait until current remediations are below maximum before creating more (as defined by minHealthy/maxUnhealthy) | ||
| ``` | ||
| **Storm Recovery Enhancement**: | ||
| ``` | ||
| If (minHealthy/maxUnhealthy blocks creation of new remediations AND stormTerminationDelay is configured): | ||
| → Enter storm recovery mode | ||
| → Block creation of new remediations while storm is active | ||
| → When constraint is satisfied (healthyNodes ≥ minHealthy or maxUnhealthy satisfied): | ||
| - Start stormTerminationDelay timer | ||
| - Exit storm recovery after delay | ||
| - Resume normal remediation creation | ||
| ``` | ||
| ### When to Use Storm Recovery | ||
| **Storm Recovery is Appropriate When**: | ||
| - Your infrastructure can be overwhelmed by too many concurrent remediations | ||
| - You want controlled recovery during mass failure events | ||
| - You prefer "wait and see" over "aggressive remediation" during storms | ||
| - You accept that some failure scenarios are beyond automated recovery | ||
| **Storm Recovery is NOT Appropriate When**: | ||
| - You need aggressive remediation regardless of system load | ||
| - Your infrastructure can handle unlimited concurrent remediations | ||
| - You can't accept any remediation delays during mass failures | ||
| ### Configuring Storm Recovery | ||
| - Enable by setting `spec.stormTerminationDelay` on the `NodeHealthCheck` CR. | ||
| - Choose a delay that matches your environment's stabilization time. While the delay is counting down, new remediations remain blocked. | ||
| - If `stormTerminationDelay` is not set, storm recovery mode is disabled and only the standard `minHealthy`/`maxUnhealthy` gating applies. | ||
| ### Storm Recovery States | ||
| **State 1: Normal Operation** | ||
| ``` | ||
| healthyNodes ≥ minHealthy | ||
| → Create remediations for unhealthy nodes (as allowed by minHealthy/maxUnhealthy) | ||
| ``` | ||
| **State 2: Storm Recovery Active** | ||
| ``` | ||
| healthyNodes < minHealthy | ||
| → Block new remediations | ||
| ``` | ||
| **State 3: Regaining Health (Delay)** | ||
| ``` | ||
| healthyNodes ≥ minHealthy | ||
| → Start stormTerminationDelay timer | ||
| → Continue blocking new remediations until delay elapses | ||
| ``` | ||
| **State 4: Storm Recovery Exit** | ||
| ``` | ||
| after stormTerminationDelay | ||
| → Exit storm recovery mode | ||
| → Resume creating remediations | ||
| ``` | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Fix markdownlint issues: add languages to fences and convert bold “State” lines to headings
Addresses MD040 (missing language) and MD036 (emphasis-as-heading).
Run markdownlint after applying.
@@
-**Real-World Example**:
-```
+**Real-World Example**:
+```text
@@
-**Normal minHealthy Behavior**:
-```
+**Normal minHealthy Behavior**:
+```text
@@
-**Storm Recovery Enhancement**:
-```
+**Storm Recovery Enhancement**:
+```text
@@
-**State 1: Normal Operation**
-```
+#### State 1: Normal Operation
+```text
@@
-**State 2: Storm Recovery Active**
-```
+#### State 2: Storm Recovery Active
+```text
@@
-**State 3: Regaining Health (Delay)**
-```
+#### State 3: Regaining Health (Delay)
+```text
@@
-**State 4: Storm Recovery Exit**
-```
+#### State 4: Storm Recovery Exit
+```textTo verify:
🏁 Script executed:
#!/bin/bash
# Validate markdown formatting on this file
npx --yes markdownlint-cli2 docs/configuration.md || trueLength of output: 8714
Add fenced code block languages and convert emphasis-as-heading to headings in the Storm Recovery section
File: docs/configuration.md (≈lines 230–320)
- Prefix each triple backtick in the Storm Recovery examples and states with a language (e.g. use ```text).
- Replace the emphasized “State 1: Normal Operation”, “State 2: Storm Recovery Active”, “State 3: Regaining Health (Delay)”, and “State 4: Storm Recovery Exit” lines with actual headings (e.g.
#### State 1: Normal Operation).
Re-run markdownlint (npx --yes markdownlint-cli2 docs/configuration.md) to ensure MD036 and MD040 are cleared.
🧰 Tools
🪛 markdownlint-cli2 (0.17.2)
243-243: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
257-257: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
265-265: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
296-296: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
297-297: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
302-302: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
303-303: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
308-308: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
309-309: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
315-315: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
316-316: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In docs/configuration.md around lines 228–321, the Storm Recovery examples use
triple-backtick fences without languages and the section "State X" labels are
formatted as emphasized text; update every code fence in this block to include a
language (e.g., change ``` to ```text) and convert the emphasized state lines
(e.g., "*State 1: Normal Operation*") into proper Markdown headings (e.g., ####
State 1: Normal Operation); after making these changes run markdownlint (npx
--yes markdownlint-cli2 docs/configuration.md) and fix any remaining MD036/MD040
warnings if reported.
… of new remediation as long as the "storm" is active. Signed-off-by: Michael Shitrit <[email protected]>
d46900a to
c548d9e
Compare
Why we need this PR
The storm recovery mechanism is intended to prevent additional remediations when already in a remediation "storm" mode.
Once the number of healthy nodes drops below minHealthy, fencing should be delayed until the number of unhealthy nodes reaches this stormRecoveryThreshold. This would prevent premature fencing during transient issues and potential "storm" scenarios.
Changes made
StormRecoveryThresholdStormRecoveryThresholdWhich issue(s) this PR fixes
RHWA-16
Test plan
Summary by CodeRabbit
New Features
Documentation