Skip to content

controller: skip MCP use when MachineConfigPool API is absent on HCP#2291

Open
paulczar wants to merge 1 commit into
openshift:develfrom
paulczar:hcp-skip-mcp-watch-when-unavailable
Open

controller: skip MCP use when MachineConfigPool API is absent on HCP#2291
paulczar wants to merge 1 commit into
openshift:develfrom
paulczar:hcp-skip-mcp-watch-when-unavailable

Conversation

@paulczar

@paulczar paulczar commented Jun 5, 2026

Copy link
Copy Markdown

Description of the problem

On ROSA HCP guest clusters, the MachineConfigPool API may be absent while other
MCO types exist. OSC 1.12 unconditionally watches MCP at manager startup and
lists MCPs during reconcile, causing crash loops or repeated ERROR logs before
DaemonSet-based Kata install can complete (KATA-4840).

What I did

  • Probe for MachineConfigPool API availability (same pattern as MachineConfig)
  • Register MCP watch only when the API is present
  • Short-circuit checkConvergedCluster() when MCP is unavailable
  • Unit tests for both paths

How to verify

  1. ROSA HCP cluster without MachineConfigPool API
  2. Set osc-feature-gates deploymentMode: DaemonSet
  3. Install operator and apply KataConfig
  4. Confirm controller-manager stays up (no MCP cache sync failure)
  5. Confirm reconcile does not ERROR-spam on MCP list each loop
  6. Confirm DaemonSet install path progresses (RuntimeClass kata when install completes)

Lab validation (author)

Validated on ROSA HCP with c5n.metal workers: DaemonSet mode, compute pool
auto_repair: false (ROSA ops — not part of this PR), manual reboot after
osc-rpm-install, then RuntimeClass/kata and a pod with runtimeClassName: kata running in a KVM sandbox.
Fixes: rhjira#KATA-4840
Related: KATA-4233 (DaemonSet reboot), KATA-5177 (live-apply), KATA-4597 (DS image)

Changelog

HCP: skip MachineConfigPool integration when MCP API is absent so DaemonSet
deployment mode can run on guest clusters without full MCO.

@coderabbitai

coderabbitai Bot commented Jun 5, 2026

Copy link
Copy Markdown

Warning

Review limit reached

@paulczar, we couldn't start this review because you've reached your PR review rate limit.

More reviews will be available in 6 minutes and 49 seconds. Learn how PR review limits work.

Your organization has run out of usage credits. Purchase more in the billing tab.

⌛ How to resolve this issue?

After more reviews become available, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans include higher PR review limits than trial, open-source, and free plans. In all cases, reviews become available again over time. During sustained high-volume PR review activity, CodeRabbit may temporarily slow when the next review becomes available.

Please see our Fair Usage Limits Policy for further information.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 8615ac0d-5a3d-4af6-8047-8466a7cfecac

📥 Commits

Reviewing files that changed from the base of the PR and between 89d829d and 7c140b2.

📒 Files selected for processing (3)
  • controllers/deployment_mode_handler.go
  • controllers/deployment_mode_handler_test.go
  • controllers/openshift_controller.go
📝 Walkthrough

Walkthrough

This PR adds graceful handling for clusters where the MachineConfigPool API is unavailable. It introduces a new isMachineConfigPoolAvailable() helper that probes the MCP API by attempting a dummy GET request, then uses this helper in two places: SetupWithManager() conditionally registers the MachineConfigPool watch only when the API is available, while always wiring Node and ConfigMap watches; checkConvergedCluster() checks availability upfront and returns non-converged when MCP is unavailable. The changes include unit tests for the detection logic and integrated convergence behavior.

Sequence Diagram

sequenceDiagram
  participant SetupWithManager
  participant isMachineConfigPoolAvailable
  participant KubeAPI as Kubernetes API
  participant ControllerBuilder
  participant checkConvergedCluster
  SetupWithManager->>isMachineConfigPoolAvailable: Probe MCP API availability
  isMachineConfigPoolAvailable->>KubeAPI: GET MachineConfigPool
  alt MCP API Available
    KubeAPI-->>isMachineConfigPoolAvailable: Success or NotFound
    isMachineConfigPoolAvailable-->>SetupWithManager: true
    SetupWithManager->>ControllerBuilder: Register MachineConfigPool watch
  else MCP API Unavailable
    KubeAPI-->>isMachineConfigPoolAvailable: NoKindMatchError
    isMachineConfigPoolAvailable-->>SetupWithManager: false
    SetupWithManager->>ControllerBuilder: Skip MachineConfigPool watch, log info
  end
  ControllerBuilder->>ControllerBuilder: Always register Node and ConfigMap watches
  rect rgba(200, 200, 255, 0.5)
  Note over checkConvergedCluster: During reconciliation
  checkConvergedCluster->>isMachineConfigPoolAvailable: Check MCP availability
  alt MCP Available
    isMachineConfigPoolAvailable-->>checkConvergedCluster: true
    checkConvergedCluster->>checkConvergedCluster: Proceed with MCP convergence check
  else MCP Unavailable
    isMachineConfigPoolAvailable-->>checkConvergedCluster: false
    checkConvergedCluster-->>checkConvergedCluster: Return non-converged
  end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 13 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Test Structure And Quality ❓ Inconclusive Tests use standard Go testing framework, not Ginkgo BDD-style tests mentioned in the custom check instructions, making the applicability of Ginkgo-specific criteria ambiguous. Clarify whether new test files should follow Ginkgo patterns (matching confidential_handler_test.go) or if standard Go testing is acceptable for unit tests.
✅ Passed checks (13 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: skipping MCP (MachineConfigPool) use when the API is absent on HCP clusters, which directly matches the primary objective of the changeset.
Description check ✅ Passed The description is well-related to the changeset, detailing the problem, solution, verification steps, and lab validation for the MachineConfigPool API availability probing feature.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed All test names in the new test file are static, descriptive, and deterministic with no dynamic content. Tests use standard Go testing framework, not Ginkgo.
Microshift Test Compatibility ✅ Passed The PR adds only unit tests (standard Go testing library), not Ginkgo e2e tests. The custom check only applies to new Ginkgo e2e tests, so it is not applicable here.
Single Node Openshift (Sno) Test Compatibility ✅ Passed No new Ginkgo e2e tests added. PR only adds standard Go unit tests in deployment_mode_handler_test.go using testing package, not Ginkgo patterns.
Topology-Aware Scheduling Compatibility ✅ Passed PR introduces no scheduling constraints. Changes are conditional watch registration and reconciliation logic for MachineConfigPool API availability, improving compatibility with HyperShift topologies.
Ote Binary Stdout Contract ✅ Passed No stdout violations found. Modified files lack main/init/TestMain functions and use logr.Logger for logging. Test file uses logr.Discard(). All fmt calls are Errorf (not Print/Println).
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PR adds no Ginkgo e2e tests; new tests are standard Go unit tests using testing.T with mock clients, not e2e tests requiring IPv6/disconnected environment compatibility.
No-Weak-Crypto ✅ Passed No weak cryptographic patterns (MD5, SHA1, DES, RC4, 3DES, Blowfish, ECB), custom crypto implementations, or non-constant-time secret comparisons found in modified files.
Container-Privileges ✅ Passed PR modifies only Go controller code, not K8s manifests. No additions of privileged: true, hostPID, hostNetwork, hostIPC, SYS_ADMIN, or allowPrivilegeEscalation.
No-Sensitive-Data-In-Logs ✅ Passed All new logging statements contain only API availability status messages using literal strings with no sensitive data exposed.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci Bot requested review from gkurz and snir911 June 5, 2026 04:20

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@controllers/deployment_mode_handler.go`:
- Around line 127-136: Replace the unbounded context usage in
isMachineConfigPoolAvailable (and similarly in isMachineConfigAvailable) with a
context that has a timeout: create ctx, cancel :=
context.WithTimeout(context.Background(), <reasonableDuration>) and defer
cancel() before calling r.Client.Get, add the context import if missing, and
ensure you handle/return the context deadline exceeded error appropriately
(preserve existing error logic such as k8serrors.IsNotFound).
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 2eb91f3f-33c8-411b-90b1-6ae861869d7e

📥 Commits

Reviewing files that changed from the base of the PR and between 2091b58 and 317cb27.

📒 Files selected for processing (3)
  • controllers/deployment_mode_handler.go
  • controllers/deployment_mode_handler_test.go
  • controllers/openshift_controller.go

Comment thread controllers/deployment_mode_handler.go
… HCP

On ROSA HCP guest clusters, the MachineConfigPool API may be absent while
other MCO types exist. OSC 1.12 unconditionally watches MCP at manager
startup and lists MCPs during reconcile, causing crash loops or repeated
ERROR logs before DaemonSet-based Kata install can complete (KATA-4840).

- Probe for MachineConfigPool API availability (same pattern as MachineConfig)
- Register MCP watch only when the API is present
- Short-circuit checkConvergedCluster() when MCP is unavailable
- Use context with timeout for API probe calls
- Unit tests for both paths

Fixes: rhjira#KATA-4840
Related: KATA-4233, KATA-5177, KATA-4597
@paulczar paulczar force-pushed the hcp-skip-mcp-watch-when-unavailable branch from 89d829d to 7c140b2 Compare June 7, 2026 23:56
@openshift-ci

openshift-ci Bot commented Jun 8, 2026

Copy link
Copy Markdown

@paulczar: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

@snir911 snir911 left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! looks good to me, pls fix only the commit msg check failure.

IIRC @c3d touched hcp, could you pls have a look

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants