Skip to content

Conversation

@theobarberbany
Copy link
Contributor

@theobarberbany theobarberbany commented Sep 29, 2025

Summary by CodeRabbit

  • New Features

    • Added resolver to map infra resources to machine reconciles.
    • New public terminal errors to indicate invalid infra references.
  • Refactor

    • Terminal-error handling added to avoid futile requeues and emit warnings.
    • Reconciler and finalizer flows adjusted for more predictable sync and deletion handling.
    • Standardized structured logging across controllers for clearer operational output.
  • Tests

    • Hardened ownership wiring, sentinel resources, and switched tests to controller-runtime/Ginkgo logging.
  • Chores

    • Updated watch/filter utilities and logging integration.

@coderabbitai
Copy link

coderabbitai bot commented Sep 29, 2025

Walkthrough

Adds terminal infra-reference errors and handling in reconciliations; introduces ResolveCAPIMachineFromInfraMachine to enqueue CAPI Machines from InfraMachine ownerRefs; replaces klog with controller-runtime/GinkgoLogr logging in utilities and tests; updates reconciler signatures to return explicit requeue flags; adjusts tests for ownerReference wiring.

Changes

Cohort / File(s) Summary
Machine sync controller
pkg/controllers/machinesync/machine_sync_controller.go
Adds terminal errors errInvalidInfraClusterReference, errInvalidInfraMachineReference and constant terminalConfigurationErrorLog; switches InfraMachine watch to util.ResolveCAPIMachineFromInfraMachine; pre-validates infra refs and returns terminal errors; handles terminal errors to avoid requeues and emit warnings; updates reconcileMAPItoCAPIMachineDeletion and ensureSyncFinalizer signatures to return (shouldRequeue bool, err error); logging cleanup.
Watch filter utilities
pkg/util/watch_filters.go
Replaces klog with controller-runtime structured logging; adds ResolveCAPIMachineFromInfraMachine(namespace) to parse ownerRefs (APIVersion/Kind) and enqueue reconcile.Requests for owning CAPI Machine; preserves namespace filtering and adds parsing/logging paths.
Machine sync tests
pkg/controllers/machinesync/machine_sync_controller_test.go
Enables test logging via GinkgoLogr in manager options; rewires test setup to create MAPI/CAPI infra and ownerReferences earlier; replaces some direct infra creations with owner-referenced resources; adjusts assertions and sequencing to match ownership and admission/policy expectations.
Test suite logging
pkg/controllers/machinesync/suite_test.go
Replaces klog textlogger with GinkgoLogr by calling logf.SetLogger(GinkgoLogr) and ctrl.SetLogger(GinkgoLogr); updates imports and BeforeSuite setup.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Infra as InfraMachine (event)
  participant Resolver as ResolveCAPIMachineFromInfraMachine
  participant Reconciler as MachineSyncReconciler
  participant API as Kubernetes API

  Infra->>Resolver: event (create/update/delete)
  Resolver->>Reconciler: enqueue reconcile for owning CAPI Machine

  Reconciler->>API: Get CAPI Machine
  Reconciler->>API: Get MAPI Machine
  Reconciler->>API: Get InfraMachine reference

  alt terminal invalid infra refs
    Reconciler->>Reconciler: detect errInvalidInfraClusterReference / errInvalidInfraMachineReference
    Reconciler-->>Reconciler: log terminal error, emit warning event, do not requeue
  else normal sync flow
    Reconciler->>Reconciler: ensureSyncFinalizer(...) → (shouldRequeue)
    alt deletion
      Reconciler->>Reconciler: reconcileMAPItoCAPIMachineDeletion(...) → (shouldRequeue, err)
    else update/sync
      Reconciler->>API: Patch status / conditions
    end
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

  • Pay attention to terminal-error branching in fetchCAPIInfraResources and how events/logging are emitted.
  • Verify signature changes propagate correctly to callers and tests.
  • Review new ResolveCAPIMachineFromInfraMachine parsing of APIVersion/Kind for edge cases.

Poem

I nibble ownerRefs by moonlit code,
Hop bridges built from infra node to node.
Finalizers snug, no needless spin,
Logs that whisper where errors begin.
A rabbit stamps — reconciliation wins. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The PR title "OCPBUGS-62325: Updates InfraMachine watch_filters for MachineSync controller" is related to the changeset and accurately describes a real part of the changes. The pull request does include updates to watch_filters, specifically the addition of a new public function ResolveCAPIMachineFromInfraMachine in pkg/util/watch_filters.go and the switch from util.RewriteNamespace to util.ResolveCAPIMachineFromInfraMachine in the SetupWithManager method. However, the title does not fully capture the broader scope of the PR, which also includes significant changes such as terminal configuration error handling with new error variables (errInvalidInfraClusterReference, errInvalidInfraMachineReference), updates to reconciliation logic with terminal-error handling, method signature changes, and test infrastructure updates. The title is specific and clear, referring to a concrete aspect of the change, though it represents a partial view of the overall changeset.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 961a450 and 72dbdfd.

📒 Files selected for processing (2)
  • pkg/controllers/machinesync/machine_sync_controller.go (9 hunks)
  • pkg/controllers/machinesync/machine_sync_controller_test.go (7 hunks)
🔇 Additional comments (10)
pkg/controllers/machinesync/machine_sync_controller_test.go (3)

270-292: Test wiring correctly establishes owner references for watch filter validation.

The restructured test properly creates the CAPI machine first, then sets it as an owner reference on the CAPA machine. This ordering is necessary to test the new ResolveCAPIMachineFromInfraMachine watch filter introduced in the controller changes (line 176 of machine_sync_controller.go).


961-974: Owner reference wiring added to admission policy tests.

The addition of owner references on the CAPA machine in the validation admission policy tests maintains consistency with the test wiring changes throughout the file and ensures the controller's watch filter can correctly resolve relationships.


1030-1043: Sentinel machine setup correctly validates VAP enforcement.

The throwaway sentinel machine with WithGenerateName is properly used to verify that the ValidatingAdmissionPolicy is active before proceeding with the actual test assertions. This is a good testing practice.

pkg/controllers/machinesync/machine_sync_controller.go (7)

126-135: Terminal configuration errors properly defined.

The new sentinel error variables errInvalidInfraClusterReference and errInvalidInfraMachineReference, along with the logging constant terminalConfigurationErrorLog, follow Go best practices and enable consistent handling of unrecoverable configuration errors across the reconciliation paths.


176-176: Core improvement: InfraMachine watch now resolves via owner references.

This change updates the InfraMachine watch filter from simple namespace rewriting to owner-reference-based resolution. The new ResolveCAPIMachineFromInfraMachine function enables the controller to correctly enqueue reconcile requests for MAPI Machines when their corresponding InfraMachine resources change, by following the InfraMachine → CAPI Machine → MAPI Machine relationship chain.


295-324: Terminal configuration error handling prevents infinite requeues.

The code correctly identifies terminal configuration errors (empty cluster name or invalid infrastructure machine reference) and avoids requeuing reconciliation attempts. When no MAPI machine exists, it emits a warning event on the CAPI machine; when a MAPI machine exists, it sets the synchronized condition to False. This dual-path handling is appropriate for the different scenarios.


438-442: Terminal error handling consistent across reconciliation directions.

The terminal configuration error handling in the MAPI→CAPI reconciliation path mirrors the pattern in the CAPI→MAPI path, maintaining consistency. The synchronized condition is already set in the earlier error block (lines 432-436), so this path correctly logs and returns without requeue.


983-993: Pre-validation prevents downstream nil pointer issues.

The validation of required references (cluster name and infrastructure machine reference) before API calls is a defensive programming practice that catches configuration errors early. The wrapped sentinel errors provide context (machine namespace/name) while allowing callers to detect terminal configuration errors via errors.Is().


1014-1015: Function signature corrected and logger properly scoped.

The typo shouldReqeueshouldRequeue has been fixed, and the logger is now properly declared at the function start rather than being scattered throughout. This addresses the issues noted in previous reviews.


1231-1231: Function signature improved for clarity.

The updated signature with shouldRequeue as a named return parameter improves code readability and maintains consistency with reconcileMAPItoCAPIMachineDeletion. The explicit naming clarifies the purpose of the boolean return value.

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.5.0)

Error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions
The command is terminated due to an error: can't load config: unsupported version of the configuration: "" See https://golangci-lint.run/docs/product/migration-guide for migration instructions


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 29, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Sep 29, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@theobarberbany theobarberbany changed the title Updates InfraMachine watch_filters OCPBUGS-62325: Updates InfraMachine watch_filters Sep 29, 2025
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Sep 29, 2025
@openshift-ci-robot
Copy link

@theobarberbany: This pull request references Jira Issue OCPBUGS-62325, which is invalid:

  • expected the bug to target the "4.21.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@theobarberbany theobarberbany force-pushed the tb/fix-machinesync-watches branch 2 times, most recently from 4b03d42 to a3b9582 Compare September 29, 2025 15:14
@theobarberbany theobarberbany changed the title OCPBUGS-62325: Updates InfraMachine watch_filters OCPBUGS-62325: Updates InfraMachine watch_filters for MachineSync controller Sep 29, 2025
@theobarberbany theobarberbany force-pushed the tb/fix-machinesync-watches branch from a3b9582 to 6c6813f Compare October 1, 2025 16:41
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 1, 2025
@theobarberbany theobarberbany force-pushed the tb/fix-machinesync-watches branch 2 times, most recently from d360bd1 to 5e6d8d4 Compare October 2, 2025 21:31
@theobarberbany
Copy link
Contributor Author

@coderabbitai full review

@coderabbitai
Copy link

coderabbitai bot commented Oct 2, 2025

✅ Actions performed

Full review triggered.

@openshift-ci-robot openshift-ci-robot added jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. and removed jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Oct 2, 2025
@openshift-ci-robot
Copy link

@theobarberbany: This pull request references Jira Issue OCPBUGS-62325, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

Summary by CodeRabbit

  • Refactor

  • Improved reconciliation timing with a unified requeue interval for smoother, more predictable sync behavior.

  • Refined event handling to trigger reconciles from infrastructure ownership relationships, enhancing responsiveness and correctness.

  • Streamlined logging for clearer, more consistent operational insights.

  • Tests

  • Expanded and hardened test scenarios around ownership wiring and reconciliation ordering to ensure reliability.

  • Unified test logging configuration for better diagnostic output.

  • Chores

  • Standardized watch/filter utilities and logging prefixes for consistent observability across components.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested a review from sunzhaohua2 October 2, 2025 21:33
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/controllers/machinesync/machine_sync_controller.go (2)

1039-1046: Do not wrap a nil error when removing the MAPI sync finalizer.

This path now always returns an error (fmt.Errorf(... %w, err) even when err is nil, breaking the deletion flow.

Use this fix:

-		_, err := util.RemoveFinalizer(ctx, r.Client, mapiMachine, SyncFinalizer)
-
-		return false, fmt.Errorf("failed to remove finalizer: %w", err)
+		changed, err := util.RemoveFinalizer(ctx, r.Client, mapiMachine, SyncFinalizer)
+		if err != nil {
+			return false, fmt.Errorf("failed to remove finalizer: %w", err)
+		}
+		return changed, nil

1161-1167: Likewise, avoid wrapping a nil error when pruning the CAPI sync finalizer.

Here too, fmt.Errorf(... %w, err) returns a non-nil error even when err is nil, so reconciliation always fails instead of continuing.

Patch it like this:

-		_, err := util.RemoveFinalizer(ctx, r.Client, capiMachine, SyncFinalizer)
-
-		return false, fmt.Errorf("failed to remove finalizer: %w", err)
+		changed, err := util.RemoveFinalizer(ctx, r.Client, capiMachine, SyncFinalizer)
+		if err != nil {
+			return false, fmt.Errorf("failed to remove finalizer: %w", err)
+		}
+		return changed, nil
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 5d63584 and 5e6d8d4.

📒 Files selected for processing (4)
  • pkg/controllers/machinesync/machine_sync_controller.go (8 hunks)
  • pkg/controllers/machinesync/machine_sync_controller_test.go (14 hunks)
  • pkg/controllers/machinesync/suite_test.go (2 hunks)
  • pkg/util/watch_filters.go (4 hunks)

@theobarberbany
Copy link
Contributor Author

/test unit

@theobarberbany theobarberbany force-pushed the tb/fix-machinesync-watches branch from 5e6d8d4 to db0597d Compare October 3, 2025 11:49
@openshift-merge-robot openshift-merge-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 3, 2025
@openshift-merge-robot openshift-merge-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Oct 11, 2025
@theobarberbany theobarberbany force-pushed the tb/fix-machinesync-watches branch 3 times, most recently from 856d7b6 to 125151e Compare October 16, 2025 14:05
@theobarberbany theobarberbany force-pushed the tb/fix-machinesync-watches branch from 31c38ef to 6a39fb7 Compare October 22, 2025 13:39
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (2)
pkg/controllers/machinesync/machine_sync_controller_test.go (1)

193-194: Verify GinkgoLogr reference.

Past review flagged this line as requiring komega.GinkgoLogr instead of GinkgoLogr. The current code shows GinkgoLogr without a package qualifier—confirm this resolves correctly or update to komega.GinkgoLogr.

Run this check to verify the symbol resolves:

#!/bin/bash
# Description: Check if GinkgoLogr is properly defined or if komega.GinkgoLogr is needed

# Search for GinkgoLogr declarations in the test file and komega package usage
rg -n 'GinkgoLogr' pkg/controllers/machinesync/machine_sync_controller_test.go

# Check if there's a local GinkgoLogr variable defined in the test
ast-grep --pattern $'var GinkgoLogr = $$$'

# Verify komega import and GinkgoLogr export
rg -n 'komega\.' pkg/controllers/machinesync/
pkg/controllers/machinesync/machine_sync_controller.go (1)

1032-1034: Critical: Past review issue not addressed - incorrect finalizer removal handling.

The past review correctly identified that lines 1032-1034 have critical bugs:

  1. The boolean return from RemoveFinalizer is ignored with _, so the requeue signal is lost
  2. fmt.Errorf wraps the error even when err is nil, returning (false, nil) instead of the intended behavior
  3. Should return (changed, nil) when removal succeeds

Apply this fix:

-    _, err := util.RemoveFinalizer(ctx, r.Client, mapiMachine, SyncFinalizer)
-
-    return false, fmt.Errorf("failed to remove finalizer: %w", err)
+    changed, err := util.RemoveFinalizer(ctx, r.Client, mapiMachine, SyncFinalizer)
+    if err != nil {
+      return false, fmt.Errorf("failed to remove finalizer: %w", err)
+    }
+    return changed, nil
🧹 Nitpick comments (1)
pkg/util/watch_filters.go (1)

84-114: Consider logging successful matches and clarify multi-owner behavior.

The implementation correctly parses ownerReferences and matches CAPI Machines, but consider:

  1. Add logging when a match is found (similar to line 72-73 in ResolveCAPIMachineSetFromInfraMachineTemplate)
  2. The function appends all matching owners—is this intentional? If an InfraMachine has multiple Machine owners, multiple reconcile requests will be enqueued.
  3. Should the function filter for Controller: true owners only?

Example enhancement:

 		if ref.Kind == "Machine" && gv.Group == clusterv1.GroupVersion.Group {
+			logger.V(4).Info("InfraMachine owned by CAPI Machine, enqueueing request",
+				"Machine", ref.Name)
 			requests = append(requests, reconcile.Request{
 				NamespacedName: client.ObjectKey{Namespace: namespace, Name: ref.Name},
 			})
 		}
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 31c38ef and 6a39fb7.

📒 Files selected for processing (4)
  • pkg/controllers/machinesync/machine_sync_controller.go (9 hunks)
  • pkg/controllers/machinesync/machine_sync_controller_test.go (12 hunks)
  • pkg/controllers/machinesync/suite_test.go (2 hunks)
  • pkg/util/watch_filters.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • pkg/controllers/machinesync/suite_test.go
🔇 Additional comments (18)
pkg/util/watch_filters.go (3)

24-27: LGTM! Imports support new functionality.

The new imports correctly support:

  • schema for parsing API versions
  • clusterv1 for CAPI Machine identification
  • logf for structured logging (addressing past review feedback)

39-51: LGTM! Structured logging properly implemented.

Successfully addresses past review feedback by using logf.FromContext with structured key-value pairs instead of klog.

Based on learnings


58-82: LGTM! Consistent structured logging.

The logging follows the same pattern as RewriteNamespace and properly logs the label-based reconciliation trigger.

pkg/controllers/machinesync/machine_sync_controller_test.go (9)

271-292: LGTM! Owner reference correctly wired for reconciliation.

The owner reference from CAPI Machine to CAPA Machine ensures the sync controller watches infra machine changes. The setup correctly precedes the Create call.


307-341: LGTM! Test properly validates infra machine recreation on providerSpec update.

The test correctly verifies that updating the MAPI machine providerSpec triggers recreation of the CAPI infra machine with the new instance type.


356-375: LGTM! Builder pattern properly maintains owner references.

Updating capaMachineBuilder with owner references ensures consistency across subsequent test uses without repetition.


465-490: LGTM! Properly handles naming and owner reference verification.

The test correctly addresses the generateName issue by rebuilding the CAPA machine with the actual MAPI machine name and verifying owner references are set.


564-589: LGTM! Correct creation order for owner references.

The test properly sequences resource creation:

  1. Build CAPA machine template
  2. Create CAPI Machine (gets UID from API server)
  3. Add owner reference with UID to CAPA machine
  4. Create CAPA machine

This ensures valid owner references.


607-607: LGTM! Explicit name instead of generateName.

Clearing generateName and using an explicit name is clearer and avoids having both set simultaneously.


650-661: LGTM! Consistent owner reference setup for MachineSet-owned machines.

Ensures CAPA machine ownership is properly established even when the CAPI Machine has a MachineSet owner.


963-975: LGTM! VAP tests maintain owner reference consistency.

The admission policy tests now include proper owner reference setup, maintaining consistency with the rest of the test suite.


1032-1044: LGTM! Simplified sentinel machine creation.

Using explicit names instead of generateName for sentinel/throwaway machines makes test behavior more predictable while still serving the VAP verification purpose.

Also applies to: 1232-1238

pkg/controllers/machinesync/machine_sync_controller.go (6)

126-130: LGTM! Clear terminal error definitions.

The new error types correctly identify configuration issues that require user intervention rather than retry.


172-172: LGTM! More precise InfraMachine event handling.

Switching from RewriteNamespace to ResolveCAPIMachineFromInfraMachine ensures only InfraMachines with CAPI Machine owners trigger reconciliation, reducing unnecessary reconcile requests.


290-317: LGTM! Terminal error handling prevents infinite retries.

The code correctly identifies and handles terminal configuration errors by logging and not requeuing, which is appropriate for issues requiring user intervention.

Minor observation: Line 316's return ctrl.Result{}, fetchErr is only reached for non-terminal errors (since terminal errors return at line 313), so the logic is correct but the flow could be slightly clearer.


431-437: LGTM! Consistent terminal error handling.

Applies the same terminal error handling pattern as in reconcileCAPIMachinetoMAPIMachine, maintaining consistency across reconciliation paths.


976-986: LGTM! Pre-flight validation prevents downstream nil pointer issues.

The validation checks catch malformed machine references early and return descriptive terminal errors, preventing nil pointer dereferences in subsequent code.


1007-1008: Good: Named returns improve clarity.

The function signature updates use named return values which improves readability. The logger initialization on line 1008 is appropriate.

Also applies to: 1224-1224

@damdo
Copy link
Member

damdo commented Oct 22, 2025

/assign @damdo @mdbooth

@theobarberbany theobarberbany force-pushed the tb/fix-machinesync-watches branch from 6a39fb7 to 60a792f Compare October 22, 2025 14:54
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
pkg/controllers/machinesync/machine_sync_controller.go (1)

290-316: Proper terminal error handling prevents infinite requeues.

The logic correctly identifies terminal configuration errors (empty cluster name or infrastructure machine reference) and exits without requeuing, while allowing transient errors to propagate for retry. This prevents wasted reconciliation cycles for user configuration mistakes.

Note: Lines 297-300 and 311-314 contain very similar terminal error checks. Consider extracting this into a helper function to reduce duplication.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6a39fb7 and 60a792f.

📒 Files selected for processing (4)
  • pkg/controllers/machinesync/machine_sync_controller.go (9 hunks)
  • pkg/controllers/machinesync/machine_sync_controller_test.go (12 hunks)
  • pkg/controllers/machinesync/suite_test.go (2 hunks)
  • pkg/util/watch_filters.go (4 hunks)
🔇 Additional comments (10)
pkg/util/watch_filters.go (1)

84-114: LGTM! Well-structured ownership-based reconciliation resolver.

The new ResolveCAPIMachineFromInfraMachine function correctly inspects owner references, parses the API group/version, and enqueues reconcile requests for the owning CAPI Machine in the MAPI namespace. The structured logging and error handling are appropriate.

pkg/controllers/machinesync/machine_sync_controller_test.go (4)

193-194: Logger setup is correct.

The use of GinkgoLogr without a package qualifier is valid because github.com/onsi/ginkgo/v2 is dot-imported (line 24), making GinkgoLogr directly accessible. Based on learnings.


271-292: Proper owner reference wiring for testing new reconciliation flow.

The test now correctly establishes the CAPI Machine as an owner of the CAPA machine before creating it. This aligns with the updated controller behavior where ResolveCAPIMachineFromInfraMachine uses owner references to trigger reconciliation, ensuring the watch/event triggering logic is properly tested.


473-490: Good test coverage for owner reference verification.

This test block properly verifies that the CAPA machine is created with the correct owner references pointing to the CAPI machine, which is critical for the new ownership-based reconciliation resolver introduced in pkg/util/watch_filters.go.


1032-1046: Sentinel machine pattern correctly implements VAP test verification.

The use of a throwaway sentinel machine to verify that the ValidatingAdmissionPolicy is active and blocking forbidden operations is appropriate test design. The pattern ensures that subsequent test assertions can rely on the policy being enforced.

pkg/controllers/machinesync/machine_sync_controller.go (4)

126-130: Good addition of terminal error sentinels.

Introducing explicit error variables for invalid infrastructure references allows the reconciler to distinguish terminal configuration errors (which should not requeue) from transient failures. This improves observability and prevents infinite requeue loops for user configuration mistakes.


171-173: Critical fix: InfraMachine watch now uses ownership-based resolution.

Replacing util.RewriteNamespace with util.ResolveCAPIMachineFromInfraMachine ensures that reconciliation is triggered correctly based on the actual owning CAPI Machine (via owner references) rather than just name-based namespace rewriting. This is essential for proper event handling when InfraMachines are updated.


976-986: Early validation prevents nil pointer dereferences.

Adding pre-validation for empty cluster names and infrastructure machine references before attempting API calls is the right approach. The wrapped error messages include the machine namespace/name for debugging, which improves observability when these configuration errors occur.


1007-1008: Function signature improvements enhance clarity.

The updated signature with explicit shouldRequeue bool return value and structured logging via logf.FromContext(ctx) improves code clarity and aligns with controller-runtime best practices. The named return parameters make the intent clear at the call sites.

pkg/controllers/machinesync/suite_test.go (1)

67-68: Test logging correctly configured with controller-runtime.

Setting both logf.SetLogger and ctrl.SetLogger to GinkgoLogr ensures that all controller-runtime logging (including manager and reconciler logs) is properly captured by Ginkgo's test output. This aligns with the structured logging migration across the PR.

@theobarberbany theobarberbany force-pushed the tb/fix-machinesync-watches branch from 60a792f to 5950f19 Compare October 22, 2025 14:58
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
pkg/controllers/machinesync/machine_sync_controller.go (2)

1028-1035: Critical: finalizer removal return-path bug; lost ‘changed’ and nil-wrapping risk.

Capture the bool from RemoveFinalizer and only wrap non-nil errors; return requeue when changed.

-        _, err := util.RemoveFinalizer(ctx, r.Client, mapiMachine, SyncFinalizer)
-
-        return false, fmt.Errorf("failed to remove finalizer: %w", err)
+        changed, err := util.RemoveFinalizer(ctx, r.Client, mapiMachine, SyncFinalizer)
+        if err != nil {
+            return false, fmt.Errorf("failed to remove finalizer: %w", err)
+        }
+        return changed, nil

1045-1051: Fix error message on infra Delete.

This path deletes the InfraMachine; message says “failed to remove finalizer”.

-            if err := r.Client.Delete(ctx, infraMachine); err != nil {
-                return false, fmt.Errorf("failed to remove finalizer: %w", err)
+            if err := r.Client.Delete(ctx, infraMachine); err != nil {
+                return false, fmt.Errorf("failed to delete Cluster API infra machine: %w", err)
             }
♻️ Duplicate comments (6)
pkg/controllers/machinesync/machine_sync_controller_test.go (5)

357-375: Same Controller flag concern as above.

Mirror the earlier suggestion: set Controller to true for InfraMachine ownerRef.


565-590: OwnerReference Controller flag — align with production.

Same recommendation: Controller should be true.


651-662: Repeat: InfraMachine ownerRef Controller flag.

Prefer Controller: true.


963-976: Repeat: InfraMachine ownerRef Controller flag.

Prefer Controller: true.


192-195: Fix Logger reference to komega.GinkgoLogr.

Manager Logger must be komega.GinkgoLogr; unqualified GinkgoLogr won’t compile.

-            Logger: GinkgoLogr,
+            Logger: komega.GinkgoLogr,
pkg/controllers/machinesync/machine_sync_controller.go (1)

431-436: Duplicate: terminal error handling mirrors the CAPI->MAPI path.

Consistent behaviour; no requeue on configuration errors.

🧹 Nitpick comments (2)
pkg/controllers/machinesync/machine_sync_controller_test.go (1)

271-292: OwnerReference Controller flag likely should be true for InfraMachine.

CAPI Machine is the controller of the InfraMachine; use Controller: ptr.To(true) for realism and closer parity with production.

-    Controller:         ptr.To(false),
+    Controller:         ptr.To(true),
pkg/controllers/machinesync/machine_sync_controller.go (1)

976-986: Pre-validate references before GETs; minor tidy.

Validation is correct. Optionally move key construction below validation for clarity.

- infraMachineKey := client.ObjectKey{ Namespace: infraMachineRef.Namespace, Name: infraMachineRef.Name }
- // Validate...
+ // Validate...
  if capiMachine.Spec.ClusterName == "" { ... }
  if infraMachineRef.Name == "" || infraMachineRef.Namespace == "" { ... }
+ infraMachineKey := client.ObjectKey{ Namespace: infraMachineRef.Namespace, Name: infraMachineRef.Name }
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to data retention organization setting

📥 Commits

Reviewing files that changed from the base of the PR and between 60a792f and 5950f19.

📒 Files selected for processing (4)
  • pkg/controllers/machinesync/machine_sync_controller.go (9 hunks)
  • pkg/controllers/machinesync/machine_sync_controller_test.go (12 hunks)
  • pkg/controllers/machinesync/suite_test.go (2 hunks)
  • pkg/util/watch_filters.go (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • pkg/controllers/machinesync/suite_test.go
  • pkg/util/watch_filters.go
🔇 Additional comments (12)
pkg/controllers/machinesync/machine_sync_controller_test.go (7)

307-341: LGTM: providerSpec update flow and assertions.

Update -> recreate infra machine and condition expectations look correct.


466-471: LGTM: name alignment before get.

Rebuilding capiMachine with MAPI name avoids 404s.


473-490: LGTM: verifies InfraMachine ownerRef points at the created CAPI Machine.

Good UID-based assertion.


547-551: No action.

Message-only change; fine to keep.


1032-1046: LGTM: VAP sentinel flow.

State transition + VAP denial check looks correct.


1232-1239: LGTM: CAPI sentinel + VAP check.

Covers namespace-scoped policy behaviour.


607-607: Verify whether builder state persists across test contexts.

The review comment flags a valid concern: line 607 clears GenerateName and sets Name on the suite-level mapiMachineBuilder, but line 1032 (in a different test context) calls WithGenerateName("sentinel-machine") on the same builder instance without clearing Name.

Standard Go builders typically accumulate state. If the builder retains Name from line 607, then sentinelMachine at line 1032 would have both Name and GenerateName, which violates Kubernetes object naming rules.

Confirm:

  1. Whether the builder implementation resets state after .Build() or retains it
  2. Whether Ginkgo test context isolation resets the builder between test executions
  3. That sentinelMachine is created with only GenerateName and not an inherited Name
pkg/controllers/machinesync/machine_sync_controller.go (5)

126-131: Good: explicit terminal error types for invalid references.

Clear sentinel errors simplify control flow and user-facing messages.


291-315: Good: do-not-requeue on terminal configuration errors.

The early exit for empty cluster name/infraRef prevents hot loops and surfaces a clear error.


1007-1015: Signature change is good; deletion path readability improved.

Named return shouldRequeue clarifies intent.


1224-1264: LGTM: ensureSyncFinalizer returns shouldRequeue; aggregates errors.

Correctly sets requeue when any finalizer is added; safe nil-object checks.


171-174: Mapping implementation verified as correct and robust.

The ResolveCAPIMachineFromInfraMachine function properly:

  • Resolves CAPI Machines via owner references with explicit Kind and APIVersion validation
  • Handles nil/empty owner references and non-Machine owners safely through iteration and filtering
  • Works generically across all providers (AWS/OpenStack/PowerVS) with no provider-specific logic

The controller correctly injects r.MAPINamespace and uses FilterNamespace(r.CAPINamespace) to ensure InfraMachine watches trigger reconciliation of the corresponding MAPI Machine mirror.

@theobarberbany theobarberbany force-pushed the tb/fix-machinesync-watches branch from 5950f19 to 961a450 Compare October 22, 2025 15:07
Copy link
Contributor

@chrischdi chrischdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm
/approve

@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Oct 22, 2025
@openshift-ci-robot
Copy link

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test e2e-aws-capi-techpreview
/test e2e-aws-ovn
/test e2e-aws-ovn-serial-1of2
/test e2e-aws-ovn-serial-2of2
/test e2e-aws-ovn-techpreview
/test e2e-aws-ovn-techpreview-upgrade
/test e2e-azure-capi-techpreview
/test e2e-azure-ovn-techpreview
/test e2e-azure-ovn-techpreview-upgrade
/test e2e-gcp-capi-techpreview
/test e2e-gcp-ovn-techpreview
/test e2e-metal3-capi-techpreview
/test e2e-openstack-capi-techpreview
/test e2e-openstack-ovn-techpreview
/test e2e-vsphere-capi-techpreview
/test regression-clusterinfra-aws-ipi-techpreview-capi

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 22, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrischdi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Oct 22, 2025
Comment on lines 470 to 471
Eventually(k.Get(capiMachine)).Should(Succeed())
})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is testing anything. We already asserted that Create() succeeded in BeforeEach. Are we wondering if it was deleted since? I would just delete this test.

Eventually(k.Get(
awsv1resourcebuilder.AWSMachine().WithName(mapiMachine.Name).WithNamespace(capiNamespace.Name).Build(),
)).Should(Succeed())
It("should successfully create the CAPA machine with correct owner references", func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto. I would delete this test. All it's doing is testing that we created the test objects in BeforeEach, but BeforeEach already asserted that.

A valid test would be asserting something else related to the machine sync controller, like some status.

Comment on lines +106 to +109
requests = append(requests, reconcile.Request{
NamespacedName: client.ObjectKey{Namespace: namespace, Name: ref.Name},
})
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we expect multiple Machine owner references? You should probably just return at this point.

This also means you don't need to append, you can return a slice you instantiate here, initialised with a single element.

}
}

return requests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which also means you can return nil here instead of an empty slice.

@sunzhaohua2
Copy link
Contributor

/retest

Copy link
Member

@damdo damdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, left a couple of Qs.
Also, do we need to do this for MachineSets/InfraMachineTemplates in any form?

Comment on lines +67 to +68
logf.SetLogger(GinkgoLogr)
ctrl.SetLogger(GinkgoLogr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I worry that the other controllers suites will drift away from this change if we don't do that in the others too. Would you be able to follow up with a PR and do the same where we do the "old approach"? TY

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not forget about this :)


if existingMAPIMachine == nil {
// Don't requeue for terminal configuration errors
if errors.Is(err, errInvalidInfraClusterReference) || errors.Is(err, errInvalidInfraMachineReference) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are repeating this check quite a lot, how about having a small function for this, so the configuration errors set can also be extended in a single point in the future.

e.g. something along the lines of
isTerminalConfigurationError()

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still see these repeated, would it make sense to add this util func?

fetchErr := fmt.Errorf("failed to fetch Cluster API infra resources: %w", err)

if existingMAPIMachine == nil {
// Don't requeue for terminal configuration errors
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to do the same for MachineSets?

Comment on lines 356 to 374
By("Creating the CAPI infra machine")
// we must set the capi machine as an owner of the capa machine
// in order to ensure we reconcile capa changes in our sync controller.

// Updates the capaMachineBuilder with the correct owner ref,
// so when we use it later on, we don't need to repeat ourselves.
capaMachineBuilder = capaMachineBuilder.WithOwnerReferences([]metav1.OwnerReference{
{
Kind: machineKind,
APIVersion: clusterv1.GroupVersion.String(),
Name: capiMachine.Name,
UID: capiMachine.UID,
BlockOwnerDeletion: ptr.To(true),
Controller: ptr.To(false),
},
})

capaMachine = capaMachineBuilder.Build()
Expect(k8sClient.Create(ctx, capaMachine)).To(Succeed(), "capa machine should be able to be created")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Name: capaMachine.GetName(),
Namespace: capaMachine.GetNamespace(),
}).Build()
Expect(k8sClient.Create(ctx, capiMachine)).Should(Succeed())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather use Eventuallys instead of Expect all over the place for these k8s api calls

capiMachine = capiMachineBuilder.Build()
Eventually(k8sClient.Create(ctx, capiMachine)).Should(Succeed(), "capi machine should be able to be created")

By("Updating the CAPA machine adding the CAPI machine as an owner")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we retain the By()s?

Copy link
Contributor

@JoelSpeed JoelSpeed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/approve cancel

There's a bunch of outstanding feedback here

Lets get that resolved

@openshift-ci openshift-ci bot removed the lgtm Indicates that a PR is ready to be merged. label Oct 28, 2025
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2025

New changes are detected. LGTM label has been removed.

@openshift-ci-robot
Copy link

@theobarberbany: This pull request references Jira Issue OCPBUGS-62325, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @sunzhaohua2

In response to this:

Summary by CodeRabbit

  • New Features

  • Added resolver to map infra resources to machine reconciles.

  • New public terminal errors to indicate invalid infra references.

  • Refactor

  • Terminal-error handling added to avoid futile requeues and emit warnings.

  • Reconciler and finalizer flows adjusted for more predictable sync and deletion handling.

  • Standardized structured logging across controllers for clearer operational output.

  • Tests

  • Hardened ownership wiring, sentinel resources, and switched tests to controller-runtime/Ginkgo logging.

  • Chores

  • Updated watch/filter utilities and logging integration.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Oct 28, 2025

@theobarberbany: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-gcp-ovn-techpreview 961a450 link true /test e2e-gcp-ovn-techpreview
ci/prow/e2e-openstack-ovn-techpreview 961a450 link true /test e2e-openstack-ovn-techpreview
ci/prow/okd-scos-e2e-aws-ovn 72dbdfd link false /test okd-scos-e2e-aws-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants