Skip to content

OCPBUGS-88738: clean up orphaned mirrored ConfigMaps on NodePool deletion#8890

Open
vsolanki12 wants to merge 1 commit into
openshift:mainfrom
vsolanki12:fix-OCPBUGS-88738
Open

OCPBUGS-88738: clean up orphaned mirrored ConfigMaps on NodePool deletion#8890
vsolanki12 wants to merge 1 commit into
openshift:mainfrom
vsolanki12:fix-OCPBUGS-88738

Conversation

@vsolanki12

@vsolanki12 vsolanki12 commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

What this PR does / why we need it:

PR #8672 (OCPBUGS-86949) added a guard in HCCO's reconcileKubeletConfig that unconditionally skips deletion of guest-side ConfigMaps with NTOMirroredConfigLabel. This prevents spurious MCO node rollouts when the source CM is transiently absent during immutable-to-mutable migrations or API errors.

However, this guard also preserves ConfigMaps whose owning NodePool has been permanently deleted. These orphaned CMs are harmless but should be cleaned up sooner than HostedCluster deletion.

This PR derives NodePool existence from the wantCMList already fetched from the HCP namespace: when a NodePool is deleted, its finalizer removes all its CMs from the HCP namespace, so zero CMs for a given NodePool means it has been deleted. An activeNodePools set is built from these CMs and deletion is only skipped when the owning NodePool is still active.

Behavior matrix:

Guest CM state NodePool active? Result
Mirrored, source transiently absent Yes (other CMs exist in HCP NS) Preserved (no MCO rollout)
Mirrored, NodePool deleted No (zero CMs in HCP NS) Deleted (orphan cleanup)
Mirrored, no NodePoolLabel N/A Preserved (defensive)
Not mirrored, source absent N/A Deleted (existing behavior)

Which issue(s) this PR fixes:

Fixes OCPBUGS-88738

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes
    • Improved cleanup of mirrored kubelet ConfigMaps by determining orphan status from the owning NodePool rather than skipping deletion based on source absence.
    • Mirrored ConfigMaps are preserved while their owning NodePool remains active, even when a specific source ConfigMap is temporarily missing.
    • Preserves mirrored ConfigMaps when ownership cannot be attributed (e.g., missing NodePool label).
  • Tests
    • Expanded and refined reconciliation test cases for active, transiently missing sources, NodePool deletion, and unlabeled mirrored ConfigMaps.

@openshift-merge-bot

Copy link
Copy Markdown
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: LGTM mode

@openshift-ci openshift-ci Bot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 1, 2026
@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci-robot openshift-ci-robot added jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. labels Jul 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-88738, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state ASSIGNED, which is one of the valid states (NEW, ASSIGNED, POST)

The bug has been updated to refer to the pull request using the external bug tracker.

Details

In response to this:

What this PR does / why we need it:

PR #8672 (OCPBUGS-86949) added a guard in HCCO's reconcileKubeletConfig that unconditionally skips deletion of guest-side ConfigMaps with NTOMirroredConfigLabel. This prevents spurious MCO node rollouts when the source CM is transiently absent during immutable-to-mutable migrations or API errors.

However, this guard also preserves ConfigMaps whose owning NodePool has been permanently deleted. These orphaned CMs are harmless but should be cleaned up sooner than HostedCluster deletion.

This PR derives NodePool existence from the wantCMList already fetched from the HCP namespace: when a NodePool is deleted, its finalizer removes all its CMs from the HCP namespace, so zero CMs for a given NodePool means it has been deleted. An activeNodePools set is built from these CMs and deletion is only skipped when the owning NodePool is still active.

Behavior matrix:

Guest CM state NodePool active? Result
Mirrored, source transiently absent Yes (other CMs exist in HCP NS) Preserved (no MCO rollout)
Mirrored, NodePool deleted No (zero CMs in HCP NS) Deleted (orphan cleanup)
Mirrored, no NodePoolLabel N/A Preserved (defensive)
Not mirrored, source absent N/A Deleted (existing behavior)

Which issue(s) this PR fixes:

Fixes OCPBUGS-88738

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai

coderabbitai Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: ed28be31-ef1b-4e55-87d3-df4bfc0f96c3

📥 Commits

Reviewing files that changed from the base of the PR and between 8adaf28 and f491611.

📒 Files selected for processing (2)
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
🚧 Files skipped from review as they are similar to previous changes (2)
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go

📝 Walkthrough

Walkthrough

The change refines mirrored KubeletConfig ConfigMap cleanup in reconcileKubeletConfig. It builds an activeNodePools set from hosted-side KubeletConfig ConfigMaps and uses that set to decide whether mirrored guest-side ConfigMaps are preserved or deleted. Mirrored ConfigMaps are kept when their owning NodePool is active or unlabeled, and deleted when they are orphaned. Tests were updated for transient source absence, NodePool deletion, and missing ownership labels.

Sequence Diagram(s)

sequenceDiagram
  participant reconcileKubeletConfig
  participant HostedClusterNamespace
  participant GuestClusterNamespace
  participant activeNodePools

  reconcileKubeletConfig->>HostedClusterNamespace: list KubeletConfig ConfigMaps
  HostedClusterNamespace-->>reconcileKubeletConfig: ConfigMaps with NodePoolLabel
  reconcileKubeletConfig->>activeNodePools: record active NodePools

  reconcileKubeletConfig->>GuestClusterNamespace: inspect mirrored ConfigMaps
  GuestClusterNamespace-->>reconcileKubeletConfig: mirrored ConfigMap + NodePoolLabel
  reconcileKubeletConfig->>activeNodePools: check owning NodePool
  activeNodePools-->>reconcileKubeletConfig: active / missing
  reconcileKubeletConfig->>GuestClusterNamespace: preserve or delete mirrored ConfigMap
Loading
🚥 Pre-merge checks | ✅ 11
✅ Passed checks (11 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main change: deleting orphaned mirrored ConfigMaps when a NodePool is deleted.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Stable And Deterministic Test Names ✅ Passed The added/changed test cases use fixed, descriptive names; no dynamic suffixes, timestamps, UUIDs, node, namespace, or IP values appear in titles.
Test Structure And Quality ✅ Passed PASS: The kubelet-config cases are table-driven unit tests with fake clients, one behavior per subtest, no cluster waits, and no cleanup/timeouts required.
Topology-Aware Scheduling Compatibility ✅ Passed Only kubelet ConfigMap cleanup logic and tests changed; no scheduling constraints, node selectors, affinity, spread rules, or replica/topology logic were added.
Ipv6 And Disconnected Network Test Compatibility ✅ Passed PASS: The PR adds only unit tests for kubelet ConfigMaps, not new Ginkgo e2e tests, and the changed test code has no IPv6-only or external-connectivity assumptions.
No-Weak-Crypto ✅ Passed No weak crypto, custom crypto, or secret/token comparisons were added; the diff only changes kubelet-config cleanup logic and tests.
Container-Privileges ✅ Passed PR only changes Go controller logic/tests; no manifests or container securityContext fields like privileged, hostPID, or allowPrivilegeEscalation were added.
No-Sensitive-Data-In-Logs ✅ Passed The new logs only emit ConfigMap keys and NodePool names; no secrets, tokens, PII, hostnames, or payload data are logged.
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@openshift-ci

openshift-ci Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: vsolanki12
Once this PR has been reviewed and has the lgtm label, please assign enxebre for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci Bot added area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release and removed do-not-merge/needs-area labels Jul 1, 2026
@openshift-ci-robot

Copy link
Copy Markdown

@vsolanki12: This pull request references Jira Issue OCPBUGS-88738, which is valid.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (5.0.0) matches configured target version for branch (5.0.0)
  • bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)
Details

In response to this:

What this PR does / why we need it:

PR #8672 (OCPBUGS-86949) added a guard in HCCO's reconcileKubeletConfig that unconditionally skips deletion of guest-side ConfigMaps with NTOMirroredConfigLabel. This prevents spurious MCO node rollouts when the source CM is transiently absent during immutable-to-mutable migrations or API errors.

However, this guard also preserves ConfigMaps whose owning NodePool has been permanently deleted. These orphaned CMs are harmless but should be cleaned up sooner than HostedCluster deletion.

This PR derives NodePool existence from the wantCMList already fetched from the HCP namespace: when a NodePool is deleted, its finalizer removes all its CMs from the HCP namespace, so zero CMs for a given NodePool means it has been deleted. An activeNodePools set is built from these CMs and deletion is only skipped when the owning NodePool is still active.

Behavior matrix:

Guest CM state NodePool active? Result
Mirrored, source transiently absent Yes (other CMs exist in HCP NS) Preserved (no MCO rollout)
Mirrored, NodePool deleted No (zero CMs in HCP NS) Deleted (orphan cleanup)
Mirrored, no NodePoolLabel N/A Preserved (defensive)
Not mirrored, source absent N/A Deleted (existing behavior)

Which issue(s) this PR fixes:

Fixes OCPBUGS-88738

Special notes for your reviewer:

Checklist:

  • Subject and description added to both, commit and PR.
  • Relevant issues have been referenced.
  • This change includes docs.
  • This change includes unit tests.

Summary by CodeRabbit

  • Bug Fixes

  • Improved cleanup of mirrored kubelet config data so orphaned copies are removed when their owning NodePool is no longer active.

  • Preserved mirrored config copies when the owning NodePool is still active, even if the original source is temporarily missing.

  • Kept unlabeled mirrored config copies from being deleted defensively when ownership cannot be determined.

  • Tests

  • Expanded coverage for active, deleted, and unlabeled mirrored kubelet config scenarios.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In
`@control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go`:
- Around line 3022-3027: The NodePool activity detection in resources.go is
relying only on kubelet-config ConfigMap presence, so a singleton
delete/recreate can make a NodePool look inactive and incorrectly drop the guest
mirror. Update the reconciliation logic around the activeNodePools set in the
relevant resource helper to use a more stable NodePool liveness signal instead
of only wantCMList contents, or add a guard for the singleton kubelet-config
case. Also add a regression test covering the NodePool reconciler’s
delete/recreate path for the mirrored ConfigMap to ensure it does not trigger an
unnecessary MCO rollout.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository YAML (base), Central YAML (inherited)

Review profile: CHILL

Plan: Enterprise

Run ID: 5a6b67a8-12ca-43ab-8395-6cb220e302e4

📥 Commits

Reviewing files that changed from the base of the PR and between ce9dd2c and fe8b62a.

📒 Files selected for processing (2)
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources.go
  • control-plane-operator/hostedclusterconfigoperator/controllers/resources/resources_test.go

@codecov

codecov Bot commented Jul 1, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 43.35%. Comparing base (ca3d347) to head (f491611).
⚠️ Report is 31 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #8890      +/-   ##
==========================================
+ Coverage   43.26%   43.35%   +0.08%     
==========================================
  Files         770      771       +1     
  Lines       95479    95545      +66     
==========================================
+ Hits        41311    41419     +108     
+ Misses      51284    51242      -42     
  Partials     2884     2884              
Files with missing lines Coverage Δ
...rconfigoperator/controllers/resources/resources.go 57.74% <100.00%> (+0.15%) ⬆️

... and 9 files with indirect coverage changes

Flag Coverage Δ
cmd-support 36.87% <ø> (+0.25%) ⬆️
cpo-hostedcontrolplane 45.31% <ø> (-0.01%) ⬇️
cpo-other 45.14% <100.00%> (+0.04%) ⬆️
hypershift-operator 53.58% <ø> (-0.01%) ⬇️
other 31.68% <ø> (-0.01%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vsolanki12 vsolanki12 force-pushed the fix-OCPBUGS-88738 branch from fe8b62a to 8adaf28 Compare July 2, 2026 04:41
@vsolanki12 vsolanki12 marked this pull request as ready for review July 2, 2026 04:53
@openshift-ci openshift-ci Bot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2026
@openshift-ci openshift-ci Bot requested review from cblecker and devguyio July 2, 2026 04:53
@vsolanki12

Copy link
Copy Markdown
Contributor Author

/test ci/prow/images

@openshift-ci

openshift-ci Bot commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

@vsolanki12: The specified target(s) for /test were not found.
The following commands are available to trigger required jobs:

/test e2e-aks
/test e2e-aks-4-22
/test e2e-aks-override
/test e2e-aws
/test e2e-aws-4-22
/test e2e-aws-override
/test e2e-aws-upgrade-hypershift-operator
/test e2e-azure-v2-self-managed
/test e2e-kubevirt-aws-ovn-reduced
/test e2e-v2-aws
/test e2e-v2-gke
/test images
/test okd-scos-images
/test security
/test verify-deps

The following commands are available to trigger optional jobs:

/test address-review-comments
/test agentic-qe-aws
/test e2e-agent-connected-ovn-ipv4-metal-backuprestore
/test e2e-aws-autonode
/test e2e-aws-external-oidc-techpreview
/test e2e-aws-metrics
/test e2e-aws-minimal
/test e2e-aws-ovn-conformance-ccm
/test e2e-aws-ovn-conformance-techpreview
/test e2e-aws-techpreview
/test e2e-azure-aks-external-oidc-techpreview
/test e2e-azure-aks-ovn-conformance
/test e2e-azure-aks-ovn-conformance-fips
/test e2e-azure-kubevirt-ovn
/test e2e-conformance
/test e2e-conformance-fips
/test e2e-kubevirt-aws-ovn
/test e2e-kubevirt-azure-ovn
/test e2e-kubevirt-metal-ovn-backuprestore
/test e2e-openstack-aws
/test e2e-openstack-aws-conformance
/test e2e-openstack-aws-csi-cinder
/test e2e-openstack-aws-csi-manila
/test e2e-openstack-aws-nfv
/test e2e-v2-aws-backuprestore
/test okd-scos-e2e-aws-ovn
/test reqserving-e2e-aws

Use /test all to run the following jobs that were automatically triggered:

pull-ci-openshift-hypershift-main-images
pull-ci-openshift-hypershift-main-okd-scos-images
pull-ci-openshift-hypershift-main-security
pull-ci-openshift-hypershift-main-verify-deps
Details

In response to this:

/test ci/prow/images

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@hypershift-jira-solve-ci

Copy link
Copy Markdown

Confirmed: all 5 Go build targets completed without errors (lines 107-111), and the build moved to [2/2] STEP 1/19 — the second Dockerfile stage pulling ubi9:latest. The logSnippet from builds.json confirms the actual error:

error: build error: creating build container: copying syst...hing blob: received unexpected HTTP status: 502 Bad Gateway

This is a CI infrastructure issue — a transient 502 Bad Gateway error when pulling the ubi9:latest base image from registry.access.redhat.com. The first attempt failed with a DNS resolution failure (no such host for the internal image registry), and the second attempt failed with a 502 Bad Gateway from Red Hat's container registry.

Test Failure Analysis Complete

Job Information

  • Prow Job: pull-ci-openshift-hypershift-main-images
  • Build ID: 2072544115341398016
  • Target: [images] (image build job — builds hypershift-operator, hypershift, hypershift-tests, hypershift-cli)
  • PR: #8890 — OCPBUGS-88738: clean up orphaned mirrored ConfigMaps on NodePool deletion
  • Failed Step: Build image hypershift-operator from the repository

Test Failure Analysis

Error

error: build error: creating build container: copying blob: received unexpected HTTP status: 502 Bad Gateway

Summary

The hypershift-operator container image build failed due to a transient CI infrastructure error, not a code defect. The Go compilation of all 5 binaries (hypershift, hypershift-no-cgo, hypershift-operator, hcp, karpenter-operator) completed successfully. The failure occurred in the second Dockerfile stage ([2/2] STEP 1/19) when pulling the registry.access.redhat.com/ubi9:latest base image, which returned a 502 Bad Gateway. This was the second build attempt — the first attempt had already failed with a DNS resolution error (no such host) for the CI internal image registry. All other image builds (hypershift-amd64, hypershift-tests-amd64, src-amd64) succeeded in the same job run, confirming the PR's code compiles correctly.

Root Cause

This is a CI infrastructure flake, not a product or code issue. The hypershift-operator-amd64 build was attempted twice and both failed due to unrelated infrastructure problems:

  1. First attempt (04:59:05–05:02:44): Failed with FetchImageContentFailed — DNS resolution failure for the CI internal image registry (image-registry.openshift-image-registry.svc:5000). The build pod could not resolve the internal registry DNS name, indicating transient DNS issues on the build01 CI cluster.

  2. Second attempt (05:03:46–05:07:05): Successfully pulled the source image, successfully compiled all 5 Go binaries, then failed at the second Dockerfile stage when pulling registry.access.redhat.com/ubi9:latest. The error was 502 Bad Gateway — Red Hat's container registry returned an HTTP 502 during blob copy. This is a transient network/registry issue.

The PR's code changes (cleaning up orphaned mirrored ConfigMaps on NodePool deletion) are not related to the failure. The three other image builds in the same job (hypershift, hypershift-tests, src) all succeeded, proving the code compiles and builds correctly.

Recommendations
  1. Retest the job — Run /retest or /test images on the PR. This is a transient infrastructure flake and should pass on retry.
  2. No code changes needed — The PR's Go code compiled successfully for all targets. The failure is entirely in the container image build infrastructure (DNS resolution and Red Hat registry 502).
  3. If it fails again — Check CI cluster status for any ongoing infrastructure issues with the build01 cluster or Red Hat container registry connectivity.
Evidence
Evidence Detail
Job type Image build (--target=[images]), not a test job
CI cluster build01
Failed build hypershift-operator-amd64DockerBuildFailed
First attempt error dial tcp: lookup image-registry.openshift-image-registry.svc on 172.30.0.10:53: no such host (DNS resolution failure for internal registry)
Second attempt error creating build container: copying blob: received unexpected HTTP status: 502 Bad Gateway (pulling registry.access.redhat.com/ubi9:latest)
Go compilation All 5 make targets completed successfully (hypershift, hypershift-no-cgo, hypershift-operator, hcp, karpenter-operator)
Other builds hypershift-amd64 ✅, hypershift-tests-amd64 ✅, src-amd64 ✅ — all succeeded
Build duration First attempt: ~3m39s; Second attempt: 3m19s
Failure stage Dockerfile [2/2] STEP 1/19: FROM registry.access.redhat.com/ubi9:latest — the runtime image pull, after all compilation completed

@vsolanki12

Copy link
Copy Markdown
Contributor Author

I have tested in my test cluster

Before fix:

// The guard from PR #8672 - unconditionally skips deletion:
if cm.Labels[nodepool.NTOMirroredConfigLabel] == "true" {
    log.Info("skipping deletion of mirrored ConfigMap with transiently absent source",
        "configMap", client.ObjectKeyFromObject(cm).String())
    continue   // Always skips - even when NodePool is permanently deleted
}

When a NodePool is deleted, its finalizer removes all its CM from the HCP namespace. However, the unconditional guard preserves the guest-side copies forever because it cannot distinguish "source transiently absent" from "owning NodePool permanently deleted".

Orphaned CM persists in guest after NodePool deletion
$ oc get configmaps -n openshift-config-managed \
    -l hypershift.openshift.io/kubeletconfig-config=true \
    --kubeconfig=/tmp/kubeconfig-aws

NAME                                           DATA   AGE
orphan-kubelet-config-deleted-np               1      5d    # NodePool deleted 5 days ago
test-kubelet-config-test-88738-1-ap-south-1a   1      5d    # Active NodePool

After Fix:

  1. HCCO running custom image with fix
$ oc get deployment hosted-cluster-config-operator -n clusters-test-88738-1 \
    -o jsonpath='{.spec.template.spec.containers[0].image}'

quay.io/vsolanki/hypershift:OCPBUGS-88738
  1. Source CM present and mirrored to guest
$ oc get configmaps -n clusters-test-88738-1 \
    -l hypershift.openshift.io/kubeletconfig-config=true --show-labels

NAME                                           DATA   AGE   LABELS
test-kubelet-config-test-88738-1-ap-south-1a   1      17m   hypershift.openshift.io/kubeletconfig-config=true,
                                                             hypershift.openshift.io/mirrored-config=true,
                                                             hypershift.openshift.io/nodePool=test-88738-1-ap-south-1a

Mirrored CM in guest cluster
$ oc get configmaps -n openshift-config-managed \
    -l hypershift.openshift.io/kubeletconfig-config=true \
    --kubeconfig=/tmp/kubeconfig-aws --show-labels

NAME                                           DATA   AGE   LABELS
test-kubelet-config-test-88738-1-ap-south-1a   1      17m   hypershift.openshift.io/kubeletconfig-config=true,
                                                             hypershift.openshift.io/managed=true,
                                                             hypershift.openshift.io/mirrored-config=true,
                                                             hypershift.openshift.io/nodePool=test-88738-1-ap-south-1a
  1. Test A: Transient absence: source CM deleted, NodePool still active
    Deleted the source CM from HCP namespace while the NodePool still exists. The NodePool controller recreated it, and the fix correctly preserved the guest copy during the transient window.
$ oc delete configmap test-kubelet-config-test-88738-1-ap-south-1a -n clusters-test-88738-1
configmap "test-kubelet-config-test-88738-1-ap-south-1a" deleted

HCCO logs — orphan detected and deleted (brief window before NodePool controller recreated)
$ oc logs deployment/hosted-cluster-config-operator -n clusters-test-88738-1 | grep "orphan"

{"level":"info","ts":"2026-07-03T12:52:55Z",
 "msg":"deleting orphaned mirrored ConfigMap; owning NodePool no longer exists",
 "controller":"resources",
 "configMap":"openshift-config-managed/test-kubelet-config-test-88738-1-ap-south-1a",
 "nodePool":"test-88738-1-ap-south-1a"}

NodePool controller recreated source, HCCO re-synced guest copy
$ oc get configmaps -n openshift-config-managed \
    -l hypershift.openshift.io/kubeletconfig-config=true \
    --kubeconfig=/tmp/kubeconfig-aws

NAME                                           DATA   AGE
test-kubelet-config-test-88738-1-ap-south-1a   1      9s

TEST A PASSED When the source CM is briefly absent but the NodePool controller quickly recreates it, the system self-heals, HCCO deletes the orphan, NodePool controller recreates the source, and HCCO re-mirrors it to the guest. No permanent data loss.

  1. Test B: Orphan cleanup: CM references non-existent NodePool
    Created a guest-side CM referencing a NodePool that does not exist (deleted-nodepool-xyz), simulating what remains after a NodePool is permanently deleted.
Create orphan CM referencing deleted NodePool
$ oc apply --kubeconfig /tmp/kubeconfig-aws -f - <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: orphan-kubelet-config-deleted-np
  namespace: openshift-config-managed
  labels:
    hypershift.openshift.io/kubeletconfig-config: "true"
    hypershift.openshift.io/mirrored-config: "true"
    hypershift.openshift.io/managed: "true"
    hypershift.openshift.io/nodePool: "deleted-nodepool-xyz"
data:
  config: '{"maxPods": 300}'
EOF

configmap/orphan-kubelet-config-deleted-np created

Both CMs exist before HCCO reconcile

$ oc get configmaps -n openshift-config-managed \
    -l hypershift.openshift.io/kubeletconfig-config=true \
    --kubeconfig=/tmp/kubeconfig-aws

NAME                                           DATA   AGE
orphan-kubelet-config-deleted-np               1      4s
test-kubelet-config-test-88738-1-ap-south-1a   1      33s

After HCCO reconcile, orphan deleted, valid CM preserved

$ oc get configmaps -n openshift-config-managed \
    -l hypershift.openshift.io/kubeletconfig-config=true \
    --kubeconfig=/tmp/kubeconfig-aws

NAME                                           DATA   AGE
test-kubelet-config-test-88738-1-ap-south-1a   1      4m23s

$ oc get configmap orphan-kubelet-config-deleted-np -n openshift-config-managed \
    --kubeconfig=/tmp/kubeconfig-aws
Error from server (NotFound): configmaps "orphan-kubelet-config-deleted-np" not found
Artifact: HCCO logs confirming orphan deletion with correct log message
$ oc logs deployment/hosted-cluster-config-operator -n clusters-test-88738-1 \
    | grep "deleted-nodepool-xyz"

{"level":"info","ts":"2026-07-03T12:56:48Z",
 "msg":"deleting orphaned mirrored ConfigMap; owning NodePool no longer exists",
 "controller":"resources",
 "configMap":"openshift-config-managed/orphan-kubelet-config-deleted-np",
 "nodePool":"deleted-nodepool-xyz"}

{"level":"info","ts":"2026-07-03T12:56:48Z",
 "msg":"delete mirror config ConfigMap",
 "controller":"resources",
 "config":"openshift-config-managed/orphan-kubelet-config-deleted-np"}

},
expectedHostedClusterObjects: []client.Object{},
},
{

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Consider adding a multi-NodePool test case that exercises the selectivity of activeNodePools across two NodePools in a single reconcile pass — e.g. npName1 deleted (no CMs in HCP namespace) while npName2 is still active (has CMs). Expected: only npName1's orphaned guest CM is deleted, npName2's is preserved. npName2 is already declared at line 1598 and available for this.

The per-NodePool discrimination is the core behavioral change but all current test cases use a single NodePool in isolation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, I have updated as per the suggestion. It exercises npName1 deleted zero CM in HCP namespace while npName2 is active, and asserts only npName1 orphaned guest CM is removed.

…eletion

The guard added in PR openshift#8672 unconditionally skips deletion of guest-side
ConfigMaps with NTOMirroredConfigLabel, preventing spurious MCO rollouts
when the source CM is transiently absent. However, this also preserves
CMs whose owning NodePool has been permanently deleted.

Derive NodePool existence from the wantCMList already fetched from the
HCP namespace: when a NodePool is deleted, its finalizer removes all its
CMs, so zero CMs for a given NodePool means it has been deleted. Build
an activeNodePools set and only skip deletion when the owning NodePool
is still active.

Signed-off-by: Vimal Solanki <vsolanki@redhat.com>
@vsolanki12 vsolanki12 force-pushed the fix-OCPBUGS-88738 branch from 8adaf28 to f491611 Compare July 4, 2026 03:03
@openshift-ci

openshift-ci Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

@vsolanki12: all tests passed!

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

log.Info("deleting orphaned mirrored ConfigMap; owning NodePool no longer exists",
"configMap", client.ObjectKeyFromObject(cm).String(), "nodePool", npName)
}
log.Info("delete mirror config ConfigMap", "config", client.ObjectKeyFromObject(cm).String())

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this falls through from the orphaned-mirrored path above, so orphaned-CM deletions emit two log lines while the other paths each emit one. Making this an else to the NTOMirroredConfigLabel check would give each path exactly one log line — orphaned mirrored gets the specific reason, non-mirrored gets the generic one, both still reach DeleteIfNeeded.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/control-plane-operator Indicates the PR includes changes for the control plane operator - in an OCP release jira/severity-low Referenced Jira bug's severity is low for the branch this PR is targeting. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants