Skip to content

PCP-6608: cluster-api-provider-maas wipes API server FQDN empty IP set persisted to MAAS DNS when CP machine is transiently powered off, hash-cached so no self-recovery#339

Merged
snehala27 merged 5 commits into
spectro-masterfrom
PCP-6608
May 15, 2026
Merged

PCP-6608: cluster-api-provider-maas wipes API server FQDN empty IP set persisted to MAAS DNS when CP machine is transiently powered off, hash-cached so no self-recovery#339
snehala27 merged 5 commits into
spectro-masterfrom
PCP-6608

Conversation

@AmitSahastra

@AmitSahastra AmitSahastra commented May 8, 2026

Copy link
Copy Markdown
Contributor

Problem

When a control-plane MaasMachine was briefly in Deployed (powered off) state, reconcileDNSAttachments() built an empty desired-IP set, PUT it to the MAAS DNSResource (wiping all A records), and persisted SHA-256("") as last-applied-dns-hash. Subsequent reconciles
saw no hash drift and returned early — DNS stayed empty with no self-recovery path. Workers lost connectivity to the API server FQDN, CNI never initialized, and the controller pod itself eventually went Unknown (chicken-and-egg).

Confirmed on vmo-eng-2025 (VMO Eng PCG): DNS wiped at 16:52:22 UTC, made visible by an unrelated power outage at 18:02 UTC. The e3b0c44…b855 annotation (SHA-256 of empty string) was the forensic indicator.

Changes

controllers/maascluster_controller.go

  • Empty-IP guard (primary fix): If runningIpAddresses is empty after filtering, return immediately without touching DNS or updating the hash annotation. The existing MAAS A records are preserved through the flap. When the CP recovers, the annotation hash will
    differ from the new desired set and DNS is corrected on the next reconcile.
  • Improved log messages: Track runningCPCount separately from machines with eligible IPs. Log messages now distinguish two silent-failure modes: "no running CP machines" (power-off flap) vs "CP machines running but no IPs match preferred subnets"
    (misconfiguration).
  • MAAS drift detection: DNS resource is fetched before the hash check. When the annotation hash matches, MAAS state is still verified via IsDriftDetected(). If MAAS has drifted (e.g. records manually wiped while desired set was unchanged), the controller forces a
    re-sync instead of returning early forever.

pkg/maas/dns/dns.go

  • IsDriftDetected(): New method comparing a pre-fetched DNS resource's current IP set against a desired list. Used by the controller's drift check.
  • Defence-in-depth guard in updateResourceIPs(): Returns an error if the desired set is empty, so any future caller that bypasses the controller-level guard cannot accidentally wipe DNS.

Failure modes addressed

Screenshot 2026-05-08 at 11 22 27 AM

Remaining gap (not in this PR)

If preferredSubnets is misconfigured such that all CP IPs are always filtered, DNS is preserved but no IP is ever written. The log message now surfaces this explicitly; a follow-up could add a condition/event on the MaasCluster object.

Tests:

  • Empty-IP guard with existingCPCount / runningCPCount distinction
  • MAAS drift detection and recovery
  • IP deduplication before hashing

Unit Tests:

• TestIsRunning — All machine states including nil state
• TestIsControlPlaneMachine — CP label detection
• TestGetExternalMachineIP — Subnet filtering
• TestReconcileDNSAttachments:
◦ CP powered off → DNS preserved
◦ No CP machines → DNS cleared
◦ CP with DeletionTimestamp → DNS cleared
◦ CP running but no ExternalIP → DNS preserved
◦ CP running, no prior annotation → DNS updated
◦ Hash match + MAAS in sync → skip update
◦ Hash match + MAAS drifted → force re-sync
◦ GetDNSResource error propagated
• IsDriftDetected tests — matching IPs, empty MAAS, different IPs, extra IPs, duplicates, empty strings

…et persisted to MAAS DNS when CP machine is transiently powered off, hash-cached so no self-recovery

@bulwark-spectrocloud bulwark-spectrocloud Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ GoVulnCheck scan found vulnerabilities:

  1. GO-2026-4918
    • Module: golang.org/x/net
    • Found in: v0.38.0
    • Fixed in: v0.53.0
    • Example Traces:
      1. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
      2. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
      3. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
      4. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http2.run

Please review these findings and fix the issues before merging.

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a control-plane DNS wipe scenario in the MAAS provider by preventing reconciles from PUT’ing an empty IP set (and caching the empty hash), and adds drift detection so MAAS-side changes can be corrected even when the last-applied annotation hasn’t changed.

Changes:

  • Add an empty-desired-IP guard in reconcileDNSAttachments() to preserve existing DNS records during transient CP unavailability.
  • Add MAAS drift detection by fetching the DNS resource before hash short-circuiting and forcing a re-sync when MAAS diverges.
  • Add DNS-layer defense-in-depth to refuse updating a MAAS DNS resource with an empty desired IP set.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
controllers/maascluster_controller.go Avoids empty-IP updates, improves logging, and adds drift-aware early-exit logic for DNS attachment reconciliation.
pkg/maas/dns/dns.go Adds drift comparison helper and prevents updating MAAS DNS resources with an empty desired IP set.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/maas/dns/dns.go Outdated
Comment thread pkg/maas/dns/dns.go
Comment thread controllers/maascluster_controller.go
- Remove empty-IP guard from updateResourceIPs helper; guard belongs in
  reconcileDNSAttachments where intent is known, not in a shared helper
  that must support intentional clearing (deprovisioning)
- Deduplicate runningIpAddresses before hashing so the annotation always
  represents the applied IP set, consistent with updateResourceIPs set
  semantics — prevents spurious re-syncs if duplicate IPs appear
- Add IsDriftDetected edge-case tests: duplicate desired IPs and empty
  strings in desired are correctly handled without false drift signals
@AmitSahastra AmitSahastra changed the title PCP-6608: cluster-api-provider-maas wipes API server FQDN: empty IP set persisted to MAAS DNS when CP machine is transiently powered off, hash-cached so no self-recovery PCP-6608: cluster-api-provider-maas wipes API server FQDN empty IP set persisted to MAAS DNS when CP machine is transiently powered off, hash-cached so no self-recovery May 8, 2026

@bulwark-spectrocloud bulwark-spectrocloud Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ GoVulnCheck scan found vulnerabilities:

  1. GO-2026-4918
    • Module: golang.org/x/net
    • Found in: v0.38.0
    • Fixed in: v0.53.0
    • Example Traces:
      1. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
      2. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
      3. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http2.run
      4. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip

Please review these findings and fix the issues before merging.

Fix #2 - last CP deletion leaves stale DNS:
  Track existingCPCount (CPs without DeletionTimestamp) separately.
  Preserve DNS only when existingCPCount > 0 (transient power-off flap).
  When existingCPCount == 0 (all CPs absent or pending deletion), fall
  through to clear DNS so stale records don't persist after a rolling
  replacement or scale-down.

Tests (#4, #5, #6):
  - CP with DeletionTimestamp: excluded from existingCPCount, DNS cleared
  - CP running but no ExternalIP: existingCPCount>0, DNS preserved
    (covers preferred-subnet mismatch code path)
  - GetDNSResource error: error propagated to caller
  - Updated "no CP machines" assertion to reflect new DNS-clear behaviour

@bulwark-spectrocloud bulwark-spectrocloud Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ GoVulnCheck scan found vulnerabilities:

  1. GO-2026-4918
    • Module: golang.org/x/net
    • Found in: v0.38.0
    • Fixed in: v0.53.0
    • Example Traces:
      1. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
      2. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
      3. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
      4. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http2.run

Please review these findings and fix the issues before merging.

@bulwark-spectrocloud bulwark-spectrocloud Bot dismissed stale reviews from themself May 8, 2026 08:49

Changes have been made to address the security findings.

@AmitSahastra AmitSahastra requested review from Kun483 and snehala27 May 8, 2026 11:06
@snehala27 snehala27 merged commit fff939b into spectro-master May 15, 2026
4 checks passed
@snehala27 snehala27 deleted the PCP-6608 branch May 15, 2026 06:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants