PCP-6608: cluster-api-provider-maas wipes API server FQDN empty IP set persisted to MAAS DNS when CP machine is transiently powered off, hash-cached so no self-recovery#339
Conversation
…et persisted to MAAS DNS when CP machine is transiently powered off, hash-cached so no self-recovery
There was a problem hiding this comment.
- GO-2026-4918
- Module: golang.org/x/net
- Found in: v0.38.0
- Fixed in: v0.53.0
- Example Traces:
1. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
2. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
3. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
4. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http2.run
Please review these findings and fix the issues before merging.
There was a problem hiding this comment.
Pull request overview
Fixes a control-plane DNS wipe scenario in the MAAS provider by preventing reconciles from PUT’ing an empty IP set (and caching the empty hash), and adds drift detection so MAAS-side changes can be corrected even when the last-applied annotation hasn’t changed.
Changes:
- Add an empty-desired-IP guard in
reconcileDNSAttachments()to preserve existing DNS records during transient CP unavailability. - Add MAAS drift detection by fetching the DNS resource before hash short-circuiting and forcing a re-sync when MAAS diverges.
- Add DNS-layer defense-in-depth to refuse updating a MAAS DNS resource with an empty desired IP set.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| controllers/maascluster_controller.go | Avoids empty-IP updates, improves logging, and adds drift-aware early-exit logic for DNS attachment reconciliation. |
| pkg/maas/dns/dns.go | Adds drift comparison helper and prevents updating MAAS DNS resources with an empty desired IP set. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove empty-IP guard from updateResourceIPs helper; guard belongs in reconcileDNSAttachments where intent is known, not in a shared helper that must support intentional clearing (deprovisioning) - Deduplicate runningIpAddresses before hashing so the annotation always represents the applied IP set, consistent with updateResourceIPs set semantics — prevents spurious re-syncs if duplicate IPs appear - Add IsDriftDetected edge-case tests: duplicate desired IPs and empty strings in desired are correctly handled without false drift signals
There was a problem hiding this comment.
- GO-2026-4918
- Module: golang.org/x/net
- Found in: v0.38.0
- Fixed in: v0.53.0
- Example Traces:
1. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
2. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
3. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http2.run
4. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
Please review these findings and fix the issues before merging.
Fix #2 - last CP deletion leaves stale DNS: Track existingCPCount (CPs without DeletionTimestamp) separately. Preserve DNS only when existingCPCount > 0 (transient power-off flap). When existingCPCount == 0 (all CPs absent or pending deletion), fall through to clear DNS so stale records don't persist after a rolling replacement or scale-down. Tests (#4, #5, #6): - CP with DeletionTimestamp: excluded from existingCPCount, DNS cleared - CP running but no ExternalIP: existingCPCount>0, DNS preserved (covers preferred-subnet mismatch code path) - GetDNSResource error: error propagated to caller - Updated "no CP machines" assertion to reflect new DNS-clear behaviour
There was a problem hiding this comment.
- GO-2026-4918
- Module: golang.org/x/net
- Found in: v0.38.0
- Fixed in: v0.53.0
- Example Traces:
1. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
2. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
3. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http.roundTrip
4. pkg/maas/dns/dns.go:52:10: dns.ReconcileDNS calls maasclient.Create, which eventually calls http2.run
Please review these findings and fix the issues before merging.
Changes have been made to address the security findings.
Problem
When a control-plane MaasMachine was briefly in Deployed (powered off) state, reconcileDNSAttachments() built an empty desired-IP set, PUT it to the MAAS DNSResource (wiping all A records), and persisted SHA-256("") as last-applied-dns-hash. Subsequent reconciles
saw no hash drift and returned early — DNS stayed empty with no self-recovery path. Workers lost connectivity to the API server FQDN, CNI never initialized, and the controller pod itself eventually went Unknown (chicken-and-egg).
Confirmed on vmo-eng-2025 (VMO Eng PCG): DNS wiped at 16:52:22 UTC, made visible by an unrelated power outage at 18:02 UTC. The e3b0c44…b855 annotation (SHA-256 of empty string) was the forensic indicator.
Changes
controllers/maascluster_controller.go
differ from the new desired set and DNS is corrected on the next reconcile.
(misconfiguration).
re-sync instead of returning early forever.
pkg/maas/dns/dns.go
Failure modes addressed
Remaining gap (not in this PR)
If preferredSubnets is misconfigured such that all CP IPs are always filtered, DNS is preserved but no IP is ever written. The log message now surfaces this explicitly; a follow-up could add a condition/event on the MaasCluster object.
Tests:
Unit Tests:
• TestIsRunning — All machine states including nil state
• TestIsControlPlaneMachine — CP label detection
• TestGetExternalMachineIP — Subnet filtering
• TestReconcileDNSAttachments:
◦ CP powered off → DNS preserved
◦ No CP machines → DNS cleared
◦ CP with DeletionTimestamp → DNS cleared
◦ CP running but no ExternalIP → DNS preserved
◦ CP running, no prior annotation → DNS updated
◦ Hash match + MAAS in sync → skip update
◦ Hash match + MAAS drifted → force re-sync
◦ GetDNSResource error propagated
• IsDriftDetected tests — matching IPs, empty MAAS, different IPs, extra IPs, duplicates, empty strings