Skip to content

[WIP] PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses#323

Open
AmitSahastra wants to merge 1 commit intospectro-masterfrom
PCP-5556
Open

[WIP] PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses#323
AmitSahastra wants to merge 1 commit intospectro-masterfrom
PCP-5556

Conversation

@AmitSahastra
Copy link
Copy Markdown
Contributor

No description provided.

@AmitSahastra AmitSahastra changed the title PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses [WIP] PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses Mar 6, 2026
@AmitSahastra AmitSahastra requested a review from Copilot March 6, 2026 10:33
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses MAAS “User reserved” IP exhaustion by adding best-effort IP release during machine cleanup paths and by retrying MAAS machine release operations to reduce stale allocations.

Changes:

  • Add best-effort IP release helpers and call them from multiple cleanup/release paths (including VM-host diagnostics and VM-host rejection).
  • Enhance ReleaseMachine (and deploy-failure cleanup) to explicitly release IPs first and retry MAAS Release() on failures.
  • On delete reconcile when no MAAS machine is found, release the spec static IP to avoid it remaining “User reserved”.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
pkg/maas/machine/machine.go Adds best-effort IP release + integrates it into machine release / failure cleanup with retries.
pkg/maas/machine/machine_test.go Updates mocks for new ReleaseMachine flow (Get + IPAddresses).
controllers/maasmachine_controller.go Releases spec static IP when MAAS machine can’t be located during delete reconciliation.
go.mod Updates Go toolchain version.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@@ -103,9 +103,11 @@ func TestMachine(t *testing.T) {
}

mockClientSetInterface.EXPECT().Machines().Return(mockMachines)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReleaseMachine calls maasClient.Machines() at least twice now (once for Get, once for Releaser().Release(), and more if retries happen), but this test only expects Machines() to be called once. Update the gomock expectation (e.g. Times(2) for the no-retry happy path, or AnyTimes/MinTimes if you want the test to tolerate retries).

Suggested change
mockClientSetInterface.EXPECT().Machines().Return(mockMachines)
mockClientSetInterface.EXPECT().Machines().Return(mockMachines).Times(2) // Get, then Releaser

Copilot uses AI. Check for mistakes.
Comment on lines +152 to +153
// Machine already gone; release scope static IP so it does not stay "User reserved".
s.releaseIPBestEffort(ctx, s.scope.GetStaticIP())
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReleaseMachine treats any Machine().Get() error as "machine already gone" and then releases the scope static IP. If Get fails for a transient/non-404 reason (e.g. MAAS API outage), this can unintentionally release an IP while the machine still exists. Consider only taking the "already gone" path when the error is a confirmed not-found (e.g. the same 404/no machine matches checks used in GetMachine), and otherwise log/return the error (or skip IP release).

Suggested change
// Machine already gone; release scope static IP so it does not stay "User reserved".
s.releaseIPBestEffort(ctx, s.scope.GetStaticIP())
// Only treat confirmed/notional "not found" errors as "machine already gone".
errMsg := getErr.Error()
notFound := strings.Contains(errMsg, "404") ||
strings.Contains(errMsg, "not found") ||
strings.Contains(errMsg, "No matching machine")
if notFound {
s.scope.V(1).Info("Machine not found when fetching before release; releasing scope static IP", "systemID", systemID, "error", getErr)
// Machine already gone; release scope static IP so it does not stay "User reserved".
s.releaseIPBestEffort(ctx, s.scope.GetStaticIP())
} else {
// Transient or unexpected error; skip scope static IP release to avoid releasing an IP
// that may still be in use, but still attempt machine release below.
s.scope.V(1).Info("Failed to get machine before release; skipping scope static IP release", "systemID", systemID, "error", getErr)
}

Copilot uses AI. Check for mistakes.
Comment on lines +157 to +170
const maxAttempts = 3
const backoff = 2 * time.Second
var lastErr error
for attempt := 1; attempt <= maxAttempts; attempt++ {
_, lastErr = s.maasClient.Machines().Machine(systemID).Releaser().Release(ctx)
if lastErr == nil {
return nil
}
s.scope.V(1).Info("Machine release attempt failed, will retry", "attempt", attempt, "systemID", systemID, "error", lastErr)
if attempt < maxAttempts {
time.Sleep(backoff)
}
}
return errors.Wrapf(lastErr, "Unable to release machine after %d attempts", maxAttempts)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Release() retry loop retries on every error, including likely non-retriable ones (e.g. 404 not found if the machine was already released by another actor). This can add unnecessary delay and may cause reconciliation to fail even though the desired end state (machine gone) is already reached. Consider detecting "not found" errors and treating them as success, and only retrying errors that are known transient (or at least stopping retries on 4xx).

Copilot uses AI. Check for mistakes.
}
s.scope.V(1).Info("Machine release attempt failed, will retry", "attempt", attempt, "systemID", systemID, "error", lastErr)
if attempt < maxAttempts {
time.Sleep(backoff)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReleaseMachine uses time.Sleep for backoff while using context.TODO(), so this blocking wait can't be cancelled on shutdown or reconcile timeout. In controller-runtime codepaths this can reduce throughput and delay other reconciliations. Prefer a context-aware backoff (e.g. timer + select on ctx.Done()) and consider threading the caller's ctx into ReleaseMachine instead of creating a TODO context.

Suggested change
time.Sleep(backoff)
timer := time.NewTimer(backoff)
select {
case <-ctx.Done():
if !timer.Stop() {
<-timer.C
}
return ctx.Err()
case <-timer.C:
}

Copilot uses AI. Check for mistakes.
Comment on lines +297 to +304
for attempt := 1; attempt <= maxAttempts; attempt++ {
_, releaseErr = m.Releaser().Release(ctx)
if releaseErr == nil {
break
}
log.Error(releaseErr, "Unable to release machine on deploy failure", "attempt", attempt)
if attempt < maxAttempts {
time.Sleep(backoff)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deploy-failure cleanup path also performs blocking retry sleeps using the same long-lived ctx variable; this can stall reconcile for multiple seconds and doesn't respect cancellation. Consider using a context-aware backoff (or pushing retries up to the reconciler via requeue) so shutdown and timeouts can interrupt the wait.

Suggested change
for attempt := 1; attempt <= maxAttempts; attempt++ {
_, releaseErr = m.Releaser().Release(ctx)
if releaseErr == nil {
break
}
log.Error(releaseErr, "Unable to release machine on deploy failure", "attempt", attempt)
if attempt < maxAttempts {
time.Sleep(backoff)
retryRelease:
for attempt := 1; attempt <= maxAttempts; attempt++ {
// Stop retrying if the context has been cancelled.
if err := ctx.Err(); err != nil {
log.Error(err, "Context cancelled while retrying machine release on deploy failure")
break retryRelease
}
_, releaseErr = m.Releaser().Release(ctx)
if releaseErr == nil {
break
}
log.Error(releaseErr, "Unable to release machine on deploy failure", "attempt", attempt)
if attempt < maxAttempts {
select {
case <-ctx.Done():
// Context cancelled during backoff; stop retrying.
log.Error(ctx.Err(), "Context cancelled during backoff while retrying machine release on deploy failure")
break retryRelease
case <-time.After(backoff):
}

Copilot uses AI. Check for mistakes.
Comment on lines +131 to +132
func (s *Service) ReleaseIP(ip string) {
ctx := context.TODO()
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReleaseIP creates a context.TODO() internally, which makes it hard for callers (controllers) to apply timeouts/cancellation and can block shutdown if MAAS calls hang. Consider changing ReleaseIP to accept a context.Context parameter and passing the reconcile ctx from the controller.

Suggested change
func (s *Service) ReleaseIP(ip string) {
ctx := context.TODO()
// For new code, prefer ReleaseIPWithContext to allow cancellation/timeouts.
func (s *Service) ReleaseIP(ip string) {
s.ReleaseIPWithContext(context.Background(), ip)
}
// ReleaseIPWithContext releases a single IP address in MAAS using the provided context.
// This allows callers (e.g. controllers) to apply timeouts and cancellation.
func (s *Service) ReleaseIPWithContext(ctx context.Context, ip string) {

Copilot uses AI. Check for mistakes.
}
// Release static IP from spec if set, so it does not remain "User reserved" in MAAS (e.g. force-deleted machine).
if maasMachine.Spec.StaticIP != nil && maasMachine.Spec.StaticIP.IP != "" {
machineSvc.ReleaseIP(maasMachine.Spec.StaticIP.IP)
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reconcileDelete receives a context but discards it (_ context.Context) and then calls machineSvc.ReleaseIP(), which currently uses context.TODO(). This prevents request-scoped timeouts/cancellation for MAAS calls during delete. Consider keeping the ctx parameter and threading it through to ReleaseIP(ctx, ip).

Suggested change
machineSvc.ReleaseIP(maasMachine.Spec.StaticIP.IP)
machineSvc.ReleaseIP(ctx, maasMachine.Spec.StaticIP.IP)

Copilot uses AI. Check for mistakes.
Comment on lines +145 to +149
if getErr == nil {
for _, ip := range m.IPAddresses() {
if ip != nil {
s.releaseIPBestEffort(ctx, ip.String())
}
Copy link

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReleaseMachine now explicitly releases each machine IP (via releaseIPBestEffort/IPAddresses().Release()) before calling Releaser().Release(), but the current unit tests only cover an empty IP list. Please add/extend tests to assert non-empty IPAddresses() triggers the expected IP release calls (and ideally cover the retry path too).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants