[WIP] PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses by AmitSahastra · Pull Request #323 · spectrocloud/cluster-api-provider-maas

AmitSahastra · 2026-03-06T10:33:35Z

No description provided.

Copilot

Pull request overview

This PR addresses MAAS “User reserved” IP exhaustion by adding best-effort IP release during machine cleanup paths and by retrying MAAS machine release operations to reduce stale allocations.

Changes:

Add best-effort IP release helpers and call them from multiple cleanup/release paths (including VM-host diagnostics and VM-host rejection).
Enhance ReleaseMachine (and deploy-failure cleanup) to explicitly release IPs first and retry MAAS Release() on failures.
On delete reconcile when no MAAS machine is found, release the spec static IP to avoid it remaining “User reserved”.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
pkg/maas/machine/machine.go	Adds best-effort IP release + integrates it into machine release / failure cleanup with retries.
pkg/maas/machine/machine_test.go	Updates mocks for new `ReleaseMachine` flow (Get + IPAddresses).
controllers/maasmachine_controller.go	Releases spec static IP when MAAS machine can’t be located during delete reconciliation.
go.mod	Updates Go toolchain version.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-06T10:39:01Z

pkg/maas/machine/machine_test.go

@@ -103,9 +103,11 @@ func TestMachine(t *testing.T) {
 		}

 		mockClientSetInterface.EXPECT().Machines().Return(mockMachines)


ReleaseMachine calls maasClient.Machines() at least twice now (once for Get, once for Releaser().Release(), and more if retries happen), but this test only expects Machines() to be called once. Update the gomock expectation (e.g. Times(2) for the no-retry happy path, or AnyTimes/MinTimes if you want the test to tolerate retries).

Suggested change

mockClientSetInterface.EXPECT().Machines().Return(mockMachines)

mockClientSetInterface.EXPECT().Machines().Return(mockMachines).Times(2) // Get, then Releaser

Copilot · 2026-03-06T10:39:02Z

pkg/maas/machine/machine.go

+		// Machine already gone; release scope static IP so it does not stay "User reserved".
+		s.releaseIPBestEffort(ctx, s.scope.GetStaticIP())


ReleaseMachine treats any Machine().Get() error as "machine already gone" and then releases the scope static IP. If Get fails for a transient/non-404 reason (e.g. MAAS API outage), this can unintentionally release an IP while the machine still exists. Consider only taking the "already gone" path when the error is a confirmed not-found (e.g. the same 404/no machine matches checks used in GetMachine), and otherwise log/return the error (or skip IP release).

Suggested change

// Machine already gone; release scope static IP so it does not stay "User reserved".

s.releaseIPBestEffort(ctx, s.scope.GetStaticIP())

// Only treat confirmed/notional "not found" errors as "machine already gone".

errMsg := getErr.Error()

notFound := strings.Contains(errMsg, "404") ||

strings.Contains(errMsg, "not found") ||

strings.Contains(errMsg, "No matching machine")

if notFound {

s.scope.V(1).Info("Machine not found when fetching before release; releasing scope static IP", "systemID", systemID, "error", getErr)

// Machine already gone; release scope static IP so it does not stay "User reserved".

s.releaseIPBestEffort(ctx, s.scope.GetStaticIP())

} else {

// Transient or unexpected error; skip scope static IP release to avoid releasing an IP

// that may still be in use, but still attempt machine release below.

s.scope.V(1).Info("Failed to get machine before release; skipping scope static IP release", "systemID", systemID, "error", getErr)

}

Copilot · 2026-03-06T10:39:02Z

pkg/maas/machine/machine.go

+	const maxAttempts = 3
+	const backoff = 2 * time.Second
+	var lastErr error
+	for attempt := 1; attempt <= maxAttempts; attempt++ {
+		_, lastErr = s.maasClient.Machines().Machine(systemID).Releaser().Release(ctx)
+		if lastErr == nil {
+			return nil
+		}
+		s.scope.V(1).Info("Machine release attempt failed, will retry", "attempt", attempt, "systemID", systemID, "error", lastErr)
+		if attempt < maxAttempts {
+			time.Sleep(backoff)
+		}
+	}
+	return errors.Wrapf(lastErr, "Unable to release machine after %d attempts", maxAttempts)


The Release() retry loop retries on every error, including likely non-retriable ones (e.g. 404 not found if the machine was already released by another actor). This can add unnecessary delay and may cause reconciliation to fail even though the desired end state (machine gone) is already reached. Consider detecting "not found" errors and treating them as success, and only retrying errors that are known transient (or at least stopping retries on 4xx).

Copilot · 2026-03-06T10:39:02Z

pkg/maas/machine/machine.go

+		}
+		s.scope.V(1).Info("Machine release attempt failed, will retry", "attempt", attempt, "systemID", systemID, "error", lastErr)
+		if attempt < maxAttempts {
+			time.Sleep(backoff)


ReleaseMachine uses time.Sleep for backoff while using context.TODO(), so this blocking wait can't be cancelled on shutdown or reconcile timeout. In controller-runtime codepaths this can reduce throughput and delay other reconciliations. Prefer a context-aware backoff (e.g. timer + select on ctx.Done()) and consider threading the caller's ctx into ReleaseMachine instead of creating a TODO context.

Suggested change

time.Sleep(backoff)

timer := time.NewTimer(backoff)

select {

case <-ctx.Done():

if !timer.Stop() {

<-timer.C

}

return ctx.Err()

case <-timer.C:

}

Copilot · 2026-03-06T10:39:02Z

pkg/maas/machine/machine.go

+			for attempt := 1; attempt <= maxAttempts; attempt++ {
+				_, releaseErr = m.Releaser().Release(ctx)
+				if releaseErr == nil {
+					break
+				}
+				log.Error(releaseErr, "Unable to release machine on deploy failure", "attempt", attempt)
+				if attempt < maxAttempts {
+					time.Sleep(backoff)


The deploy-failure cleanup path also performs blocking retry sleeps using the same long-lived ctx variable; this can stall reconcile for multiple seconds and doesn't respect cancellation. Consider using a context-aware backoff (or pushing retries up to the reconciler via requeue) so shutdown and timeouts can interrupt the wait.

Suggested change

for attempt := 1; attempt <= maxAttempts; attempt++ {

_, releaseErr = m.Releaser().Release(ctx)

if releaseErr == nil {

break

}

log.Error(releaseErr, "Unable to release machine on deploy failure", "attempt", attempt)

if attempt < maxAttempts {

time.Sleep(backoff)

retryRelease:

for attempt := 1; attempt <= maxAttempts; attempt++ {

// Stop retrying if the context has been cancelled.

if err := ctx.Err(); err != nil {

log.Error(err, "Context cancelled while retrying machine release on deploy failure")

break retryRelease

}

_, releaseErr = m.Releaser().Release(ctx)

if releaseErr == nil {

break

}

log.Error(releaseErr, "Unable to release machine on deploy failure", "attempt", attempt)

if attempt < maxAttempts {

select {

case <-ctx.Done():

// Context cancelled during backoff; stop retrying.

log.Error(ctx.Err(), "Context cancelled during backoff while retrying machine release on deploy failure")

break retryRelease

case <-time.After(backoff):

}

Copilot · 2026-03-06T10:39:03Z

pkg/maas/machine/machine.go

+func (s *Service) ReleaseIP(ip string) {
+	ctx := context.TODO()


ReleaseIP creates a context.TODO() internally, which makes it hard for callers (controllers) to apply timeouts/cancellation and can block shutdown if MAAS calls hang. Consider changing ReleaseIP to accept a context.Context parameter and passing the reconcile ctx from the controller.

Suggested change

func (s *Service) ReleaseIP(ip string) {

ctx := context.TODO()

// For new code, prefer ReleaseIPWithContext to allow cancellation/timeouts.

func (s *Service) ReleaseIP(ip string) {

s.ReleaseIPWithContext(context.Background(), ip)

}

// ReleaseIPWithContext releases a single IP address in MAAS using the provided context.

// This allows callers (e.g. controllers) to apply timeouts and cancellation.

func (s *Service) ReleaseIPWithContext(ctx context.Context, ip string) {

Copilot · 2026-03-06T10:39:03Z

controllers/maasmachine_controller.go

 		}
+		// Release static IP from spec if set, so it does not remain "User reserved" in MAAS (e.g. force-deleted machine).
+		if maasMachine.Spec.StaticIP != nil && maasMachine.Spec.StaticIP.IP != "" {
+			machineSvc.ReleaseIP(maasMachine.Spec.StaticIP.IP)


reconcileDelete receives a context but discards it (_ context.Context) and then calls machineSvc.ReleaseIP(), which currently uses context.TODO(). This prevents request-scoped timeouts/cancellation for MAAS calls during delete. Consider keeping the ctx parameter and threading it through to ReleaseIP(ctx, ip).

Suggested change

machineSvc.ReleaseIP(maasMachine.Spec.StaticIP.IP)

machineSvc.ReleaseIP(ctx, maasMachine.Spec.StaticIP.IP)

Copilot · 2026-03-06T10:39:03Z

pkg/maas/machine/machine.go

+	if getErr == nil {
+		for _, ip := range m.IPAddresses() {
+			if ip != nil {
+				s.releaseIPBestEffort(ctx, ip.String())
+			}


ReleaseMachine now explicitly releases each machine IP (via releaseIPBestEffort/IPAddresses().Release()) before calling Releaser().Release(), but the current unit tests only cover an empty IP list. Please add/extend tests to assert non-empty IPAddresses() triggers the expected IP release calls (and ideally cover the retry path too).

PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses

f94dd8f

AmitSahastra changed the title ~~PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses~~ [WIP] PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses Mar 6, 2026

AmitSahastra requested a review from Copilot March 6, 2026 10:33

Copilot started reviewing on behalf of AmitSahastra March 6, 2026 10:34 View session

Copilot AI reviewed Mar 6, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses#323

[WIP] PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses#323
AmitSahastra wants to merge 1 commit intospectro-masterfrom
PCP-5556

AmitSahastra commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Copilot AI Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -103,9 +103,11 @@ func TestMachine(t *testing.T) {
		}

		mockClientSetInterface.EXPECT().Machines().Return(mockMachines)

		// Machine already gone; release scope static IP so it does not stay "User reserved".
		s.releaseIPBestEffort(ctx, s.scope.GetStaticIP())

-		// Machine already gone; release scope static IP so it does not stay "User reserved".
-		s.releaseIPBestEffort(ctx, s.scope.GetStaticIP())
+		// Only treat confirmed/notional "not found" errors as "machine already gone".
+		errMsg := getErr.Error()
+		notFound := strings.Contains(errMsg, "404") ||
+			strings.Contains(errMsg, "not found") ||
+			strings.Contains(errMsg, "No matching machine")
+		if notFound {
+			s.scope.V(1).Info("Machine not found when fetching before release; releasing scope static IP", "systemID", systemID, "error", getErr)
+			// Machine already gone; release scope static IP so it does not stay "User reserved".
+			s.releaseIPBestEffort(ctx, s.scope.GetStaticIP())
+		} else {
+			// Transient or unexpected error; skip scope static IP release to avoid releasing an IP
+			// that may still be in use, but still attempt machine release below.
+			s.scope.V(1).Info("Failed to get machine before release; skipping scope static IP release", "systemID", systemID, "error", getErr)
+		}

-			time.Sleep(backoff)
+			timer := time.NewTimer(backoff)
+			select {
+			case <-ctx.Done():
+				if !timer.Stop() {
+					<-timer.C
+				}
+				return ctx.Err()
+			case <-timer.C:
+			}

		func (s *Service) ReleaseIP(ip string) {
		ctx := context.TODO()

-func (s *Service) ReleaseIP(ip string) {
-	ctx := context.TODO()
+// For new code, prefer ReleaseIPWithContext to allow cancellation/timeouts.
+func (s *Service) ReleaseIP(ip string) {
+	s.ReleaseIPWithContext(context.Background(), ip)
+}
+// ReleaseIPWithContext releases a single IP address in MAAS using the provided context.
+// This allows callers (e.g. controllers) to apply timeouts and cancellation.
+func (s *Service) ReleaseIPWithContext(ctx context.Context, ip string) {

	machineSvc.ReleaseIP(maasMachine.Spec.StaticIP.IP)
	machineSvc.ReleaseIP(ctx, maasMachine.Spec.StaticIP.IP)

Conversation

AmitSahastra commented Mar 6, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 6, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants