[WIP] PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses#323
[WIP] PCP-5556: MAAS CAPI Provider is failing to clean up Stale IP addresses#323AmitSahastra wants to merge 1 commit intospectro-masterfrom
Conversation
There was a problem hiding this comment.
Pull request overview
This PR addresses MAAS “User reserved” IP exhaustion by adding best-effort IP release during machine cleanup paths and by retrying MAAS machine release operations to reduce stale allocations.
Changes:
- Add best-effort IP release helpers and call them from multiple cleanup/release paths (including VM-host diagnostics and VM-host rejection).
- Enhance
ReleaseMachine(and deploy-failure cleanup) to explicitly release IPs first and retry MAASRelease()on failures. - On delete reconcile when no MAAS machine is found, release the spec static IP to avoid it remaining “User reserved”.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
| pkg/maas/machine/machine.go | Adds best-effort IP release + integrates it into machine release / failure cleanup with retries. |
| pkg/maas/machine/machine_test.go | Updates mocks for new ReleaseMachine flow (Get + IPAddresses). |
| controllers/maasmachine_controller.go | Releases spec static IP when MAAS machine can’t be located during delete reconciliation. |
| go.mod | Updates Go toolchain version. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| @@ -103,9 +103,11 @@ func TestMachine(t *testing.T) { | |||
| } | |||
|
|
|||
| mockClientSetInterface.EXPECT().Machines().Return(mockMachines) | |||
There was a problem hiding this comment.
ReleaseMachine calls maasClient.Machines() at least twice now (once for Get, once for Releaser().Release(), and more if retries happen), but this test only expects Machines() to be called once. Update the gomock expectation (e.g. Times(2) for the no-retry happy path, or AnyTimes/MinTimes if you want the test to tolerate retries).
| mockClientSetInterface.EXPECT().Machines().Return(mockMachines) | |
| mockClientSetInterface.EXPECT().Machines().Return(mockMachines).Times(2) // Get, then Releaser |
| // Machine already gone; release scope static IP so it does not stay "User reserved". | ||
| s.releaseIPBestEffort(ctx, s.scope.GetStaticIP()) |
There was a problem hiding this comment.
ReleaseMachine treats any Machine().Get() error as "machine already gone" and then releases the scope static IP. If Get fails for a transient/non-404 reason (e.g. MAAS API outage), this can unintentionally release an IP while the machine still exists. Consider only taking the "already gone" path when the error is a confirmed not-found (e.g. the same 404/no machine matches checks used in GetMachine), and otherwise log/return the error (or skip IP release).
| // Machine already gone; release scope static IP so it does not stay "User reserved". | |
| s.releaseIPBestEffort(ctx, s.scope.GetStaticIP()) | |
| // Only treat confirmed/notional "not found" errors as "machine already gone". | |
| errMsg := getErr.Error() | |
| notFound := strings.Contains(errMsg, "404") || | |
| strings.Contains(errMsg, "not found") || | |
| strings.Contains(errMsg, "No matching machine") | |
| if notFound { | |
| s.scope.V(1).Info("Machine not found when fetching before release; releasing scope static IP", "systemID", systemID, "error", getErr) | |
| // Machine already gone; release scope static IP so it does not stay "User reserved". | |
| s.releaseIPBestEffort(ctx, s.scope.GetStaticIP()) | |
| } else { | |
| // Transient or unexpected error; skip scope static IP release to avoid releasing an IP | |
| // that may still be in use, but still attempt machine release below. | |
| s.scope.V(1).Info("Failed to get machine before release; skipping scope static IP release", "systemID", systemID, "error", getErr) | |
| } |
| const maxAttempts = 3 | ||
| const backoff = 2 * time.Second | ||
| var lastErr error | ||
| for attempt := 1; attempt <= maxAttempts; attempt++ { | ||
| _, lastErr = s.maasClient.Machines().Machine(systemID).Releaser().Release(ctx) | ||
| if lastErr == nil { | ||
| return nil | ||
| } | ||
| s.scope.V(1).Info("Machine release attempt failed, will retry", "attempt", attempt, "systemID", systemID, "error", lastErr) | ||
| if attempt < maxAttempts { | ||
| time.Sleep(backoff) | ||
| } | ||
| } | ||
| return errors.Wrapf(lastErr, "Unable to release machine after %d attempts", maxAttempts) |
There was a problem hiding this comment.
The Release() retry loop retries on every error, including likely non-retriable ones (e.g. 404 not found if the machine was already released by another actor). This can add unnecessary delay and may cause reconciliation to fail even though the desired end state (machine gone) is already reached. Consider detecting "not found" errors and treating them as success, and only retrying errors that are known transient (or at least stopping retries on 4xx).
| } | ||
| s.scope.V(1).Info("Machine release attempt failed, will retry", "attempt", attempt, "systemID", systemID, "error", lastErr) | ||
| if attempt < maxAttempts { | ||
| time.Sleep(backoff) |
There was a problem hiding this comment.
ReleaseMachine uses time.Sleep for backoff while using context.TODO(), so this blocking wait can't be cancelled on shutdown or reconcile timeout. In controller-runtime codepaths this can reduce throughput and delay other reconciliations. Prefer a context-aware backoff (e.g. timer + select on ctx.Done()) and consider threading the caller's ctx into ReleaseMachine instead of creating a TODO context.
| time.Sleep(backoff) | |
| timer := time.NewTimer(backoff) | |
| select { | |
| case <-ctx.Done(): | |
| if !timer.Stop() { | |
| <-timer.C | |
| } | |
| return ctx.Err() | |
| case <-timer.C: | |
| } |
| for attempt := 1; attempt <= maxAttempts; attempt++ { | ||
| _, releaseErr = m.Releaser().Release(ctx) | ||
| if releaseErr == nil { | ||
| break | ||
| } | ||
| log.Error(releaseErr, "Unable to release machine on deploy failure", "attempt", attempt) | ||
| if attempt < maxAttempts { | ||
| time.Sleep(backoff) |
There was a problem hiding this comment.
The deploy-failure cleanup path also performs blocking retry sleeps using the same long-lived ctx variable; this can stall reconcile for multiple seconds and doesn't respect cancellation. Consider using a context-aware backoff (or pushing retries up to the reconciler via requeue) so shutdown and timeouts can interrupt the wait.
| for attempt := 1; attempt <= maxAttempts; attempt++ { | |
| _, releaseErr = m.Releaser().Release(ctx) | |
| if releaseErr == nil { | |
| break | |
| } | |
| log.Error(releaseErr, "Unable to release machine on deploy failure", "attempt", attempt) | |
| if attempt < maxAttempts { | |
| time.Sleep(backoff) | |
| retryRelease: | |
| for attempt := 1; attempt <= maxAttempts; attempt++ { | |
| // Stop retrying if the context has been cancelled. | |
| if err := ctx.Err(); err != nil { | |
| log.Error(err, "Context cancelled while retrying machine release on deploy failure") | |
| break retryRelease | |
| } | |
| _, releaseErr = m.Releaser().Release(ctx) | |
| if releaseErr == nil { | |
| break | |
| } | |
| log.Error(releaseErr, "Unable to release machine on deploy failure", "attempt", attempt) | |
| if attempt < maxAttempts { | |
| select { | |
| case <-ctx.Done(): | |
| // Context cancelled during backoff; stop retrying. | |
| log.Error(ctx.Err(), "Context cancelled during backoff while retrying machine release on deploy failure") | |
| break retryRelease | |
| case <-time.After(backoff): | |
| } |
| func (s *Service) ReleaseIP(ip string) { | ||
| ctx := context.TODO() |
There was a problem hiding this comment.
ReleaseIP creates a context.TODO() internally, which makes it hard for callers (controllers) to apply timeouts/cancellation and can block shutdown if MAAS calls hang. Consider changing ReleaseIP to accept a context.Context parameter and passing the reconcile ctx from the controller.
| func (s *Service) ReleaseIP(ip string) { | |
| ctx := context.TODO() | |
| // For new code, prefer ReleaseIPWithContext to allow cancellation/timeouts. | |
| func (s *Service) ReleaseIP(ip string) { | |
| s.ReleaseIPWithContext(context.Background(), ip) | |
| } | |
| // ReleaseIPWithContext releases a single IP address in MAAS using the provided context. | |
| // This allows callers (e.g. controllers) to apply timeouts and cancellation. | |
| func (s *Service) ReleaseIPWithContext(ctx context.Context, ip string) { |
| } | ||
| // Release static IP from spec if set, so it does not remain "User reserved" in MAAS (e.g. force-deleted machine). | ||
| if maasMachine.Spec.StaticIP != nil && maasMachine.Spec.StaticIP.IP != "" { | ||
| machineSvc.ReleaseIP(maasMachine.Spec.StaticIP.IP) |
There was a problem hiding this comment.
reconcileDelete receives a context but discards it (_ context.Context) and then calls machineSvc.ReleaseIP(), which currently uses context.TODO(). This prevents request-scoped timeouts/cancellation for MAAS calls during delete. Consider keeping the ctx parameter and threading it through to ReleaseIP(ctx, ip).
| machineSvc.ReleaseIP(maasMachine.Spec.StaticIP.IP) | |
| machineSvc.ReleaseIP(ctx, maasMachine.Spec.StaticIP.IP) |
| if getErr == nil { | ||
| for _, ip := range m.IPAddresses() { | ||
| if ip != nil { | ||
| s.releaseIPBestEffort(ctx, ip.String()) | ||
| } |
There was a problem hiding this comment.
ReleaseMachine now explicitly releases each machine IP (via releaseIPBestEffort/IPAddresses().Release()) before calling Releaser().Release(), but the current unit tests only cover an empty IP list. Please add/extend tests to assert non-empty IPAddresses() triggers the expected IP release calls (and ideally cover the retry path too).
No description provided.