Skip to content

Clean up supervisor images in the database#2535

Open
pipex wants to merge 1 commit into
masterfrom
remove-sv-image-fix
Open

Clean up supervisor images in the database#2535
pipex wants to merge 1 commit into
masterfrom
remove-sv-image-fix

Conversation

@pipex

@pipex pipex commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

On supervisor v18.2.0, we renamed the supervisor service to core in the supervisor
composition
. That change also updated the isSupervisor filter in the supervisor-metadata module.

The problem is that when updating to v18.2.0, the following race condition may happen

  • The device is on v17.8.5, and the user updates the supervisor on the dashboard to v18.2.0. The target state changes
  • Before the update-balena-supervisor script can run, supervisor v17.8.5, sees the new core service which it not part of its filter list and creates the service and image on the database.
  • Eventually supervisor v18.2.0 runs, it knows to filter core and tries to remove the image, but its using the image itself, so it gets stuck in a loop trying to remove the image and failing.

This adds a isSupervisor check to the initial application manager image cleanup to avoid the issue happening altogether.

If a customer is stuck in the delete loop, updating to this supervisor should fix it.

Change-type: patch

Release notes

Fixes an issue introduced by the renaming of the supervisor service to core, where the old supervisor could install the core container and image, and the new supervisor would not be able to remove the image due to the image being used by the current supervisor container. This change adds a cleanup step to avoid the image delete loop and restore operation.

On supervisor v18.2.0, we renamed the supervisor service to `core` in
the [supervisor
composition](https://github.com/balena-os/balena-supervisor/blob/v18.2.0/docker-compose.yml). That change also updated the [isSupervisor](https://github.com/balena-os/balena-supervisor/blob/v18.2.0/src/lib/supervisor-metadata.ts#L34) filter in the supervisor-metadata module.

The problem is that when updating to v18.2.0, the following race
condition may happen

- The device is on v17.8.5, and the user updates the supervisor on the dashboard to v18.2.0. The target state changes
- Before the update-balena-supervisor script can run, supervisor v17.8.5,
  sees the new `core` service which it not part of its filter list and
  creates the service and image on the database.
- Eventually supervisor v18.2.0 runs, it knows to filter `core` and tries
  to remove the image, but its using the image itself, so it gets stuck
  in a loop trying to remove the image and failing.

This adds a `isSupervisor` check to the initial application manager
image cleanup to avoid the issue happening altogether.

If a customer is stuck in the delete loop, updating to this supervisor
should fix it.

Change-type: patch
@flowzone-app flowzone-app Bot enabled auto-merge July 3, 2026 15:11
@pipex

pipex commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

How to test

  • From a supervisor <18.2.0
  • Disable openvpn on the device
  • Trigger an update to v18.2.0 or above
  • Restart the supervisor to trigger a new poll, wait for the core_xxx container to show up on the engine
  • Call update-balena-supervisor you should see the core container get removed and a loop trying to remove the core image
  • Update to this supervisor, you should see Device state apply success

@pipex

pipex commented Jul 3, 2026

Copy link
Copy Markdown
Contributor Author

This works, I tested two scenarios

  • A device on 18.2.0 stuck on the image removal loop, when updated to this version gets to a healthy state
  • A device on 17.8.5, updated to this version with the instructions above removes the core container and gets a target state applied

@pipex pipex requested a review from cywang117 July 3, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant