Skip to content

Autopilot applyingUpdate reconciler is non-idempotent — workers wedge forever after client.Update conflict #7703

@banjoh

Description

@banjoh

Before creating an issue, make sure you've checked the following:

  • You are running the latest released version of k0s
  • Make sure you've searched for existing issues, both open and closed
  • Make sure you've searched for PRs too, a fix might've been merged already
  • You're looking at docs for the released version, "main" branch docs are usually ahead of released versions.

Platform

Linux (observed on workers running k0s in HA)

Version

v1.35.1+k0s.1

Sysinfo

`k0s sysinfo`
➡️ Please replace this text with the output of `k0s sysinfo`. ⬅️

What happened?

The applyingUpdate reconciler in pkg/autopilot/controller/signal/k0s/apply.go is not idempotent. If the worker successfully renames /usr/local/bin/k0s.tmp/usr/local/bin/k0s but the subsequent client.Update(ctx, node) fails (e.g. with a resourceVersion conflict — "the object has been modified"), controller-runtime retries the reconciler. On retry, os.Stat("/usr/local/bin/k0s.tmp") fails because the file was already moved, and the reconciler enters an infinite error loop:

Applying update
Reconciler error  error="unable to find update file 'k0s.tmp': stat /usr/local/bin/k0s.tmp: no such file or directory"

The Node's autopilot signal annotation stays stuck at ApplyingUpdate forever, even though the target binary is already correctly in place at /usr/local/bin/k0s. Restarting the worker (or the k0s service) is not sufficient — on restart, the reconciler reads the same stuck annotation and re-enters the same failing loop. The only recovery is operator intervention: delete the autopilot Plan and remove the k0sproject.io/autopilot-signal-data annotation from each affected worker Node before a new Plan can make progress.

Reference source on main:
https://github.com/k0sproject/k0s/blob/main/pkg/autopilot/controller/signal/k0s/apply.go

The reconciler does:

  1. client.Get(ctx, node) — read Node
  2. os.Stat("/usr/local/bin/k0s.tmp") — check downloaded binary
  3. os.Rename("/usr/local/bin/k0s.tmp", "/usr/local/bin/k0s") — replace binary
  4. client.Update(ctx, node) — write Restart status to Node annotation

If step 4 fails after step 3 has run, the controller-runtime retry hits step 2 and fails forever.

Note: PR #6994 ("Requeue Autopilot signal node updates on conflict", merged 2026-01-27) fixes an analogous conflict-on-update bug in pkg/autopilot/controller/plans/cmdprovider/{k0supdate,airgapupdate}/schedulable.go, but does not touch signal/k0s/apply.go. The worker-side applyingUpdate reconciler remains non-idempotent.

Steps to reproduce

  1. Create a 2-node cluster (1 controller + 1 worker) running k0s v1.35.1+k0s.1 with autopilot enabled.
    On the controller node:

    curl --proto '=https' --tlsv1.2 -sSf https://get.k0s.sh | sudo K0S_VERSION=v1.35.1+k0s.1 sh
    sudo k0s install controller --enable-worker
    sudo k0s start

    On the controller, generate a worker join token:

    sudo k0s token create --role=worker > /tmp/worker-token

    Copy /tmp/worker-token over to the worker node.

    On the worker node:

    curl --proto '=https' --tlsv1.2 -sSf https://get.k0s.sh | sudo K0S_VERSION=v1.35.1+k0s.1 sh
    sudo k0s install worker --token-file /tmp/worker-token
    sudo k0s start

    Wait for the cluster to be ready (run on the controller):

    sudo k0s kubectl wait --for=condition=Ready node --all --timeout=120s
  2. On the worker, start a tight loop that moves k0s.tmp into place as soon as it appears:

    while true; do
      { [ -f /usr/local/bin/k0s.tmp ] && mv /usr/local/bin/k0s.tmp /usr/local/bin/k0s; } 2>/dev/null || true
      sleep 0.05 2>/dev/null || true
    done
  3. From the controller, create an autopilot Plan that upgrades to a newer version (e.g. v1.35.4+k0s.0) with selector: {} discovery so all nodes are targeted.

  4. Observe the worker's autopilot logs:

    journalctl -u k0sworker -f | grep k0s.tmp
    

    The applyingUpdate reconciler loops forever with:

    unable to find update file 'k0s.tmp': stat /usr/local/bin/k0s.tmp: no such file or directory
    
  5. Verify the binary was moved into place and is the correct target version:

    /usr/local/bin/k0s version
    # v1.35.4+k0s.0

    The Node annotation k0sproject.io/autopilot-signal-data stays stuck at status.status: ApplyingUpdate indefinitely.

Note: This reproducer does not simulate the original client.Update API-conflict trigger. It instead deterministically produces the same on-disk state the race leaves behind — k0s.tmp already moved into place before the reconciler's os.Stat runs. The reconciler then enters the same infinite error loop, which proves the underlying bug is the reconciler's non-idempotency, independent of what triggers the second pass.

Expected behavior

The applyingUpdate reconciler should be idempotent: if k0s.tmp is missing, it should check whether /usr/local/bin/k0s is already the requested target version and, if so, proceed to write the Restart status to the Node annotation instead of erroring out. This would make the reconciler resilient to:

  1. The client.Update conflict / retry path.
  2. Any other scenario in which the binary is already in place from a prior partial run.

Actual behavior

The reconciler enters an infinite error loop. The Node annotation k0sproject.io/autopilot-signal-data is permanently stuck at status.status: ApplyingUpdate, even though the upgrade binary is correctly installed and k0s version reports the new version. The autopilot Plan never progresses to Restart on the affected worker.

Screenshots and logs

# journalctl -u k0sworker -f | grep autopilot
...
applying-update  Applying update
applying-update  Reconciler error  error="unable to find update file 'k0s.tmp': stat /usr/local/bin/k0s.tmp: no such file or directory"
applying-update  Applying update
applying-update  Reconciler error  error="unable to find update file 'k0s.tmp': stat /usr/local/bin/k0s.tmp: no such file or directory"
... (repeats every few seconds indefinitely)

When the trigger is the client.Update conflict, the preceding log line is:

failed to update signal node to status 'Restart':
Operation cannot be fulfilled on nodes "<node>": the object has been modified;
please apply your changes to the latest version and try again

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions