Skip to content

fix(teams): prevent lock timeout crash from stale locks and thundering herd#43

Open
awkay wants to merge 1 commit intotmustier:mainfrom
awkay:main
Open

fix(teams): prevent lock timeout crash from stale locks and thundering herd#43
awkay wants to merge 1 commit intotmustier:mainfrom
awkay:main

Conversation

@awkay
Copy link
Copy Markdown

@awkay awkay commented Apr 30, 2026

Crash Reproduced and Fixed

While running a 3-worker team session (testing-expert, architect, rad-expert) using Pi with the GLM model, team_done crashed the entire leader process with:

Error: Timeout acquiring lock: …/config.json.lock
    at withLock (…/fs-lock.ts:59:15)
    at async setMemberStatus (…/team-config.ts:308:10)

I had Pi (running GLM) diagnose the crash from the stack trace and stale lock file on disk. It identified two bugs.

Root Cause

When team_done fires, stopAllTeammates() sends SIGTERM to workers. Their exit handlers call setMemberStatus, and the team_done loop also calls setMemberStatus for each worker — all concurrently. One worker died mid-lock, leaving a stale .lock file. The leader then hit a 10-second timeout.

Bugs Fixed (3)

1. PID-based staleness detection — the old code only checked file age (staleMs). If a process died holding the lock and you came back later, the age check should have worked, but there was a second problem (see #2). Now we also check process.kill(pid, 0) — if the PID in the lock file is dead, the lock is immediately stale regardless of age.

2. Atomic lock stealing via renameSync — the old stale detection used unlinkSync + openSync("wx"), which has a TOCTOU race. When multiple processes all detect a stale lock simultaneously (thundering herd during team_done), they would all unlinkSync the same file and then race to recreate it, causing lock contention timeouts. Now uses renameSync to atomically steal the stale lock to a temp path — only one contender wins the rename; others see ENOENT and retry cleanly.

3. Better error messages — timeout errors now include the lock-holder PID, alive status, and manual cleanup instructions (rm -f <path>).

Testing

Ran a happy-path test after the fix: spawned a 2-teammate team, delegated a task, confirmed completion, called team_done — all clean, no lock issues.

Team done. 1 task(s): 1 completed, 0 pending. Widget hidden.

…g herd

Two bugs fixed in fs-lock.ts:

1. PID-based staleness detection: When a process dies while holding a
   lock, the stale lock is now detected immediately (not just by age).
   Uses process.kill(pid, 0) to check if the lock owner is alive.

2. Atomic lock stealing via renameSync: The old code used unlinkSync +
   openSync('wx') which had a TOCTOU race. When multiple processes
   detected a stale lock simultaneously (thundering herd), they would
   all unlink it and race to recreate it, causing lock contention
   timeouts. Now uses renameSync to atomically steal the stale lock
   to a temp path -- only one contender wins the rename.

3. Better error messages: Timeout errors now include the PID of the
   lock holder and manual cleanup instructions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant