fix(teams): prevent lock timeout crash from stale locks and thundering herd by awkay · Pull Request #43 · tmustier/pi-agent-teams

awkay · 2026-04-30T14:25:23Z

Crash Reproduced and Fixed

While running a 3-worker team session (testing-expert, architect, rad-expert) using Pi with the GLM model, team_done crashed the entire leader process with:

Error: Timeout acquiring lock: …/config.json.lock
    at withLock (…/fs-lock.ts:59:15)
    at async setMemberStatus (…/team-config.ts:308:10)

I had Pi (running GLM) diagnose the crash from the stack trace and stale lock file on disk. It identified two bugs.

Root Cause

When team_done fires, stopAllTeammates() sends SIGTERM to workers. Their exit handlers call setMemberStatus, and the team_done loop also calls setMemberStatus for each worker — all concurrently. One worker died mid-lock, leaving a stale .lock file. The leader then hit a 10-second timeout.

Bugs Fixed (3)

1. PID-based staleness detection — the old code only checked file age (staleMs). If a process died holding the lock and you came back later, the age check should have worked, but there was a second problem (see #2). Now we also check process.kill(pid, 0) — if the PID in the lock file is dead, the lock is immediately stale regardless of age.

2. Atomic lock stealing via renameSync — the old stale detection used unlinkSync + openSync("wx"), which has a TOCTOU race. When multiple processes all detect a stale lock simultaneously (thundering herd during team_done), they would all unlinkSync the same file and then race to recreate it, causing lock contention timeouts. Now uses renameSync to atomically steal the stale lock to a temp path — only one contender wins the rename; others see ENOENT and retry cleanly.

3. Better error messages — timeout errors now include the lock-holder PID, alive status, and manual cleanup instructions (rm -f <path>).

Testing

Ran a happy-path test after the fix: spawned a 2-teammate team, delegated a task, confirmed completion, called team_done — all clean, no lock issues.

Team done. 1 task(s): 1 completed, 0 pending. Widget hidden.

…g herd Two bugs fixed in fs-lock.ts: 1. PID-based staleness detection: When a process dies while holding a lock, the stale lock is now detected immediately (not just by age). Uses process.kill(pid, 0) to check if the lock owner is alive. 2. Atomic lock stealing via renameSync: The old code used unlinkSync + openSync('wx') which had a TOCTOU race. When multiple processes detected a stale lock simultaneously (thundering herd), they would all unlink it and race to recreate it, causing lock contention timeouts. Now uses renameSync to atomically steal the stale lock to a temp path -- only one contender wins the rename. 3. Better error messages: Timeout errors now include the PID of the lock holder and manual cleanup instructions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(teams): prevent lock timeout crash from stale locks and thundering herd#43

fix(teams): prevent lock timeout crash from stale locks and thundering herd#43
awkay wants to merge 1 commit intotmustier:mainfrom
awkay:main

awkay commented Apr 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

awkay commented Apr 30, 2026

Crash Reproduced and Fixed

Root Cause

Bugs Fixed (3)

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant