fix(teams): prevent lock timeout crash from stale locks and thundering herd#43
Open
awkay wants to merge 1 commit intotmustier:mainfrom
Open
fix(teams): prevent lock timeout crash from stale locks and thundering herd#43awkay wants to merge 1 commit intotmustier:mainfrom
awkay wants to merge 1 commit intotmustier:mainfrom
Conversation
…g herd
Two bugs fixed in fs-lock.ts:
1. PID-based staleness detection: When a process dies while holding a
lock, the stale lock is now detected immediately (not just by age).
Uses process.kill(pid, 0) to check if the lock owner is alive.
2. Atomic lock stealing via renameSync: The old code used unlinkSync +
openSync('wx') which had a TOCTOU race. When multiple processes
detected a stale lock simultaneously (thundering herd), they would
all unlink it and race to recreate it, causing lock contention
timeouts. Now uses renameSync to atomically steal the stale lock
to a temp path -- only one contender wins the rename.
3. Better error messages: Timeout errors now include the PID of the
lock holder and manual cleanup instructions.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Crash Reproduced and Fixed
While running a 3-worker team session (testing-expert, architect, rad-expert) using Pi with the GLM model,
team_donecrashed the entire leader process with:I had Pi (running GLM) diagnose the crash from the stack trace and stale lock file on disk. It identified two bugs.
Root Cause
When
team_donefires,stopAllTeammates()sends SIGTERM to workers. Theirexithandlers callsetMemberStatus, and theteam_doneloop also callssetMemberStatusfor each worker — all concurrently. One worker died mid-lock, leaving a stale.lockfile. The leader then hit a 10-second timeout.Bugs Fixed (3)
1. PID-based staleness detection — the old code only checked file age (
staleMs). If a process died holding the lock and you came back later, the age check should have worked, but there was a second problem (see #2). Now we also checkprocess.kill(pid, 0)— if the PID in the lock file is dead, the lock is immediately stale regardless of age.2. Atomic lock stealing via
renameSync— the old stale detection usedunlinkSync+openSync("wx"), which has a TOCTOU race. When multiple processes all detect a stale lock simultaneously (thundering herd duringteam_done), they would allunlinkSyncthe same file and then race to recreate it, causing lock contention timeouts. Now usesrenameSyncto atomically steal the stale lock to a temp path — only one contender wins the rename; others seeENOENTand retry cleanly.3. Better error messages — timeout errors now include the lock-holder PID, alive status, and manual cleanup instructions (
rm -f <path>).Testing
Ran a happy-path test after the fix: spawned a 2-teammate team, delegated a task, confirmed completion, called
team_done— all clean, no lock issues.