Three concurrency races in the companion runtime cause data loss and orphan resources when two /codex:* invocations are launched in parallel from the same cwd. Adversarial review (/codex:adversarial-review) is the canonical reproduction because it spawns multiple parallel review streams from one workspace.
Plugin version: openai-codex/codex/1.0.4 (cache install via Claude Code marketplace).
Race 1 — state.mjs updateState / upsertJob is non-atomic
scripts/lib/state.mjs:118-146:
export function updateState(cwd, mutate) {
const state = loadState(cwd);
mutate(state);
return saveState(cwd, state); // writeFileSync, no lock, no temp+rename
}
export function upsertJob(cwd, jobPatch) {
return updateState(cwd, (state) => {
// append or update by id
});
}
Two parallel upsertJob calls from the same cwd both load the same baseline, both mutate, both writeFileSync. Last writer wins; the other job is silently dropped from the aggregate jobs.json index.
Race 2 — pruneJobs deletes per-job files for races-1 losers
state.mjs:92-116 — saveState re-loads "previous" state, computes a retain set from state.jobs (the next state), and deletes per-job <jobId>.json and <jobId>.log for any id present in old but not in next.
Combined with race 1: process B's loadState runs before A persists A's job, so B's retain set lacks A's id. B's saveState then unlinkSyncs A's per-job .json and .log while A's task-worker is still running.
Result: a job that the in-memory queue still thinks is running, with no on-disk record. Reproduced today: task-momcu911-gkhshj had status="running" in state.json but no .json and no .log on disk.
Race 3 — broker-lifecycle.mjs ensureBrokerSession races on broker.json
broker-lifecycle.mjs:113-171 does check-load-spawn-save with no lock:
export async function ensureBrokerSession(cwd, options = {}) {
const existing = loadBrokerSession(cwd);
if (existing && (await isBrokerEndpointReady(existing.endpoint))) {
return existing;
}
// ...
const sessionDir = createBrokerSessionDir(); // mkdtempSync — unique
// spawn broker child, wait for endpoint
saveBrokerSession(cwd, session); // writeFileSync, no lock
return session;
}
Two concurrent callers both probe isBrokerEndpointReady(existing.endpoint) (false or no existing), both mkdtempSync (unique dirs), both spawn brokers, both saveBrokerSession. Loser's broker is orphaned in /tmp/cxc-*/.
On-disk evidence yesterday: 4 dead orphan brokers (cxc-SCOvLB, cxc-WQhSPN, cxc-eW3wP7, cxc-vSD3zY) + 1 alive (cxc-UouOR2). Textbook last-writer-wins.
Suggested fixes
- Atomic write everywhere —
writeFileSync(tmp) + renameSync(tmp, dest) for jobs.json, broker.json, and per-job files.
- Cooperative file lock around
loadState → saveState and loadBrokerSession → saveBrokerSession. proper-lockfile is a pure-Node option with retries.
- Decouple
pruneJobs from on-disk file deletion — separate sweeper that only removes <jobId>.json / .log when the on-disk record is cancelled / completed / failed AND older than N minutes. Never delete based on absence-from-aggregate alone.
/codex:status PID liveness check — verify the recorded PID is alive (or the broker socket reachable for that session) before showing running. Today the status reads running for jobs whose codex process exited hours ago.
Related open PRs
A few open PRs already cover (3) and (4):
These don't address the underlying non-atomic write + race-2 file deletion, so even with PID liveness checks the on-disk state will still get corrupted under same-cwd parallelism.
Workaround
Avoid same-cwd concurrent /codex:* invocations. Use raw codex exec directly for any review/diagnosis work that wants parallelism, since the companion runtime is the part that races.
Three concurrency races in the companion runtime cause data loss and orphan resources when two
/codex:*invocations are launched in parallel from the same cwd. Adversarial review (/codex:adversarial-review) is the canonical reproduction because it spawns multiple parallel review streams from one workspace.Plugin version:
openai-codex/codex/1.0.4(cache install via Claude Code marketplace).Race 1 —
state.mjsupdateState/upsertJobis non-atomicscripts/lib/state.mjs:118-146:Two parallel
upsertJobcalls from the same cwd both load the same baseline, both mutate, bothwriteFileSync. Last writer wins; the other job is silently dropped from the aggregatejobs.jsonindex.Race 2 —
pruneJobsdeletes per-job files for races-1 losersstate.mjs:92-116—saveStatere-loads "previous" state, computes a retain set fromstate.jobs(the next state), and deletes per-job<jobId>.jsonand<jobId>.logfor any id present in old but not in next.Combined with race 1: process B's
loadStateruns before A persists A's job, so B's retain set lacks A's id. B'ssaveStatethenunlinkSyncs A's per-job.jsonand.logwhile A'stask-workeris still running.Result: a job that the in-memory queue still thinks is
running, with no on-disk record. Reproduced today:task-momcu911-gkhshjhadstatus="running"instate.jsonbut no.jsonand no.logon disk.Race 3 —
broker-lifecycle.mjsensureBrokerSessionraces onbroker.jsonbroker-lifecycle.mjs:113-171does check-load-spawn-save with no lock:Two concurrent callers both probe
isBrokerEndpointReady(existing.endpoint)(false or no existing), bothmkdtempSync(unique dirs), both spawn brokers, bothsaveBrokerSession. Loser's broker is orphaned in/tmp/cxc-*/.On-disk evidence yesterday: 4 dead orphan brokers (
cxc-SCOvLB,cxc-WQhSPN,cxc-eW3wP7,cxc-vSD3zY) + 1 alive (cxc-UouOR2). Textbook last-writer-wins.Suggested fixes
writeFileSync(tmp)+renameSync(tmp, dest)forjobs.json,broker.json, and per-job files.loadState → saveStateandloadBrokerSession → saveBrokerSession.proper-lockfileis a pure-Node option with retries.pruneJobsfrom on-disk file deletion — separate sweeper that only removes<jobId>.json/.logwhen the on-disk record iscancelled/completed/failedAND older than N minutes. Never delete based on absence-from-aggregate alone./codex:statusPID liveness check — verify the recorded PID is alive (or the broker socket reachable for that session) before showingrunning. Today the status readsrunningfor jobs whose codex process exited hours ago.Related open PRs
A few open PRs already cover (3) and (4):
fix(broker-lifecycle): reject crashed brokers via PID-alive probeReconcile stale running jobs when pid exitsfix: handle zombie jobs when process crashes but status remains "running"Fix #202: zombie job blocks subsequent task calls when process crashesThese don't address the underlying non-atomic write + race-2 file deletion, so even with PID liveness checks the on-disk state will still get corrupted under same-cwd parallelism.
Workaround
Avoid same-cwd concurrent
/codex:*invocations. Use rawcodex execdirectly for any review/diagnosis work that wants parallelism, since the companion runtime is the part that races.