Same-cwd parallel /codex:* races on jobs.json + broker.json: data loss + orphan brokers

Three concurrency races in the companion runtime cause data loss and orphan resources when two `/codex:*` invocations are launched in parallel from the same cwd. Adversarial review (`/codex:adversarial-review`) is the canonical reproduction because it spawns multiple parallel review streams from one workspace.

Plugin version: `openai-codex/codex/1.0.4` (cache install via Claude Code marketplace).

## Race 1 — `state.mjs` `updateState` / `upsertJob` is non-atomic

`scripts/lib/state.mjs:118-146`:

```js
export function updateState(cwd, mutate) {
  const state = loadState(cwd);
  mutate(state);
  return saveState(cwd, state);   // writeFileSync, no lock, no temp+rename
}

export function upsertJob(cwd, jobPatch) {
  return updateState(cwd, (state) => {
    // append or update by id
  });
}
```

Two parallel `upsertJob` calls from the same cwd both load the same baseline, both mutate, both `writeFileSync`. Last writer wins; the other job is silently dropped from the aggregate `jobs.json` index.

## Race 2 — `pruneJobs` deletes per-job files for races-1 losers

`state.mjs:92-116` — `saveState` re-loads "previous" state, computes a retain set from `state.jobs` (the next state), and **deletes per-job `<jobId>.json` and `<jobId>.log`** for any id present in old but not in next.

Combined with race 1: process B's `loadState` runs before A persists A's job, so B's retain set lacks A's id. B's `saveState` then `unlinkSync`s A's per-job `.json` and `.log` while A's `task-worker` is still running.

Result: a job that the in-memory queue still thinks is `running`, with no on-disk record. Reproduced today: `task-momcu911-gkhshj` had `status="running"` in `state.json` but no `.json` and no `.log` on disk.

## Race 3 — `broker-lifecycle.mjs` `ensureBrokerSession` races on `broker.json`

`broker-lifecycle.mjs:113-171` does check-load-spawn-save with no lock:

```js
export async function ensureBrokerSession(cwd, options = {}) {
  const existing = loadBrokerSession(cwd);
  if (existing && (await isBrokerEndpointReady(existing.endpoint))) {
    return existing;
  }
  // ...
  const sessionDir = createBrokerSessionDir();   // mkdtempSync — unique
  // spawn broker child, wait for endpoint
  saveBrokerSession(cwd, session);   // writeFileSync, no lock
  return session;
}
```

Two concurrent callers both probe `isBrokerEndpointReady(existing.endpoint)` (false or no existing), both `mkdtempSync` (unique dirs), both spawn brokers, both `saveBrokerSession`. Loser's broker is orphaned in `/tmp/cxc-*/`.

On-disk evidence yesterday: 4 dead orphan brokers (`cxc-SCOvLB`, `cxc-WQhSPN`, `cxc-eW3wP7`, `cxc-vSD3zY`) + 1 alive (`cxc-UouOR2`). Textbook last-writer-wins.

## Suggested fixes

1. **Atomic write everywhere** — `writeFileSync(tmp)` + `renameSync(tmp, dest)` for `jobs.json`, `broker.json`, and per-job files.
2. **Cooperative file lock** around `loadState → saveState` and `loadBrokerSession → saveBrokerSession`. `proper-lockfile` is a pure-Node option with retries.
3. **Decouple `pruneJobs` from on-disk file deletion** — separate sweeper that only removes `<jobId>.json` / `.log` when the on-disk record is `cancelled` / `completed` / `failed` AND older than N minutes. Never delete based on absence-from-aggregate alone.
4. **`/codex:status` PID liveness check** — verify the recorded PID is alive (or the broker socket reachable for that session) before showing `running`. Today the status reads `running` for jobs whose codex process exited hours ago.

## Related open PRs

A few open PRs already cover (3) and (4):
- #262 `fix(broker-lifecycle): reject crashed brokers via PID-alive probe`
- #225 `Reconcile stale running jobs when pid exits`
- #218 `fix: handle zombie jobs when process crashes but status remains "running"`
- #216 `Fix #202: zombie job blocks subsequent task calls when process crashes`

These don't address the underlying non-atomic write + race-2 file deletion, so even with PID liveness checks the on-disk state will still get corrupted under same-cwd parallelism.

## Workaround

Avoid same-cwd concurrent `/codex:*` invocations. Use raw `codex exec` directly for any review/diagnosis work that wants parallelism, since the companion runtime is the part that races.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Same-cwd parallel /codex:* races on jobs.json + broker.json: data loss + orphan brokers #286

Race 1 — `state.mjs` `updateState` / `upsertJob` is non-atomic

Race 2 — `pruneJobs` deletes per-job files for races-1 losers

Race 3 — `broker-lifecycle.mjs` `ensureBrokerSession` races on `broker.json`

Suggested fixes

Related open PRs

Workaround

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Same-cwd parallel /codex:* races on jobs.json + broker.json: data loss + orphan brokers #286

Description

Race 1 — state.mjs updateState / upsertJob is non-atomic

Race 2 — pruneJobs deletes per-job files for races-1 losers

Race 3 — broker-lifecycle.mjs ensureBrokerSession races on broker.json

Suggested fixes

Related open PRs

Workaround

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

Race 1 — `state.mjs` `updateState` / `upsertJob` is non-atomic

Race 2 — `pruneJobs` deletes per-job files for races-1 losers

Race 3 — `broker-lifecycle.mjs` `ensureBrokerSession` races on `broker.json`