Skip to content

Same-cwd parallel /codex:* races on jobs.json + broker.json: data loss + orphan brokers #286

@yaronhuang

Description

@yaronhuang

Three concurrency races in the companion runtime cause data loss and orphan resources when two /codex:* invocations are launched in parallel from the same cwd. Adversarial review (/codex:adversarial-review) is the canonical reproduction because it spawns multiple parallel review streams from one workspace.

Plugin version: openai-codex/codex/1.0.4 (cache install via Claude Code marketplace).

Race 1 — state.mjs updateState / upsertJob is non-atomic

scripts/lib/state.mjs:118-146:

export function updateState(cwd, mutate) {
  const state = loadState(cwd);
  mutate(state);
  return saveState(cwd, state);   // writeFileSync, no lock, no temp+rename
}

export function upsertJob(cwd, jobPatch) {
  return updateState(cwd, (state) => {
    // append or update by id
  });
}

Two parallel upsertJob calls from the same cwd both load the same baseline, both mutate, both writeFileSync. Last writer wins; the other job is silently dropped from the aggregate jobs.json index.

Race 2 — pruneJobs deletes per-job files for races-1 losers

state.mjs:92-116saveState re-loads "previous" state, computes a retain set from state.jobs (the next state), and deletes per-job <jobId>.json and <jobId>.log for any id present in old but not in next.

Combined with race 1: process B's loadState runs before A persists A's job, so B's retain set lacks A's id. B's saveState then unlinkSyncs A's per-job .json and .log while A's task-worker is still running.

Result: a job that the in-memory queue still thinks is running, with no on-disk record. Reproduced today: task-momcu911-gkhshj had status="running" in state.json but no .json and no .log on disk.

Race 3 — broker-lifecycle.mjs ensureBrokerSession races on broker.json

broker-lifecycle.mjs:113-171 does check-load-spawn-save with no lock:

export async function ensureBrokerSession(cwd, options = {}) {
  const existing = loadBrokerSession(cwd);
  if (existing && (await isBrokerEndpointReady(existing.endpoint))) {
    return existing;
  }
  // ...
  const sessionDir = createBrokerSessionDir();   // mkdtempSync — unique
  // spawn broker child, wait for endpoint
  saveBrokerSession(cwd, session);   // writeFileSync, no lock
  return session;
}

Two concurrent callers both probe isBrokerEndpointReady(existing.endpoint) (false or no existing), both mkdtempSync (unique dirs), both spawn brokers, both saveBrokerSession. Loser's broker is orphaned in /tmp/cxc-*/.

On-disk evidence yesterday: 4 dead orphan brokers (cxc-SCOvLB, cxc-WQhSPN, cxc-eW3wP7, cxc-vSD3zY) + 1 alive (cxc-UouOR2). Textbook last-writer-wins.

Suggested fixes

  1. Atomic write everywherewriteFileSync(tmp) + renameSync(tmp, dest) for jobs.json, broker.json, and per-job files.
  2. Cooperative file lock around loadState → saveState and loadBrokerSession → saveBrokerSession. proper-lockfile is a pure-Node option with retries.
  3. Decouple pruneJobs from on-disk file deletion — separate sweeper that only removes <jobId>.json / .log when the on-disk record is cancelled / completed / failed AND older than N minutes. Never delete based on absence-from-aggregate alone.
  4. /codex:status PID liveness check — verify the recorded PID is alive (or the broker socket reachable for that session) before showing running. Today the status reads running for jobs whose codex process exited hours ago.

Related open PRs

A few open PRs already cover (3) and (4):

These don't address the underlying non-atomic write + race-2 file deletion, so even with PID liveness checks the on-disk state will still get corrupted under same-cwd parallelism.

Workaround

Avoid same-cwd concurrent /codex:* invocations. Use raw codex exec directly for any review/diagnosis work that wants parallelism, since the companion runtime is the part that races.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions