Skip to content

Parallel multi-session is blocked by design — land the #1474 self-healing stack + make the opt-in broker path reachable (SSOT #1359 / audit #1457 alignment) #1480

@shaun0927

Description

@shaun0927

Investigation context: openchrome-mcp 1.12.7 (npm latest), Node v20.19.6, macOS (Darwin 25.3.0, arm64), MCP host = Claude Code with the global registration openchrome serve --auto-launch (stdio, default port 9222, default profile ~/.openchrome/profile). This issue consolidates a root-cause analysis with a direction/SSOT alignment review of the fixes already in flight, and catalogs additional wiring gaps found during the investigation. It is meant to sit under the SSOT (#1359) / audit (#1457) umbrella and to act as the tracking issue for the #1474 stack + follow-ups.

TL;DR


1. Symptom & deterministic repro

With ≥2 concurrent host sessions sharing one global registration (openchrome serve --auto-launch, same port+profile):

  • Every session except the first reports, repeatedly and deterministically: Failed to reconnect to openchrome: -32000.
  • It does not self-recover. If the winning owner becomes a half-zombie (MCP alive, Chrome/CDP dead), all sessions are starved indefinitely.

Captured on the investigation machine (one healthy owner holding the lock):

~/.openchrome/locks/port-9222-Users_<user>_.openchrome_profile.json
{ "pid": 24035, "version": "1.12.7", "port": 9222,
  "userDataDir": "/Users/<user>/.openchrome/profile",
  "lifecycleMode": "auto", "transportMode": "stdio", ... }

A second serve --auto-launch against the same key prints a rich remediation to stderr and exit(2)s before the MCP handshake — so the host discards it and the user only ever sees -32000.


2. Root cause (two-part)

  1. Default serve --auto-launch is single-owner-per-(port, userDataDir) with no broker auto-attach fallback. acquireControllerLock() (src/utils/controller-lock.ts) creates ~/.openchrome/locks/<key>.json with openSync(..., 'wx'); the second process gets EEXISTDuplicateControllerError → refuse to start. With N identical registrations, N−1 are structurally guaranteed to fail.
  2. The lock treats "owner PID alive" as "owner healthy." No CDP-reachability probe, no heartbeat/lease/TTL on the default owner. A half-zombie owner holds the lock forever; the orphaned managed Chrome (and Tier-3 headed-fallback children on basePort+100, e.g. 9322, plus headless 9666) can linger un-reaped.

Code anchors: src/utils/controller-lock.ts, src/utils/duplicate-controller-diagnostics.ts, src/chrome/launcher.ts (SingletonLock), src/chrome/process-watchdog.ts, src/chrome/headed-fallback.ts, src/chrome/auto-connect.ts (note: explicitly refuses to attach to the managed ~/.openchrome/profile, so the managed profile always takes launch mode).


3. Why this is by-design (not a regression)

Implication for the fix direction: the SSOT-aligned move is not "make broker the default." It is "make the default single-owner path self-healing + observable, and make the opt-in broker path reachable, portable, and documented."


4. The fix already in flight — #1474 stack

Issue #1474 ("Parallel sessions deadlock on controller lock; host sees only -32000", OPEN) is the canonical bug. Three stacked PRs (all OPEN, all MERGEABLE, base develop):

PR Role Base What it does Self-sufficiency
#1477 1/3 — reactive develop acquireControllerLockWithHealthCheck(): on live-owner collision, probe owner CDP /json/version; if unreachable past a boot-grace window, atomically take over the stale lock. Healthy owner never evicted. src/index.ts awaits it on --auto-launch. "This PR alone closes the reported deadlock."
#1478 2/3 — proactive fix/1474-controller-lock-health-aware (#1477) owner-self-release.ts: on terminal watchdog-exhausted, release the lock and exit 70 so the host respawns a fresh owner. Anti-flap: only the terminal event surrenders ownership; chrome-died/single relaunch-failed do not. Complements #1477
#1479 3/3 — observability fix/1474-owner-self-release (#1478) duplicate-controller-error-server.ts: a degraded stdio responder that completes initialize then surfaces remediation via portable MCP surfaces — notifications/message, a diagnostic tool, and a structured JSON-RPC error (data: port, profile, owner pid, lock path, ordered remediations) — instead of bare -32000. Independent value

Stacking/merge order: #1477#1478#1479.


5. Alignment analysis vs SSOT #1359 / audit #1457 / roadmap #1463

SSOT #1359 = "host-neutral MCP browser harness for real Chrome; the MCP protocol is the product boundary; no hidden host-specific behavior." Audit #1457 = "direction adherence high; achievement gated by primitives built, wiring/enforcement missing; + developmain divergence (Pillar B stack on main, absent on develop)."

PR / item Aligned? Rationale against direction & SSOT
#1477 health-aware takeover ✅ Strong Turns Pillar-B safety primitive (controller-lock) from advisory/deadlock-prone into enforcing + self-healing. Exactly the audit's "wire the primitive to the real call path" remedy. Preserves single-owner invariant (no split-brain via boot-grace + multi-probe). Does not change the opt-in default → respects D3.
#1478 owner self-release ✅ Strong Same Pillar-B reliability axis; makes lifecycle facts truthful (a dead owner stops claiming ownership). Anti-flap guard keeps it conservative. No new host-specific behavior.
#1479 MCP-visible remediation ✅ Strong + directly SSOT-core Moves the failure story from discarded stderr onto portable MCP surfaces (notification + tool + structured error data). This is the literal product boundary of #1359: "a feature belongs in OpenChrome only if exposed through portable MCP surfaces." Best alignment of the three.
Keeping broker opt-in (no default change) ✅ Required D3 froze broker as opt-in; default session exempt from TTL. Auto-electing broker in the default path would contradict the recorded decision and the "no hidden behavior" non-identity.
Targeting develop ✅ Correct, but verify divergence #1457 M0 flagged the Pillar-B stack as main-only / absent on develop; back-merge #1455 has since merged (2026-05-29). Confirm src/utils/controller-lock.ts et al. now exist on develop so the stack rebases cleanly and doesn't reintroduce divergence.

Verdict: the #1474 stack is on-direction and SSOT-consistent. It fixes (a) the deadlock and (b) the observability gap without disturbing the opt-in broker policy. It is the "enforcement/wiring" the audit asked for, applied to Pillar B. Recommendation: merge #1477#1478#1479 as-is.


6. Remaining gaps NOT covered by the #1474 stack (newly found)

The #1474 stack makes the default path safe and legible, and auto-recovers a dead owner — but it still leaves (N−1) concurrent sessions non-functional (they get a clear error instead of a cryptic one). For hosts that genuinely want concurrent sessions, the opt-in broker path must be reachable, portable, and documented. These are "wiring/enforcement missing" items in the #1457 sense:

  • G1 — Broker flags are hidden in the user-facing CLI (discoverability/Pillar A). node dist/index.js serve --help lists --broker/--connect-broker, but the bin wrapper openchrome serve --help does not (the --pilot, --hybrid, etc. show, broker does not). The one documented escape hatch is invisible at the surface most operators read. → Reconcile the two help surfaces.
  • G2 — Orphan Chrome leak on owner death (Pillar B isolation/cleanup). When an owner dies/half-zombies, Tier-3 headed-fallback children (basePort+100, e.g. 9322) and headless instances (e.g. 9666) can survive un-reaped; reap-orphans does not collect them. → Extend reaper + watchdog teardown to cover fallback/headless descendants and verify owner-self-release (fix(reliability): release controller lock on irrecoverable Chrome death (#1474) #1478) triggers full child cleanup.
  • G3 — No portable host-registration recipe for concurrent sharing (Pillar A). Today an operator must hand-author: one serve --auto-launch --broker owner first, then switch every session's registration to serve --connect-broker. --connect-broker exit(2)s if no broker is published (no auto-start), and the chicken-and-egg/SPOF are undocumented. → Have mcp-client-config emit a broker-topology registration (owner + client) and document the recipe in README; surface the "no broker found — start one with …" hint over MCP (per feat(observability): surface DuplicateController remediation over MCP instead of bare -32000 (#1474) #1479's pattern), not just stderr.
  • G4 — (Optional) explicit opt-in convenience auto-elect — must NOT change the default. A first-session-becomes-owner / others-auto-connect-as-clients mode would remove the manual ordering for concurrent users, but per D3 it can only ship as an explicit opt-in (e.g. --auto-broker), never as the default --auto-launch behavior. Gate behind a flag + trust config; lease/TTL already exists (feat(session): sliding idle-TTL target leases — reclaim crashed-client tabs (#1457 PR-3) #1460). Treat as a separate proposal, not part of the Parallel sessions deadlock on controller lock: half-zombie owner holds lock forever, host sees only -32000 #1474 stack.

7. Proposed action plan (phased)


8. Acceptance criteria


9. Evidence appendix (investigation machine)

  • Live owner lock: ~/.openchrome/locks/port-9222-...profile.json → pid 24035, v1.12.7, stdio/auto.
  • Managed Chrome: pid 95492 --remote-debugging-port=9222 --user-data-dir=~/.openchrome/profile (+ watchdog pid 95493 that SIGKILLs Chrome on owner death).
  • Orphans observed: pid 805 (headed-fallback, port 9322), pid 32230 (--headless=new, port 9666) — un-reaped (relates to G2).
  • auto-connect.ts boundary confirmed: refuses to attach to the managed profile → managed profile always launch-mode (relevant to why no implicit sharing exists today).

Refs: SSOT #1359 · audit #1457 · back-merge #1455 · controller-lock #1376 · diagnostics #1377 · broker foundation #1379 · lease TTL #1460 · roadmap/D3 #1463 · deadlock bug #1474 · fixes #1477/#1478/#1479.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions