Summary
Codex bot's heartbeat ack fails because it runs c4-control.js with the system Node.js (v18) instead of the nvm-managed Node.js (v22). Since better-sqlite3 in comm-bridge was compiled with Node v22, this causes ERR_DLOPEN_FAILED, making ack permanently fail and trapping the bot in a recovery restart loop.
Fault Chain (observed on production VM)
- 08:32 SGT — Codex session context reaches 75%, triggers normal session rotation
- 08:51 SGT — New session starts, heartbeat is not acked → health transitions to
RECOVERING
- Recovery mechanism sends heartbeat every 5 minutes (with 300s
available_in delay), but recovery_timeout also fires within ~5 minutes → restart loop
- User messages reset cooldown, further delaying heartbeat delivery
Root Cause
The VM's default Node.js is v18 (system-installed), but comm-bridge's better-sqlite3 native addon was compiled against Node v22 (nvm-managed). When the Codex bot runs c4-control.js ack from tmux, it resolves to system node (v18), producing:
The ack never succeeds → health never transitions from recovering to ok → activity-monitor keeps restarting the bot.
Workaround
Manually run ack with the correct Node version:
~/.nvm/versions/node/v22.22.0/bin/node c4-control.js ack --id <id>
Expected Fix
Ensure that all zylos-core scripts (especially those invoked by the Codex bot in tmux, such as c4-control.js) use the nvm-managed Node.js rather than falling back to the system Node.js. Possible approaches:
- Shebang or wrapper: Add explicit nvm node path resolution in scripts or use a wrapper that sources nvm before execution
- PATH enforcement: Ensure the bot's tmux session inherits the correct PATH with nvm node first
- Codex runtime init: When starting the Codex runtime, explicitly set
NODE / PATH in the tmux environment
Environment
- VM default node: v18 (system)
- nvm node: v22.22.0
- Affected module:
comm-bridge/better-sqlite3 (native addon)
- Runtime: Codex (tmux-based)
Summary
Codex bot's heartbeat ack fails because it runs
c4-control.jswith the system Node.js (v18) instead of the nvm-managed Node.js (v22). Sincebetter-sqlite3in comm-bridge was compiled with Node v22, this causesERR_DLOPEN_FAILED, making ack permanently fail and trapping the bot in a recovery restart loop.Fault Chain (observed on production VM)
RECOVERINGavailable_indelay), butrecovery_timeoutalso fires within ~5 minutes → restart loopRoot Cause
The VM's default Node.js is v18 (system-installed), but comm-bridge's
better-sqlite3native addon was compiled against Node v22 (nvm-managed). When the Codex bot runsc4-control.js ackfrom tmux, it resolves to system node (v18), producing:The ack never succeeds → health never transitions from
recoveringtook→ activity-monitor keeps restarting the bot.Workaround
Manually run ack with the correct Node version:
Expected Fix
Ensure that all zylos-core scripts (especially those invoked by the Codex bot in tmux, such as
c4-control.js) use the nvm-managed Node.js rather than falling back to the system Node.js. Possible approaches:NODE/PATHin the tmux environmentEnvironment
comm-bridge/better-sqlite3(native addon)