Skip to content

feat(adversarial-review): make self-collect path multi-turn so it can actually work#328

Open
kingdoooo wants to merge 19 commits into
openai:mainfrom
kingdoooo:feat/codex-self-collect-multiturn
Open

feat(adversarial-review): make self-collect path multi-turn so it can actually work#328
kingdoooo wants to merge 19 commits into
openai:mainfrom
kingdoooo:feat/codex-self-collect-multiturn

Conversation

@kingdoooo
Copy link
Copy Markdown

Summary

The self-collect path for /codex:adversarial-review has been broken since
it was introduced: a single turn pinned outputSchema, which forced every
agent emission to validate against the final-output schema, blocking the
shell investigation the prompt asked Codex to perform. Restructure the
path as a two-phase flow on the same Codex thread — a free-form
investigation phase with no schema, followed by a schema-enforced
finalization turn — so the contract is internally consistent and
large-diff reviews actually produce structured output. Codex's read-only
sandbox already permits the shell commands the investigation needs; no
new tool registration is required.

What changed

  • runAppServerInvestigation in lib/codex.mjs orchestrates a recon loop (no schema, free-form tool use) until convergence (finalAnswerSeen && commandCount === 0) or the turn cap (default 10), then runs a single finalize turn with the review schema set.
  • executeReviewRun branches on context.inputMode === "self-collect" to call the new orchestrator. Inline-diff payloads stay byte-identical.
  • New investigate/finalize prompt templates and matching builders.
  • New CLI flag --max-investigation-turns N with positive-integer validation.
  • Renderer prints Investigation truncated at N turns; findings may be shallow. Use --max-investigation-turns to raise the cap. on all three render paths (parse-error, validation-error, success) when the cap was hit.
  • Status normalization: runAppServerInvestigation always returns numeric status so payload.codex.status shape stays consistent with runAppServerTurn.
  • Queue-driven fake-codex harness and 22 new tests covering convergence, truncation, soft/hard error abort on phase 1, finalize-turn transport error, outputSchema discipline, banner placement, CLI flag propagation, end-to-end self-collect, inline regression guard.

Test plan

  • npm test — 109/109 passing
  • All new test cases run hermetically (each fixture creates and cleans up its own temp git repo)
  • Inline-diff path payload remains byte-identical (regression test added)
  • Self-collect e2e subprocess test exercises companion CLI → orchestrator → fake-codex round trip
  • Manual smoke test on a real branch with >2 changed files (covered by spec; reviewer's discretion)
  • Manual smoke test of inline path (1-2 file diff) to confirm no regression

🤖 Generated with Claude Code

kingdoooo and others added 14 commits May 17, 2026 19:36
Address review findings on the queue-driven harness:
- close() now splices its own binDir out of PATH so handles can be closed in any order
- Collapse redundant saveState calls in the queue-driven branch
- Document the requests getter's per-access I/O cost

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Address review finding: the soft-error test only checked truthiness of
result.error. The fixture queues a deterministic message, so match
against it explicitly to catch regressions where the error gets swallowed
or replaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…rtion

The argument-hint assertion in commands.test.mjs hard-coded a contiguous
match of "[--scope ...] [focus ...]". Splitting that into separate matches
lets the new flag sit between the two without breaking the test.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The two transport-error returns in runAppServerInvestigation were emitting
{ code: 1, kind: "error" } as `status`, while the happy path and
runAppServerTurn always return a number. Downstream payload.codex.status
inherited this object on error paths but a number elsewhere. Normalizing
the error returns to `status: 1` makes the public shape consistent and
removes the need for the defensive typeof check at the executeReviewRun
seam.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adds a unit test that runs both recon turns to convergence, then
queues a transport error on the finalize turn. Verifies that the
investigation metadata (turnCount=2, truncated=false) is preserved
even when finalize fails, and that the numeric status contract holds.
Also tightens the existing phase-1 hard-error assertion with the
status check.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Extend the truncation banner to tell the user how to raise the cap. The
banner only fires when investigation hit the turn cap, which is exactly
when the flag is actionable.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
@kingdoooo kingdoooo requested a review from a team May 18, 2026 01:22
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6992596928

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +732 to +733
const parsedTurns = Number.parseInt(rawMaxTurns, 10);
if (!Number.isFinite(parsedTurns) || parsedTurns <= 0) {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reject non-integer max investigation turn values

The new --max-investigation-turns validation uses Number.parseInt, which silently accepts malformed values like 1.5 (becomes 1) or 10abc (becomes 10) even though the flag contract says this must be a positive integer. This can hide user typos and run significantly fewer or different investigation turns than requested, degrading review depth without any error signal. Please validate the full token (for example with a strict integer regex or Number(...) plus an integer check) before accepting it.

Useful? React with 👍 / 👎.

kingdoooo and others added 3 commits May 18, 2026 09:58
Number.parseInt silently salvages "1.5" as 1, "10abc" as 10, "  5" as 5,
and "1e2" via Number() as 100 — all of which violate the flag contract
of "positive integer literal" and would silently change the depth of
investigation without any error signal. Replace the parseInt-based check
with a strict regex /^[1-9][0-9]*$/ so typos throw instead of degrading
review depth. Reported by bot review on PR openai#328.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… message

The legacy convergence check required `phase: "final_answer"` on the
agent message, but recon turns run with outputSchema=null and the
app-server does not always tag messages with that phase. Production
runs hit the cap with the model emitting "My investigation is complete"
on every turn — visible evidence the loop was ignoring the contract the
investigate prompt teaches the model. Treat any 0-command turn that
emits an agent message as convergence; this matches the prompt and
unblocks the runaway-loop scenario.

Test fixture: make finalAnswer.phase overridable so we can simulate the
real-world case where recon agent messages lack the final_answer tag.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Production observation: even with a clean recon convergence, the model
sometimes emits a tool-call stub (e.g. {"cmd": "wc -l README.md"})
during the finalize turn instead of the structured review JSON. The
companion's parser then surfaces "Validation error: Missing string
verdict" with the raw stub as the final message — a confusing failure
for the user when the investigation produced real findings.

Fix in two layers:
1. Tighten the finalize prompt: explicitly forbid running commands or
   emitting anything other than the schema-conforming JSON.
2. Detect the contract violation in the orchestrator: if the finalize
   turn's commandExecutions is non-empty, retry once with a STRICT
   FINALIZE preamble that reiterates the no-tool-call rule. After two
   attempts, surface whatever the model produced so the parser can
   still flag malformed output instead of looping forever.

Test fixes:
- Adapt `respects maxInvestigationTurns` so the 4th queued entry is the
  pure-JSON finalize turn (previous version queued 4 cmd-only entries
  before finalize, accidentally relying on the old leniency).
- Add coverage for the new retry path: success-on-second-attempt and
  give-up-after-two-attempts.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a7a90c7d73

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +1225 to +1227
for (const cmd of finalizeState.commandExecutions) {
aggregatedCommandExecutions.push(cmd);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Prevent duplicate finalize command aggregation

When both finalize attempts run commands (the retry-exhausted path), the second attempt is appended once inside the retry loop and then appended again unconditionally after the loop, so commandExecutions/fileChanges contain duplicates of the last finalize turn. This skews downstream telemetry/audit consumers (e.g., command counts and changed-file traces) exactly in the contract-violation scenario this code is trying to report, so the final append should avoid re-adding entries already recorded in-loop.

Useful? React with 👍 / 👎.

Production observation: a 2-file untracked review of two large markdown
docs (~100KB total) hit the inline-diff path, embedded both files into
a single-turn schema-pinned prompt, and the model responded with a
tool-call stub `{"cmd": "wc -l ..."}` instead of the review JSON. The
companion's parser then surfaced "Validation error: Missing string
verdict" — the bug PR openai#328 thought it had fixed, except inline-diff is
a separate code path from self-collect and was never touched.

Two layers:

1. Lower the inline cap from 2 files to 1. The inline path embeds full
   file contents into a single turn with outputSchema pinned and has no
   recovery if the model wants to investigate before producing a verdict.
   Anything larger than 1 file falls through to the two-phase
   self-collect path which tolerates exploratory recon turns.

2. Tighten the inline prompt: explicitly forbid running shell commands
   or emitting tool-call stubs, with the `{"cmd": "..."}` anti-pattern
   named so the model cannot rationalize emitting one.

Tests:
- New git.test.mjs case locks 2 files → self-collect.
- New investigation.test.mjs cases lock the prompt-contract for both
  the inline and finalize prompts (no shell commands, no tool-call
  stubs).

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 324836e24b

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".


if (turnState.error) {
return {
status: buildResultStatus(turnState),
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Return non-zero status when investigation turn reports error

When an investigation turn emits a soft failure (turnState.error is set), this branch still derives status from buildResultStatus(turnState), which returns 0 whenever the turn ended as completed. In practice that makes /codex:adversarial-review exit successfully even though no valid review JSON was produced (only parseError), so CI/automation can treat a failed review run as success. This error path should force a non-zero status whenever turnState.error is present.

Useful? React with 👍 / 👎.

When captureTurn records an error notification but the app-server still
emits turn/completed with status="completed", buildResultStatus was
returning 0. That made /codex:adversarial-review exit 0 even though no
valid review JSON was produced (only parseError), so CI/automation could
treat a failed run as success.

Honor turnState.error in buildResultStatus so all four call sites
(review, turn, investigation soft-error branch, finalize) report a
non-zero status whenever an error was observed. Strengthen the existing
soft-error test to assert result.status === 1.

Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant