feat(adversarial-review): make self-collect path multi-turn so it can actually work#328
feat(adversarial-review): make self-collect path multi-turn so it can actually work#328kingdoooo wants to merge 19 commits into
Conversation
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Address review findings on the queue-driven harness: - close() now splices its own binDir out of PATH so handles can be closed in any order - Collapse redundant saveState calls in the queue-driven branch - Document the requests getter's per-access I/O cost Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Address review finding: the soft-error test only checked truthiness of result.error. The fixture queues a deterministic message, so match against it explicitly to catch regressions where the error gets swallowed or replaced. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…gation Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
…rtion The argument-hint assertion in commands.test.mjs hard-coded a contiguous match of "[--scope ...] [focus ...]". Splitting that into separate matches lets the new flag sit between the two without breaking the test. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
The two transport-error returns in runAppServerInvestigation were emitting
{ code: 1, kind: "error" } as `status`, while the happy path and
runAppServerTurn always return a number. Downstream payload.codex.status
inherited this object on error paths but a number elsewhere. Normalizing
the error returns to `status: 1` makes the public shape consistent and
removes the need for the defensive typeof check at the executeReviewRun
seam.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Adds a unit test that runs both recon turns to convergence, then queues a transport error on the finalize turn. Verifies that the investigation metadata (turnCount=2, truncated=false) is preserved even when finalize fails, and that the numeric status contract holds. Also tightens the existing phase-1 hard-error assertion with the status check. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Extend the truncation banner to tell the user how to raise the cap. The banner only fires when investigation hit the turn cap, which is exactly when the flag is actionable. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6992596928
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| const parsedTurns = Number.parseInt(rawMaxTurns, 10); | ||
| if (!Number.isFinite(parsedTurns) || parsedTurns <= 0) { |
There was a problem hiding this comment.
Reject non-integer max investigation turn values
The new --max-investigation-turns validation uses Number.parseInt, which silently accepts malformed values like 1.5 (becomes 1) or 10abc (becomes 10) even though the flag contract says this must be a positive integer. This can hide user typos and run significantly fewer or different investigation turns than requested, degrading review depth without any error signal. Please validate the full token (for example with a strict integer regex or Number(...) plus an integer check) before accepting it.
Useful? React with 👍 / 👎.
Number.parseInt silently salvages "1.5" as 1, "10abc" as 10, " 5" as 5, and "1e2" via Number() as 100 — all of which violate the flag contract of "positive integer literal" and would silently change the depth of investigation without any error signal. Replace the parseInt-based check with a strict regex /^[1-9][0-9]*$/ so typos throw instead of degrading review depth. Reported by bot review on PR openai#328. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
… message The legacy convergence check required `phase: "final_answer"` on the agent message, but recon turns run with outputSchema=null and the app-server does not always tag messages with that phase. Production runs hit the cap with the model emitting "My investigation is complete" on every turn — visible evidence the loop was ignoring the contract the investigate prompt teaches the model. Treat any 0-command turn that emits an agent message as convergence; this matches the prompt and unblocks the runaway-loop scenario. Test fixture: make finalAnswer.phase overridable so we can simulate the real-world case where recon agent messages lack the final_answer tag. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Production observation: even with a clean recon convergence, the model
sometimes emits a tool-call stub (e.g. {"cmd": "wc -l README.md"})
during the finalize turn instead of the structured review JSON. The
companion's parser then surfaces "Validation error: Missing string
verdict" with the raw stub as the final message — a confusing failure
for the user when the investigation produced real findings.
Fix in two layers:
1. Tighten the finalize prompt: explicitly forbid running commands or
emitting anything other than the schema-conforming JSON.
2. Detect the contract violation in the orchestrator: if the finalize
turn's commandExecutions is non-empty, retry once with a STRICT
FINALIZE preamble that reiterates the no-tool-call rule. After two
attempts, surface whatever the model produced so the parser can
still flag malformed output instead of looping forever.
Test fixes:
- Adapt `respects maxInvestigationTurns` so the 4th queued entry is the
pure-JSON finalize turn (previous version queued 4 cmd-only entries
before finalize, accidentally relying on the old leniency).
- Add coverage for the new retry path: success-on-second-attempt and
give-up-after-two-attempts.
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a7a90c7d73
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| for (const cmd of finalizeState.commandExecutions) { | ||
| aggregatedCommandExecutions.push(cmd); | ||
| } |
There was a problem hiding this comment.
Prevent duplicate finalize command aggregation
When both finalize attempts run commands (the retry-exhausted path), the second attempt is appended once inside the retry loop and then appended again unconditionally after the loop, so commandExecutions/fileChanges contain duplicates of the last finalize turn. This skews downstream telemetry/audit consumers (e.g., command counts and changed-file traces) exactly in the contract-violation scenario this code is trying to report, so the final append should avoid re-adding entries already recorded in-loop.
Useful? React with 👍 / 👎.
Production observation: a 2-file untracked review of two large markdown
docs (~100KB total) hit the inline-diff path, embedded both files into
a single-turn schema-pinned prompt, and the model responded with a
tool-call stub `{"cmd": "wc -l ..."}` instead of the review JSON. The
companion's parser then surfaced "Validation error: Missing string
verdict" — the bug PR openai#328 thought it had fixed, except inline-diff is
a separate code path from self-collect and was never touched.
Two layers:
1. Lower the inline cap from 2 files to 1. The inline path embeds full
file contents into a single turn with outputSchema pinned and has no
recovery if the model wants to investigate before producing a verdict.
Anything larger than 1 file falls through to the two-phase
self-collect path which tolerates exploratory recon turns.
2. Tighten the inline prompt: explicitly forbid running shell commands
or emitting tool-call stubs, with the `{"cmd": "..."}` anti-pattern
named so the model cannot rationalize emitting one.
Tests:
- New git.test.mjs case locks 2 files → self-collect.
- New investigation.test.mjs cases lock the prompt-contract for both
the inline and finalize prompts (no shell commands, no tool-call
stubs).
Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 324836e24b
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
|
|
||
| if (turnState.error) { | ||
| return { | ||
| status: buildResultStatus(turnState), |
There was a problem hiding this comment.
Return non-zero status when investigation turn reports error
When an investigation turn emits a soft failure (turnState.error is set), this branch still derives status from buildResultStatus(turnState), which returns 0 whenever the turn ended as completed. In practice that makes /codex:adversarial-review exit successfully even though no valid review JSON was produced (only parseError), so CI/automation can treat a failed review run as success. This error path should force a non-zero status whenever turnState.error is present.
Useful? React with 👍 / 👎.
When captureTurn records an error notification but the app-server still emits turn/completed with status="completed", buildResultStatus was returning 0. That made /codex:adversarial-review exit 0 even though no valid review JSON was produced (only parseError), so CI/automation could treat a failed run as success. Honor turnState.error in buildResultStatus so all four call sites (review, turn, investigation soft-error branch, finalize) report a non-zero status whenever an error was observed. Strengthen the existing soft-error test to assert result.status === 1. Co-Authored-By: Claude Opus 4.7 (1M context) <[email protected]>
Summary
The self-collect path for
/codex:adversarial-reviewhas been broken sinceit was introduced: a single turn pinned
outputSchema, which forced everyagent emission to validate against the final-output schema, blocking the
shell investigation the prompt asked Codex to perform. Restructure the
path as a two-phase flow on the same Codex thread — a free-form
investigation phase with no schema, followed by a schema-enforced
finalization turn — so the contract is internally consistent and
large-diff reviews actually produce structured output. Codex's read-only
sandbox already permits the shell commands the investigation needs; no
new tool registration is required.
What changed
runAppServerInvestigationinlib/codex.mjsorchestrates a recon loop (no schema, free-form tool use) until convergence (finalAnswerSeen && commandCount === 0) or the turn cap (default 10), then runs a single finalize turn with the review schema set.executeReviewRunbranches oncontext.inputMode === "self-collect"to call the new orchestrator. Inline-diff payloads stay byte-identical.--max-investigation-turns Nwith positive-integer validation.Investigation truncated at N turns; findings may be shallow. Use --max-investigation-turns to raise the cap.on all three render paths (parse-error, validation-error, success) when the cap was hit.runAppServerInvestigationalways returns numericstatussopayload.codex.statusshape stays consistent withrunAppServerTurn.Test plan
npm test— 109/109 passing🤖 Generated with Claude Code