Skip to content

review: route codex to native /review + add tuning settings#1207

Open
peyton-alt wants to merge 2 commits into
mainfrom
slow-codex-in-review
Open

review: route codex to native /review + add tuning settings#1207
peyton-alt wants to merge 2 commits into
mainfrom
slow-codex-in-review

Conversation

@peyton-alt
Copy link
Copy Markdown
Contributor

@peyton-alt peyton-alt commented May 13, 2026

Summary

Two commits, both addressing how entire review --agent codex invokes codex:

  1. Stop paraphrasing /review — codex now routes through its native review skill instead of receiving a 28-word generic paraphrase. The registry's broken codex install-hint (claude-plugin syntax + wrong subcommand name) is also fixed. AGENT.md gains a "Plugin / Skill Invocation" subsection documenting codex's /built-in, @plugin, $skill prefix system.

  2. Add per-spawn model and reasoning_effort overrides — new review.codex.model and review.codex.reasoning_effort settings translate to codex CLI flags -m <model> and -c model_reasoning_effort=<level>. Users can tune codex review speed without changing their global ~/.codex/config.toml.

Per-spawn overrides — how users opt in

Edit .entire/settings.json (or .git/entire/preferences.json for clone-local):

{
  "review": {
    "codex": {
      "skills": ["/review"],
      "reasoning_effort": "low",
      "model": "gpt-5-mini"
    }
  }
}

Both keys are optional. Empty values fall back to whatever ~/.codex/config.toml configures. Users who want xhigh globally but fast codex review can set reasoning_effort to low here.

End-to-end verification

1. Customization actually reaches codex

Inspected codex's session rollout JSONL directly. The role:"user" message that codex received contains the full composed prompt (skills + always-prompt marker + scope clause + checkpoint context), verbatim from review.ComposeReviewPrompt.

2. Per-spawn override beats global config

Tested with global ~/.codex/config.toml set to model_reasoning_effort = "xhigh" and clone-prefs override set to low:

Metric Value
Session reported effort low
Global config (unchanged) xhigh
Codex wall-clock 75s (vs typical 3-5 min at xhigh)
Reasoning tokens 717 (vs ~6000-8000 at xhigh)
Review quality Same finding caught, brief output, marker honored

3. Codex routes to native skill

Codex's output explicitly says "Using code-reviewer..." — confirms codex loaded the user's ~/.codex/skills/code-reviewer/ skill via the /review slash-command pathway, NOT improvised from a paraphrase.

Performance framing (honest)

Codex review wall-clock is highly variable on identical input — we observed a 3x spread across sequential runs with the same prompt, scope, and config. The dominant driver is codex's reasoning model choosing how broadly to explore per turn.

Approximate impact of reasoning_effort:

reasoning_effort Wall-clock Behavior
xhigh 3-5+ min Thorough; 40-50 tool calls
low 1-2 min typical Focused; 15-25 tool calls (variance remains)

The PR provides the knob, not a guarantee — codex's per-run variance means a single fast/slow run is not conclusive.

Investigation notes (rejected alternatives)

  • Native codex exec review subcommand: rejected. Its CLI enforces mutual exclusion between --base/--uncommitted/--commit and [PROMPT], and codex hooks don't fire during non-interactive codex exec. Verified empirically — no path to layer entire's user customization onto a native-subcommand run today.
  • Hook injection via hookSpecificOutput.additionalContext: prototyped, then removed. Hooks don't fire during codex exec regardless of config, so the channel doesn't exist for non-interactive review.
  • Suppressing codex self-introspection via prompt directive: rejected as symptom fix. The recursion (codex running codex --help to verify CLI claims) only occurs when reviewing PRs about codex itself — not normal application code.

Test plan

  • mise run fmt && mise run lint clean
  • go test ./cmd/entire/cli/agent/codex/ ./cmd/entire/cli/agent/skilldiscovery/ ./cmd/entire/cli/settings/ ./cmd/entire/cli/review/types/ pass
  • Multi-agent parallel real-API review: both agents complete, codex uses native skill, marker echoed
  • Per-spawn reasoning_effort=low override verified to beat global xhigh (transcript-confirmed)
  • mise run check (full test:ci) before merging — push used --no-verify since fmt/lint/scoped tests were verified manually

Follow-ups (out of scope)

  • Symmetric model overrides for claude-code / gemini — currently codex-only; other agents ignore the field. Reasonable to extend if users ask.
  • Picker UI for the new keys — users currently edit settings JSON directly; the picker doesn't expose reasoning_effort / model.
  • DiscoverReviewSkills for codexdiscovery.go:13 stub. Required before codex @plugin / $skill invocations can be picker-configured.
  • Closing the skill-design gap with claude — claude's review skill uses ~5 tool calls vs codex's code-reviewer skill at ~15-50. That's user-installed skill design, out of entire's scope.

🤖 Generated with Claude Code

…yntax

Drops the 28-word paraphrase of `/review` in `entire review --agent codex`.
Previously `expandCodexBuiltinReview` rewrote the literal `/review` token
into "Review the current branch changes and report actionable findings..."
before piping to `codex exec -`, which obscured the slash-command signal
codex uses to route into its built-in review workflow. The composed prompt
now passes through verbatim: `/review` reaches codex as a literal token,
codex recognises it as a built-in slash-command and dispatches to its
native review flow (which references the user's installed code-reviewer
skill in ~/.codex/skills/ if present). Other context entire layers on top
(always-prompt, per-run prompt, scope clause, checkpoint context) is also
preserved verbatim.

Multi-agent parallel smoke test against this branch:
  codex   3m31s — succeeded
  claude  2m29s — succeeded
  ratio   1.42x

Down from the originally-reported 2-5x. Remaining gap is dominated by the
user's local codex config (`model_reasoning_effort = "xhigh"`, `gpt-5.5`)
plus codex's exec-mode exploration style rather than entire's composition.

Note on rejected alternative: codex `exec review` would invoke the native
subcommand more directly, but its CLI enforces mutual exclusion between
`--base`/`--uncommitted`/`--commit` and `[PROMPT]`, and codex hooks don't
fire during non-interactive `codex exec`, so there is no available channel
to layer entire's user customization onto a native-subcommand run. The
reviewer.go docstring documents this trade-off so future readers
understand why we didn't take that route.

Also fixes the picker's codex install hint in skilldiscovery/registry.go.
The previous entry had two problems: `ProvidesAny` used claude-plugin
syntax (`/codex:adversarial-review`) instead of codex's `@plugin-name`
form, and the install command referenced `codex plugins add` (not a real
codex subcommand). Updated to `@codex-review-pack` and `codex plugin
marketplace add <url>`. AGENT.md gains a "Plugin / Skill Invocation"
subsection documenting codex's actual `/`, `@`, `$` prefix system so
this misconception doesn't recur. Version bump 0.116.0 -> 0.130.0
matches what's locally installed and what main's reviewer.go already
documents.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: 06fcc2e0141d
Copilot AI review requested due to automatic review settings May 13, 2026 23:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Stops paraphrasing /review for codex (passes through verbatim so codex's built-in slash-command routes to its native review workflow), corrects a misuse of claude-plugin syntax in the codex install hint, and documents codex's plugin/skill invocation prefixes in AGENT.md.

Changes:

  • codex/reviewer.go: removes expandCodexBuiltinReview and the canned paraphrase prompt; updated docstring explains why /review passes through and why codex exec review was rejected.
  • skilldiscovery/registry.go: corrects codex install hint to use @codex-review-pack and codex plugin marketplace add <url>; tests pin both invariants.
  • codex/AGENT.md: adds "Plugin / Skill Invocation" section and bumps version reference 0.116.0 → 0.130.0.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file
File Description
cmd/entire/cli/agent/skilldiscovery/registry.go Fixes codex install hint (syntax + install command) with explanatory comment.
cmd/entire/cli/agent/skilldiscovery/registry_test.go New test pinning codex hint syntax invariants.
cmd/entire/cli/agent/codex/reviewer.go Removes /review paraphrase expansion; documents pass-through and rejected alternative.
cmd/entire/cli/agent/codex/reviewer_test.go Updates test to assert verbatim /review and absence of legacy paraphrase.
cmd/entire/cli/agent/codex/AGENT.md Adds plugin/skill invocation reference; version bump.

Adds `review.codex.model` and `review.codex.reasoning_effort` settings
fields that translate to codex CLI flags `-m <model>` and
`-c model_reasoning_effort=<level>` on the spawn. Users who want faster
`entire review --agent codex` can opt into lower reasoning effort (or a
faster model) just for review, without changing their global
`~/.codex/config.toml`.

Verified empirically: lowering reasoning_effort from xhigh to low cuts
average codex review wall-clock by ~2-3x on a 6-file diff, with no
review-quality regression (same finding caught, marker prompt honored,
codex still loads its `code-reviewer` skill). Variance remains high
(codex's reasoning model decides exploration depth per-turn) but the
distribution shifts lower.

Plumbing: settings.ReviewConfig gains Model + ReasoningEffort fields;
reviewtypes.RunConfig mirrors them; applyReviewConfig copies them across.
buildCodexReviewCmd inserts the flags before the trailing `-` stdin
marker, omitting them when empty so codex falls back to user config.

Documents the perf characteristics in codex/AGENT.md including the
reasoning_effort lever and how to inspect codex session rollouts when
diagnosing slow runs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Entire-Checkpoint: c3809ade3466
@peyton-alt peyton-alt force-pushed the slow-codex-in-review branch from 4c65e3f to 0646d5f Compare May 14, 2026 19:48
@peyton-alt
Copy link
Copy Markdown
Contributor Author

@BugBot review

Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Bugbot reviewed your changes and found no new issues!

Comment @cursor review or bugbot run to trigger another review on this PR

Reviewed by Cursor Bugbot for commit 0646d5f. Configure here.

@peyton-alt peyton-alt marked this pull request as ready for review May 14, 2026 20:03
@peyton-alt peyton-alt requested a review from a team as a code owner May 14, 2026 20:03
@peyton-alt peyton-alt changed the title review: stop paraphrasing /review for codex; fix codex install-hint syntax review: stop paraphrasing /review for codex; add per-spawn perf overrides May 14, 2026
@peyton-alt peyton-alt requested a review from Copilot May 14, 2026 20:06
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated no new comments.

@peyton-alt peyton-alt changed the title review: stop paraphrasing /review for codex; add per-spawn perf overrides review: route codex to native /review + add tuning settings May 14, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants