Skip to content

feat(auto): wire driver selection through capabilities#672

Merged
shaun0927 merged 12 commits intostack/auto-capabilities/2-hermes-driverfrom
stack/auto-capabilities/3-auto-driver-brake
May 7, 2026
Merged

feat(auto): wire driver selection through capabilities#672
shaun0927 merged 12 commits intostack/auto-capabilities/2-hermes-driverfrom
stack/auto-capabilities/3-auto-driver-brake

Conversation

@Q00
Copy link
Copy Markdown
Owner

@Q00 Q00 commented May 6, 2026

Stack

3/3 Auto selected driver/brake mode plus registry-backed CLI wiring. Review after #671.

Previous PRs:

Changes

Validation

  • uv run pytest tests/unit/backends/test_capabilities.py tests/unit/providers/test_factory.py tests/unit/providers/test_hermes_cli_adapter.py tests/unit/config/test_loader.py tests/unit/config/test_models.py tests/unit/cli/test_config.py tests/unit/cli/test_auto_command.py tests/unit/auto/test_driver_answerer.py tests/unit/auto/test_surface.py -q
  • uv run ruff check src/ouroboros/backends src/ouroboros/providers/hermes_cli_adapter.py src/ouroboros/providers/factory.py src/ouroboros/orchestrator/runtime_factory.py src/ouroboros/config/loader.py src/ouroboros/cli/commands/auto.py src/ouroboros/cli/commands/config.py tests/unit/backends/test_capabilities.py tests/unit/providers/test_hermes_cli_adapter.py tests/unit/providers/test_factory.py tests/unit/cli/test_config.py tests/unit/cli/test_auto_command.py
  • uv run mypy src/ouroboros/backends src/ouroboros/providers/hermes_cli_adapter.py src/ouroboros/providers/factory.py src/ouroboros/cli/commands/config.py src/ouroboros/cli/commands/auto.py
  • uv run ouroboros config driver --help && uv run ouroboros auto --help

@Q00 Q00 force-pushed the stack/auto-capabilities/2-hermes-driver branch from 1f7499b to 4cae512 Compare May 6, 2026 19:02
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 58ec772 for PR #672

Review record: f99c89b6-8054-4c18-ac9c-f249bf2276b6

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/cli/commands/config.py:315 | BLOCKING | config driver set claims to manage the default auto-interview driver, but it persists the choice by overwriting llm.backend. That setting is consumed globally by other LLM-backed flows, so this command silently changes unrelated behavior outside ooo auto as well. This is a contract/UI violation, not just an implementation detail: a user trying to scope the change to interview answering will unexpectedly retarget other features. |
| 2 | src/ouroboros/cli/commands/config.py:292 | BLOCKING | config driver dereferences data.get("llm", {}).get("backend", ...) without validating that llm is a mapping. If the config is structurally invalid (for example llm: []), this subcommand raises AttributeError instead of returning the friendly config error that the rest of this command group already provides. The new surface needs the same malformed-config handling as config show/validate. |

Non-blocking Suggestions

| 1 | src/ouroboros/cli/commands/config.py:292 | documentation | The user-facing default/current value is shown as claude_code, while the accepted driver choices are documented as canonical names like claude. Normalizing the display would make the new command less confusing. |

Design Notes

The async answerer plumbing in auto is directionally sound, and the resume-path validation is careful. The main design problem is that the new config driver surface is not actually scoped to “interview driver” state; it is wired to the global LLM backend knob.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00 Q00 force-pushed the stack/auto-capabilities/3-auto-driver-brake branch 2 times, most recently from fc02093 to 3f906b5 Compare May 6, 2026 19:06
@Q00
Copy link
Copy Markdown
Owner Author

Q00 commented May 6, 2026

Addressed the requested changes in 3f906b54:

  • config driver set now writes auto.interview_driver_backend, not global llm.backend.
  • Added AutoConfig and get_auto_interview_driver_backend() so the scope is explicit.
  • ooo auto uses the configured auto driver when --driver is omitted, while plain default config remains deterministic.
  • _load_config() now validates the auto section as a mapping, and config driver has regression coverage for malformed auto: [].

Validation after the fix:

  • uv run pytest ... -q => 347 passed
  • uv run ruff check ...
  • uv run mypy ...
  • uv run ouroboros config driver --help && uv run ouroboros auto --help

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 3f906b5 for PR #672

Review record: bdfc1490-abd8-40a9-973a-573dd0b18974

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/cli/commands/auto.py:233 | BLOCKING | Resume rejects valid Claude-backed sessions because the persisted driver is normalized to claude_code later (state.interview_driver_backend = selected_answerer.backend), but resume compares against resolve_interview_driver_backend(...), which canonicalizes the same user choice to claude. After the first save, ooo auto --resume ... can raise resume driver mismatch for the same backend, making Claude driver sessions effectively non-resumable. |
| 2 | src/ouroboros/cli/commands/auto.py:188 | BLOCKING | The configured default driver is applied before the resume path is examined, so resuming a legacy/deterministic session with state.interview_driver_backend is None will silently switch that in-flight session to the current configured auto driver even when the user did not pass --driver. That changes interview semantics mid-session based only on current config, which violates the persisted-session contract the surrounding resume logic is trying to preserve. |

Non-blocking Suggestions

None.

Design Notes

The separation between deterministic scaffolding and selected-driver answering is reasonable, and the new state/config plumbing is mostly coherent. The main weakness is that driver identity is normalized inconsistently across config, session state, and resume checks, so persisted-session semantics are not yet stable.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00 Q00 force-pushed the stack/auto-capabilities/2-hermes-driver branch from 4cae512 to 84095bb Compare May 6, 2026 19:49
@Q00 Q00 force-pushed the stack/auto-capabilities/3-auto-driver-brake branch from 3f906b5 to c0efaed Compare May 6, 2026 19:53
@Q00
Copy link
Copy Markdown
Owner Author

Q00 commented May 6, 2026

Addressed the latest review in c0efaed3:

  • Resume no longer applies configured default auto driver; config defaults only affect new sessions.
  • Driver identity now validates through interview-driver capability and persists/compares through LLM adapter identity (claude -> claude_code) to avoid Claude resume mismatch.
  • MCP driver alias handling now follows the same identity path.
  • Driver mode preserves structured scaffold ledger values/status instead of collapsing sections into one freeform driver answer.
  • Risk provenance for brake=off is now active (risk.auto_driver.*, LedgerStatus.INFERRED) so grading can surface it.

Validation:

  • uv run pytest ... -q => 355 passed
  • uv run ruff check ...
  • uv run mypy ...

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit c0efaed for PR #672

Review record: c0207d13-11d0-4b2c-8eac-4deecd60521d

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/cli/commands/auto.py:232 | BLOCKING | The new resume check normalizes the requested driver to LLM adapter identity (claude -> claude_code) but compares it against the persisted state.interview_driver_backend without normalizing that stored value first. Any in-flight session created before this patch that persisted claude will now fail ooo auto --resume ... --driver claude with a false resume driver mismatch, which is exactly the compatibility case this change is supposed to fix. |
| 2 | src/ouroboros/mcp/tools/auto_handler.py:159 | BLOCKING | The MCP resume path has the same raw-vs-normalized driver comparison bug as the CLI path. A previously persisted session with interview_driver_backend="claude" will be rejected when resumed with driver="claude" because the request is normalized to claude_code before comparison, but the stored session value is not. |

Non-blocking Suggestions

None.

Design Notes

The overall direction is reasonable: driver selection is now scoped to auto interviews instead of global LLM config, and preserving structured scaffold ledger entries avoids collapsing seed contracts into freeform text. The main remaining issue is backward-compatibility in resume semantics, where identity normalization is only applied to one side of the comparison.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00 Q00 force-pushed the stack/auto-capabilities/2-hermes-driver branch from 84095bb to 9d7947a Compare May 6, 2026 20:02
@Q00 Q00 force-pushed the stack/auto-capabilities/3-auto-driver-brake branch from c0efaed to 3961fc8 Compare May 6, 2026 20:03
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit 3961fc8 for PR #672

Review record: 11257bd5-7bcd-4329-9736-bfb647f26213

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Follow-up Findings

  • src/ouroboros/mcp/tools/auto_handler.py:176 [warning] New MCP-started auto sessions ignore the configured default auto.interview_driver_backend and always stay deterministic unless the caller passes driver explicitly. The direct CLI path now applies that default (src/ouroboros/cli/commands/auto.py:189-192), and config driver set presents it as the default interview driver backend globally. In practice this means the same user configuration produces different interview behavior depending on whether ooo auto is run through the CLI or through the ouroboros_auto MCP/tooling path, which is a real contract break for users relying on the configured default.

Non-blocking Suggestions

None.

Design Notes

The driver-selection wiring is mostly consistent now: resume semantics, backend identity normalization, and ledger provenance are all materially better. The main architectural gap is that the new “default interview driver backend” concept was implemented only in the CLI entrypoint, not in the MCP auto entrypoint, so the system still has two different sources of truth for how a default auto session is started.

Policy Notes

  • No in-scope blocking findings remained after policy filtering; downgraded verdict accordingly.

Recovery Notes

First recoverable review artifact generated from codex analysis log.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00 Q00 force-pushed the stack/auto-capabilities/2-hermes-driver branch from 9d7947a to efaa22a Compare May 6, 2026 20:13
@Q00 Q00 force-pushed the stack/auto-capabilities/3-auto-driver-brake branch from 3961fc8 to 3b9b655 Compare May 6, 2026 20:14
@Q00
Copy link
Copy Markdown
Owner Author

Q00 commented May 6, 2026

Rebased #672 onto the updated #671 (efaa22a3) and force-pushed 3b9b6559 so the stacked PR points at the latest Hermes driver layer.

Validation after rebase:

  • uv run pytest tests/unit/cli/test_auto_command.py tests/unit/auto/test_surface.py tests/unit/auto/test_driver_answerer.py tests/unit/providers/test_factory.py tests/unit/providers/test_hermes_cli_adapter.py -q -> 117 passed
  • uv run ruff check src/ouroboros/cli/commands/auto.py src/ouroboros/mcp/tools/auto_handler.py src/ouroboros/auto/driver_answerer.py src/ouroboros/providers/factory.py src/ouroboros/providers/hermes_cli_adapter.py tests/unit/cli/test_auto_command.py tests/unit/auto/test_surface.py tests/unit/auto/test_driver_answerer.py tests/unit/providers/test_factory.py tests/unit/providers/test_hermes_cli_adapter.py -> passed
  • uv run mypy src/ouroboros/cli/commands/auto.py src/ouroboros/mcp/tools/auto_handler.py src/ouroboros/auto/driver_answerer.py src/ouroboros/providers/factory.py src/ouroboros/providers/hermes_cli_adapter.py -> passed

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 3b9b655 for PR #672

Review record: 967e517e-d661-4624-9f8b-bfbe3a6a6751

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/config/models.py:245 | BLOCKING | auto.interview_driver_backend is introduced as persisted config, but AutoConfig accepts any string. That means load_config() and ouroboros config validate will happily accept invalid values like auto.interview_driver_backend: bogus, and plain ooo auto only fails later when _run_auto() tries to consume the default backend. Since this PR adds a first-class config contract for the auto driver, the model needs the same backend validation the orchestrator config already has, plus a regression test for an invalid stored value. |

Non-blocking Suggestions

| 1 | src/ouroboros/cli/commands/config.py:290 | cosmetic refactoring | @driver_app.callback(invoke_without_command=True) will also run for subcommands, so ouroboros config driver set ... prints the current backend before mutating it. That is mostly a UX annoyance, but splitting “show” into an explicit command or guarding on invoked subcommand would make the CLI output cleaner. |

Design Notes

The overall direction is coherent: persisting driver/brake in auto state and normalizing driver identity on resume fixes real lifecycle issues, and keeping deterministic ledger updates separate from freeform driver text is the right architectural choice. The main gap is that the new config surface is not validated at the model boundary, so the persisted contract is weaker than the runtime contract.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00 Q00 force-pushed the stack/auto-capabilities/2-hermes-driver branch from dcf37ee to 9430acf Compare May 6, 2026 20:26
@Q00 Q00 force-pushed the stack/auto-capabilities/3-auto-driver-brake branch from 3b9b655 to 2817d48 Compare May 6, 2026 20:27
@Q00
Copy link
Copy Markdown
Owner Author

Q00 commented May 6, 2026

Rebased #672 again onto the corrected #671 head (9430acf8) and pushed 2817d48c.

Validation after rebase:

  • uv run pytest tests/unit/cli/test_auto_command.py tests/unit/auto/test_surface.py tests/unit/auto/test_driver_answerer.py tests/unit/providers/test_factory.py tests/unit/providers/test_hermes_cli_adapter.py tests/unit/mcp/tools/test_pm_handler.py tests/unit/backends/test_capabilities.py -q -> 180 passed
  • uv run ruff check ... on touched backend/auto/provider/MCP/test files -> passed
  • uv run mypy ... on touched backend/auto/provider/MCP files -> passed

@Q00 Q00 force-pushed the stack/auto-capabilities/3-auto-driver-brake branch from 2817d48 to 75e4fb6 Compare May 6, 2026 20:29
@Q00
Copy link
Copy Markdown
Owner Author

Q00 commented May 6, 2026

Updated #672 after the config validation review.

Changes:

  • Added model-level validation for auto.interview_driver_backend using the backend capability registry's interview-driver resolver.
  • Preserved stored aliases like claude_code while rejecting unsupported values such as bogus during config load/validate.
  • Added regression coverage for both AutoConfig and ouroboros config validate.

Validation:

  • uv run pytest tests/unit/config/test_models.py tests/unit/cli/test_config.py tests/unit/cli/test_auto_command.py tests/unit/auto/test_surface.py -q -> 176 passed
  • uv run ruff check src/ouroboros/config/models.py tests/unit/config/test_models.py tests/unit/cli/test_config.py -> passed
  • uv run mypy src/ouroboros/config/models.py -> passed

@Q00 Q00 force-pushed the stack/auto-capabilities/2-hermes-driver branch from 9430acf to 76475a5 Compare May 6, 2026 20:37
@Q00 Q00 force-pushed the stack/auto-capabilities/3-auto-driver-brake branch from 75e4fb6 to 52a3852 Compare May 6, 2026 20:37
@Q00
Copy link
Copy Markdown
Owner Author

Q00 commented May 6, 2026

Rebased #672 onto the updated #671 head (76475a55) and pushed 52a38528.

Validation after rebase:

  • uv run pytest tests/unit/cli/test_auto_command.py tests/unit/auto/test_surface.py tests/unit/auto/test_driver_answerer.py tests/unit/providers/test_factory.py tests/unit/providers/test_hermes_cli_adapter.py tests/unit/mcp/tools/test_pm_handler.py tests/unit/backends/test_capabilities.py tests/unit/config/test_models.py tests/unit/cli/test_config.py -q -> 290 passed
  • ruff on touched backend/config/auto/provider/MCP/test files -> passed
  • mypy on touched backend/config/auto/provider/MCP files -> passed

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 52a3852 for PR #672

Review record: 01e6a8fd-6f11-4d7d-94ba-3a13fad6cb89

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/backends/capabilities.py:110 | BLOCKING | litellm is now advertised and validated as a supported interview driver, but the selected-driver path always calls the adapter with CompletionConfig(model="default", role="auto_interview_answer") (driver_answerer.py:87). LiteLLMAdapter forwards config.model directly to LiteLLM (litellm_adapter.py:268) unless a matching llm_role_profiles entry rewrites it, and the default config has no auto_interview_answer profile. On a normal install, ooo auto --driver litellm or config driver set litellm will therefore fail on the first interview answer with an invalid literal model name "default". Either keep litellm out of supports_interview_driver for now, or add a real model-resolution path and regression test for this backend. |

Non-blocking Suggestions

None.

Design Notes

The overall shape is reasonable: the async answerer integration, persisted driver/brake state, and alias normalization all line up well. The main issue is the capability contract now claims support for a backend whose adapter path is not actually runnable with the completion config this feature uses.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00 Q00 force-pushed the stack/auto-capabilities/3-auto-driver-brake branch from 52a3852 to 80351dd Compare May 6, 2026 20:44
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 032964a for PR #672

Review record: 16698601-5e2a-4a4e-856d-7ee3bdaad063

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:236 | BLOCKING | _ledger_updates_for() rewrites every scaffold entry to LedgerSource.INFERENCE, which drops the original ASSUMPTION/NON_GOAL source categories. The pipeline result and ledger summary derive assumptions/non_goals strictly from those sources (src/ouroboros/auto/ledger.py:284-290, src/ouroboros/auto/pipeline.py:405-406), so selected-driver sessions will silently lose normal assumptions and non-goals from --show-ledger, MCP output, and downstream grading context. For example, a non-goals answer from AutoAnswerer (src/ouroboros/auto/answerer.py:162-171) becomes invisible after this rewrite. |

Non-blocking Suggestions

None.

Design Notes

The resume/config scoping work is disciplined, and normalizing persisted driver identity against adapter identity is the right direction. The main architectural regression is in the new driver answerer collapsing semantic ledger provenance that other layers still treat as part of the contract.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00
Copy link
Copy Markdown
Owner Author

Q00 commented May 7, 2026

Addressed the change request in 3f91e397:\n\n- Preserved scaffold ledger source categories in selected-driver sessions instead of rewriting every entry to INFERENCE.\n- Added regression coverage that driver-mode non-goal scaffold entries remain LedgerSource.NON_GOAL.\n\nValidation:\n- uv run pytest tests/unit/auto/test_driver_answerer.py -q -> 8 passed\n- uv run pytest tests/unit/auto/test_driver_answerer.py tests/unit/auto/test_surface.py tests/unit/auto/test_ledger_grading_answerer.py -q -> 118 passed\n- uv run ruff check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py -> passed\n- uv run mypy src/ouroboros/auto/driver_answerer.py -> passed\n- git diff --check -> passed

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit 3f91e39 for PR #672

Review record: 4d0c7ce4-d06d-473f-8278-0693687da6a8

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Follow-up Findings

  • src/ouroboros/auto/driver_answerer.py:159 [warning] The risk gate is too broad for the default brake=on path: matching any standalone add marks the question as a “scope or product/business tradeoff”, and DriverAutoAnswerer.answer() then hard-blocks the session before calling the driver. Ordinary interview questions like “How should users add a task?” or “What should the add command do on duplicate input?” will now stop selected-driver sessions even though they are routine feature-behavior questions, not risky scope decisions. This makes the new default selected-driver mode regress into false blockers on common CRUD wording.

Non-blocking Suggestions

None.

Design Notes

The config scoping and resume semantics are much cleaner than earlier iterations: driver identity is persisted canonically, resume mismatch handling is explicit, and preserving scaffold ledger structure avoids collapsing interview state into freeform text. The main remaining issue is that the brake heuristic is currently policy-heavy enough to undermine the selected-driver path’s usability.

Policy Notes

  • No in-scope blocking findings remained after policy filtering; downgraded verdict accordingly.

Recovery Notes

First recoverable review artifact generated from codex analysis log.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

The bot's design note on PR #672 flagged a regression: bare `add` in the
risk gate matched ordinary CRUD wording like "How should users add a
task?", false-blocking selected-driver sessions under brake=on.

Replace bare `add` with `add(?:ing)?\s+(?:a|an|the)?\s*<scope-noun>`
where the scope-noun is feature/capability/support/requirement/epic/
story/scope/product-area. CRUD verbs no longer match while genuine
scope-additions ("Should we add a feature for offline mode?") still
fire the gate.

Refs #672

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shaun0927 pushed a commit to shaun0927/ouroboros that referenced this pull request May 7, 2026
Resolves conflict in driver_answerer.py / test_driver_answerer.py
between the stack/3 risk-gate scope fix (974f156) and the
post-response classifier branch. Keep both:
- Tightened scope-add regex from stack/3 (no longer false-blocks
  CRUD wording like "How do users add a task?").
- Post-response classifier and metadata threading from Q00#682/Q00#683.

Refs Q00#672
Refs Q00#675

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
shaun0927 pushed a commit to shaun0927/ouroboros that referenced this pull request May 7, 2026
Resolves conflict in driver_answerer.py between the stack/3 risk-gate
scope fix (974f156) and the metadata-plumbing branch. The metadata
field added in Q00#682 is preserved on the brake-on blocker path; the
scope-add regex tightening from stack/3 lands unchanged.

Refs Q00#672
Refs Q00#675

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 974f156 for PR #672

Review record: ac4a1e8a-92ae-4d0e-a7fc-883dd70962dd

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:129 | BLOCKING | The selected-driver path sends the driver’s freeform answer into the interview transcript (interview_driver.py:157), but ledger_updates are copied from the deterministic scaffold instead of reflecting that answer (driver_answerer.py:232-253). Seed generation later reads the interview session transcript (auto/adapters.py:67-77, authoring_handlers.py:557-563), while grading only checks the ledger for gaps/risk (auto/grading.py:138-245). That means a driver can answer “use Django/Postgres” while the ledger still records the scaffold’s different choice, and the pipeline can generate/review a Seed against two conflicting sources of truth without detecting the mismatch. This is a real correctness bug in the new mode, not just provenance metadata. |

Non-blocking Suggestions

None.

Design Notes

The driver-selection/config plumbing is coherent, and the async answerer integration is straightforward. The main architectural problem is that selected-driver mode currently breaks the ledger-as-contract model by letting transcript content diverge from the persisted ledger.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

The bot's blocking finding on PR #672 noted that the selected-driver
path sent the driver's freeform answer into the interview transcript
but kept the deterministic scaffold's value in ``ledger_updates``.
Seed generation reads the transcript while grading reads the ledger,
so the two could disagree silently (driver says "Django", ledger says
"FastAPI") without anyone catching the divergence.

- Replace each scaffold-derived ledger entry value with the driver's
  freeform answer verbatim and tag the entry CONFLICTING with reduced
  confidence (<= 0.4), so the Seed-ready/A-grade gates surface the
  divergence instead of treating the scaffold value as confirmed.
- Preserve the structural section/key/source so downstream
  section-aware Seed generation still works, and keep the original
  scaffold value in the rationale + evidence as audit context.
- Update the existing ledger-values test to lock in the new contract
  and assert the original scaffold value is preserved as rationale
  evidence.

Refs #675

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit f229e35 for PR #672

Review record: d8174b0f-b35c-488f-ad42-3c26a4626c98

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:239 | BLOCKING | Every structural ledger update produced by DriverAutoAnswerer is forced to LedgerStatus.CONFLICTING. In this pipeline, conflicting sections are treated as open gaps (src/ouroboros/auto/ledger.py:270-282), and the interview loop blocks whenever gaps remain (src/ouroboros/auto/interview_driver.py:182-189). That means selected-driver sessions cannot actually converge to Seed-ready once they populate required sections; they will keep carrying unresolved gaps and eventually block instead of advancing to seed generation. |

Non-blocking Suggestions

None.

Design Notes

The driver/brake/config plumbing is coherent, and the resume semantics are much cleaner than earlier iterations. The remaining issue is architectural: the new “driver answer is canonical but needs later verification” model is encoded using a ledger status that the interview phase already treats as a hard stop, so the new mode conflicts with the existing Seed-ready contract.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

#679)

Issue #673 acceptance criteria require that the actual MCP payload reaching
ouroboros_auto carries forwarded driver/brake values and never leaks $driver
or $brake placeholder literals. Upstream stack/3 already locks dispatch-level
resolution; this commit adds the missing layer by mocking the ouroboros MCP
server and asserting call_tool's payload.

- Forwarding test: --driver hermes --brake off must reach call_tool with
  driver="hermes", brake="off", skip_run=True.
- Placeholder leak test: plain `ooo auto "goal"` must not send "$driver" or
  "$brake" as real values; an empty/None value is acceptable.

Closes #673
The earlier fix that aligned ledger values with the driver's freeform
answer marked every structural ledger update as ``LedgerStatus.CONFLICTING``.
The interview loop treats CONFLICTING (along with MISSING/WEAK/BLOCKED) as
an open gap (``ledger.py:open_gaps``), so a selected-driver session could
populate every required section and still never reach ``is_seed_ready`` —
the loop would carry the same gaps forever and eventually block.

Use ``LedgerStatus.INFERRED`` instead. The ledger entry still:
- Carries the driver's freeform answer verbatim as the value, so the
  persisted ledger and the interview transcript share the same content
  (no silent divergence between the two sources of truth).
- Has reduced confidence (<= 0.4) and an ``auto_interview_transcript``
  evidence marker so grading and the A-grade gate can downgrade or
  re-verify the driver answer before final acceptance.
- Preserves the original scaffold value verbatim in the rationale and
  in a ``scaffold_value:<value>`` evidence tag, so divergence between
  scaffold and driver is never silently lost.

INFERRED is the right semantic: the value is the driver's inferred answer,
not the user's confirmed choice; downstream grading is responsible for
deciding whether to escalate to verification before A-grade. The interview
loop is now free to converge for selected-driver sessions.

Tests:
- ``test_driver_answerer_ledger_values_reflect_driver_answer_without_blocking_loop``
  asserts the new INFERRED+verbatim+scaffold-evidence contract and locks
  in that the status is not in the loop's blocking set.
- ``test_driver_answerer_ledger_does_not_block_seed_ready_convergence``
  is a regression test that drives the answerer through every required
  section and asserts those sections do not end up in the open-gap set
  (the contract that broke under the CONFLICTING variant).

Addresses ouroboros-agent[bot] blocking finding on PR #672 against
``f229e35``. Refs #672.
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit dc421da for PR #672

Review record: ec420147-2d07-442b-94b5-31e7a4db6f04

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Non-blocking Suggestions

| 1 | tests/unit/auto/test_driver_answerer.py:67 | Nice-to-have tests | The new selected-driver coverage exercises the brake=off path well, but the default brake=on behavior still is not directly tested. A small regression test that asserts risky questions return a blocker without invoking the adapter would lock down the primary safety branch of this feature. |

Design Notes

The split between deterministic scaffolding and driver-sourced text is sound, and the config scoping change to auto.interview_driver_backend keeps this feature from mutating unrelated LLM defaults. Resume semantics and ledger provenance also look consistent with the contributor’s stated intent.

Recovery Notes

First recoverable review artifact generated from codex analysis log.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

shaun0927 and others added 2 commits May 7, 2026 16:44
Co-authored-by: Hermes Agent <hermes-agent@users.noreply.github.com>
Co-authored-by: shaun0927 <junghwan1912@gmail.com>
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Hermes Agent <hermes-agent@users.noreply.github.com>
@shaun0927
Copy link
Copy Markdown
Collaborator

Re-review ping

Status

What the PR does

Wires interview driver selection (auto.interview_driver_backend) and a brake mode (AutoBrakeMode.{ON,OFF}) through the capability registry. The deterministic AutoAnswerer is kept as a structural scaffold; a DriverAutoAnswerer calls the selected backend (codex / hermes / claude_code / …) for the actual freeform answer that goes to the interview transcript. Brake-on holds risky scope/policy answers for approval; brake-off auto-sends them with assumption + provenance tags so the Seed-ready and A-grade gates remain the safety net.

Resume semantics persist driver identity canonically; resume mismatches are explicit; configured defaults only affect new sessions (not resumed ones); ledger source categories on selected-driver sessions preserve scaffold structure instead of collapsing into freeform text.

Iterative improvements driven by the bot

The current APPROVE is the result of three rounds against dc421dac:

  1. 974f1562 fix(auto): scope add-keyword brake gate to feature additions — The bot's first design note flagged a regression in the brake=on default: a bare add token in the risk gate matched ordinary CRUD wording such as "How should users add a task?" or "What should the add command do on duplicate input?", false-blocking selected-driver sessions. The regex now requires a scope-context noun within the same clause (add(?:ing)? <feature|capability|support|requirement|epic|story|product area|scope>); CRUD verbs no longer match while genuine scope-additions still do. Parametrized regression tests cover both directions.

  2. f229e35b fix(auto): align selected-driver ledger values with the freeform answer — The bot then escalated to a blocking finding: the driver's freeform answer went into the interview transcript but _ledger_updates_for kept the deterministic scaffold's value, so Seed generation (transcript-driven) and grading (ledger-driven) could disagree silently (driver says "Django/Postgres", ledger says "FastAPI"). The fix replaces each structural ledger entry's value with the driver's freeform answer verbatim, preserves the original scaffold value as audit evidence (scaffold_value:<value> evidence tag + rationale), and tags the entry with auto_interview_transcript evidence so downstream gates can verify before A-grade.

  3. dc421dac fix(auto): use INFERRED status for driver-derived ledger entries — The previous fix marked driver-derived entries LedgerStatus.CONFLICTING, which the bot correctly pointed out is in the loop's blocking set (SeedDraftLedger.open_gaps treats CONFLICTING/MISSING/WEAK/BLOCKED as open). That meant a selected-driver session would populate every required section yet never reach is_seed_ready, blocking convergence. The status is now LedgerStatus.INFERRED (low confidence ≤ 0.4 + transcript evidence marker), which preserves the value/transcript alignment without trapping the loop. A new regression test (test_driver_answerer_ledger_does_not_block_seed_ready_convergence) drives the answerer through every required section and asserts none of them end up in the open-gap set — locking in the contract that broke under the CONFLICTING variant.

Non-blocking suggestion in the latest review

The bot suggested a direct brake=on adapter-not-invoked test. This is already covered by the existing test_driver_answerer_brake_on_gates_risky_question (asserts answer.blocker is not None and adapter.prompts == []), so I did not add a duplicate. Happy to add an additional explicit assertion in a follow-up if you'd prefer broader explicit coverage.

Why merging is safe

  • Two design-level concerns (false brake-on regressions on routine CRUD wording; ledger ↔ transcript divergence) were discovered and fixed in this PR rather than being deferred — the bot itself pivoted from CHANGES_REQUESTED to APPROVE on the current head.
  • The convergence regression test (test_driver_answerer_ledger_does_not_block_seed_ready_convergence) explicitly locks in the property that the earlier CONFLICTING variant broke, so the same regression cannot reappear silently.
  • auto.interview_driver_backend is a scoped config knob that does not mutate global llm.backend defaults; resume sessions never inherit configured defaults, so existing sessions are unaffected.
  • 245 unit tests pass on the merged stack (tests/unit/auto/, tests/unit/cli/test_auto_command.py, tests/unit/router/test_packaged_auto_dispatch.py).

cc @Q00 — please consider merging this after #671. #680 is already approved on top of this branch and will rebase cleanly onto main once this stack lands.

@shaun0927
Copy link
Copy Markdown
Collaborator

Ready for maintainer review — re-review ping

This PR (driver selection through capabilities + brake mode for selected-driver auto interviews) is now in a clean merge-ready state on commit 9fcef4ca (which absorbs the just-landed #680 / #682 follow-ups via the stack chain):

  • Bot review: APPROVED on dc421dac (INFERRED status fix), with 9fcef4ca only adding the merge of fix(auto): align MCP resume bounds #680 (fix(auto): align mcp resume bounds) and feat(auto): add selected-driver answer metadata #682 (feat(auto): add selected-driver answer metadata) into the stack head — both already approved on their own PR HEADs and now merged into stack/3.
  • Mergeable: MERGEABLE/CLEAN against the parent stack/auto-capabilities/2-hermes-driver.
  • Verification: UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/auto/test_driver_answerer.py tests/unit/auto/test_interview_pipeline.py tests/unit/auto/test_surface.py tests/unit/backends/test_capabilities.py tests/unit/providers/test_hermes_cli_adapter.py tests/unit/providers/test_factory.py tests/unit/auto/test_state.py → 209 passing locally on the stack/3 head.
  • Scope: Wires driver selection (llm.backend-resolved) through the capability registry, persists driver identity canonically, adds brake-mode gating for risky auto interview answers, and preserves scaffold ledger source categories.

How prior bot blocking findings were resolved (all on the current head)

  1. Risk gate add-keyword over-broad (false-blocked CRUD wording like "How do users add a task?") → tightened to \badd(?:ing)?\s+(?:a|an|the)?\s*<scope-noun> so routine CRUD passes while genuine scope-additions still fire (974f1562). Routine + scope tests pin both directions in tests/unit/auto/test_driver_answerer.py.
  2. Ledger contract divergence (driver freeform answer in transcript vs. scaffold value in ledger → conflicting sources of truth) → ledger entry values now hold the driver answer verbatim, with original scaffold value preserved in rationale + evidence (f229e35b).
  3. CONFLICTING blocked Seed-ready forever (because open_gaps treats CONFLICTING as a hard gap) → status switched to INFERRED with cap on confidence ≤ 0.4, so the A-grade review gate still flags entries for verification while the Seed-ready gate can close (dc421dac). Regression test test_driver_mode_ledger_entries_do_not_block_seed_ready locks down the contract.

Why this is safe to merge

  1. The downstream stack PRs (fix(auto): align MCP resume bounds #680, feat(auto): add selected-driver answer metadata #682) already merged cleanly on top of dc421dac, demonstrating no follow-up regressions in the resume-bounds and metadata layers that consume this stack.
  2. Selected-driver mode now: (a) does not false-block on CRUD wording, (b) does not let transcript and ledger silently disagree, and (c) does not permanently block Seed-ready while still surfacing low-confidence driver entries to the A-grade gate.
  3. No remaining bot blocking findings on the current head, no outstanding non-blocking suggestions, and no design notes flagging unaddressed risk.

Q00 maintainers — would you take a final pass and merge the foundation chain (#670#671#672)? Once #672 lands, the auto driver/brake surface is fully unblocked.

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 9fcef4c for PR #672

Review record: f761e672-0038-45c0-acf3-6922f131cd61

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:302 | BLOCKING | _ledger_updates_for() still rewrites each structural ledger value to "driver:{backend} answer (verbatim): {driver_text}" instead of persisting the actual driver answer string unchanged. That reintroduces the exact ledger/transcript divergence this PR says it fixed: the interview transcript stores answer.prefixed_text, while grading/seed consumers read a different ledger value with an extra wrapper. Any downstream equality/parsing checks now see two different sources of truth again, and the existing regression test only asserts substring inclusion, so it does not catch this contract break. |

Non-blocking Suggestions

None.

Design Notes

The selected-driver path is mostly well-contained: config, CLI/MCP, resume semantics, and brake gating line up cleanly. The remaining weakness is that provenance is still being encoded into the ledger value itself instead of staying in metadata/evidence, which makes the “single canonical answer” contract fragile.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927
Copy link
Copy Markdown
Collaborator

Cross-linking #725 self-restraint thread and the closures of #695/#683 (regulated-topic vocab + answer risk classifier).

This PR bundles two changes with very different self-restraint profiles:

Part Intent Self-restraint check
--driver flag + ouroboros config driver set <backend> Lets user pick which backend answers interview questions ✅ Generic UX, runtime-abstraction primitive — fits core
Brake mode (--brake on|off) + _is_risky classifier in driver_answerer.py Auto-blocks "risky" driver answers based on keyword categories 🟠 Same anti-pattern as #695 (regulated-topic vocab) and #683 (answer risk classifier), both closed

The brake's risk-question vocabulary (destructive execution intent, credentials, external side effects, ...) is exactly what #689 / #725 say should live in a UserLevel skill (selected-driver-risk-default), not in src/ouroboros/auto/driver_answerer.py.

Concrete recommendation:

  1. Split this PR:
  2. Or close this PR and Add selected driver and brake mode for ooo auto interviews #665, then re-open A as a fresh smaller PR. The driver_answerer.py classifier vocabulary is the same shape as feat(auto): block risky-fallback answers for regulated topics #695/feat(auto): classify selected-driver answer risk #683 and should follow the same boundary.

If keeping the brake in core temporarily is preferred, make it a no-op default (always allows answers, brake-on means "log only") so the vocabulary's enforcement gravity stays in plugin-land. That avoids the inconsistency of merging the brake here while #695/#683 are closed for the same reason.

Refs #725, #689, closed siblings #695, #683.

@shaun0927 shaun0927 merged commit d61df30 into stack/auto-capabilities/2-hermes-driver May 7, 2026
@shaun0927 shaun0927 deleted the stack/auto-capabilities/3-auto-driver-brake branch May 7, 2026 09:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants