feat(auto): classify selected-driver answer risk#683
feat(auto): classify selected-driver answer risk#683shaun0927 wants to merge 6 commits intoQ00:stack/auto-capabilities/3-auto-driver-brakefrom
Conversation
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
490f287for PR #683
Review record:
39967c8f-b67e-4698-9c41-a49d5d47245d
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:239 | BLOCKING | The destructive-action classifier will block many benign answers because production_target includes customer and the action list includes generic verbs like remove. A response such as “remove the customer-facing banner” or “delete the customer help text” now gets labeled actual answer recommends destructive production action, which turns ordinary content/UI edits into approval blockers under brake=on. |
| 2 | src/ouroboros/auto/driver_answerer.py:230 | BLOCKING | The new secret/credential gate has a common false-negative: it only catches a handful of provider-specific prefixes or phrases like access token, but not plain token: / Bearer ... / JWT-style values. That means a driver answer containing a live bearer token can still be auto-sent with brake=off, and even with brake=on it will bypass the new answer-text approval gate entirely. |
Non-blocking Suggestions
| 1 | tests/unit/auto/test_driver_answerer.py:41 | nice-to-have tests | Add regression coverage for both sides of the new classifier: a benign “customer-facing” edit should stay unflagged, and a plain bearer/JWT token should be flagged. The current tests only exercise the happy-path patterns the regex already handles. |
Design Notes
The structured AutoAnswerMetadata addition is reasonable and the intent to preserve deterministic scaffold ledger entries is coherent. The weak point is the new answer-text risk classifier: it is doing policy enforcement with broad regexes, so precision matters, and the current patterns are both over-broad for benign “customer-facing” edits and under-broad for common secret formats.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
490f287 to
ccedac3
Compare
|
Addressed the bot review findings on the selected-driver answer-text classifier:
Validation after the fix:
|
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
ccedac3for PR #683
Review record:
ccaef606-3d97-49c2-83fe-c1c4e9f58662
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:229 | BLOCKING | The new secret classifier suppresses all generic credential checks whenever the answer contains any placeholder-like word (placeholder, example, dummy, etc.). That means a response such as Use this placeholder password: supersecretprod123 or Example token: eyJ... will not be flagged, so brake=on can still auto-send an actual secret. Because this branch is intended to be the last safety gate for risky driver text, that bypass is a production correctness/security bug and needs coverage. |
Non-blocking Suggestions
None.
Design Notes
The change is well-scoped and the metadata/provenance plumbing is clean, but the text-risk classifier is carrying too much policy in a few regex branches. The safety-critical detection rules need tighter ordering and regression tests around mixed benign+risky phrasing, because small heuristic changes can silently defeat the approval gate.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
|
Addressed the latest bot blocking finding on commit Changes:
Validation:
|
ccedac3 to
a34d1bc
Compare
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
a34d1bcfor PR #683
Review record:
af2912ba-5b42-4350-830a-66a98cc9cd2b
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:229 | BLOCKING | The new secret classifier only matches bare labels like token: / password= plus a handful of provider-specific prefixes, so common env-var or DSN style leaks such as OPENAI_API_KEY=..., AWS_SECRET_ACCESS_KEY=..., or DATABASE_URL=postgres://user:pass@... are not blocked. Under brake=on those answers would still be auto-sent even though this code is meant to approval-gate credential leakage from the selected driver. |
Non-blocking Suggestions
| 1 | tests/unit/auto/test_driver_answerer.py:55 | tests | Add regressions for env-var and connection-string credential leaks. The current tests cover bearer/JWT/plain token: cases, but not the much more common FOO_API_KEY=... or DATABASE_URL=... output shape that the classifier currently misses. |
Design Notes
The patch is directionally good: it adds structured provenance and a second risk gate on actual selected-driver output instead of relying only on question-level heuristics. The main weakness is that the text classifier is still too shape-specific for secrets, so the safety boundary is not yet reliable enough for brake=on.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
a34d1bc to
cafb38e
Compare
|
Addressed the latest bot blocking finding on env-var and DSN-shaped credential leaks. Changes:
Validation:
|
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
cafb38efor PR #683
Review record:
31671fc1-f45d-4286-82d5-35799b24a44f
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:229 | BLOCKING | The new credential classifier will false-positive on benign explanatory sentences because the generic pattern treats any 12+ character token after `password/token/secret ... is |
Non-blocking Suggestions
None.
Design Notes
The overall direction is sound: preserving the deterministic scaffold while adding structured provenance and post-generation risk gating is a reasonable way to keep the selected-driver path auditable. The main weakness is that the new text-risk classifier is still too regex-broad for natural-language explanations, so it can interfere with the approval flow rather than just catching genuine leaks.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
cafb38e to
d73de88
Compare
|
Addressed the latest bot blocking finding on natural-language false positives. Changes:
Validation:
|
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
d73de88for PR #683
Review record:
beaea29a-1fdd-4a0e-b309-0d48b5259ac1
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
None.
Design Notes
The change is directionally sound: it keeps the deterministic scaffold as the ledger source of truth while adding explicit provenance and a second-stage classifier for risky selected-driver output. That separation preserves existing contracts and makes downstream audit surfaces easier to build.
Recovery Notes
First recoverable review artifact generated from codex analysis log.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Catch any rm -rf <path> not just bare /, and add regression tests for known provider tokens (AKIA/ghp_/github_pat/xoxb-/sk-), DROP/ TRUNCATE TABLE, rm -rf with non-root paths, and benign credential phrasing to keep the brake=on gate precise without silent bypass. Refs Q00#675 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
7dc5a55for PR #683
Review record:
e65707ad-1499-4c8d-9315-db426681e41c
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:257 | BLOCKING | The new answer-text classifier treats any rm -rf command, DROP TABLE, or TRUNCATE TABLE as a destructive production action even when the answer is clearly about local build cleanup or a development database. In brake=on this now blocks ordinary answers like rm -rf ./build && rebuild, and in brake=off it unnecessarily downgrades confidence and records risk provenance. That is a behavior regression from the stated “high-impact/risky drafts” policy, and the added tests explicitly lock the false positive in. |
Non-blocking Suggestions
None.
Design Notes
The metadata/provenance addition is straightforward and useful, but the selected-driver risk classifier is still too heuristic-heavy. The new credential checks are directionally good; the destructive-action branch needs tighter scoping so only production- or data-destruction advice is gated.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
Revert rm -rf detection back to bare-root and drop the redundant DROP/TRUNCATE TABLE branch. The verb+production_target check already gates production destruction (DROP DATABASE, delete prod DB), while local dev cleanup like rm -rf ./build, rm -rf node_modules, and DROP TABLE users in dev no longer false-positive under brake=on. Tests inverted to lock the corrected policy. Refs Q00#675 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: APPROVE
Reviewing commit
422ea25for PR #683
Review record:
b1688285-9850-43b7-8d8d-fc9bf6c4eb6b
Blocking Findings
No in-scope blocking findings remained after policy filtering.
Non-blocking Suggestions
None.
Design Notes
The patch keeps the existing deterministic scaffold as the source of ledger state while adding explicit provenance and post-generation risk screening for selected-driver text. That separation is sound, and the new tests cover both the classifier rules and the brake-on/brake-off integration paths for risky driver output.
Recovery Notes
First recoverable review artifact generated from codex analysis log.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
|
Status update on this PR after the regression-and-fix cycle on Bot review trajectory
The bot's design note on What this PR is (closes #675)Post-generation risk screening for selected-driver
Why the current head is ready from the contributor side
Verification on
|
Resolves conflict in driver_answerer.py / test_driver_answerer.py between the stack/3 risk-gate scope fix (974f156) and the post-response classifier branch. Keep both: - Tightened scope-add regex from stack/3 (no longer false-blocks CRUD wording like "How do users add a task?"). - Post-response classifier and metadata threading from Q00#682/Q00#683. Refs Q00#672 Refs Q00#675 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
b896e22for PR #683
Review record:
4c0111b1-b538-4c75-a63a-2901c8df2451
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:234 | BLOCKING | The new selected-driver secret classifier still lets short but perfectly plausible credentials through because the generic assignment rule requires 12+ characters and the DSN rule requires an 8+ character password. password: hunter2 and postgres://demo:hunter2@... both bypass classify_driver_answer_text_risk, so with brake=on they would be auto-sent instead of blocked even though the policy here is “actual answer contains secret or credential”. The added tests only cover long samples, so this gap would ship unnoticed. |
Non-blocking Suggestions
None.
Design Notes
The overall shape is sound: deterministic scaffold first, selected-driver text layered on top, and additive provenance metadata instead of mutating the ledger contract. The main weakness is that the new post-generation classifier is still relying on length heuristics that are too aggressive for credential blocking, which undermines the safety guarantee this PR is trying to add.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
The bot's blocking finding on PR Q00#683 noted that short but plausible credentials (`password: hunter2`, DSN `postgres://demo:hunter2@...`) slipped past the classifier because the generic keyword:value rule required 12+ characters and the DSN rule required an 8+ character password. Under brake=on this could auto-send credential leaks. Drop the length thresholds: any explicit ``keyword: value`` or DSN credential pattern with a non-whitespace value is flagged as ``actual answer contains secret or credential``. Add regression coverage for short literal passwords and short-DSN cases. Refs Q00#675 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Q00
left a comment
There was a problem hiding this comment.
Reviewed across OS/UserLevel/Program boundaries, auto scope, and UX complexity. Approving: post-response risk classification is local, deterministic safety enforcement for selected-driver answers and does not broaden the user-facing auto workflow.
|
@ouroboros-agent please review the current head |
There was a problem hiding this comment.
Review — ouroboros-agent[bot]
Verdict: REQUEST_CHANGES
Reviewing commit
5fb5b3afor PR #683
Review record:
b92c4710-116f-4c64-b11d-13085ae98acf
Blocking Findings
| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.
---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:260 | BLOCKING | The destructive-action classifier now blocks on any destructive verb plus any production_target token appearing anywhere in the answer, without checking that the destructive action is actually aimed at that target. A safe answer like Delete the local test database and re-seed it from production still matches because delete and production/database both appear, so brake=on will hard-block benign remediation guidance. The current tests cover explicit prod-destruction phrases and dev-only SQL, but not this mixed-context false-positive path. |
| 2 | src/ouroboros/auto/driver_answerer.py:188 | BLOCKING | AutoAnswerMetadata is populated in answer(), but apply() immediately delegates to AutoAnswerer.apply(), which only persists prefixed text, blocker entries, and ledger_updates. In the real interview pipeline the AutoAnswer object is discarded after apply(), so the new structured provenance/confidence never survives to later Seed-ready or A-grade gates. As written, the PR adds an audit contract that production code cannot actually consume. |
Non-blocking Suggestions
None.
Design Notes
The overall direction is sound: deterministic scaffold plus post-generation screening is a reasonable way to constrain selected-driver answers. The two issues above are both contract problems, though: one heuristic is still too context-insensitive for a brake gate, and the new metadata object is not wired into any persisted state yet.
Reviewed by ouroboros-agent[bot] via Codex deep analysis
|
Following up on #725 ("Design a UserLevel plugin manager for operational workflows") and the #689 closure: recommend HOLD on this PR (and the sibling #683) until #725 v0 contract lands. What's safe to land independently: the metadata schema part of #675 ( What needs to wait: the classifier itself. The vocabulary categories — "destructive execution intent / credential handling / external side effects / unsupported claims / answers contradicting known repo/user constraints" — are exactly the kind of domain knowledge that #689 was rejected for putting in core, and that #725 wants to move into UserLevel skills via a stable protocol. Concrete suggested redesign (after #725 v0):
Suggest splitting now into:
That way the metadata round-trip tests can land and unblock downstream work, while the classifier doesn't pre-commit core to a specific vocabulary. |
|
Closing this PR per the boundary established in #689 / #725 v0 discussion. The classifier vocabulary in this PR — "destructive execution intent / credential handling / external side effects / unsupported claims / answers contradicting known repo/user constraints" — is the kind of domain knowledge that should live in a UserLevel skill, not in core auto. The intent returns after #725 v0 lands as:
The metadata layer this PR depends on (#682) is core primitive and remains on track to land independently (HOLD lifted there). |
Address the bot's two blocking findings on PR Q00#683: 1. The destructive-action classifier matched any verb plus any production_target token anywhere in the answer, so a benign mixed clause like "Delete the local test database and re-seed it from production" was hard-blocked. Replace the flat search with a clause-scoped match that also suppresses when a dev qualifier (local/test/staging/dev/sandbox/scratch/etc.) appears in the same clause as the verb+target pair. 2. AutoAnswerMetadata was populated in answer() but discarded when apply() delegated to the baseline AutoAnswerer.apply(), so the structured risk/confidence/provenance never reached downstream Seed-ready or A-grade gates. Persist the metadata as a constraint ledger entry inside DriverAutoAnswerer.apply() so the audit contract is actually consumable. Add regression tests covering both: mixed-context clause guidance is not flagged, while production-only destructive guidance still is, and apply() round-trips the metadata into ledger.constraints. Refs Q00#675 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
@ouroboros-agent please review the new head |
|
Withdrawing the bot review-ping above — I missed that this PR was already closed at 06:43:49Z per the #689 / #725 boundary and was still polling for the bot's response on the new head when the closure landed. The architectural pivot makes sense: the classifier vocabulary ("destructive execution intent / credential handling / external side effects / answers contradicting known repo/user constraints") is policy that should live in a UserLevel |
Summary
brake=on, block high-risk actual driver responses before they are sent back to the interview backend.brake=off, continue autopilot while recording combined risk metadata and ledger provenance.Discussion / implementation notes
Validation
UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/auto/test_driver_answerer.py -q→ 10 passedUV_CACHE_DIR=/tmp/uv-cache uv run ruff check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py→ passedUV_CACHE_DIR=/tmp/uv-cache uv run ruff format --check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py→ passedgit diff --check→ passedCloses #675
Depends on #682
Depends on #672