Skip to content

feat(auto): classify selected-driver answer risk#683

Closed
shaun0927 wants to merge 6 commits intoQ00:stack/auto-capabilities/3-auto-driver-brakefrom
shaun0927:feat/675-auto-answer-post-risk
Closed

feat(auto): classify selected-driver answer risk#683
shaun0927 wants to merge 6 commits intoQ00:stack/auto-capabilities/3-auto-driver-brakefrom
shaun0927:feat/675-auto-answer-post-risk

Conversation

@shaun0927
Copy link
Copy Markdown
Collaborator

Summary

  • Add post-response selected-driver answer risk classification for risky generated answer text.
  • With brake=on, block high-risk actual driver responses before they are sent back to the interview backend.
  • With brake=off, continue autopilot while recording combined risk metadata and ledger provenance.

Discussion / implementation notes

Validation

  • UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/auto/test_driver_answerer.py -q → 10 passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff format --check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • git diff --check → passed

Closes #675
Depends on #682
Depends on #672

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 490f287 for PR #683

Review record: 39967c8f-b67e-4698-9c41-a49d5d47245d

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:239 | BLOCKING | The destructive-action classifier will block many benign answers because production_target includes customer and the action list includes generic verbs like remove. A response such as “remove the customer-facing banner” or “delete the customer help text” now gets labeled actual answer recommends destructive production action, which turns ordinary content/UI edits into approval blockers under brake=on. |
| 2 | src/ouroboros/auto/driver_answerer.py:230 | BLOCKING | The new secret/credential gate has a common false-negative: it only catches a handful of provider-specific prefixes or phrases like access token, but not plain token: / Bearer ... / JWT-style values. That means a driver answer containing a live bearer token can still be auto-sent with brake=off, and even with brake=on it will bypass the new answer-text approval gate entirely. |

Non-blocking Suggestions

| 1 | tests/unit/auto/test_driver_answerer.py:41 | nice-to-have tests | Add regression coverage for both sides of the new classifier: a benign “customer-facing” edit should stay unflagged, and a plain bearer/JWT token should be flagged. The current tests only exercise the happy-path patterns the regex already handles. |

Design Notes

The structured AutoAnswerMetadata addition is reasonable and the intent to preserve deterministic scaffold ledger entries is coherent. The weak point is the new answer-text risk classifier: it is doing policy enforcement with broad regexes, so precision matters, and the current patterns are both over-broad for benign “customer-facing” edits and under-broad for common secret formats.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927 shaun0927 force-pushed the feat/675-auto-answer-post-risk branch from 490f287 to ccedac3 Compare May 7, 2026 01:50
@shaun0927
Copy link
Copy Markdown
Collaborator Author

Addressed the bot review findings on the selected-driver answer-text classifier:

  • Reduced destructive-production false positives by no longer treating generic customer wording and remove as sufficient destructive production evidence.
  • Added detection for plain token: ..., Bearer ..., and JWT-shaped credential values.
  • Added regression coverage for benign customer-facing copy edits and bearer/JWT credential leakage.

Validation after the fix:

  • UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/auto/test_driver_answerer.py -q → 12 passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff format --check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • git diff --check → passed

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit ccedac3 for PR #683

Review record: ccaef606-3d97-49c2-83fe-c1c4e9f58662

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:229 | BLOCKING | The new secret classifier suppresses all generic credential checks whenever the answer contains any placeholder-like word (placeholder, example, dummy, etc.). That means a response such as Use this placeholder password: supersecretprod123 or Example token: eyJ... will not be flagged, so brake=on can still auto-send an actual secret. Because this branch is intended to be the last safety gate for risky driver text, that bypass is a production correctness/security bug and needs coverage. |

Non-blocking Suggestions

None.

Design Notes

The change is well-scoped and the metadata/provenance plumbing is clean, but the text-risk classifier is carrying too much policy in a few regex branches. The safety-critical detection rules need tighter ordering and regression tests around mixed benign+risky phrasing, because small heuristic changes can silently defeat the approval gate.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927
Copy link
Copy Markdown
Collaborator Author

Addressed the latest bot blocking finding on commit ccedac3.

Changes:

  • Removed the placeholder/example/dummy short-circuit from the selected-driver answer text credential classifier.
  • Kept benign placeholder-only references non-blocking, but credential assignment patterns, bearer tokens, and JWT-shaped tokens now remain blocking even when the answer also says “placeholder” or “example”.
  • Added regressions for placeholder password and example token bypass cases.

Validation:

  • UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/auto/test_driver_answerer.py -q → 13 passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff format --check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • git diff --check → passed

@shaun0927 shaun0927 force-pushed the feat/675-auto-answer-post-risk branch from ccedac3 to a34d1bc Compare May 7, 2026 02:08
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit a34d1bc for PR #683

Review record: af2912ba-5b42-4350-830a-66a98cc9cd2b

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:229 | BLOCKING | The new secret classifier only matches bare labels like token: / password= plus a handful of provider-specific prefixes, so common env-var or DSN style leaks such as OPENAI_API_KEY=..., AWS_SECRET_ACCESS_KEY=..., or DATABASE_URL=postgres://user:pass@... are not blocked. Under brake=on those answers would still be auto-sent even though this code is meant to approval-gate credential leakage from the selected driver. |

Non-blocking Suggestions

| 1 | tests/unit/auto/test_driver_answerer.py:55 | tests | Add regressions for env-var and connection-string credential leaks. The current tests cover bearer/JWT/plain token: cases, but not the much more common FOO_API_KEY=... or DATABASE_URL=... output shape that the classifier currently misses. |

Design Notes

The patch is directionally good: it adds structured provenance and a second risk gate on actual selected-driver output instead of relying only on question-level heuristics. The main weakness is that the text classifier is still too shape-specific for secrets, so the safety boundary is not yet reliable enough for brake=on.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927 shaun0927 force-pushed the feat/675-auto-answer-post-risk branch from a34d1bc to cafb38e Compare May 7, 2026 02:26
@shaun0927
Copy link
Copy Markdown
Collaborator Author

Addressed the latest bot blocking finding on env-var and DSN-shaped credential leaks.

Changes:

  • Added selected-driver answer text detection for common uppercase env-var secret shapes such as API key, access key, secret, token, password, and credential names.
  • Added detection for connection-string/DSN shapes containing an inline password component.
  • Added regression coverage for env-var and DSN leak forms alongside the existing bearer/JWT/plain-token cases.

Validation:

  • UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/auto/test_driver_answerer.py -q → passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff format --check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • git diff --check → passed

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit cafb38e for PR #683

Review record: 31671fc1-f45d-4286-82d5-35799b24a44f

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:229 | BLOCKING | The new credential classifier will false-positive on benign explanatory sentences because the generic pattern treats any 12+ character token after `password/token/secret ... is |

Non-blocking Suggestions

None.

Design Notes

The overall direction is sound: preserving the deterministic scaffold while adding structured provenance and post-generation risk gating is a reasonable way to keep the selected-driver path auditable. The main weakness is that the new text-risk classifier is still too regex-broad for natural-language explanations, so it can interfere with the approval flow rather than just catching genuine leaks.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927 shaun0927 force-pushed the feat/675-auto-answer-post-risk branch from cafb38e to d73de88 Compare May 7, 2026 02:33
@shaun0927
Copy link
Copy Markdown
Collaborator Author

Addressed the latest bot blocking finding on natural-language false positives.

Changes:

  • Narrowed the generic credential rule to assignment-shaped values (= / :) instead of natural-language is / are phrasing.
  • Kept env-var, DSN, bearer, JWT, and explicit label assignment detections intact.
  • Added a regression that benign explanatory text like “The password is generated during setup” is not classified as a credential leak.

Validation:

  • UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/auto/test_driver_answerer.py -q → passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • UV_CACHE_DIR=/tmp/uv-cache uv run ruff format --check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py → passed
  • git diff --check → passed

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit d73de88 for PR #683

Review record: beaea29a-1fdd-4a0e-b309-0d48b5259ac1

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Non-blocking Suggestions

None.

Design Notes

The change is directionally sound: it keeps the deterministic scaffold as the ledger source of truth while adding explicit provenance and a second-stage classifier for risky selected-driver output. That separation preserves existing contracts and makes downstream audit surfaces easier to build.

Recovery Notes

First recoverable review artifact generated from codex analysis log.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

Catch any rm -rf <path> not just bare /, and add regression tests
for known provider tokens (AKIA/ghp_/github_pat/xoxb-/sk-), DROP/
TRUNCATE TABLE, rm -rf with non-root paths, and benign credential
phrasing to keep the brake=on gate precise without silent bypass.

Refs Q00#675

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 7dc5a55 for PR #683

Review record: e65707ad-1499-4c8d-9315-db426681e41c

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:257 | BLOCKING | The new answer-text classifier treats any rm -rf command, DROP TABLE, or TRUNCATE TABLE as a destructive production action even when the answer is clearly about local build cleanup or a development database. In brake=on this now blocks ordinary answers like rm -rf ./build && rebuild, and in brake=off it unnecessarily downgrades confidence and records risk provenance. That is a behavior regression from the stated “high-impact/risky drafts” policy, and the added tests explicitly lock the false positive in. |

Non-blocking Suggestions

None.

Design Notes

The metadata/provenance addition is straightforward and useful, but the selected-driver risk classifier is still too heuristic-heavy. The new credential checks are directionally good; the destructive-action branch needs tighter scoping so only production- or data-destruction advice is gated.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

Revert rm -rf detection back to bare-root and drop the redundant
DROP/TRUNCATE TABLE branch. The verb+production_target check
already gates production destruction (DROP DATABASE, delete prod
DB), while local dev cleanup like rm -rf ./build, rm -rf
node_modules, and DROP TABLE users in dev no longer false-positive
under brake=on. Tests inverted to lock the corrected policy.

Refs Q00#675

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@Q00 Q00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed. This selected-driver answer-risk classifier targets a stack branch and already has requested changes. Please resolve the stack-level feedback before approval.

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: APPROVE

Reviewing commit 422ea25 for PR #683

Review record: b1688285-9850-43b7-8d8d-fc9bf6c4eb6b

Blocking Findings

No in-scope blocking findings remained after policy filtering.

Non-blocking Suggestions

None.

Design Notes

The patch keeps the existing deterministic scaffold as the source of ledger state while adding explicit provenance and post-generation risk screening for selected-driver text. That separation is sound, and the new tests cover both the classifier rules and the brake-on/brake-off integration paths for risky driver output.

Recovery Notes

First recoverable review artifact generated from codex analysis log.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@Q00 Q00 added OS Core engine, state machine, internal pipeline, and system-level behavior Safety Risk, guardrail, policy, and regulated-topic behavior labels May 7, 2026
@shaun0927
Copy link
Copy Markdown
Collaborator Author

Status update on this PR after the regression-and-fix cycle on 7dc5a55d422ea251.

Bot review trajectory

Commit Verdict Reason
d73de88a APPROVE initial classifier landing
7dc5a55d REQUEST_CHANGES I had broadened rm -rf detection to any path and kept the (drop|truncate)\s+(database|table) branch — that flagged local dev cleanup like rm -rf ./build and DROP TABLE users as production destruction
422ea251 APPROVE reverted rm -rf to bare-root only, dropped the redundant SQL branch (production destruction is already covered by the existing destructive_verb + production_target check), inverted the relevant tests to lock the corrected policy

The bot's design note on 422ea251: "the new tests cover both the classifier rules and the brake-on/brake-off integration paths for risky driver output."

What this PR is (closes #675)

Post-generation risk screening for selected-driver ooo auto interview answers. The deterministic scaffold remains the ledger source of truth; on top of that, the classifier inspects the actual driver text and the brake decides:

  • brake=on: block answers that contain real secrets/credentials or production-destruction intent before they reach the interview backend.
  • brake=off: auto-send, but persist combined risk + provenance metadata so the later Seed-ready / A-grade gates remain the safety net.

Why the current head is ready from the contributor side

  • Policy now matches the bot's guidance verbatim: only production- or data-destruction advice is gated. rm -rf node_modules, rm -rf ./build, DROP TABLE users CASCADE, TRUNCATE TABLE accounts all return None. Bare rm -rf /, DROP DATABASE app, delete the production database, and other production-context destruction remain flagged via the unchanged destructive_verb + production_target check (database and db are in production_target).
  • Credential-side hardening retained: provider tokens (AKIA / ghp_ / github_pat_ / xoxb- / sk-, written as split literals so GitHub secret scanning does not false-positive on the test fixtures), env-var (OPENAI_API_KEY=…, AWS_SECRET_ACCESS_KEY=…), DSN (DATABASE_URL=postgres://user:pass@…), Bearer / JWT, and placeholder-bypass attempts all return the secret label. Benign phrasing (Bearer-style auth pattern, past-tense password mention, OAuth Bearer general mention) returns None.
  • No persisted schema growth beyond the additive AutoAnswerMetadata already approved in feat(auto): add selected-driver answer metadata #682. The deterministic-scaffold ledger contract is preserved.
  • The classifier is fully deterministic and local — no external calls.

Verification on 422ea251

UV_CACHE_DIR=/tmp/uv-cache uv run pytest tests/unit/auto/test_driver_answerer.py tests/unit/auto/test_interview_pipeline.py -q
  → 89 passed

UV_CACHE_DIR=/tmp/uv-cache uv run ruff check src tests
  → All checks passed

UV_CACHE_DIR=/tmp/uv-cache uv run ruff format --check src/ouroboros/auto/driver_answerer.py tests/unit/auto/test_driver_answerer.py
  → 2 files already formatted

On the stack situation

I read your earlier note about resolving the upstream stack first. That is squarely a maintainer call — the prerequisite #665 / #672 chain isn't something I can drive from a fork. This comment is just to flag that the contributor-side surface is settled and the bot is back to APPROVE on the latest commit, so whenever the stack chain is ready (or if you'd prefer me to rebase #682 + #683 directly onto main once the selected-driver answerer is on main), the code itself is in a re-reviewable state. Re-requesting your review here when convenient.

Resolves conflict in driver_answerer.py / test_driver_answerer.py
between the stack/3 risk-gate scope fix (974f156) and the
post-response classifier branch. Keep both:
- Tightened scope-add regex from stack/3 (no longer false-blocks
  CRUD wording like "How do users add a task?").
- Post-response classifier and metadata threading from Q00#682/Q00#683.

Refs Q00#672
Refs Q00#675

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit b896e22 for PR #683

Review record: 4c0111b1-b538-4c75-a63a-2901c8df2451

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:234 | BLOCKING | The new selected-driver secret classifier still lets short but perfectly plausible credentials through because the generic assignment rule requires 12+ characters and the DSN rule requires an 8+ character password. password: hunter2 and postgres://demo:hunter2@... both bypass classify_driver_answer_text_risk, so with brake=on they would be auto-sent instead of blocked even though the policy here is “actual answer contains secret or credential”. The added tests only cover long samples, so this gap would ship unnoticed. |

Non-blocking Suggestions

None.

Design Notes

The overall shape is sound: deterministic scaffold first, selected-driver text layered on top, and additive provenance metadata instead of mutating the ledger contract. The main weakness is that the new post-generation classifier is still relying on length heuristics that are too aggressive for credential blocking, which undermines the safety guarantee this PR is trying to add.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

The bot's blocking finding on PR Q00#683 noted that short but plausible
credentials (`password: hunter2`, DSN `postgres://demo:hunter2@...`)
slipped past the classifier because the generic keyword:value rule
required 12+ characters and the DSN rule required an 8+ character
password. Under brake=on this could auto-send credential leaks.

Drop the length thresholds: any explicit ``keyword: value`` or DSN
credential pattern with a non-whitespace value is flagged as
``actual answer contains secret or credential``. Add regression
coverage for short literal passwords and short-DSN cases.

Refs Q00#675

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Owner

@Q00 Q00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed across OS/UserLevel/Program boundaries, auto scope, and UX complexity. Approving: post-response risk classification is local, deterministic safety enforcement for selected-driver answers and does not broaden the user-facing auto workflow.

@shaun0927
Copy link
Copy Markdown
Collaborator Author

@ouroboros-agent please review the current head 5fb5b3ac, which addresses the previous blocking finding by dropping length thresholds ({12,} for keyword:value / env-var, {8,} for DSN password) so short literal credentials and short DSN passwords are no longer let through.

Copy link
Copy Markdown
Contributor

@ouroboros-agent ouroboros-agent Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review — ouroboros-agent[bot]

Verdict: REQUEST_CHANGES

Reviewing commit 5fb5b3a for PR #683

Review record: b92c4710-116f-4c64-b11d-13085ae98acf

Blocking Findings

| # | File:Line | Severity | Finding |
|### Recovery Notes
First recoverable review artifact generated from codex analysis log.

---|-----------|----------|---------|
| 1 | src/ouroboros/auto/driver_answerer.py:260 | BLOCKING | The destructive-action classifier now blocks on any destructive verb plus any production_target token appearing anywhere in the answer, without checking that the destructive action is actually aimed at that target. A safe answer like Delete the local test database and re-seed it from production still matches because delete and production/database both appear, so brake=on will hard-block benign remediation guidance. The current tests cover explicit prod-destruction phrases and dev-only SQL, but not this mixed-context false-positive path. |
| 2 | src/ouroboros/auto/driver_answerer.py:188 | BLOCKING | AutoAnswerMetadata is populated in answer(), but apply() immediately delegates to AutoAnswerer.apply(), which only persists prefixed text, blocker entries, and ledger_updates. In the real interview pipeline the AutoAnswer object is discarded after apply(), so the new structured provenance/confidence never survives to later Seed-ready or A-grade gates. As written, the PR adds an audit contract that production code cannot actually consume. |

Non-blocking Suggestions

None.

Design Notes

The overall direction is sound: deterministic scaffold plus post-generation screening is a reasonable way to constrain selected-driver answers. The two issues above are both contract problems, though: one heuristic is still too context-insensitive for a brake gate, and the new metadata object is not wired into any persisted state yet.


Reviewed by ouroboros-agent[bot] via Codex deep analysis

@shaun0927
Copy link
Copy Markdown
Collaborator Author

Following up on #725 ("Design a UserLevel plugin manager for operational workflows") and the #689 closure: recommend HOLD on this PR (and the sibling #683) until #725 v0 contract lands.

What's safe to land independently: the metadata schema part of #675 (risk_level, confidence, risk_reasons, provenance_sources fields on the answer record). That's a core primitive — durable state and provenance — and it does not commit to any particular classifier.

What needs to wait: the classifier itself. The vocabulary categories — "destructive execution intent / credential handling / external side effects / unsupported claims / answers contradicting known repo/user constraints" — are exactly the kind of domain knowledge that #689 was rejected for putting in core, and that #725 wants to move into UserLevel skills via a stable protocol.

Concrete suggested redesign (after #725 v0):

  • core: RiskAssessor protocol (likely the same protocol feat(auto): block risky-fallback answers for regulated topics #695 will use), default no-op implementation.
  • core: brake-on/brake-off behavior unchanged — it just consults whatever RiskAssessor is registered.
  • skill: selected-driver-risk-default ships with the keyword categories from this PR. Users can install richer assessors without patching core.

Suggest splitting now into:

  • this PR's metadata-only delta (lands), and
  • the classifier delta (held until v0).

That way the metadata round-trip tests can land and unblock downstream work, while the classifier doesn't pre-commit core to a specific vocabulary.

@shaun0927
Copy link
Copy Markdown
Collaborator Author

Closing this PR per the boundary established in #689 / #725 v0 discussion.

The classifier vocabulary in this PR — "destructive execution intent / credential handling / external side effects / unsupported claims / answers contradicting known repo/user constraints" — is the kind of domain knowledge that should live in a UserLevel skill, not in core auto.

The intent returns after #725 v0 lands as:

  1. core RiskAssessor protocol (likely the same protocol feat(auto): block risky-fallback answers for regulated topics #695's redesign will use), default no-op,
  2. a default-shipping selected-driver-risk-default skill carrying the current keyword categories,
  3. the brake-on/brake-off contract from Add post-response risk and confidence tagging for selected-driver auto answers #675 stays unchanged — it just consults whatever assessor is registered.

The metadata layer this PR depends on (#682) is core primitive and remains on track to land independently (HOLD lifted there).

Refs #725, #675.

@shaun0927 shaun0927 closed this May 7, 2026
shaun0927 added a commit to shaun0927/ouroboros that referenced this pull request May 7, 2026
Address the bot's two blocking findings on PR Q00#683:

1. The destructive-action classifier matched any verb plus any
   production_target token anywhere in the answer, so a benign mixed
   clause like "Delete the local test database and re-seed it from
   production" was hard-blocked. Replace the flat search with a
   clause-scoped match that also suppresses when a dev qualifier
   (local/test/staging/dev/sandbox/scratch/etc.) appears in the same
   clause as the verb+target pair.

2. AutoAnswerMetadata was populated in answer() but discarded when
   apply() delegated to the baseline AutoAnswerer.apply(), so the
   structured risk/confidence/provenance never reached downstream
   Seed-ready or A-grade gates. Persist the metadata as a constraint
   ledger entry inside DriverAutoAnswerer.apply() so the audit
   contract is actually consumable.

Add regression tests covering both: mixed-context clause guidance is
not flagged, while production-only destructive guidance still is, and
apply() round-trips the metadata into ledger.constraints.

Refs Q00#675

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shaun0927
Copy link
Copy Markdown
Collaborator Author

@ouroboros-agent please review the new head f2653956 which addresses the two blocking findings on 5fb5b3ac: (1) destructive-action gate is now clause-scoped with dev-qualifier suppression so mixed answers like "Delete the local test database and re-seed it from production" are no longer hard-blocked, and (2) DriverAutoAnswerer.apply() now persists AutoAnswerMetadata (risk / confidence / provenance) into ledger.constraints so downstream Seed-ready and A-grade gates can consume the audit contract. New regression tests cover both paths.

shaun0927 added a commit to shaun0927/ouroboros that referenced this pull request May 7, 2026
Empty commit to force GitHub to re-evaluate the PR head; the previous
push of f265395 advanced the branch on shaun0927/ouroboros but the
PR Q00#683 head_sha did not refresh.

Refs Q00#675

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@shaun0927
Copy link
Copy Markdown
Collaborator Author

Withdrawing the bot review-ping above — I missed that this PR was already closed at 06:43:49Z per the #689 / #725 boundary and was still polling for the bot's response on the new head when the closure landed. The architectural pivot makes sense: the classifier vocabulary ("destructive execution intent / credential handling / external side effects / answers contradicting known repo/user constraints") is policy that should live in a UserLevel selected-driver-risk-default skill consulted via the future core RiskAssessor protocol, not as keyword regexes inside core auto. The brake-on/brake-off contract from #675 stays unchanged on the auto side, and #682's metadata primitive lands independently — both on track. No further action needed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

OS Core engine, state machine, internal pipeline, and system-level behavior Safety Risk, guardrail, policy, and regulated-topic behavior

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants