feat(search): EXP-14 — retrieval-side abstention gate#8
Draft
moralespanitz wants to merge 3 commits intomainfrom
Draft
feat(search): EXP-14 — retrieval-side abstention gate#8moralespanitz wants to merge 3 commits intomainfrom
moralespanitz wants to merge 3 commits intomainfrom
Conversation
Extraction LLM was truncating JSON output at ~14 KB during BEAM Sprint 2 CR mini-slice runs on dense 10-turn chunks. Server log showed: [extractFacts] JSON parse failed (Unterminated string in JSON at position 14152 ...); attempting repair across 6 chunks of one ingest, causing iter 7 (first attempt) to crash on conv-3. The Anthropic max_tokens budget defaults to 4096 in extraction.ts. Going to 8192 doubles the headroom for JSON output without changing any other behavior. Cost impact is marginal (Anthropic bills only for tokens actually generated; rare for extraction to use the full 8192). Validation: server is running with this change locally; iter 7 v3 N=3 full-ingest reruns succeed without truncation. Companion harness mitigation lowered chunk size from 10 to 5 turn-pairs (in atomicmemory-benchmarks PR #8) to reduce the chance of hitting the limit at all. This server-side bump is defense-in-depth.
Subagent that was supposed to write a plan-only doc also produced preliminary code on this branch. Preserving for later — autoresearch loop will treat as a future iteration candidate. NOT verified, NOT ready for review.
Adds a post-rerank confidence computation that signals when retrieval results are poorly separated and/or absolutely weak. Targets BEAM abstention (ABS) ability where Honcho scores below baseline. - New retrieval-confidence-gate.ts (~70 LOC) with computeRetrievalConfidence - Gates on similarity (stable, scale-invariant) not score (rewritten by RRF) - Four new RuntimeConfig fields, all default-off, allowlisted in INTERNAL_POLICY_CONFIG_FIELDS for config_override A/B testing - Threaded through search-pipeline → memory-search → routes → response - Emits retrieval_confidence JSON in search responses when enabled - Trace event 'low-confidence-gate' fires when low confidence detected - 10 unit tests covering: disabled, empty, strong separation, narrow+ weak, strong-margin override, normalizer/floor overrides Plan: experiments/exp-14-implementation-plan-2026-04-29.md
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
EXP-14 — Retrieval-side abstention gate
Target ability: ABS (Abstention)
Status: Implementation complete, tests pass, ready for experimentation
Plan:
experiments/exp-14-implementation-plan-2026-04-29.mdWhat
Adds a post-rerank confidence computation that signals when retrieval results are poorly separated and/or absolutely weak. The BEAM adapter can read this signal and append an abstention instruction block to the answer prompt.
Why
ABS scored 0/2 in the Stage 7 dry-run. Honcho's ABS (36.3) is below the no-memory baseline (60.0) — the rare BEAM ability where memory retrieval actively hurts. EXP-14 is the only Phase-2 category where success lands above both Honcho and baseline.
How
similarity, notscore** —scoreis rewritten by RRF, cross-encoder, MMR, and boosts mid-pipeline.similarityis the only stable, scale-invariant signal.applyRankingProtectionStagesandselectAndExpandCandidatesinsearch-pipeline.ts— after cross-encoder (top is final), before MMR (which would artificially lower top-1).RuntimeConfigfields, all default-off:retrievalConfidenceGateEnabledretrievalConfidenceMarginNormalizer(default 0.05)retrievalConfidenceSimilarityNormalizer(default 0.5)retrievalConfidenceFloor(default 0.3)INTERNAL_POLICY_CONFIG_FIELDSso the BEAM adapter A/Bs per-ability viaconfig_override.Files changed
src/services/retrieval-confidence-gate.tssrc/services/search-pipeline.tsSearchPipelineRuntimeConfig, invoke gate, change return type, emit trace eventsrc/services/memory-search.tsretrievalConfidencethrough toRetrievalResultsrc/services/memory-service-types.tsretrievalConfidence?: RetrievalConfidencetoRetrievalResultsrc/routes/memories.tsretrieval_confidenceinformatSearchResponsesrc/config.tsRuntimeConfig, env loaders,INTERNAL_POLICY_CONFIG_FIELDSsrc/app/runtime-container.tsCoreRuntimeConfigsrc/services/__tests__/retrieval-confidence-gate.test.tsDivision of labor
retrieval_confidence.low_confidenceand appends abstention prompt blockTests
npx tsc --noEmitcleanRollback
Set
retrievalConfidenceGateEnabled: false(default). Server emits noretrieval_confidence; adapter seesundefined; no prompt block appended. Bit-identical to pre-EXP-14.