feat(rag-server): extend agentic eval evidence expectations by HendrikReh · Pull Request #43 · HendrikReh/apex

HendrikReh · 2026-04-03T21:30:19Z

Summary

Extend the offline agentic eval benchmark schema to support richer evidence expectations via optional required_evidence, supporting_evidence, and min_expected_recall fields.
Update eval gating logic so required evidence is enforced strictly while expected-pool recall is thresholded, enabling multi-document cases without forcing all-or-nothing recall.
Preserve backward compatibility: legacy expected_evidence-only cases remain strict by default unless an explicit min_expected_recall override is provided.

Related Issue

Beads: apex-e0i

Scope

Changed files

crates/rag-server/tests/agentic_eval.rs - add extended benchmark schema parsing, evidence expectation normalization, threshold-based recall checks, and focused unit tests.
docs/howto/testing.md - document legacy vs extended benchmark expectation fields for just agentic-eval.

Out of scope

Changing route selection heuristics or retrieval profile behavior.
Modifying benchmark corpus content in data/evals/agentic_search_v1/corpus.
Changing /chat or /agents/* API contracts.

Validation

Check	Command	Result
Format	`cargo fmt`	pass
Eval tests	`cargo test -p rag-server --test agentic_eval`	pass (3 passed, 1 ignored benchmark)
Compilation	`cargo check -p rag-server`	pass
Full suite	`just test`	not run -- user decision
Integration suite	`just integration-tests`	not run -- user decision

Risk and Rollback

Risk: medium -- benchmark acceptance semantics changed, so misconfigured thresholds could mask retrieval regressions if cases are authored carelessly.
Rollback: revert commit 61abaa3 (or revert crates/rag-server/tests/agentic_eval.rs and docs/howto/testing.md) to restore strict legacy recall behavior.

Reviewer Focus

Verify default behavior remains strict for existing benchmark entries that only define expected_evidence.
Verify the new required_evidence/min_expected_recall interaction is mathematically and operationally correct for multi-document benchmark expansion.

Generated with Codex

HendrikReh · 2026-04-03T21:31:33Z

@claude

claude · 2026-04-03T21:31:48Z

Claude finished @HendrikReh's task in 1m 17s —— View job

PR Review

Summary

The design is sound and backward compatibility is correctly preserved for legacy expected_evidence-only cases. The two-pass assertion (strict required-doc check + threshold recall check) is a clean approach. A few issues are worth addressing before merge.

Issues

1. supporting_evidence-only case silently defaults to 1.0 recall (bug / footgun)

agentic_eval.rs:69-74 — when has_explicit_buckets = true but required_evidence is empty (e.g. you only provide supporting_evidence), required_documents is empty and default_min_expected_recall is 1.0:

let default_min_expected_recall = if has_explicit_buckets {
    if required_documents.is_empty() {
        1.0  // ← ALL supporting docs must be retrieved
    } else { ... }

A benchmark case like { supporting_evidence: ["doc-a", "doc-b"] } silently requires 100% recall of those supporting docs — contradicting the purpose of "supporting" (optional/secondary evidence). If this is intentional, it should be documented and tested. If not, the default for the supporting-only case should be 0.0 (or some fraction). Fix this →

2. Mixed-schema footgun: expected_evidence silently expands the pool

agentic_eval.rs:50-58 — in extended mode, expected_documents is the union of expected_evidence + required_evidence + supporting_evidence. If a benchmark author adds required_evidence to an existing case without clearing expected_evidence, the extra docs dilute the threshold without being tracked as either required or supporting. The default_min_expected_recall becomes required_count / enlarged_total, which will be lower than intended.

This could lead to benchmark cases silently relaxing over time without a reviewer noticing. Consider either:

Emitting an assert! or warning when both expected_evidence and explicit buckets are non-empty, or
Treating mixed use as a configuration error

3. Missing unit test for supporting_evidence-only case

There's no test covering { required_evidence: [], supporting_evidence: ["a", "b"], min_expected_recall: None }. Given the non-obvious 1.0 default behavior noted above, this is the case most likely to surprise future benchmark authors. One focused unit test would document and pin the behavior.

Minor / Nits

4. count_expected_hits signature change (&[String] → &BTreeSet<String>) — clean improvement, no concerns.

5. Recall assertion error message (agentic_eval.rs:271-276) is good. Consider also logging which specific documents were missing to make CI failures easier to triage, e.g. "missing: {:?}", expected.required_documents.difference(&routed_docs).collect::<Vec<_>>().

6. docs/howto/testing.md update is accurate and sufficient. The note about the default threshold being required/total matches the implementation.

Reviewer Focus Items (from PR description)

Default behavior for legacy entries — confirmed strict: required_documents == expected_documents, min_expected_recall == 1.0.
required_evidence/min_expected_recall interaction — mathematically correct. required_hits == required_documents.len() is a strict equality check (all-or-nothing for required). routed_recall >= min_expected_recall is the threshold gate on the full pool. The two checks are independent and compose cleanly.

Verdict: Approve pending resolution of issue #1 (supporting-only default) and #3 (missing unit test). Issue #2 (mixed schema) is lower priority but worth a comment in the code or a config assertion.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61abaa3ae6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T21:32:37Z

+            if required_documents.is_empty() {
+                1.0


Use required/total default when no required evidence exists

When explicit evidence buckets are used and required_evidence is empty, this branch sets the default min_expected_recall to 1.0, which effectively makes all supporting evidence mandatory. That contradicts the documented/default formula (required/total) and changes the semantics for cases that intentionally have only supporting_evidence (where the default should evaluate to 0.0). As written, such benchmarks will fail unless every supporting document is retrieved or authors remember to override min_expected_recall manually.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61abaa3ae6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T21:32:45Z

+            if required_documents.is_empty() {
+                1.0


Compute default recall threshold as required/total

When required_evidence/supporting_evidence are present but required_evidence is empty, this branch sets default_min_expected_recall to 1.0, which makes every supporting document effectively mandatory. That contradicts the documented extended-schema default (required/total in docs/howto/testing.md) and will cause benchmark cases that intentionally use only supporting evidence to fail unless authors remember to add an explicit override. Using the same required_documents.len() / expected_documents.len() rule here preserves the stated semantics and avoids surprising strictness.

Useful? React with 👍 / 👎.

HendrikReh · 2026-04-03T21:33:48Z

@claude review this PR

HendrikReh · 2026-04-03T21:33:48Z

@claude review

claude · 2026-04-03T21:34:04Z

Claude finished @HendrikReh's task in 1m 21s —— View job

PR Review

Summary

The two-pass assertion (strict required-doc check + threshold recall check) is a clean design and backward compatibility for legacy expected_evidence-only cases is correctly preserved. There are a few issues worth addressing, the most important being a semantic inconsistency in the supporting_evidence-only default that also contradicts the documentation.

Issues

1. supporting_evidence-only default is 1.0, but docs say required/total (bug)

agentic_eval.rs:69-71 — when has_explicit_buckets = true and required_documents is empty (e.g. only supporting_evidence is set), the default falls into the 1.0 branch:

let default_min_expected_recall = if has_explicit_buckets {
    if required_documents.is_empty() {
        1.0  // ← forces 100% recall of supporting docs
    } else {
        required_documents.len() as f32 / expected_documents.len() as f32
    }

For { supporting_evidence: ["doc-a", "doc-b"] } this computes required/total = 0/2 = 0.0, but the code returns 1.0 instead. The documented default in testing.md:145 reads "the default threshold is required/total" — which would be 0.0 in this case. The code contradicts its own documentation and the "supporting = optional" semantics. If strict recall of supporting docs is intentional for this case, it must be documented and tested explicitly.

Fix this →

2. testing.md documents required/total default but implementation diverges

docs/howto/testing.md:145: "If required_evidence/supporting_evidence are provided and min_expected_recall is omitted, the default threshold is required/total."

This is accurate only when required_evidence is non-empty. The supporting_evidence-only case produces 0/N = 0.0 by the documented formula but 1.0 in practice. The docs should either reflect the special case or the code should be fixed to match.

3. Missing unit test for supporting_evidence-only case

There is no test covering { required_evidence: [], supporting_evidence: ["a", "b"], min_expected_recall: None }. Given the non-obvious behavior noted above, this is the case most likely to surprise future benchmark authors. A focused unit test would pin and document the actual behavior. Fix this →

4. Mixed-schema footgun: expected_evidence silently expands the pool (low priority)

agentic_eval.rs:50-58 — in extended mode, expected_documents is the union of all three fields. If an author adds required_evidence to an existing case that has expected_evidence, the extra docs expand the pool and dilute the recall threshold without any warning. Consider an assert! or a warning when both expected_evidence and explicit buckets are non-empty simultaneously.

What Works Well

Two-pass assertion (lines 264-276): strict required-doc equality check then threshold recall check compose cleanly and independently — correct design.
Legacy backward compatibility (lines 42-48, 75-77): expected_evidence-only cases retain required_documents == expected_documents and min_expected_recall == 1.0 — verified by the legacy_expected_evidence_defaults_to_strict_recall unit test.
count_expected_hits signature (&BTreeSet<String>) — cleaner than &[String], no issues.
Three existing unit tests cover the main documented scenarios well.

Reviewer Focus Items (from PR description)

Default behavior for legacy entries — confirmed strict: required_documents == expected_documents, min_expected_recall == 1.0.
required_evidence/min_expected_recall interaction — mathematically correct when required_evidence is non-empty. The 0/N edge case (supporting-only) needs resolution per issue fix: correct 4 chunking library bugs #1.

Verdict: Needs changes on issues #1 and #3 before merge. Issue #2 is a documentation fix that should accompany whichever resolution you pick for #1. Issue #4 is optional but worth a code comment.

HendrikReh · 2026-04-03T21:36:34Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61abaa3ae6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T21:38:50Z

+            if required_documents.is_empty() {
+                1.0


Derive default recall from required/total ratio

When required_evidence/supporting_evidence are used without min_expected_recall, this branch forces min_expected_recall to 1.0 whenever required_evidence is empty, which makes supporting-only cases require full recall. That contradicts the documented required/total default and prevents authors from expressing “all evidence is optional unless thresholded” cases unless they remember to set an explicit override. In practice, any benchmark entry with only supporting_evidence will fail unless every supporting document is returned.

Useful? React with 👍 / 👎.

HendrikReh · 2026-04-03T21:46:22Z

Review Comments Resolved

Codex

#	Finding	Action
1	[r3034464555] supporting-only explicit buckets defaulted to `1.0` recall instead of `required/total`	Fixed — explicit-bucket default now always computes `required_documents.len() / expected_documents.len()` so supporting-only defaults to `0.0`.
2	[r3034465270] same recall-default issue (`required/total` mismatch)	Fixed — same code-path change as above in `BenchmarkCase::evidence_expectation`.
3	[r3034496695] same supporting-only strictness regression	Fixed — same code-path change as above; behavior now matches docs.

Claude

#	Finding	Action
1	[c4185280100] supporting-only default should not be strict 1.0	Fixed — default threshold now derived from `required/total`; added `supporting_only_defaults_to_zero_recall_threshold` test.
2	[c4185280100] mixed schema (`expected_evidence` + explicit buckets) can silently dilute thresholds	Fixed — added assertion rejecting mixed legacy+explicit schema with a dedicated panic test.
3	[c4185280100] missing supporting-only unit test	Fixed — added test coverage for supporting-only default semantics.
4	[c4185288612] docs/implementation mismatch for default threshold	Fixed — code now matches documented `required/total` default and docs now explicitly state not to mix legacy and explicit buckets.

Validation run:

cargo test -p rag-server --test agentic_eval (5 passed, 1 ignored)
cargo check -p rag-server (pass)

HendrikReh · 2026-04-03T21:46:30Z

@claude review this PR

HendrikReh · 2026-04-03T21:46:31Z

@codex review

claude · 2026-04-03T21:46:42Z

Claude finished @HendrikReh's task in 1m 32s —— View job

PR Review

Read changed files (current state post-fixes)
Analyze evidence normalization logic
Verify previously-reported issues are resolved
Check for remaining issues
Post findings

Summary

All issues from the previous review rounds are resolved. The implementation is correct and well-tested. Two minor observations remain.

Previously Reported Issues — Verified Fixed

#	Finding	Status
supporting-only default was `1.0` instead of `required/total`	`agentic_eval.rs:74-75` now always computes `required_documents.len() / expected_documents.len()` → `0/2 = 0.0` for supporting-only. ✓
Mixed legacy + explicit schema silently dilutes thresholds	`agentic_eval.rs:41-45` now hard-panics when `has_explicit_buckets && !expected_evidence.is_empty()`. ✓
Missing unit test for supporting-only case	`supporting_only_defaults_to_zero_recall_threshold` test added at line 382. ✓
Docs/implementation mismatch	`testing.md:145` now reads "the default threshold is `required/total`" — matches code. ✓

Remaining Observations

1. Dead code in the has_explicit_buckets branch of expected_documents (cosmetic)

agentic_eval.rs:56-57 — the self.expected_evidence.iter() chain is dead: the assertion at lines 41-45 guarantees expected_evidence is empty whenever has_explicit_buckets is true. The code is harmless but slightly misleading. Could be simplified to:

let expected_documents: BTreeSet<String> = if has_explicit_buckets {
    // expected_evidence is guaranteed empty here (assertion above)
    self.required_evidence.iter().chain(self.supporting_evidence.iter()).cloned().collect()
} else {
    self.expected_evidence.iter().cloned().collect()
};

Not blocking — the behavior is identical.

2. Supporting-only case with no threshold is a no-op evidence gate (by design, but worth a comment)

agentic_eval.rs:265-276 — when required_documents is empty and min_expected_recall == 0.0 (the supporting-only default), both assertions trivially pass regardless of what is retrieved. This is the intentional behavior, and the unit test documents it. However, a future benchmark author adding supporting_evidence without setting min_expected_recall won't get any enforcement, which may be surprising.

Consider adding a brief code comment above the default_min_expected_recall computation:

// supporting-only cases (required empty) default to 0.0 — no threshold enforced
// unless the author sets min_expected_recall explicitly.

What's Correct

Division-by-zero safety: assert!(!expected_documents.is_empty()) at line 66 fires before the threshold division at line 75. ✓
Two-pass assertion (lines 265-276): required-doc strict equality check + threshold recall check remain independent and compose cleanly. ✓
Legacy backward compatibility (legacy_expected_evidence_defaults_to_strict_recall test): required_documents == expected_documents, min_expected_recall == 1.0. ✓
F32 precision: integer sizes typical in benchmark files are well within f32 range, no precision issues. ✓
Five unit tests cover all key branches: legacy strict, explicit mixed, relaxed legacy, supporting-only, and mixed-schema rejection.

Verdict: Approve. Observation #1 is cosmetic. Observation #2 is worth a one-line comment but not blocking given the unit test already pins the behavior.

chatgpt-codex-connector · 2026-04-03T21:48:52Z

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

HendrikReh · 2026-04-03T21:56:43Z

@codex review

HendrikReh · 2026-04-03T21:56:43Z

@claude review this PR

claude · 2026-04-03T21:56:58Z

Claude finished @HendrikReh's task in 57s —— View job

PR Review

Read changed files (current state post-fixes)
Analyze evidence normalization logic
Verify all previously-reported issues are resolved
Check for remaining issues
Post findings

Summary

All previously-reported issues are fully resolved. The implementation is correct, well-tested, and consistent with the documentation. This is ready to merge.

Previously-Reported Issues — All Verified Fixed

Finding	Status
`supporting_evidence`-only default was `1.0` instead of `required/total`	Fixed — `agentic_eval.rs:74-75` now always computes `required_documents.len() / expected_documents.len()` → `0/2 = 0.0` for supporting-only ✓
Mixed legacy + explicit schema silently dilutes thresholds	Fixed — `agentic_eval.rs:41-45` panics when `has_explicit_buckets && !expected_evidence.is_empty()` ✓
Missing unit test for supporting-only case	Fixed — `supporting_only_defaults_to_zero_recall_threshold` added at line 382 ✓
Docs/implementation mismatch (`required/total`)	Fixed — `testing.md:145` matches code; explicit note added to not mix schemas ✓

Code Review: Remaining Observations

1. Dead chain in expected_documents union (cosmetic, non-blocking)

agentic_eval.rs:55-64 — in the has_explicit_buckets branch, self.expected_evidence.iter() contributes nothing because the assertion at lines 41-45 guarantees expected_evidence is empty here. The union is correct but slightly misleading:

// current (self.expected_evidence is always empty in this branch)
let expected_documents: BTreeSet<String> = if has_explicit_buckets {
    self.expected_evidence.iter()     // ← dead, always empty
        .chain(self.required_evidence.iter())
        .chain(self.supporting_evidence.iter())
        .cloned()
        .collect()

Not a bug — the behavior is identical. Could simplify to required_evidence.iter().chain(supporting_evidence.iter()) in a follow-up.

2. Supporting-only is a no-op gate by default (by design, documented by test)

agentic_eval.rs:265-276 — when required_documents is empty and min_expected_recall == 0.0, both assertions trivially pass regardless of retrieval output. This is the correct intended behavior for the supporting-only case (purely optional evidence), and it's now pinned by supporting_only_defaults_to_zero_recall_threshold. No action required.

What's Correct

Backward compatibility: legacy_expected_evidence_defaults_to_strict_recall confirms required_documents == expected_documents and min_expected_recall == 1.0 for legacy-schema cases ✓
Two-pass assertion (lines 265-276): strict required-doc equality check is independent of threshold recall check — correct composition ✓
Division-by-zero safety: assert!(!expected_documents.is_empty()) at line 66 fires before the threshold division at line 75 ✓
Mixed-schema rejection: hard panic with clear message, covered by #[should_panic] test ✓
Five unit tests cover all key branches: legacy strict, explicit mixed, relaxed legacy, supporting-only, and mixed-schema rejection ✓
Docs accuracy (testing.md:141-147): all four schema variants documented correctly ✓

Verdict: Approve. Observation #1 is cosmetic and non-blocking. Observation #2 is intentional and already tested.

chatgpt-codex-connector · 2026-04-03T21:59:06Z

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

feat(rag-server): support richer agentic eval evidence expectations

61abaa3

HendrikReh added priority:medium Moderate urgency; plan for near-term enhancement New feature or improvement testing Test coverage, frameworks, and infrastructure rag-server rag-server crate (Axum REST API) labels Apr 3, 2026

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

fix(rag-server): align agentic eval supporting-evidence defaults

359a866

fix(rag-server): satisfy clippy in eval schema guard

b980dff

HendrikReh merged commit d40e694 into main Apr 3, 2026
1 check passed

HendrikReh deleted the feat/apex-e0i-eval-evidence-schema branch April 3, 2026 21:58

Conversation

HendrikReh commented Apr 3, 2026

Summary

Related Issue

Scope

Changed files

Out of scope

Validation

Risk and Rollback

Reviewer Focus

Uh oh!

HendrikReh commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Summary

Issues

Minor / Nits

Reviewer Focus Items (from PR description)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

HendrikReh commented Apr 3, 2026

Uh oh!

HendrikReh commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Summary

Issues

What Works Well

Reviewer Focus Items (from PR description)

Uh oh!

HendrikReh commented Apr 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

HendrikReh commented Apr 3, 2026

Review Comments Resolved

Codex

Claude

Uh oh!

HendrikReh commented Apr 3, 2026

Uh oh!

HendrikReh commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review

Summary

Previously Reported Issues — Verified Fixed

Remaining Observations

What's Correct

Uh oh!

chatgpt-codex-connector Bot commented Apr 3, 2026

Uh oh!

HendrikReh commented Apr 3, 2026

Uh oh!

HendrikReh commented Apr 3, 2026

Uh oh!

claude Bot commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

claude Bot commented Apr 3, 2026 •

edited

Loading

claude Bot commented Apr 3, 2026 •

edited

Loading

claude Bot commented Apr 3, 2026 •

edited

Loading

claude Bot commented Apr 3, 2026 •

edited

Loading