Skip to content

feat(rag-server): extend agentic eval evidence expectations#43

Merged
HendrikReh merged 3 commits into
mainfrom
feat/apex-e0i-eval-evidence-schema
Apr 3, 2026
Merged

feat(rag-server): extend agentic eval evidence expectations#43
HendrikReh merged 3 commits into
mainfrom
feat/apex-e0i-eval-evidence-schema

Conversation

@HendrikReh

Copy link
Copy Markdown
Owner

Summary

  • Extend the offline agentic eval benchmark schema to support richer evidence expectations via optional required_evidence, supporting_evidence, and min_expected_recall fields.
  • Update eval gating logic so required evidence is enforced strictly while expected-pool recall is thresholded, enabling multi-document cases without forcing all-or-nothing recall.
  • Preserve backward compatibility: legacy expected_evidence-only cases remain strict by default unless an explicit min_expected_recall override is provided.

Related Issue

  • Beads: apex-e0i

Scope

Changed files

  • crates/rag-server/tests/agentic_eval.rs - add extended benchmark schema parsing, evidence expectation normalization, threshold-based recall checks, and focused unit tests.
  • docs/howto/testing.md - document legacy vs extended benchmark expectation fields for just agentic-eval.

Out of scope

  • Changing route selection heuristics or retrieval profile behavior.
  • Modifying benchmark corpus content in data/evals/agentic_search_v1/corpus.
  • Changing /chat or /agents/* API contracts.

Validation

Check Command Result
Format cargo fmt pass
Eval tests cargo test -p rag-server --test agentic_eval pass (3 passed, 1 ignored benchmark)
Compilation cargo check -p rag-server pass
Full suite just test not run -- user decision
Integration suite just integration-tests not run -- user decision

Risk and Rollback

  • Risk: medium -- benchmark acceptance semantics changed, so misconfigured thresholds could mask retrieval regressions if cases are authored carelessly.
  • Rollback: revert commit 61abaa3 (or revert crates/rag-server/tests/agentic_eval.rs and docs/howto/testing.md) to restore strict legacy recall behavior.

Reviewer Focus

  1. Verify default behavior remains strict for existing benchmark entries that only define expected_evidence.
  2. Verify the new required_evidence/min_expected_recall interaction is mathematically and operationally correct for multi-document benchmark expansion.

Generated with Codex

@HendrikReh HendrikReh added priority:medium Moderate urgency; plan for near-term enhancement New feature or improvement testing Test coverage, frameworks, and infrastructure rag-server rag-server crate (Axum REST API) labels Apr 3, 2026
@HendrikReh

Copy link
Copy Markdown
Owner Author

@claude

@claude

claude Bot commented Apr 3, 2026

Copy link
Copy Markdown

Claude finished @HendrikReh's task in 1m 17s —— View job


PR Review

  • Read changed files
  • Analyze evidence normalization logic
  • Review recall threshold implementation
  • Check backward compatibility
  • Review unit tests
  • Post findings

Summary

The design is sound and backward compatibility is correctly preserved for legacy expected_evidence-only cases. The two-pass assertion (strict required-doc check + threshold recall check) is a clean approach. A few issues are worth addressing before merge.


Issues

1. supporting_evidence-only case silently defaults to 1.0 recall (bug / footgun)

agentic_eval.rs:69-74 — when has_explicit_buckets = true but required_evidence is empty (e.g. you only provide supporting_evidence), required_documents is empty and default_min_expected_recall is 1.0:

let default_min_expected_recall = if has_explicit_buckets {
    if required_documents.is_empty() {
        1.0  // ← ALL supporting docs must be retrieved
    } else { ... }

A benchmark case like { supporting_evidence: ["doc-a", "doc-b"] } silently requires 100% recall of those supporting docs — contradicting the purpose of "supporting" (optional/secondary evidence). If this is intentional, it should be documented and tested. If not, the default for the supporting-only case should be 0.0 (or some fraction). Fix this →

2. Mixed-schema footgun: expected_evidence silently expands the pool

agentic_eval.rs:50-58 — in extended mode, expected_documents is the union of expected_evidence + required_evidence + supporting_evidence. If a benchmark author adds required_evidence to an existing case without clearing expected_evidence, the extra docs dilute the threshold without being tracked as either required or supporting. The default_min_expected_recall becomes required_count / enlarged_total, which will be lower than intended.

This could lead to benchmark cases silently relaxing over time without a reviewer noticing. Consider either:

  • Emitting an assert! or warning when both expected_evidence and explicit buckets are non-empty, or
  • Treating mixed use as a configuration error

3. Missing unit test for supporting_evidence-only case

There's no test covering { required_evidence: [], supporting_evidence: ["a", "b"], min_expected_recall: None }. Given the non-obvious 1.0 default behavior noted above, this is the case most likely to surprise future benchmark authors. One focused unit test would document and pin the behavior.


Minor / Nits

4. count_expected_hits signature change (&[String]&BTreeSet<String>) — clean improvement, no concerns.

5. Recall assertion error message (agentic_eval.rs:271-276) is good. Consider also logging which specific documents were missing to make CI failures easier to triage, e.g. "missing: {:?}", expected.required_documents.difference(&routed_docs).collect::<Vec<_>>().

6. docs/howto/testing.md update is accurate and sufficient. The note about the default threshold being required/total matches the implementation.


Reviewer Focus Items (from PR description)

  1. Default behavior for legacy entries — confirmed strict: required_documents == expected_documents, min_expected_recall == 1.0.
  2. required_evidence/min_expected_recall interaction — mathematically correct. required_hits == required_documents.len() is a strict equality check (all-or-nothing for required). routed_recall >= min_expected_recall is the threshold gate on the full pool. The two checks are independent and compose cleanly.

Verdict: Approve pending resolution of issue #1 (supporting-only default) and #3 (missing unit test). Issue #2 (mixed schema) is lower priority but worth a comment in the code or a config assertion.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61abaa3ae6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/rag-server/tests/agentic_eval.rs Outdated
Comment on lines +70 to +71
if required_documents.is_empty() {
1.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use required/total default when no required evidence exists

When explicit evidence buckets are used and required_evidence is empty, this branch sets the default min_expected_recall to 1.0, which effectively makes all supporting evidence mandatory. That contradicts the documented/default formula (required/total) and changes the semantics for cases that intentionally have only supporting_evidence (where the default should evaluate to 0.0). As written, such benchmarks will fail unless every supporting document is retrieved or authors remember to override min_expected_recall manually.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61abaa3ae6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/rag-server/tests/agentic_eval.rs Outdated
Comment on lines +70 to +71
if required_documents.is_empty() {
1.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compute default recall threshold as required/total

When required_evidence/supporting_evidence are present but required_evidence is empty, this branch sets default_min_expected_recall to 1.0, which makes every supporting document effectively mandatory. That contradicts the documented extended-schema default (required/total in docs/howto/testing.md) and will cause benchmark cases that intentionally use only supporting evidence to fail unless authors remember to add an explicit override. Using the same required_documents.len() / expected_documents.len() rule here preserves the stated semantics and avoids surprising strictness.

Useful? React with 👍 / 👎.

@HendrikReh

Copy link
Copy Markdown
Owner Author

@claude review this PR

@HendrikReh

Copy link
Copy Markdown
Owner Author

@claude review

@claude

claude Bot commented Apr 3, 2026

Copy link
Copy Markdown

Claude finished @HendrikReh's task in 1m 21s —— View job


PR Review

  • Read changed files
  • Analyze evidence normalization logic
  • Review recall threshold implementation
  • Check backward compatibility
  • Review unit tests
  • Post findings

Summary

The two-pass assertion (strict required-doc check + threshold recall check) is a clean design and backward compatibility for legacy expected_evidence-only cases is correctly preserved. There are a few issues worth addressing, the most important being a semantic inconsistency in the supporting_evidence-only default that also contradicts the documentation.


Issues

1. supporting_evidence-only default is 1.0, but docs say required/total (bug)

agentic_eval.rs:69-71 — when has_explicit_buckets = true and required_documents is empty (e.g. only supporting_evidence is set), the default falls into the 1.0 branch:

let default_min_expected_recall = if has_explicit_buckets {
    if required_documents.is_empty() {
        1.0  // ← forces 100% recall of supporting docs
    } else {
        required_documents.len() as f32 / expected_documents.len() as f32
    }

For { supporting_evidence: ["doc-a", "doc-b"] } this computes required/total = 0/2 = 0.0, but the code returns 1.0 instead. The documented default in testing.md:145 reads "the default threshold is required/total" — which would be 0.0 in this case. The code contradicts its own documentation and the "supporting = optional" semantics. If strict recall of supporting docs is intentional for this case, it must be documented and tested explicitly.

Fix this →

2. testing.md documents required/total default but implementation diverges

docs/howto/testing.md:145: "If required_evidence/supporting_evidence are provided and min_expected_recall is omitted, the default threshold is required/total."

This is accurate only when required_evidence is non-empty. The supporting_evidence-only case produces 0/N = 0.0 by the documented formula but 1.0 in practice. The docs should either reflect the special case or the code should be fixed to match.

3. Missing unit test for supporting_evidence-only case

There is no test covering { required_evidence: [], supporting_evidence: ["a", "b"], min_expected_recall: None }. Given the non-obvious behavior noted above, this is the case most likely to surprise future benchmark authors. A focused unit test would pin and document the actual behavior. Fix this →

4. Mixed-schema footgun: expected_evidence silently expands the pool (low priority)

agentic_eval.rs:50-58 — in extended mode, expected_documents is the union of all three fields. If an author adds required_evidence to an existing case that has expected_evidence, the extra docs expand the pool and dilute the recall threshold without any warning. Consider an assert! or a warning when both expected_evidence and explicit buckets are non-empty simultaneously.


What Works Well

  • Two-pass assertion (lines 264-276): strict required-doc equality check then threshold recall check compose cleanly and independently — correct design.
  • Legacy backward compatibility (lines 42-48, 75-77): expected_evidence-only cases retain required_documents == expected_documents and min_expected_recall == 1.0 — verified by the legacy_expected_evidence_defaults_to_strict_recall unit test.
  • count_expected_hits signature (&BTreeSet<String>) — cleaner than &[String], no issues.
  • Three existing unit tests cover the main documented scenarios well.

Reviewer Focus Items (from PR description)

  1. Default behavior for legacy entries — confirmed strict: required_documents == expected_documents, min_expected_recall == 1.0.
  2. required_evidence/min_expected_recall interaction — mathematically correct when required_evidence is non-empty. The 0/N edge case (supporting-only) needs resolution per issue fix: correct 4 chunking library bugs #1.

Verdict: Needs changes on issues #1 and #3 before merge. Issue #2 is a documentation fix that should accompany whichever resolution you pick for #1. Issue #4 is optional but worth a code comment.

@HendrikReh

Copy link
Copy Markdown
Owner Author

@codex review

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 61abaa3ae6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/rag-server/tests/agentic_eval.rs Outdated
Comment on lines +70 to +71
if required_documents.is_empty() {
1.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Derive default recall from required/total ratio

When required_evidence/supporting_evidence are used without min_expected_recall, this branch forces min_expected_recall to 1.0 whenever required_evidence is empty, which makes supporting-only cases require full recall. That contradicts the documented required/total default and prevents authors from expressing “all evidence is optional unless thresholded” cases unless they remember to set an explicit override. In practice, any benchmark entry with only supporting_evidence will fail unless every supporting document is returned.

Useful? React with 👍 / 👎.

@HendrikReh

Copy link
Copy Markdown
Owner Author

Review Comments Resolved

Codex

# Finding Action
1 [r3034464555] supporting-only explicit buckets defaulted to 1.0 recall instead of required/total Fixed — explicit-bucket default now always computes required_documents.len() / expected_documents.len() so supporting-only defaults to 0.0.
2 [r3034465270] same recall-default issue (required/total mismatch) Fixed — same code-path change as above in BenchmarkCase::evidence_expectation.
3 [r3034496695] same supporting-only strictness regression Fixed — same code-path change as above; behavior now matches docs.

Claude

# Finding Action
1 [c4185280100] supporting-only default should not be strict 1.0 Fixed — default threshold now derived from required/total; added supporting_only_defaults_to_zero_recall_threshold test.
2 [c4185280100] mixed schema (expected_evidence + explicit buckets) can silently dilute thresholds Fixed — added assertion rejecting mixed legacy+explicit schema with a dedicated panic test.
3 [c4185280100] missing supporting-only unit test Fixed — added test coverage for supporting-only default semantics.
4 [c4185288612] docs/implementation mismatch for default threshold Fixed — code now matches documented required/total default and docs now explicitly state not to mix legacy and explicit buckets.

Validation run:

  • cargo test -p rag-server --test agentic_eval (5 passed, 1 ignored)
  • cargo check -p rag-server (pass)

@HendrikReh

Copy link
Copy Markdown
Owner Author

@claude review this PR

@HendrikReh

Copy link
Copy Markdown
Owner Author

@codex review

@claude

claude Bot commented Apr 3, 2026

Copy link
Copy Markdown

Claude finished @HendrikReh's task in 1m 32s —— View job


PR Review

  • Read changed files (current state post-fixes)
  • Analyze evidence normalization logic
  • Verify previously-reported issues are resolved
  • Check for remaining issues
  • Post findings

Summary

All issues from the previous review rounds are resolved. The implementation is correct and well-tested. Two minor observations remain.


Previously Reported Issues — Verified Fixed

# Finding Status
supporting-only default was 1.0 instead of required/total agentic_eval.rs:74-75 now always computes required_documents.len() / expected_documents.len()0/2 = 0.0 for supporting-only. ✓
Mixed legacy + explicit schema silently dilutes thresholds agentic_eval.rs:41-45 now hard-panics when has_explicit_buckets && !expected_evidence.is_empty(). ✓
Missing unit test for supporting-only case supporting_only_defaults_to_zero_recall_threshold test added at line 382. ✓
Docs/implementation mismatch testing.md:145 now reads "the default threshold is required/total" — matches code. ✓

Remaining Observations

1. Dead code in the has_explicit_buckets branch of expected_documents (cosmetic)

agentic_eval.rs:56-57 — the self.expected_evidence.iter() chain is dead: the assertion at lines 41-45 guarantees expected_evidence is empty whenever has_explicit_buckets is true. The code is harmless but slightly misleading. Could be simplified to:

let expected_documents: BTreeSet<String> = if has_explicit_buckets {
    // expected_evidence is guaranteed empty here (assertion above)
    self.required_evidence.iter().chain(self.supporting_evidence.iter()).cloned().collect()
} else {
    self.expected_evidence.iter().cloned().collect()
};

Not blocking — the behavior is identical.

2. Supporting-only case with no threshold is a no-op evidence gate (by design, but worth a comment)

agentic_eval.rs:265-276 — when required_documents is empty and min_expected_recall == 0.0 (the supporting-only default), both assertions trivially pass regardless of what is retrieved. This is the intentional behavior, and the unit test documents it. However, a future benchmark author adding supporting_evidence without setting min_expected_recall won't get any enforcement, which may be surprising.

Consider adding a brief code comment above the default_min_expected_recall computation:

// supporting-only cases (required empty) default to 0.0 — no threshold enforced
// unless the author sets min_expected_recall explicitly.

What's Correct

  • Division-by-zero safety: assert!(!expected_documents.is_empty()) at line 66 fires before the threshold division at line 75. ✓
  • Two-pass assertion (lines 265-276): required-doc strict equality check + threshold recall check remain independent and compose cleanly. ✓
  • Legacy backward compatibility (legacy_expected_evidence_defaults_to_strict_recall test): required_documents == expected_documents, min_expected_recall == 1.0. ✓
  • F32 precision: integer sizes typical in benchmark files are well within f32 range, no precision issues. ✓
  • Five unit tests cover all key branches: legacy strict, explicit mixed, relaxed legacy, supporting-only, and mixed-schema rejection.

Verdict: Approve. Observation #1 is cosmetic. Observation #2 is worth a one-line comment but not blocking given the unit test already pins the behavior.

@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Bravo.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@HendrikReh

Copy link
Copy Markdown
Owner Author

@codex review

@HendrikReh

Copy link
Copy Markdown
Owner Author

@claude review this PR

@claude

claude Bot commented Apr 3, 2026

Copy link
Copy Markdown

Claude finished @HendrikReh's task in 57s —— View job


PR Review

  • Read changed files (current state post-fixes)
  • Analyze evidence normalization logic
  • Verify all previously-reported issues are resolved
  • Check for remaining issues
  • Post findings

Summary

All previously-reported issues are fully resolved. The implementation is correct, well-tested, and consistent with the documentation. This is ready to merge.


Previously-Reported Issues — All Verified Fixed

Finding Status
supporting_evidence-only default was 1.0 instead of required/total Fixedagentic_eval.rs:74-75 now always computes required_documents.len() / expected_documents.len()0/2 = 0.0 for supporting-only ✓
Mixed legacy + explicit schema silently dilutes thresholds Fixedagentic_eval.rs:41-45 panics when has_explicit_buckets && !expected_evidence.is_empty()
Missing unit test for supporting-only case Fixedsupporting_only_defaults_to_zero_recall_threshold added at line 382 ✓
Docs/implementation mismatch (required/total) Fixedtesting.md:145 matches code; explicit note added to not mix schemas ✓

Code Review: Remaining Observations

1. Dead chain in expected_documents union (cosmetic, non-blocking)

agentic_eval.rs:55-64 — in the has_explicit_buckets branch, self.expected_evidence.iter() contributes nothing because the assertion at lines 41-45 guarantees expected_evidence is empty here. The union is correct but slightly misleading:

// current (self.expected_evidence is always empty in this branch)
let expected_documents: BTreeSet<String> = if has_explicit_buckets {
    self.expected_evidence.iter()     // ← dead, always empty
        .chain(self.required_evidence.iter())
        .chain(self.supporting_evidence.iter())
        .cloned()
        .collect()

Not a bug — the behavior is identical. Could simplify to required_evidence.iter().chain(supporting_evidence.iter()) in a follow-up.

2. Supporting-only is a no-op gate by default (by design, documented by test)

agentic_eval.rs:265-276 — when required_documents is empty and min_expected_recall == 0.0, both assertions trivially pass regardless of retrieval output. This is the correct intended behavior for the supporting-only case (purely optional evidence), and it's now pinned by supporting_only_defaults_to_zero_recall_threshold. No action required.


What's Correct

  • Backward compatibility: legacy_expected_evidence_defaults_to_strict_recall confirms required_documents == expected_documents and min_expected_recall == 1.0 for legacy-schema cases ✓
  • Two-pass assertion (lines 265-276): strict required-doc equality check is independent of threshold recall check — correct composition ✓
  • Division-by-zero safety: assert!(!expected_documents.is_empty()) at line 66 fires before the threshold division at line 75 ✓
  • Mixed-schema rejection: hard panic with clear message, covered by #[should_panic] test ✓
  • Five unit tests cover all key branches: legacy strict, explicit mixed, relaxed legacy, supporting-only, and mixed-schema rejection ✓
  • Docs accuracy (testing.md:141-147): all four schema variants documented correctly ✓

Verdict: Approve. Observation #1 is cosmetic and non-blocking. Observation #2 is intentional and already tested.

@HendrikReh HendrikReh merged commit d40e694 into main Apr 3, 2026
1 check passed
@HendrikReh HendrikReh deleted the feat/apex-e0i-eval-evidence-schema branch April 3, 2026 21:58
@chatgpt-codex-connector

Copy link
Copy Markdown

Codex Review: Didn't find any major issues. Another round soon, please!

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or improvement priority:medium Moderate urgency; plan for near-term rag-server rag-server crate (Axum REST API) testing Test coverage, frameworks, and infrastructure

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant