feat(rag-server): extend agentic eval evidence expectations#43
Conversation
|
Claude finished @HendrikReh's task in 1m 17s —— View job PR Review
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61abaa3ae6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if required_documents.is_empty() { | ||
| 1.0 |
There was a problem hiding this comment.
Use required/total default when no required evidence exists
When explicit evidence buckets are used and required_evidence is empty, this branch sets the default min_expected_recall to 1.0, which effectively makes all supporting evidence mandatory. That contradicts the documented/default formula (required/total) and changes the semantics for cases that intentionally have only supporting_evidence (where the default should evaluate to 0.0). As written, such benchmarks will fail unless every supporting document is retrieved or authors remember to override min_expected_recall manually.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61abaa3ae6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if required_documents.is_empty() { | ||
| 1.0 |
There was a problem hiding this comment.
Compute default recall threshold as required/total
When required_evidence/supporting_evidence are present but required_evidence is empty, this branch sets default_min_expected_recall to 1.0, which makes every supporting document effectively mandatory. That contradicts the documented extended-schema default (required/total in docs/howto/testing.md) and will cause benchmark cases that intentionally use only supporting evidence to fail unless authors remember to add an explicit override. Using the same required_documents.len() / expected_documents.len() rule here preserves the stated semantics and avoids surprising strictness.
Useful? React with 👍 / 👎.
|
@claude review this PR |
|
@claude review |
|
Claude finished @HendrikReh's task in 1m 21s —— View job PR Review
SummaryThe two-pass assertion (strict required-doc check + threshold recall check) is a clean design and backward compatibility for legacy Issues1.
let default_min_expected_recall = if has_explicit_buckets {
if required_documents.is_empty() {
1.0 // ← forces 100% recall of supporting docs
} else {
required_documents.len() as f32 / expected_documents.len() as f32
}For 2.
This is accurate only when 3. Missing unit test for There is no test covering 4. Mixed-schema footgun:
What Works Well
Reviewer Focus Items (from PR description)
Verdict: Needs changes on issues #1 and #3 before merge. Issue #2 is a documentation fix that should accompany whichever resolution you pick for #1. Issue #4 is optional but worth a code comment. |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 61abaa3ae6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if required_documents.is_empty() { | ||
| 1.0 |
There was a problem hiding this comment.
Derive default recall from required/total ratio
When required_evidence/supporting_evidence are used without min_expected_recall, this branch forces min_expected_recall to 1.0 whenever required_evidence is empty, which makes supporting-only cases require full recall. That contradicts the documented required/total default and prevents authors from expressing “all evidence is optional unless thresholded” cases unless they remember to set an explicit override. In practice, any benchmark entry with only supporting_evidence will fail unless every supporting document is returned.
Useful? React with 👍 / 👎.
Review Comments ResolvedCodex
Claude
Validation run:
|
|
@claude review this PR |
|
@codex review |
|
Claude finished @HendrikReh's task in 1m 32s —— View job PR Review
SummaryAll issues from the previous review rounds are resolved. The implementation is correct and well-tested. Two minor observations remain. Previously Reported Issues — Verified Fixed
Remaining Observations1. Dead code in the
let expected_documents: BTreeSet<String> = if has_explicit_buckets {
// expected_evidence is guaranteed empty here (assertion above)
self.required_evidence.iter().chain(self.supporting_evidence.iter()).cloned().collect()
} else {
self.expected_evidence.iter().cloned().collect()
};Not blocking — the behavior is identical. 2. Supporting-only case with no threshold is a no-op evidence gate (by design, but worth a comment)
Consider adding a brief code comment above the // supporting-only cases (required empty) default to 0.0 — no threshold enforced
// unless the author sets min_expected_recall explicitly.What's Correct
Verdict: Approve. Observation #1 is cosmetic. Observation #2 is worth a one-line comment but not blocking given the unit test already pins the behavior. |
|
Codex Review: Didn't find any major issues. Bravo. ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
|
@claude review this PR |
|
Claude finished @HendrikReh's task in 57s —— View job PR Review
SummaryAll previously-reported issues are fully resolved. The implementation is correct, well-tested, and consistent with the documentation. This is ready to merge. Previously-Reported Issues — All Verified Fixed
Code Review: Remaining Observations1. Dead chain in
// current (self.expected_evidence is always empty in this branch)
let expected_documents: BTreeSet<String> = if has_explicit_buckets {
self.expected_evidence.iter() // ← dead, always empty
.chain(self.required_evidence.iter())
.chain(self.supporting_evidence.iter())
.cloned()
.collect()Not a bug — the behavior is identical. Could simplify to 2. Supporting-only is a no-op gate by default (by design, documented by test)
What's Correct
Verdict: Approve. Observation #1 is cosmetic and non-blocking. Observation #2 is intentional and already tested. |
|
Codex Review: Didn't find any major issues. Another round soon, please! ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |

Summary
required_evidence,supporting_evidence, andmin_expected_recallfields.expected_evidence-only cases remain strict by default unless an explicitmin_expected_recalloverride is provided.Related Issue
apex-e0iScope
Changed files
crates/rag-server/tests/agentic_eval.rs- add extended benchmark schema parsing, evidence expectation normalization, threshold-based recall checks, and focused unit tests.docs/howto/testing.md- document legacy vs extended benchmark expectation fields forjust agentic-eval.Out of scope
data/evals/agentic_search_v1/corpus./chator/agents/*API contracts.Validation
cargo fmtcargo test -p rag-server --test agentic_evalcargo check -p rag-serverjust testjust integration-testsRisk and Rollback
61abaa3(or revertcrates/rag-server/tests/agentic_eval.rsanddocs/howto/testing.md) to restore strict legacy recall behavior.Reviewer Focus
expected_evidence.required_evidence/min_expected_recallinteraction is mathematically and operationally correct for multi-document benchmark expansion.Generated with Codex