Skip to content

Commit 5cfef01

Browse files
pragnyanramthaharanrk
authored andcommitted
fix(eval): handle unevaluated final response v2 results
Merge #5728 ## Summary Fixes a small aggregation edge case in `FinalResponseMatchV2Evaluator`: when every per-invocation result is skipped or not evaluated, the evaluator currently divides by zero while computing the overall score. ## Root Cause `aggregate_invocation_results()` filters out results whose `score` is `None` or whose `eval_status` is `NOT_EVALUATED`, but it unconditionally computes: ```python overall_score = num_valid / num_evaluated ``` If all judge samples fail to produce a usable score, `num_evaluated` remains `0` and evaluation crashes instead of returning a not-evaluated aggregate result. Other ADK evaluators handle this condition by returning `overall_score=None` and `overall_eval_status=NOT_EVALUATED`. ## Change - Return an `EvaluationResult` with `overall_score=None` and `overall_eval_status=NOT_EVALUATED` when no FinalResponseMatchV2 invocation results are evaluable. - Add a focused regression test for all-skipped/all-not-evaluated invocation results. ## Validation ```bash uv sync --extra test uv run pytest tests/unittests/evaluation/test_final_response_match_v2.py ``` Result: `18 passed, 20 warnings`. Full unit suite was not run; this patch is limited to FinalResponseMatchV2 aggregation and its targeted unit test file. Co-authored-by: Haran Rajkumar <haranrk@google.com> COPYBARA_INTEGRATE_REVIEW=#5728 from pragnyanramtha:pragnyan/final-response-v2-no-eval-guard 3d5ab73 PiperOrigin-RevId: 933818272
1 parent a546bcf commit 5cfef01

2 files changed

Lines changed: 39 additions & 0 deletions

File tree

src/google/adk/evaluation/final_response_match_v2.py

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -237,6 +237,14 @@ def aggregate_invocation_results(
237237
continue
238238
num_evaluated += 1
239239
num_valid += result.score
240+
241+
if num_evaluated == 0:
242+
return EvaluationResult(
243+
overall_score=None,
244+
overall_eval_status=EvalStatus.NOT_EVALUATED,
245+
per_invocation_results=per_invocation_results,
246+
)
247+
240248
overall_score = num_valid / num_evaluated
241249
return EvaluationResult(
242250
overall_score=overall_score,

tests/unittests/evaluation/test_final_response_match_v2.py

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -561,3 +561,34 @@ def test_aggregate_invocation_results():
561561
# Only 4 / 8 invocations are evaluated, and 2 / 4 are valid.
562562
assert aggregated_result.overall_score == 0.5
563563
assert aggregated_result.overall_eval_status == EvalStatus.PASSED
564+
565+
566+
def test_aggregate_invocation_results_none_evaluated():
567+
evaluator = _create_test_evaluator_gemini(threshold=0.5)
568+
569+
actual_invocation, expected_invocation = _create_test_invocations(
570+
"candidate text", "reference text"
571+
)
572+
573+
per_invocation_results = [
574+
PerInvocationResult(
575+
actual_invocation=actual_invocation,
576+
expected_invocation=expected_invocation,
577+
score=None,
578+
eval_status=EvalStatus.NOT_EVALUATED,
579+
),
580+
PerInvocationResult(
581+
actual_invocation=actual_invocation,
582+
expected_invocation=expected_invocation,
583+
score=1.0,
584+
eval_status=EvalStatus.NOT_EVALUATED,
585+
),
586+
]
587+
588+
aggregated_result = evaluator.aggregate_invocation_results(
589+
per_invocation_results
590+
)
591+
592+
assert aggregated_result.overall_score is None
593+
assert aggregated_result.overall_eval_status == EvalStatus.NOT_EVALUATED
594+
assert aggregated_result.per_invocation_results == per_invocation_results

0 commit comments

Comments
 (0)