fix: filter empty documents in OpenAI rerank client#1345
fix: filter empty documents in OpenAI rerank client#1345yc111233 wants to merge 1 commit intovolcengine:mainfrom
Conversation
Rerank providers like DashScope (qwen3-rerank) return HTTP 400 when any document in the batch is an empty string. This can happen when vector records have empty abstract fields. Fix: filter out empty/whitespace-only documents before sending to the rerank API, and map scores back to original indices (empty documents receive a score of 0.0). This acts as a safety net so that rerank degrades gracefully instead of failing entirely. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨No code suggestions found for the PR. |
qin-ctx
left a comment
There was a problem hiding this comment.
Thanks for the defense-in-depth fix. I found one blocking correctness issue and one non-blocking test gap. The main concern is the all-empty batch path, which currently returns zero scores and suppresses the retriever's existing fallback to vector scores.
| # empty strings with HTTP 400. | ||
| valid_indices = [i for i, d in enumerate(documents) if d and d.strip()] | ||
| if not valid_indices: | ||
| return [0.0] * len(documents) |
There was a problem hiding this comment.
[Bug] (blocking) When every input document is empty or whitespace, this returns an all-zero score list instead of signaling rerank failure. HierarchicalRetriever._rerank_scores() treats any numeric list with the expected length as a successful rerank, so this path bypasses fallback to vector scores and can filter out otherwise retrievable results at the rerank threshold. Mixed batches should keep the current behavior, but an all-empty batch should return None or otherwise trigger the existing fallback path.
|
|
||
| # Filter out empty documents — rerank providers (e.g. DashScope) reject | ||
| # empty strings with HTTP 400. | ||
| valid_indices = [i for i, d in enumerate(documents) if d and d.strip()] |
There was a problem hiding this comment.
[Suggestion] (non-blocking) Please add regression tests for the new filtering behavior. The current suite does not cover either a mixed batch like ['doc', '', ' '] with index remapping or the all-empty batch path, so the rerank/fallback semantics introduced here are not locked down.
Summary
abstractfields (e.g. due to fix: backfill abstract from file content in vectorize_file #1343)Fix
Filter out empty/whitespace-only documents before sending to the rerank API, and map scores back to original indices. Empty documents receive a score of 0.0.
This acts as a defensive safety net — rerank degrades gracefully (empty docs get score 0.0) instead of failing the entire batch.
Related
Test plan
rerank_batchwith a mix of valid and empty documents🤖 Generated with Claude Code