Skip to content

fix: filter empty documents in OpenAI rerank client#1345

Open
yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233:fix/rerank-filter-empty-documents
Open

fix: filter empty documents in OpenAI rerank client#1345
yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233:fix/rerank-filter-empty-documents

Conversation

@yc111233
Copy link
Copy Markdown
Contributor

@yc111233 yc111233 commented Apr 9, 2026

Summary

  • Rerank providers like DashScope (qwen3-rerank) return HTTP 400 when any document in the batch is an empty string
  • This can happen when vector records have empty abstract fields (e.g. due to fix: backfill abstract from file content in vectorize_file #1343)
  • Currently the entire rerank call fails, causing fallback to raw vector scores for all results

Fix

Filter out empty/whitespace-only documents before sending to the rerank API, and map scores back to original indices. Empty documents receive a score of 0.0.

valid_indices = [i for i, d in enumerate(documents) if d and d.strip()]
if not valid_indices:
    return [0.0] * len(documents)
# ... call API with filtered_docs ...
# ... map scores back to original positions ...

This acts as a defensive safety net — rerank degrades gracefully (empty docs get score 0.0) instead of failing the entire batch.

Related

Test plan

  • Call rerank_batch with a mix of valid and empty documents
  • Verify valid documents get proper rerank scores
  • Verify empty documents get score 0.0
  • Verify no HTTP 400 errors from DashScope

🤖 Generated with Claude Code

Rerank providers like DashScope (qwen3-rerank) return HTTP 400 when
any document in the batch is an empty string. This can happen when
vector records have empty abstract fields.

Fix: filter out empty/whitespace-only documents before sending to the
rerank API, and map scores back to original indices (empty documents
receive a score of 0.0). This acts as a safety net so that rerank
degrades gracefully instead of failing entirely.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

1343 - Partially compliant

Compliant requirements:

  • Provides defense-in-depth to prevent rerank 400 errors by filtering empty documents

Non-compliant requirements:

  • Does not address the root cause (backfilling abstract in vectorize_file)

Requires further human verification:

  • Verify that empty documents receive a score of 0.0
  • Verify that valid documents get proper rerank scores
  • Verify no HTTP 400 errors from rerank providers
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 92
🧪 No relevant tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

PR Code Suggestions ✨

No code suggestions found for the PR.

Copy link
Copy Markdown
Collaborator

@qin-ctx qin-ctx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the defense-in-depth fix. I found one blocking correctness issue and one non-blocking test gap. The main concern is the all-empty batch path, which currently returns zero scores and suppresses the retriever's existing fallback to vector scores.

# empty strings with HTTP 400.
valid_indices = [i for i, d in enumerate(documents) if d and d.strip()]
if not valid_indices:
return [0.0] * len(documents)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Bug] (blocking) When every input document is empty or whitespace, this returns an all-zero score list instead of signaling rerank failure. HierarchicalRetriever._rerank_scores() treats any numeric list with the expected length as a successful rerank, so this path bypasses fallback to vector scores and can filter out otherwise retrievable results at the rerank threshold. Mixed batches should keep the current behavior, but an all-empty batch should return None or otherwise trigger the existing fallback path.


# Filter out empty documents — rerank providers (e.g. DashScope) reject
# empty strings with HTTP 400.
valid_indices = [i for i, d in enumerate(documents) if d and d.strip()]
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Suggestion] (non-blocking) Please add regression tests for the new filtering behavior. The current suite does not cover either a mixed batch like ['doc', '', ' '] with index remapping or the all-empty batch path, so the rerank/fallback semantics introduced here are not locked down.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants