Skip to content

fix: preserve existing abstract during reindex#1419

Open
yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233:fix/preserve-abstract-on-reindex
Open

fix: preserve existing abstract during reindex#1419
yc111233 wants to merge 1 commit intovolcengine:mainfrom
yc111233:fix/preserve-abstract-on-reindex

Conversation

@yc111233
Copy link
Copy Markdown
Contributor

Problem

When index_resource() rebuilds the vector index (e.g. after switching embedding models via reindex with regenerate=False), it passes an empty summary_dict to vectorize_file:

await vectorize_file(
    file_path=file_uri,
    summary_dict={"name": file_name},  # no "summary" key
    ...
)

This causes Context(abstract=""), which overwrites the existing VLM-generated abstract in the vector index with an empty string.

Impact: Rerank relies on the abstract field to differentiate documents. With empty abstracts, all documents sent to the rerank API are identical ("[empty]" after the DashScope safety filter), resulting in uniform scores and no meaningful ranking.

Root Cause

index_resource and the normal write path use different code paths for the abstract field:

  • Normal write: semantic processor (VLM) generates summary → passed as summary_dict["summary"] → stored as abstract in vector index
  • Reindex: no summary available → summary_dict has no "summary" key → abstract="" → overwrites old value via upsert

Reindex and VLM summary generation are independent concerns, but the current implementation couples them by discarding the old abstract during reindex.

Fix

Before calling vectorize_file, query the existing vector index entry via fetch_by_uri (which already includes abstract in its output fields) and carry forward the old value:

existing_abstract = ""
if vector_store:
    try:
        existing = await vector_store.fetch_by_uri(file_uri, ctx=ctx)
        if existing:
            existing_abstract = existing.get("abstract", "") or ""
    except Exception:
        pass

await vectorize_file(
    file_path=file_uri,
    summary_dict={"name": file_name, "summary": existing_abstract},
    ...
)

Test Plan

  • Verified on OpenViking 0.3.5 with 8600+ vectors
  • Reindex with new embedding model (BGE-M3 → text-embedding-v4) preserves all VLM abstracts
  • Rerank scores show proper differentiation after reindex (10 unique scores vs all-identical before fix)
  • No impact on regenerate=True path (which uses summarize() instead of build_index())

When `index_resource` rebuilds the vector index (e.g. after switching
embedding models), it passes an empty `summary_dict` to `vectorize_file`.
This overwrites the VLM-generated `abstract` field with an empty string,
causing rerank to lose document differentiation — all documents receive
identical scores because the rerank API receives empty text for comparison.

Fix: before calling `vectorize_file`, query the existing vector index
entry via `fetch_by_uri` and carry forward the old `abstract` value.
This decouples reindex (embedding model change) from VLM summary
generation, which are independent concerns.

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 90
🧪 No relevant tests
🔒 No security concerns identified
📝 TODO sections

🔀 No multiple PR themes
⚡ Recommended focus areas for review

Error Handling

Bare except Exception: swallows all errors without logging, potentially hiding real issues when fetching existing abstracts. This could lead to silent failures to preserve abstracts without observability.

except Exception:
    pass

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

No code suggestions found for the PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

1 participant