Skip to content

【WIP】feat: add local llama-cpp embedding support#1388

Open
Mijamind719 wants to merge 3 commits intovolcengine:mainfrom
Mijamind719:embedding_local
Open

【WIP】feat: add local llama-cpp embedding support#1388
Mijamind719 wants to merge 3 commits intovolcengine:mainfrom
Mijamind719:embedding_local

Conversation

@Mijamind719
Copy link
Copy Markdown
Collaborator

@Mijamind719 Mijamind719 commented Apr 12, 2026

Co-authored-by: GPT-5.4

Description

Related Issue

Type of Change

  • Bug fix (non-breaking change that fixes an issue)
  • New feature (non-breaking change that adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation update
  • Refactoring (no functional changes)
  • Performance improvement
  • Test update

Changes Made

Testing

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • I have tested this on the following platforms:
    • Linux
    • macOS
    • Windows

Checklist

  • My code follows the project's coding style
  • I have performed a self-review of my code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

Co-authored-by: GPT-5.4 <noreply@openai.com>
@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 70
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add embedding metadata validation to collection initialization

Relevant files:

  • openviking/storage/collection_schemas.py
  • openviking/storage/errors.py
  • openviking/storage/viking_vector_index_backend.py
  • openviking/storage/vikingdb_manager.py
  • tests/storage/test_collection_schemas.py

Sub-PR theme: Add local llama-cpp embedding support

Relevant files:

  • openviking/models/embedder/init.py
  • openviking/models/embedder/base.py
  • openviking/models/embedder/local_embedders.py
  • openviking_cli/doctor.py
  • openviking_cli/utils/config/embedding_config.py
  • tests/cli/test_doctor.py
  • tests/misc/test_config_validation.py
  • tests/unit/test_local_embedder.py
  • pyproject.toml

⚡ Recommended focus areas for review

Backward Compatibility Break

The init_context_collection function now raises EmbeddingConfigurationError when the storage backend does not implement get_collection_meta, breaking existing deployments that use backends without this method. Previously, the function simply returned False when the collection already existed.

existing_meta = None
if hasattr(storage, "get_collection_meta"):
    existing_meta = await storage.get_collection_meta()

if not existing_meta:
    raise EmbeddingConfigurationError(
        "Existing collection metadata is unavailable; cannot validate embedding compatibility"
    )
Blocking Async Operations

The LocalDenseEmbedder does not override the base class async methods (embed_async, embed_batch_async). The base class default implementation may not properly offload the blocking llama-cpp operations to a thread pool, potentially starving the async event loop.

class LocalDenseEmbedder(DenseEmbedderBase):
    """Dense embedder backed by a local GGUF model via llama-cpp-python."""

    def __init__(
        self,
        model_name: str = DEFAULT_LOCAL_DENSE_MODEL,
        model_path: Optional[str] = None,
        cache_dir: Optional[str] = None,
        dimension: Optional[int] = None,
        query_instruction: Optional[str] = None,
        config: Optional[Dict[str, Any]] = None,
    ):
        runtime_config = dict(config or {})
        runtime_config.setdefault("provider", "local")
        super().__init__(model_name, runtime_config)

        self.model_spec = get_local_model_spec(model_name)
        self.model_path = model_path
        self.cache_dir = cache_dir or DEFAULT_LOCAL_MODEL_CACHE_DIR
        self.query_instruction = (
            query_instruction
            if query_instruction is not None
            else self.model_spec.query_instruction
        )
        self._dimension = dimension or self.model_spec.dimension
        if self._dimension != self.model_spec.dimension:
            raise ValueError(
                f"Local model '{model_name}' has fixed dimension {self.model_spec.dimension}, "
                f"but got dimension={self._dimension}"
            )

        self._resolved_model_path = self._resolve_model_path()
        self._llama = self._load_model()

    def _import_llama(self):
        try:
            module = importlib.import_module("llama_cpp")
        except ImportError as exc:
            raise EmbeddingConfigurationError(
                "Local embedding is enabled but 'llama-cpp-python' is not installed. "
                'Install it with: pip install "openviking[local-embed]". '
                "If you prefer a remote provider, set embedding.dense.provider explicitly in ov.conf."
            ) from exc

        llama_cls = getattr(module, "Llama", None)
        if llama_cls is None:
            raise EmbeddingConfigurationError(
                "llama_cpp.Llama is unavailable in the installed llama-cpp-python package."
            )
        return llama_cls

    def _resolve_model_path(self) -> Path:
        if self.model_path:
            resolved = Path(self.model_path).expanduser().resolve()
            if not resolved.exists():
                raise EmbeddingConfigurationError(
                    f"Local embedding model file not found: {resolved}"
                )
            return resolved

        cache_root = Path(self.cache_dir).expanduser().resolve()
        cache_root.mkdir(parents=True, exist_ok=True)
        target = get_local_model_cache_path(self.model_name, self.cache_dir)
        if target.exists():
            return target

        self._download_model(self.model_spec.download_url, target)
        return target

    def _download_model(self, url: str, target: Path) -> None:
        logger.info("Downloading local embedding model %s to %s", self.model_name, target)
        tmp_target = target.with_suffix(target.suffix + ".part")
        try:
            with requests.get(url, stream=True, timeout=(10, 300)) as response:
                response.raise_for_status()
                with tmp_target.open("wb") as fh:
                    for chunk in response.iter_content(chunk_size=1024 * 1024):
                        if chunk:
                            fh.write(chunk)
            os.replace(tmp_target, target)
        except Exception as exc:
            tmp_target.unlink(missing_ok=True)
            raise EmbeddingConfigurationError(
                f"Failed to download local embedding model '{self.model_name}' from {url} "
                f"to {target}: {exc}"
            ) from exc

    def _load_model(self):
        llama_cls = self._import_llama()
        try:
            return llama_cls(
                model_path=str(self._resolved_model_path),
                embedding=True,
                verbose=False,
            )
        except Exception as exc:
            raise EmbeddingConfigurationError(
                f"Failed to load GGUF embedding model from {self._resolved_model_path}: {exc}"
            ) from exc

    def _format_text(self, text: str, *, is_query: bool) -> str:
        if is_query and self.query_instruction:
            return f"{self.query_instruction}{text}"
        return text

    @staticmethod
    def _extract_embedding(payload: Any) -> List[float]:
        if isinstance(payload, dict):
            data = payload.get("data")
            if isinstance(data, list) and data:
                item = data[0]
                if isinstance(item, dict) and "embedding" in item:
                    return list(item["embedding"])
            if "embedding" in payload:
                return list(payload["embedding"])
        raise RuntimeError("Unexpected llama-cpp-python embedding response format")

    @staticmethod
    def _extract_embeddings(payload: Any) -> List[List[float]]:
        if isinstance(payload, dict):
            data = payload.get("data")
            if isinstance(data, list):
                vectors: List[List[float]] = []
                for item in data:
                    if not isinstance(item, dict) or "embedding" not in item:
                        raise RuntimeError(
                            "Unexpected llama-cpp-python batch embedding response format"
                        )
                    vectors.append(list(item["embedding"]))
                return vectors
        raise RuntimeError("Unexpected llama-cpp-python batch embedding response format")

    def embed(self, text: str, is_query: bool = False) -> EmbedResult:
        formatted = self._format_text(text, is_query=is_query)

        def _call() -> EmbedResult:
            payload = self._llama.create_embedding(formatted)
            return EmbedResult(dense_vector=self._extract_embedding(payload))

        try:
            result = self._run_with_retry(
                _call,
                logger=logger,
                operation_name="local embedding",
            )
        except Exception as exc:
            raise RuntimeError(f"Local embedding failed: {exc}") from exc

        estimated_tokens = self._estimate_tokens(formatted)
        self.update_token_usage(
            model_name=self.model_name,
            provider="local",
            prompt_tokens=estimated_tokens,
            completion_tokens=0,
        )
        return result

    def embed_batch(self, texts: List[str], is_query: bool = False) -> List[EmbedResult]:
        if not texts:
            return []

        formatted = [self._format_text(text, is_query=is_query) for text in texts]

        def _call() -> List[EmbedResult]:
            payload = self._llama.create_embedding(formatted)
            return [
                EmbedResult(dense_vector=vector) for vector in self._extract_embeddings(payload)
            ]

        try:
            results = self._run_with_retry(
                _call,
                logger=logger,
                operation_name="local batch embedding",
            )
        except Exception as exc:
            raise RuntimeError(f"Local batch embedding failed: {exc}") from exc

        estimated_tokens = sum(self._estimate_tokens(text) for text in formatted)
        self.update_token_usage(
            model_name=self.model_name,
            provider="local",
            prompt_tokens=estimated_tokens,
            completion_tokens=0,
        )
        return results

    def get_dimension(self) -> int:
        return self._dimension

    def close(self):
        close_fn = getattr(self._llama, "close", None)
        if callable(close_fn):
            close_fn()

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Add retries for model downloads

Add retry logic for model downloads using the existing _run_with_retry helper to
improve resilience against transient network errors.

openviking/models/embedder/local_embedders.py [145-161]

 def _download_model(self, url: str, target: Path) -> None:
     logger.info("Downloading local embedding model %s to %s", self.model_name, target)
     tmp_target = target.with_suffix(target.suffix + ".part")
-    try:
+
+    def _download():
         with requests.get(url, stream=True, timeout=(10, 300)) as response:
             response.raise_for_status()
             with tmp_target.open("wb") as fh:
                 for chunk in response.iter_content(chunk_size=1024 * 1024):
                     if chunk:
                         fh.write(chunk)
         os.replace(tmp_target, target)
+
+    try:
+        self._run_with_retry(
+            _download,
+            logger=logger,
+            operation_name="local model download",
+        )
     except Exception as exc:
         tmp_target.unlink(missing_ok=True)
         raise EmbeddingConfigurationError(
             f"Failed to download local embedding model '{self.model_name}' from {url} "
             f"to {target}: {exc}"
         ) from exc
Suggestion importance[1-10]: 6

__

Why: This improves resilience against transient network errors during model downloads by reusing the existing _run_with_retry helper, making the local embedder more robust.

Low

Mijamind719 and others added 2 commits April 13, 2026 08:39
Co-authored-by: GPT-5.4 <noreply@openai.com>
Legacy issue: investigate true llama-cpp native multi-sequence batch support for local embedding models such as bge-small-zh-v1.5-f16 (current runtime reports n_seq_max=1, so embed_batch uses sequential mode).

Co-authored-by: GPT-5.4 <noreply@openai.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants