【WIP】feat: add local llama-cpp embedding support by Mijamind719 · Pull Request #1388 · volcengine/OpenViking

Mijamind719 · 2026-04-12T12:50:42Z

Co-authored-by: GPT-5.4

Description

Related Issue

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Refactoring (no functional changes)
Performance improvement
Test update

Changes Made

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this on the following platforms:
- Linux
- macOS
- Windows

Checklist

My code follows the project's coding style
I have performed a self-review of my code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
Any dependent changes have been merged and published

Screenshots (if applicable)

Additional Notes

Co-authored-by: GPT-5.4 <noreply@openai.com>

github-actions · 2026-04-12T12:52:37Z

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🏅 Score: 70
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes Sub-PR theme: Add embedding metadata validation to collection initialization Relevant files: openviking/storage/collection_schemas.py openviking/storage/errors.py openviking/storage/viking_vector_index_backend.py openviking/storage/vikingdb_manager.py tests/storage/test_collection_schemas.py Sub-PR theme: Add local llama-cpp embedding support Relevant files: openviking/models/embedder/init.py openviking/models/embedder/base.py openviking/models/embedder/local_embedders.py openviking_cli/doctor.py openviking_cli/utils/config/embedding_config.py tests/cli/test_doctor.py tests/misc/test_config_validation.py tests/unit/test_local_embedder.py pyproject.toml
⚡ Recommended focus areas for review Backward Compatibility Break The `init_context_collection` function now raises `EmbeddingConfigurationError` when the storage backend does not implement `get_collection_meta`, breaking existing deployments that use backends without this method. Previously, the function simply returned `False` when the collection already existed. existing_meta = None if hasattr(storage, "get_collection_meta"): existing_meta = await storage.get_collection_meta() if not existing_meta: raise EmbeddingConfigurationError( "Existing collection metadata is unavailable; cannot validate embedding compatibility" ) Blocking Async Operations The `LocalDenseEmbedder` does not override the base class async methods (`embed_async`, `embed_batch_async`). The base class default implementation may not properly offload the blocking llama-cpp operations to a thread pool, potentially starving the async event loop. class LocalDenseEmbedder(DenseEmbedderBase): """Dense embedder backed by a local GGUF model via llama-cpp-python.""" def __init__( self, model_name: str = DEFAULT_LOCAL_DENSE_MODEL, model_path: Optional[str] = None, cache_dir: Optional[str] = None, dimension: Optional[int] = None, query_instruction: Optional[str] = None, config: Optional[Dict[str, Any]] = None, ): runtime_config = dict(config or {}) runtime_config.setdefault("provider", "local") super().__init__(model_name, runtime_config) self.model_spec = get_local_model_spec(model_name) self.model_path = model_path self.cache_dir = cache_dir or DEFAULT_LOCAL_MODEL_CACHE_DIR self.query_instruction = ( query_instruction if query_instruction is not None else self.model_spec.query_instruction ) self._dimension = dimension or self.model_spec.dimension if self._dimension != self.model_spec.dimension: raise ValueError( f"Local model '{model_name}' has fixed dimension {self.model_spec.dimension}, " f"but got dimension={self._dimension}" ) self._resolved_model_path = self._resolve_model_path() self._llama = self._load_model() def _import_llama(self): try: module = importlib.import_module("llama_cpp") except ImportError as exc: raise EmbeddingConfigurationError( "Local embedding is enabled but 'llama-cpp-python' is not installed. " 'Install it with: pip install "openviking[local-embed]". ' "If you prefer a remote provider, set embedding.dense.provider explicitly in ov.conf." ) from exc llama_cls = getattr(module, "Llama", None) if llama_cls is None: raise EmbeddingConfigurationError( "llama_cpp.Llama is unavailable in the installed llama-cpp-python package." ) return llama_cls def _resolve_model_path(self) -> Path: if self.model_path: resolved = Path(self.model_path).expanduser().resolve() if not resolved.exists(): raise EmbeddingConfigurationError( f"Local embedding model file not found: {resolved}" ) return resolved cache_root = Path(self.cache_dir).expanduser().resolve() cache_root.mkdir(parents=True, exist_ok=True) target = get_local_model_cache_path(self.model_name, self.cache_dir) if target.exists(): return target self._download_model(self.model_spec.download_url, target) return target def _download_model(self, url: str, target: Path) -> None: logger.info("Downloading local embedding model %s to %s", self.model_name, target) tmp_target = target.with_suffix(target.suffix + ".part") try: with requests.get(url, stream=True, timeout=(10, 300)) as response: response.raise_for_status() with tmp_target.open("wb") as fh: for chunk in response.iter_content(chunk_size=1024 * 1024): if chunk: fh.write(chunk) os.replace(tmp_target, target) except Exception as exc: tmp_target.unlink(missing_ok=True) raise EmbeddingConfigurationError( f"Failed to download local embedding model '{self.model_name}' from {url} " f"to {target}: {exc}" ) from exc def _load_model(self): llama_cls = self._import_llama() try: return llama_cls( model_path=str(self._resolved_model_path), embedding=True, verbose=False, ) except Exception as exc: raise EmbeddingConfigurationError( f"Failed to load GGUF embedding model from {self._resolved_model_path}: {exc}" ) from exc def _format_text(self, text: str, *, is_query: bool) -> str: if is_query and self.query_instruction: return f"{self.query_instruction}{text}" return text @staticmethod def _extract_embedding(payload: Any) -> List[float]: if isinstance(payload, dict): data = payload.get("data") if isinstance(data, list) and data: item = data[0] if isinstance(item, dict) and "embedding" in item: return list(item["embedding"]) if "embedding" in payload: return list(payload["embedding"]) raise RuntimeError("Unexpected llama-cpp-python embedding response format") @staticmethod def _extract_embeddings(payload: Any) -> List[List[float]]: if isinstance(payload, dict): data = payload.get("data") if isinstance(data, list): vectors: List[List[float]] = [] for item in data: if not isinstance(item, dict) or "embedding" not in item: raise RuntimeError( "Unexpected llama-cpp-python batch embedding response format" ) vectors.append(list(item["embedding"])) return vectors raise RuntimeError("Unexpected llama-cpp-python batch embedding response format") def embed(self, text: str, is_query: bool = False) -> EmbedResult: formatted = self._format_text(text, is_query=is_query) def _call() -> EmbedResult: payload = self._llama.create_embedding(formatted) return EmbedResult(dense_vector=self._extract_embedding(payload)) try: result = self._run_with_retry( _call, logger=logger, operation_name="local embedding", ) except Exception as exc: raise RuntimeError(f"Local embedding failed: {exc}") from exc estimated_tokens = self._estimate_tokens(formatted) self.update_token_usage( model_name=self.model_name, provider="local", prompt_tokens=estimated_tokens, completion_tokens=0, ) return result def embed_batch(self, texts: List[str], is_query: bool = False) -> List[EmbedResult]: if not texts: return [] formatted = [self._format_text(text, is_query=is_query) for text in texts] def _call() -> List[EmbedResult]: payload = self._llama.create_embedding(formatted) return [ EmbedResult(dense_vector=vector) for vector in self._extract_embeddings(payload) ] try: results = self._run_with_retry( _call, logger=logger, operation_name="local batch embedding", ) except Exception as exc: raise RuntimeError(f"Local batch embedding failed: {exc}") from exc estimated_tokens = sum(self._estimate_tokens(text) for text in formatted) self.update_token_usage( model_name=self.model_name, provider="local", prompt_tokens=estimated_tokens, completion_tokens=0, ) return results def get_dimension(self) -> int: return self._dimension def close(self): close_fn = getattr(self._llama, "close", None) if callable(close_fn): close_fn()

github-actions · 2026-04-12T12:54:11Z

PR Code Suggestions ✨

Explore these optional code suggestions:

Category Suggestion Impact

General

Add retries for model downloads

Add retry logic for model downloads using the existing _run_with_retry helper to
improve resilience against transient network errors.

openviking/models/embedder/local_embedders.py [145-161]

 def _download_model(self, url: str, target: Path) -> None:
     logger.info("Downloading local embedding model %s to %s", self.model_name, target)
     tmp_target = target.with_suffix(target.suffix + ".part")
-    try:
+
+    def _download():
         with requests.get(url, stream=True, timeout=(10, 300)) as response:
             response.raise_for_status()
             with tmp_target.open("wb") as fh:
                 for chunk in response.iter_content(chunk_size=1024 * 1024):
                     if chunk:
                         fh.write(chunk)
         os.replace(tmp_target, target)
+
+    try:
+        self._run_with_retry(
+            _download,
+            logger=logger,
+            operation_name="local model download",
+        )
     except Exception as exc:
         tmp_target.unlink(missing_ok=True)
         raise EmbeddingConfigurationError(
             f"Failed to download local embedding model '{self.model_name}' from {url} "
             f"to {target}: {exc}"
         ) from exc

Suggestion importance[1-10]: 6

__

Why: This improves resilience against transient network errors during model downloads by reusing the existing _run_with_retry helper, making the local embedder more robust.

Low

Co-authored-by: GPT-5.4 <noreply@openai.com>

Legacy issue: investigate true llama-cpp native multi-sequence batch support for local embedding models such as bge-small-zh-v1.5-f16 (current runtime reports n_seq_max=1, so embed_batch uses sequential mode). Co-authored-by: GPT-5.4 <noreply@openai.com>

feat: add local llama-cpp embedding support

dd16683

Co-authored-by: GPT-5.4 <noreply@openai.com>

github-project-automation bot moved this to Backlog in OpenViking project Apr 12, 2026

github-project-automation bot added this to OpenViking project Apr 12, 2026

Mijamind719 assigned ZaynJarvis Apr 12, 2026

Mijamind719 and others added 2 commits April 13, 2026 08:39

fix: fallback local batch embedding to sequential mode

e352097

Co-authored-by: GPT-5.4 <noreply@openai.com>

Mijamind719 force-pushed the embedding_local branch from 9c8e7ef to 6e57688 Compare April 13, 2026 00:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

【WIP】feat: add local llama-cpp embedding support#1388

【WIP】feat: add local llama-cpp embedding support#1388
Mijamind719 wants to merge 3 commits intovolcengine:mainfrom
Mijamind719:embedding_local

Mijamind719 commented Apr 12, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

github-actions bot commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mijamind719 commented Apr 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue

Type of Change

Changes Made

Testing

Checklist

Screenshots (if applicable)

Additional Notes

Uh oh!

github-actions bot commented Apr 12, 2026

PR Reviewer Guide 🔍

Uh oh!

github-actions bot commented Apr 12, 2026

PR Code Suggestions ✨

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mijamind719 commented Apr 12, 2026 •

edited

Loading