Skip to content

Conversation

KavinKrishnan
Copy link
Contributor

@KavinKrishnan KavinKrishnan commented Oct 2, 2025

Overview:

Moved MX downloads to before vLLM worker init occurs to prevent double model pulls.

Details:

Dynamo ModelExpress Integration Summary

The changes integrate ModelExpress as a model caching layer into the VLLM worker startup process, enabling centralized model caching on a shared PVC. This results in significant speedups for subsequent workers (34 minutes → 3.4 minutes in testing) while maintaining robust fallback behavior.

New File: lib/bindings/python/rust/llm/hub.rs

Created a Rust-to-Python binding that exposes the from_hf() function to Python code. This function downloads models from HuggingFace using the ModelExpress client library, supports an ignore_weights parameter for partial downloads, and returns the local filesystem path to the cached model. It serves as the bridge between Python VLLM code and the Rust ModelExpress client.

Modified: components/src/dynamo/vllm/main.py

Added a new async pull_model() function that pre-downloads models via ModelExpress before VLLM engine initialization. The function uses a 30-minute timeout for large model downloads and updates the VLLM config to use the resolved local cached path if successful. On timeout or failure, it keeps the original HuggingFace model name, allowing VLLM to leverage the shared HF_HUB_CACHE directory that ModelExpress populates. The function is called in the worker startup flow immediately after port configuration.

Modified: lib/bindings/python/rust/lib.rs

Added the module registration llm::hub::add_to_module(m)? to expose the new hub functionality to Python, making the from_hf() function available for import in Python code.

Modified: lib/bindings/python/rust/llm.rs

Added the module declaration pub mod hub; to enable the Rust compiler to include the new hub module in the build process.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

Summary by CodeRabbit

  • New Features
    • Models are now pre-resolved and cached before VLLM starts, reducing cold-start time and using local paths when available.
    • Automatic fallback to standard downloads on timeouts or failures, with clear warnings to aid troubleshooting.
    • Added a Python-accessible helper to download Hugging Face models via an optimized hub service, returning the local model path.
    • Works transparently with existing configurations; no changes required.

@KavinKrishnan KavinKrishnan requested review from a team as code owners October 2, 2025 17:07
Copy link

copy-pr-bot bot commented Oct 2, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@KavinKrishnan KavinKrishnan marked this pull request as draft October 2, 2025 17:09
Copy link
Contributor

coderabbitai bot commented Oct 2, 2025

Walkthrough

Adds async model pre-pull in vLLM runtime to resolve/cache model path before argument overwrite. Introduces a Rust PyO3 “hub” binding exposing from_hf and wires it into the Python module initialization, enabling server-first ModelExpress downloads with fallback to direct download.

Changes

Cohort / File(s) Summary
vLLM runtime pre-pull
components/src/dynamo/vllm/main.py
Adds async pull_model(config) to resolve/cache model via ModelExpress; updates config.model and engine_args.model to local path; invoked early in worker(runtime); handles asyncio.TimeoutError and general exceptions with warnings and fallback to original model.
Python bindings: LLM hub
lib/bindings/python/rust/lib.rs, lib/bindings/python/rust/llm.rs, lib/bindings/python/rust/llm/hub.rs
Exposes hub module to Python: registers llm::hub in _core init; adds pub mod hub; implements from_hf(model_name, ignore_weights=false) using async runtime to download and return local path; adds add_to_module for registration; maps errors to Python exceptions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Worker(runtime)
  participant V as vLLM Main
  participant H as Hub Binding (Rust/PyO3)
  participant ME as ModelExpress/Server
  participant HF as HF Hub (fallback)

  U->>V: start()
  Note over V: Configure ports
  V->>H: from_hf(model_name, ignore_weights=false)
  alt Server-available
    H->>ME: request download/resolve(model_name)
    ME-->>H: local_path
  else Server-unavailable or error
    H->>HF: direct download(model_name)
    HF-->>H: local_path or error
  end
  alt Success
    H-->>V: local_path
    V->>V: set config.model and engine_args.model
  else Timeout/Error
    H--x V: raise error
    V->>V: warn and keep original model
  end
  V->>V: overwrite_args(config)
  V-->>U: continue startup
Loading
sequenceDiagram
  autonumber
  participant Py as Python Caller
  participant Hub as llm.hub.from_hf
  participant RT as pyo3 async runtime
  participant Core as dynamo_llm::hub::from_hf

  Py->>Hub: from_hf(model_name, ignore_weights?)
  Hub->>RT: spawn/await async
  RT->>Core: from_hf(model_name, ignore_weights)
  Core-->>RT: Result(local_path | Error)
  alt Ok
    RT-->>Hub: local_path
    Hub-->>Py: str(local_path)
  else Err
    RT--x Hub: Error
    Hub-->>Py: Python exception
  end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I tug a thread from the model moon,
Cache burrows warm, I hum a tune.
Rust whispers “hub,” and Python grins—
Paths turn local, loading begins.
If storms delay the carrot cart,
I fallback swift—ears up, smart.
Hop, fetch, serve: a latency art. 🥕🐇

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description includes the required template headings but the “Where should the reviewer start?” section is left empty and the Related Issues section still uses a placeholder issue number. Without specifying the files or areas to review and replacing the placeholder with an actual issue reference, the description does not fully meet the template requirements. Therefore the description is incomplete. Please populate the “Where should the reviewer start?” section with specific file paths or code areas for review and update the Related Issues section to reference the actual GitHub issue number instead of the placeholder.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title accurately reflects the primary goal of enabling vLLM to call the ModelExpress (MX) service within Dynamo and is concise and relevant to the changeset, though the “Draft:” prefix is unnecessary.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
components/src/dynamo/vllm/main.py (1)

44-44: Remove the duplicate asyncio import.

The asyncio module is already imported at Line 4, making this import redundant.

Apply this diff:

-    import asyncio
     from dynamo._core import from_hf
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f2aedf and 107dfec.

📒 Files selected for processing (4)
  • components/src/dynamo/vllm/main.py (2 hunks)
  • lib/bindings/python/rust/lib.rs (1 hunks)
  • lib/bindings/python/rust/llm.rs (1 hunks)
  • lib/bindings/python/rust/llm/hub.rs (1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
lib/bindings/python/rust/lib.rs (1)
lib/bindings/python/rust/llm/hub.rs (1)
  • add_to_module (34-37)
lib/bindings/python/rust/llm/hub.rs (1)
lib/bindings/python/rust/lib.rs (17)
  • to_pyerr (197-202)
  • m (138-138)
  • m (139-139)
  • m (140-140)
  • m (141-141)
  • m (142-142)
  • m (143-143)
  • m (144-144)
  • m (145-145)
  • m (146-146)
  • m (147-147)
  • m (148-148)
  • m (149-149)
  • m (150-150)
  • m (151-151)
  • m (152-152)
  • m (153-153)
components/src/dynamo/vllm/main.py (2)
components/src/dynamo/vllm/args.py (1)
  • Config (39-64)
lib/bindings/python/rust/llm/hub.rs (1)
  • from_hf (20-32)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3374/merge) by KavinKrishnan.
lib/bindings/python/rust/llm/hub.rs

[error] 1-1: trailing-whitespace: files were modified by this hook during pre-commit (pre-commit trailing-whitespace).

components/src/dynamo/vllm/main.py

[error] 1-1: isort: files were modified by this hook during pre-commit (pre-commit isort).


[error] 1-1: black: reformatted components/src/dynamo/vllm/main.py during pre-commit.

🪛 Ruff (0.13.2)
components/src/dynamo/vllm/main.py

69-69: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: Build and Test - dynamo
  • GitHub Check: clippy (lib/bindings/python)
  • GitHub Check: clippy (.)
  • GitHub Check: clippy (launch/dynamo-run)
🔇 Additional comments (6)
components/src/dynamo/vllm/main.py (2)

35-74: LGTM with one observation.

The implementation correctly handles the async model download with appropriate timeouts and fallback behavior. The broad exception catch at Line 69 is appropriate here to ensure VLLM can still operate with the original model name if ModelExpress fails.

However, note the pipeline failures indicate formatting issues (isort, black) that need to be addressed before merge.


109-112: LGTM!

The placement of pull_model is correct—after port configuration and before overwrite_args, ensuring the model path is resolved before VLLM engine initialization.

lib/bindings/python/rust/lib.rs (1)

189-189: LGTM!

The hub module initialization follows the established pattern and is correctly placed in the module setup sequence.

lib/bindings/python/rust/llm.rs (1)

32-32: LGTM!

The hub module export is correctly placed and follows the established pattern for other submodules.

lib/bindings/python/rust/llm/hub.rs (2)

7-32: LGTM with pipeline failure note.

The implementation correctly:

  • Wraps the async Rust function for Python
  • Handles path-to-string conversion appropriately
  • Maps errors to Python exceptions

However, the pipeline failure indicates trailing whitespace that needs to be removed before merge.


34-37: LGTM!

The module registration helper follows the standard PyO3 pattern and correctly exposes the from_hf function to Python.

@ayushag-nv
Copy link
Contributor

@KavinKrishnan is this ready for review ? I see its still draft. Let me know when u want us to review.

@KavinKrishnan
Copy link
Contributor Author

@KavinKrishnan is this ready for review ? I see its still draft. Let me know when u want us to review.

will do @ayushag-nv - I am waiting on some guidance from @grahamking as am not sure if injecting MX (from_hf()) directly into the dynamo backend implementations (in this case vLLM) is ok

@grahamking
Copy link
Contributor

@KavinKrishnan is this ready for review ? I see its still draft. Let me know when u want us to review.

will do @ayushag-nv - I am waiting on some guidance from @grahamking as am not sure if injecting MX (from_hf()) directly into the dynamo backend implementations (in this case vLLM) is ok

Apologies I didn't realise you were waiting for me.

@grahamking
Copy link
Contributor

The model is downloaded in register_llm. That should be transparent to the engine, it can assume the model exists on disk when it starts.

Is that not working correctly?

I'd rather fix the init order if that the issue, rather than expose model downloading to the Python layer. The model may come from somewhere else, it might be silently moved to this machine from somewhere else in the data center (via NIXL etc). We don't want to make clients handle all that.

Copy link
Contributor

@grahamking grahamking left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking because we shouldn't be exposing a new Python binding, especially not such a specific one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants