Draft: Enable vLLM calls to MX in Dynamo #3374

KavinKrishnan · 2025-10-02T17:07:56Z

Overview:

Moved MX downloads to before vLLM worker init occurs to prevent double model pulls.

Details:

Dynamo ModelExpress Integration Summary

The changes integrate ModelExpress as a model caching layer into the VLLM worker startup process, enabling centralized model caching on a shared PVC. This results in significant speedups for subsequent workers (34 minutes → 3.4 minutes in testing) while maintaining robust fallback behavior.

New File: lib/bindings/python/rust/llm/hub.rs

Created a Rust-to-Python binding that exposes the from_hf() function to Python code. This function downloads models from HuggingFace using the ModelExpress client library, supports an ignore_weights parameter for partial downloads, and returns the local filesystem path to the cached model. It serves as the bridge between Python VLLM code and the Rust ModelExpress client.

Modified: components/src/dynamo/vllm/main.py

Added a new async pull_model() function that pre-downloads models via ModelExpress before VLLM engine initialization. The function uses a 30-minute timeout for large model downloads and updates the VLLM config to use the resolved local cached path if successful. On timeout or failure, it keeps the original HuggingFace model name, allowing VLLM to leverage the shared HF_HUB_CACHE directory that ModelExpress populates. The function is called in the worker startup flow immediately after port configuration.

Modified: lib/bindings/python/rust/lib.rs

Added the module registration llm::hub::add_to_module(m)? to expose the new hub functionality to Python, making the from_hf() function available for import in Python code.

Modified: lib/bindings/python/rust/llm.rs

Added the module declaration pub mod hub; to enable the Rust compiler to include the new hub module in the build process.

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

Summary by CodeRabbit

New Features
- Models are now pre-resolved and cached before VLLM starts, reducing cold-start time and using local paths when available.
- Automatic fallback to standard downloads on timeouts or failures, with clear warnings to aid troubleshooting.
- Added a Python-accessible helper to download Hugging Face models via an optimized hub service, returning the local model path.
- Works transparently with existing configurations; no changes required.

copy-pr-bot · 2025-10-02T17:08:00Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

coderabbitai · 2025-10-02T17:11:01Z

Walkthrough

Adds async model pre-pull in vLLM runtime to resolve/cache model path before argument overwrite. Introduces a Rust PyO3 “hub” binding exposing from_hf and wires it into the Python module initialization, enabling server-first ModelExpress downloads with fallback to direct download.

Changes

Cohort / File(s)	Summary
vLLM runtime pre-pull `components/src/dynamo/vllm/main.py`	Adds async pull_model(config) to resolve/cache model via ModelExpress; updates config.model and engine_args.model to local path; invoked early in worker(runtime); handles asyncio.TimeoutError and general exceptions with warnings and fallback to original model.
Python bindings: LLM hub `lib/bindings/python/rust/lib.rs`, `lib/bindings/python/rust/llm.rs`, `lib/bindings/python/rust/llm/hub.rs`	Exposes hub module to Python: registers llm::hub in _core init; adds pub mod hub; implements from_hf(model_name, ignore_weights=false) using async runtime to download and return local path; adds add_to_module for registration; maps errors to Python exceptions.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant U as Worker(runtime)
  participant V as vLLM Main
  participant H as Hub Binding (Rust/PyO3)
  participant ME as ModelExpress/Server
  participant HF as HF Hub (fallback)

  U->>V: start()
  Note over V: Configure ports
  V->>H: from_hf(model_name, ignore_weights=false)
  alt Server-available
    H->>ME: request download/resolve(model_name)
    ME-->>H: local_path
  else Server-unavailable or error
    H->>HF: direct download(model_name)
    HF-->>H: local_path or error
  end
  alt Success
    H-->>V: local_path
    V->>V: set config.model and engine_args.model
  else Timeout/Error
    H--x V: raise error
    V->>V: warn and keep original model
  end
  V->>V: overwrite_args(config)
  V-->>U: continue startup

sequenceDiagram
  autonumber
  participant Py as Python Caller
  participant Hub as llm.hub.from_hf
  participant RT as pyo3 async runtime
  participant Core as dynamo_llm::hub::from_hf

  Py->>Hub: from_hf(model_name, ignore_weights?)
  Hub->>RT: spawn/await async
  RT->>Core: from_hf(model_name, ignore_weights)
  Core-->>RT: Result(local_path | Error)
  alt Ok
    RT-->>Hub: local_path
    Hub-->>Py: str(local_path)
  else Err
    RT--x Hub: Error
    Hub-->>Py: Python exception
  end

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I tug a thread from the model moon,
Cache burrows warm, I hum a tune.
Rust whispers “hub,” and Python grins—
Paths turn local, loading begins.
If storms delay the carrot cart,
I fallback swift—ears up, smart.
Hop, fetch, serve: a latency art. 🥕🐇

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description includes the required template headings but the “Where should the reviewer start?” section is left empty and the Related Issues section still uses a placeholder issue number. Without specifying the files or areas to review and replacing the placeholder with an actual issue reference, the description does not fully meet the template requirements. Therefore the description is incomplete.	Please populate the “Where should the reviewer start?” section with specific file paths or code areas for review and update the Related Issues section to reference the actual GitHub issue number instead of the placeholder.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title accurately reflects the primary goal of enabling vLLM to call the ModelExpress (MX) service within Dynamo and is concise and relevant to the changeset, though the “Draft:” prefix is unnecessary.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

components/src/dynamo/vllm/main.py (1)
44-44: Remove the duplicate asyncio import.

The asyncio module is already imported at Line 4, making this import redundant.

Apply this diff:
-    import asyncio
     from dynamo._core import from_hf

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6f2aedf and 107dfec.

📒 Files selected for processing (4)

components/src/dynamo/vllm/main.py (2 hunks)
lib/bindings/python/rust/lib.rs (1 hunks)
lib/bindings/python/rust/llm.rs (1 hunks)
lib/bindings/python/rust/llm/hub.rs (1 hunks)

🧰 Additional context used

🧬 Code graph analysis (3)

lib/bindings/python/rust/lib.rs (1)

lib/bindings/python/rust/llm/hub.rs (1)

add_to_module (34-37)

lib/bindings/python/rust/llm/hub.rs (1)

lib/bindings/python/rust/lib.rs (17)

to_pyerr (197-202)

m (138-138)

m (139-139)

m (140-140)

m (141-141)

m (142-142)

m (143-143)

m (144-144)

m (145-145)

m (146-146)

m (147-147)

m (148-148)

m (149-149)

m (150-150)

m (151-151)

m (152-152)

m (153-153)

components/src/dynamo/vllm/main.py (2)

components/src/dynamo/vllm/args.py (1)

Config (39-64)

lib/bindings/python/rust/llm/hub.rs (1)

from_hf (20-32)

🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3374/merge) by KavinKrishnan.

lib/bindings/python/rust/llm/hub.rs

[error] 1-1: trailing-whitespace: files were modified by this hook during pre-commit (pre-commit trailing-whitespace).

components/src/dynamo/vllm/main.py

[error] 1-1: isort: files were modified by this hook during pre-commit (pre-commit isort).

[error] 1-1: black: reformatted components/src/dynamo/vllm/main.py during pre-commit.

🪛 Ruff (0.13.2)

components/src/dynamo/vllm/main.py

69-69: Do not catch blind exception: Exception

(BLE001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)

GitHub Check: Build and Test - dynamo
GitHub Check: clippy (lib/bindings/python)
GitHub Check: clippy (.)
GitHub Check: clippy (launch/dynamo-run)

🔇 Additional comments (6)

components/src/dynamo/vllm/main.py (2)

35-74: LGTM with one observation.

The implementation correctly handles the async model download with appropriate timeouts and fallback behavior. The broad exception catch at Line 69 is appropriate here to ensure VLLM can still operate with the original model name if ModelExpress fails.

However, note the pipeline failures indicate formatting issues (isort, black) that need to be addressed before merge.

109-112: LGTM!

The placement of pull_model is correct—after port configuration and before overwrite_args, ensuring the model path is resolved before VLLM engine initialization.

lib/bindings/python/rust/lib.rs (1)

189-189: LGTM!

The hub module initialization follows the established pattern and is correctly placed in the module setup sequence.

lib/bindings/python/rust/llm.rs (1)

32-32: LGTM!

The hub module export is correctly placed and follows the established pattern for other submodules.

lib/bindings/python/rust/llm/hub.rs (2)

7-32: LGTM with pipeline failure note.

The implementation correctly:

Wraps the async Rust function for Python

Handles path-to-string conversion appropriately

Maps errors to Python exceptions

However, the pipeline failure indicates trailing whitespace that needs to be removed before merge.

34-37: LGTM!

The module registration helper follows the standard PyO3 pattern and correctly exposes the from_hf function to Python.

ayushag-nv · 2025-10-02T19:12:31Z

@KavinKrishnan is this ready for review ? I see its still draft. Let me know when u want us to review.

KavinKrishnan · 2025-10-02T21:41:08Z

@KavinKrishnan is this ready for review ? I see its still draft. Let me know when u want us to review.

will do @ayushag-nv - I am waiting on some guidance from @grahamking as am not sure if injecting MX (from_hf()) directly into the dynamo backend implementations (in this case vLLM) is ok

grahamking · 2025-10-02T21:50:59Z

@KavinKrishnan is this ready for review ? I see its still draft. Let me know when u want us to review.

will do @ayushag-nv - I am waiting on some guidance from @grahamking as am not sure if injecting MX (from_hf()) directly into the dynamo backend implementations (in this case vLLM) is ok

Apologies I didn't realise you were waiting for me.

grahamking · 2025-10-02T21:53:00Z

The model is downloaded in register_llm. That should be transparent to the engine, it can assume the model exists on disk when it starts.

Is that not working correctly?

I'd rather fix the init order if that the issue, rather than expose model downloading to the Python layer. The model may come from somewhere else, it might be silently moved to this machine from somewhere else in the data center (via NIXL etc). We don't want to make clients handle all that.

grahamking

Blocking because we shouldn't be exposing a new Python binding, especially not such a specific one.

Enable vLLM calls to MX in Dynamo

107dfec

KavinKrishnan requested review from a team as code owners October 2, 2025 17:07

pull-request-size bot added the size/M label Oct 2, 2025

KavinKrishnan marked this pull request as draft October 2, 2025 17:09

coderabbitai bot reviewed Oct 2, 2025

View reviewed changes

grahamking requested changes Oct 3, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Draft: Enable vLLM calls to MX in Dynamo #3374

Draft: Enable vLLM calls to MX in Dynamo #3374

Uh oh!

KavinKrishnan commented Oct 2, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Oct 2, 2025

Uh oh!

coderabbitai bot commented Oct 2, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

ayushag-nv commented Oct 2, 2025

Uh oh!

KavinKrishnan commented Oct 2, 2025

Uh oh!

grahamking commented Oct 2, 2025

Uh oh!

grahamking commented Oct 2, 2025

Uh oh!

grahamking left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Draft: Enable vLLM calls to MX in Dynamo #3374

Are you sure you want to change the base?

Draft: Enable vLLM calls to MX in Dynamo #3374

Uh oh!

Conversation

KavinKrishnan commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 2, 2025

Uh oh!

coderabbitai bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

ayushag-nv commented Oct 2, 2025

Uh oh!

KavinKrishnan commented Oct 2, 2025

Uh oh!

grahamking commented Oct 2, 2025

Uh oh!

grahamking commented Oct 2, 2025

Uh oh!

grahamking left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

KavinKrishnan commented Oct 2, 2025 •

edited

Loading

coderabbitai bot commented Oct 2, 2025 •

edited

Loading