-
Notifications
You must be signed in to change notification settings - Fork 637
Draft: Enable vLLM calls to MX in Dynamo #3374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
WalkthroughAdds async model pre-pull in vLLM runtime to resolve/cache model path before argument overwrite. Introduces a Rust PyO3 “hub” binding exposing from_hf and wires it into the Python module initialization, enabling server-first ModelExpress downloads with fallback to direct download. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant U as Worker(runtime)
participant V as vLLM Main
participant H as Hub Binding (Rust/PyO3)
participant ME as ModelExpress/Server
participant HF as HF Hub (fallback)
U->>V: start()
Note over V: Configure ports
V->>H: from_hf(model_name, ignore_weights=false)
alt Server-available
H->>ME: request download/resolve(model_name)
ME-->>H: local_path
else Server-unavailable or error
H->>HF: direct download(model_name)
HF-->>H: local_path or error
end
alt Success
H-->>V: local_path
V->>V: set config.model and engine_args.model
else Timeout/Error
H--x V: raise error
V->>V: warn and keep original model
end
V->>V: overwrite_args(config)
V-->>U: continue startup
sequenceDiagram
autonumber
participant Py as Python Caller
participant Hub as llm.hub.from_hf
participant RT as pyo3 async runtime
participant Core as dynamo_llm::hub::from_hf
Py->>Hub: from_hf(model_name, ignore_weights?)
Hub->>RT: spawn/await async
RT->>Core: from_hf(model_name, ignore_weights)
Core-->>RT: Result(local_path | Error)
alt Ok
RT-->>Hub: local_path
Hub-->>Py: str(local_path)
else Err
RT--x Hub: Error
Hub-->>Py: Python exception
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
components/src/dynamo/vllm/main.py (1)
44-44
: Remove the duplicate asyncio import.The asyncio module is already imported at Line 4, making this import redundant.
Apply this diff:
- import asyncio from dynamo._core import from_hf
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
components/src/dynamo/vllm/main.py
(2 hunks)lib/bindings/python/rust/lib.rs
(1 hunks)lib/bindings/python/rust/llm.rs
(1 hunks)lib/bindings/python/rust/llm/hub.rs
(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
lib/bindings/python/rust/lib.rs (1)
lib/bindings/python/rust/llm/hub.rs (1)
add_to_module
(34-37)
lib/bindings/python/rust/llm/hub.rs (1)
lib/bindings/python/rust/lib.rs (17)
to_pyerr
(197-202)m
(138-138)m
(139-139)m
(140-140)m
(141-141)m
(142-142)m
(143-143)m
(144-144)m
(145-145)m
(146-146)m
(147-147)m
(148-148)m
(149-149)m
(150-150)m
(151-151)m
(152-152)m
(153-153)
components/src/dynamo/vllm/main.py (2)
components/src/dynamo/vllm/args.py (1)
Config
(39-64)lib/bindings/python/rust/llm/hub.rs (1)
from_hf
(20-32)
🪛 GitHub Actions: Pre Merge Validation of (ai-dynamo/dynamo/refs/pull/3374/merge) by KavinKrishnan.
lib/bindings/python/rust/llm/hub.rs
[error] 1-1: trailing-whitespace: files were modified by this hook during pre-commit (pre-commit trailing-whitespace).
components/src/dynamo/vllm/main.py
[error] 1-1: isort: files were modified by this hook during pre-commit (pre-commit isort).
[error] 1-1: black: reformatted components/src/dynamo/vllm/main.py during pre-commit.
🪛 Ruff (0.13.2)
components/src/dynamo/vllm/main.py
69-69: Do not catch blind exception: Exception
(BLE001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Build and Test - dynamo
- GitHub Check: clippy (lib/bindings/python)
- GitHub Check: clippy (.)
- GitHub Check: clippy (launch/dynamo-run)
🔇 Additional comments (6)
components/src/dynamo/vllm/main.py (2)
35-74
: LGTM with one observation.The implementation correctly handles the async model download with appropriate timeouts and fallback behavior. The broad exception catch at Line 69 is appropriate here to ensure VLLM can still operate with the original model name if ModelExpress fails.
However, note the pipeline failures indicate formatting issues (isort, black) that need to be addressed before merge.
109-112
: LGTM!The placement of
pull_model
is correct—after port configuration and beforeoverwrite_args
, ensuring the model path is resolved before VLLM engine initialization.lib/bindings/python/rust/lib.rs (1)
189-189
: LGTM!The hub module initialization follows the established pattern and is correctly placed in the module setup sequence.
lib/bindings/python/rust/llm.rs (1)
32-32
: LGTM!The hub module export is correctly placed and follows the established pattern for other submodules.
lib/bindings/python/rust/llm/hub.rs (2)
7-32
: LGTM with pipeline failure note.The implementation correctly:
- Wraps the async Rust function for Python
- Handles path-to-string conversion appropriately
- Maps errors to Python exceptions
However, the pipeline failure indicates trailing whitespace that needs to be removed before merge.
34-37
: LGTM!The module registration helper follows the standard PyO3 pattern and correctly exposes the
from_hf
function to Python.
@KavinKrishnan is this ready for review ? I see its still draft. Let me know when u want us to review. |
will do @ayushag-nv - I am waiting on some guidance from @grahamking as am not sure if injecting MX (from_hf()) directly into the dynamo backend implementations (in this case vLLM) is ok |
Apologies I didn't realise you were waiting for me. |
The model is downloaded in register_llm. That should be transparent to the engine, it can assume the model exists on disk when it starts. Is that not working correctly? I'd rather fix the init order if that the issue, rather than expose model downloading to the Python layer. The model may come from somewhere else, it might be silently moved to this machine from somewhere else in the data center (via NIXL etc). We don't want to make clients handle all that. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Blocking because we shouldn't be exposing a new Python binding, especially not such a specific one.
Overview:
Moved MX downloads to before vLLM worker init occurs to prevent double model pulls.
Details:
Dynamo ModelExpress Integration Summary
The changes integrate ModelExpress as a model caching layer into the VLLM worker startup process, enabling centralized model caching on a shared PVC. This results in significant speedups for subsequent workers (34 minutes → 3.4 minutes in testing) while maintaining robust fallback behavior.
New File: lib/bindings/python/rust/llm/hub.rs
Created a Rust-to-Python binding that exposes the from_hf() function to Python code. This function downloads models from HuggingFace using the ModelExpress client library, supports an ignore_weights parameter for partial downloads, and returns the local filesystem path to the cached model. It serves as the bridge between Python VLLM code and the Rust ModelExpress client.
Modified: components/src/dynamo/vllm/main.py
Added a new async pull_model() function that pre-downloads models via ModelExpress before VLLM engine initialization. The function uses a 30-minute timeout for large model downloads and updates the VLLM config to use the resolved local cached path if successful. On timeout or failure, it keeps the original HuggingFace model name, allowing VLLM to leverage the shared HF_HUB_CACHE directory that ModelExpress populates. The function is called in the worker startup flow immediately after port configuration.
Modified: lib/bindings/python/rust/lib.rs
Added the module registration llm::hub::add_to_module(m)? to expose the new hub functionality to Python, making the from_hf() function available for import in Python code.
Modified: lib/bindings/python/rust/llm.rs
Added the module declaration pub mod hub; to enable the Rust compiler to include the new hub module in the build process.
Where should the reviewer start?
Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit