Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds comprehensive support for vLLM as a first-class provider for local GPU inference, enabling high-throughput LLM serving on NVIDIA GPUs. The implementation includes a new VLLMProvider class, Docker Compose sidecar service, CLI integration, smoke tests, extensive documentation, and 18 unit tests.
Changes:
- New
VLLMProviderthin client wrapping vLLM's OpenAI-compatible/v1API - Docker Compose
vllm-serversidecar with GPU passthrough, health checks, and tool-calling support - CLI integration for deployment via
archi create --services vllm-server --gpu-ids all - Smoke test infrastructure for vLLM validation
- Full documentation page (
docs/vllm.md) with architecture, configuration, and troubleshooting - MONIT OpenSearch tools and skill loading utilities for the CMS CompOps agent
- Removal of
--sourcesCLI flag (breaking change)
Reviewed changes
Copilot reviewed 45 out of 46 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
src/archi/providers/vllm_provider.py |
New VLLMProvider implementation with OpenAI-compatible API client |
src/archi/providers/base.py |
Added ProviderType.VLLM enum and vLLM to provider registry |
src/archi/providers/__init__.py |
Registered VLLMProvider and updated provider name mapping |
src/cli/service_registry.py |
Registered vllm-server as a compute service |
src/cli/templates/base-compose.yaml |
Added vLLM server service definition with GPU config |
src/cli/managers/templates_manager.py |
Pass vLLM config to Compose template |
src/cli/utils/service_builder.py |
Added vllm-server to service state and compute services |
src/cli/cli_main.py |
Removed --sources CLI flag from create/evaluate commands |
src/cli/utils/helpers.py |
Removed parse_sources_option function |
tests/unit/test_vllm_provider.py |
18 unit tests for VLLMProvider |
tests/smoke/vllm_smoke.py |
End-to-end smoke test for vLLM deployments |
tests/smoke/preflight.py |
Added vLLM-specific preflight checks |
tests/smoke/combined_smoke.sh |
Integrated vLLM smoke test path |
scripts/dev/run_smoke_preview.sh |
Added vLLM provider support to smoke test script |
docs/docs/vllm.md |
Comprehensive vLLM documentation |
docs/docs/user_guide.md |
Restructured to link to dedicated topic pages |
docs/mkdocs.yml |
Added vLLM page and restructured navigation |
src/archi/pipelines/agents/tools/monit_opensearch.py |
New MONIT OpenSearch client and tool factories |
src/archi/pipelines/agents/utils/skill_utils.py |
Skill loading utility for agent tools |
src/archi/pipelines/agents/cms_comp_ops_agent.py |
Integrated MONIT tools and skill loading |
examples/deployments/basic-vllm/ |
Example vLLM deployment configuration |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
I went through copilot's review suggestions and resolved or explored the reasoning behind them. |
2f0e900 to
7e722ba
Compare
|
Hi @swinney , Before I review this in depth, can you rebase the commits so that this PR only contains the relevant changes for the vLLM provider? On that note, you shouldn't push the openspec changes specs and proposal files. Regarding the documentation, vLLM should not be in a separate page, but rather as a provider description [1]. For the WisDQM instance, I've been using the same local provider with vLLM - have you tried this and still feel adding a specific vLLM provider is necessary? I suppose the plus side would be to having multiple local providers in the same instance (e.g. one Ollama and another vLLM) but other than that is there something else? From what I'm understanding, this also creates a vLLM server directly from Archi - is this what you want to do? Or rather create a vLLM server and then just use that as a provider. I'm not sure that we want Archi to also control the vLLM deployment itself. Just a few things to think about. [1] https://archi-physics.github.io/archi/models_providers/#quick-start-by-provider |
|
Agreed to:
|
f2f9f2b to
993f510
Compare
Introduce a new vLLM provider enabling self-hosted LLM inference as a first-class deployment option in archi. This includes: - vLLM provider implementation with OpenAI-compatible API interface - Docker service template for vllm-server with configurable engine args - CLI integration: service registry, template manager, and compose generation - Example deployment config (basic-vllm) and GPU config example - Documentation for vLLM setup and usage - Unit tests and smoke tests for the vLLM provider - Include source URL in retriever tool output Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
993f510 to
164ad6a
Compare
lucalavezzo
left a comment
There was a problem hiding this comment.
Hi @swinney thanks a lot for this and sorry for the slow turnaround in review this -- per the discussion also with @juanpablosalas in the meeting yesterday I understood I thought that we would abandon launching vllm through archi, and just connect to a running server (as we do with ollama now). Can you confirm that is the plan, and fix the PR accordingly?
|
|
||
| --- | ||
|
|
||
| <<<<<<< HEAD |
There was a problem hiding this comment.
can you clean this up?
|
|
||
| --- | ||
|
|
||
| ### Grader Interface |
There was a problem hiding this comment.
I don't think we want this in the docs, it likely doesn't work
| | `REDMINE_USER` / `REDMINE_PW` | Redmine source | | ||
|
|
||
| See [Data Sources](data_sources.md) and [Services](services.md) for service-specific secrets. | ||
| ======= |
There was a problem hiding this comment.
All interfaces already have their own doc page. why was this added?
| @@ -0,0 +1,320 @@ | |||
| # vLLM Provider | |||
There was a problem hiding this comment.
why does vllm get its own provider page? should be the same as the other providers
There was a problem hiding this comment.
this also looks outdated (references the side-car idea)
| - Agents & Tools: agents_tools.md | ||
| - Configuration: configuration.md | ||
| - CLI Reference: cli_reference.md | ||
| - Advanced Setup and Deployment: advanced_setup_deploy.md |
There was a problem hiding this comment.
this is already in this list
| - Configuration: configuration.md | ||
| - CLI Reference: cli_reference.md | ||
| - Advanced Setup and Deployment: advanced_setup_deploy.md | ||
| - vLLM Provider: vllm.md |
There was a problem hiding this comment.
as mentioned above, please remove
There was a problem hiding this comment.
this looks outdated, does it still work with the new agents set up in the cli?
There was a problem hiding this comment.
why is there both basic-vllm and basic-gpu deployments? do we need both?
| # Pass vLLM model name from provider config to compose template | ||
| vllm_cfg = context.config_manager.config.get("services", {}).get("chat_app", {}).get("providers", {}).get("vllm", {}) | ||
| if vllm_cfg.get("default_model"): | ||
| template_vars["vllm_model"] = vllm_cfg["default_model"] | ||
| if vllm_cfg.get("tool_call_parser"): | ||
| template_vars["vllm_tool_parser"] = vllm_cfg["tool_call_parser"] | ||
|
|
||
| # Pass vLLM server configuration keys to compose template | ||
| if vllm_cfg.get("gpu_memory_utilization"): | ||
| template_vars["vllm_gpu_memory_utilization"] = vllm_cfg["gpu_memory_utilization"] | ||
| if vllm_cfg.get("max_model_len"): | ||
| template_vars["vllm_max_model_len"] = vllm_cfg["max_model_len"] | ||
| if vllm_cfg.get("tensor_parallel_size"): | ||
| template_vars["vllm_tensor_parallel_size"] = vllm_cfg["tensor_parallel_size"] | ||
| if vllm_cfg.get("dtype"): | ||
| template_vars["vllm_dtype"] = vllm_cfg["dtype"] | ||
| if vllm_cfg.get("quantization"): | ||
| template_vars["vllm_quantization"] = vllm_cfg["quantization"] | ||
| if "enforce_eager" in vllm_cfg: | ||
| template_vars["vllm_enforce_eager"] = vllm_cfg["enforce_eager"] | ||
| if vllm_cfg.get("max_num_seqs"): | ||
| template_vars["vllm_max_num_seqs"] = vllm_cfg["max_num_seqs"] | ||
| if "enable_prefix_caching" in vllm_cfg: | ||
| template_vars["vllm_enable_prefix_caching"] = vllm_cfg["enable_prefix_caching"] | ||
| template_vars["vllm_engine_args"] = vllm_cfg.get("engine_args", {}) | ||
|
|
There was a problem hiding this comment.
why is this needed here? I don't like that we have provider-specific hooks in a generic templates manager.
There was a problem hiding this comment.
I thought we abandoned the idea of a vllm server, is this not the case anymore?
vLLM infrastructure (Docker compose service, GPU config, engine args, smoke tests) is now the operator's responsibility. archi connects to any vLLM instance via base_url — Docker, bare metal, Slurm, or k8s. removes: vllm-server from compose template, service registry, templates manager, service builder. reverts unrelated changes (retriever URL fix, .gitignore, chat_app whitespace). rewrites vllm docs as "how to connect" instead of "how to deploy".
There was a problem hiding this comment.
I think you can remove this as you removed vLLM as an Archi service
There was a problem hiding this comment.
you can remove this change, just to keep history clean
Summary
VLLMProvider— thin client that wraps a locally hosted vLLM server via its OpenAI-compatible/v1API. Supports model listing, connection validation, and chat model creation through LangChain'sChatOpenAI.vllm-serverDocker Compose service with GPU passthrough, health checks, tool-calling flags (--enable-auto-tool-choice,--tool-call-parser), and conditional host networking. Deployed viaarchi create --services chatbot,vllm-server --gpu-ids all.services.vllm.model,services.vllm.tool_parser), andVLLM_BASE_URLis auto-injected into the chatbot container.SMOKE_PROVIDER=vllmpath inrun_smoke_preview.sh, vLLM-specific preflight checks, andvllm_smoke.pyfor end-to-end chat completion testing.docs/docs/vllm.mdpage (architecture, config reference, tool calling, troubleshooting), vLLM subsection in user guide, example deployment inexamples/deployments/basic-vllm/.Breaking Changes
--sources/-srcflag fromarchi createandarchi evaluate. Data sources are now configured exclusively via the YAML config file underdata_manager.sources. Thelinkssource remains enabled by default.Changes
src/archi/providers/vllm_provider.py,__init__.py,base.pyservice_registry.py,service_builder.py,templates_manager.py,base-compose.yaml,cli_main.pyrun_smoke_preview.sh,combined_smoke.sh,preflight.py,vllm_smoke.pytests/unit/test_vllm_provider.pydocs/docs/vllm.md,docs/docs/user_guide.md,docs/mkdocs.ymlexamples/deployments/basic-vllm/Test plan
pytest tests/unit/test_vllm_provider.py)archi create --services chatbot,vllm-server --gpu-ids allon a GPU hostcurl localhost:8000/v1/modelsreturns the served model