Skip to content

Add vLLM provider for local GPU inference#442

Open
swinney wants to merge 2 commits intodevfrom
add-vllm-provider
Open

Add vLLM provider for local GPU inference#442
swinney wants to merge 2 commits intodevfrom
add-vllm-provider

Conversation

@swinney
Copy link
Copy Markdown
Collaborator

@swinney swinney commented Feb 13, 2026

Summary

  • New VLLMProvider — thin client that wraps a locally hosted vLLM server via its OpenAI-compatible /v1 API. Supports model listing, connection validation, and chat model creation through LangChain's ChatOpenAI.
  • vLLM sidecar servicevllm-server Docker Compose service with GPU passthrough, health checks, tool-calling flags (--enable-auto-tool-choice, --tool-call-parser), and conditional host networking. Deployed via archi create --services chatbot,vllm-server --gpu-ids all.
  • CLI integration — registered in service registry, wired into deployment plan, Compose template renders vLLM config from YAML (services.vllm.model, services.vllm.tool_parser), and VLLM_BASE_URL is auto-injected into the chatbot container.
  • Smoke test supportSMOKE_PROVIDER=vllm path in run_smoke_preview.sh, vLLM-specific preflight checks, and vllm_smoke.py for end-to-end chat completion testing.
  • Documentation — full docs/docs/vllm.md page (architecture, config reference, tool calling, troubleshooting), vLLM subsection in user guide, example deployment in examples/deployments/basic-vllm/.
  • Unit tests — 18 tests covering init, env overrides, URL normalization, model operations, and provider registration.

Breaking Changes

  • Removed --sources / -src flag from archi create and archi evaluate. Data sources are now configured exclusively via the YAML config file under data_manager.sources. The links source remains enabled by default.

Changes

Area Files
Provider src/archi/providers/vllm_provider.py, __init__.py, base.py
CLI/Compose service_registry.py, service_builder.py, templates_manager.py, base-compose.yaml, cli_main.py
Smoke tests run_smoke_preview.sh, combined_smoke.sh, preflight.py, vllm_smoke.py
Unit tests tests/unit/test_vllm_provider.py
Docs docs/docs/vllm.md, docs/docs/user_guide.md, docs/mkdocs.yml
Example examples/deployments/basic-vllm/

Test plan

  • All 18 vLLM unit tests pass (pytest tests/unit/test_vllm_provider.py)
  • Full unit suite shows no regressions (73/81 pass; 8 failures are pre-existing and unrelated)
  • Deploy with archi create --services chatbot,vllm-server --gpu-ids all on a GPU host
  • Verify curl localhost:8000/v1/models returns the served model
  • End-to-end chat through the chatbot UI using a vLLM-backed model

@swinney swinney changed the base branch from main to dev February 15, 2026 19:52
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds comprehensive support for vLLM as a first-class provider for local GPU inference, enabling high-throughput LLM serving on NVIDIA GPUs. The implementation includes a new VLLMProvider class, Docker Compose sidecar service, CLI integration, smoke tests, extensive documentation, and 18 unit tests.

Changes:

  • New VLLMProvider thin client wrapping vLLM's OpenAI-compatible /v1 API
  • Docker Compose vllm-server sidecar with GPU passthrough, health checks, and tool-calling support
  • CLI integration for deployment via archi create --services vllm-server --gpu-ids all
  • Smoke test infrastructure for vLLM validation
  • Full documentation page (docs/vllm.md) with architecture, configuration, and troubleshooting
  • MONIT OpenSearch tools and skill loading utilities for the CMS CompOps agent
  • Removal of --sources CLI flag (breaking change)

Reviewed changes

Copilot reviewed 45 out of 46 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/archi/providers/vllm_provider.py New VLLMProvider implementation with OpenAI-compatible API client
src/archi/providers/base.py Added ProviderType.VLLM enum and vLLM to provider registry
src/archi/providers/__init__.py Registered VLLMProvider and updated provider name mapping
src/cli/service_registry.py Registered vllm-server as a compute service
src/cli/templates/base-compose.yaml Added vLLM server service definition with GPU config
src/cli/managers/templates_manager.py Pass vLLM config to Compose template
src/cli/utils/service_builder.py Added vllm-server to service state and compute services
src/cli/cli_main.py Removed --sources CLI flag from create/evaluate commands
src/cli/utils/helpers.py Removed parse_sources_option function
tests/unit/test_vllm_provider.py 18 unit tests for VLLMProvider
tests/smoke/vllm_smoke.py End-to-end smoke test for vLLM deployments
tests/smoke/preflight.py Added vLLM-specific preflight checks
tests/smoke/combined_smoke.sh Integrated vLLM smoke test path
scripts/dev/run_smoke_preview.sh Added vLLM provider support to smoke test script
docs/docs/vllm.md Comprehensive vLLM documentation
docs/docs/user_guide.md Restructured to link to dedicated topic pages
docs/mkdocs.yml Added vLLM page and restructured navigation
src/archi/pipelines/agents/tools/monit_opensearch.py New MONIT OpenSearch client and tool factories
src/archi/pipelines/agents/utils/skill_utils.py Skill loading utility for agent tools
src/archi/pipelines/agents/cms_comp_ops_agent.py Integrated MONIT tools and skill loading
examples/deployments/basic-vllm/ Example vLLM deployment configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@swinney
Copy link
Copy Markdown
Collaborator Author

swinney commented Feb 18, 2026

I went through copilot's review suggestions and resolved or explored the reasoning behind them.

@pmlugato pmlugato added the enhancement New feature or request label Feb 23, 2026
@juanpablosalas juanpablosalas self-requested a review March 18, 2026 15:12
@juanpablosalas
Copy link
Copy Markdown
Collaborator

Hi @swinney ,

Before I review this in depth, can you rebase the commits so that this PR only contains the relevant changes for the vLLM provider?

On that note, you shouldn't push the openspec changes specs and proposal files. Regarding the documentation, vLLM should not be in a separate page, but rather as a provider description [1]. For the WisDQM instance, I've been using the same local provider with vLLM - have you tried this and still feel adding a specific vLLM provider is necessary? I suppose the plus side would be to having multiple local providers in the same instance (e.g. one Ollama and another vLLM) but other than that is there something else?

From what I'm understanding, this also creates a vLLM server directly from Archi - is this what you want to do? Or rather create a vLLM server and then just use that as a provider. I'm not sure that we want Archi to also control the vLLM deployment itself. Just a few things to think about.

[1] https://archi-physics.github.io/archi/models_providers/#quick-start-by-provider

@swinney
Copy link
Copy Markdown
Collaborator Author

swinney commented Mar 18, 2026

Agreed to:

  • Not include openspec related files. ( I thought they might be good left in for context to others. )
  • Squash commits outside of the main set of changes.
  • Pull the vllm docker instance out of this pull request so that it is managed externally.
  • Update documentation.

Introduce a new vLLM provider enabling self-hosted LLM inference as a
first-class deployment option in archi. This includes:

- vLLM provider implementation with OpenAI-compatible API interface
- Docker service template for vllm-server with configurable engine args
- CLI integration: service registry, template manager, and compose generation
- Example deployment config (basic-vllm) and GPU config example
- Documentation for vLLM setup and usage
- Unit tests and smoke tests for the vLLM provider
- Include source URL in retriever tool output

Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>
@swinney swinney force-pushed the add-vllm-provider branch from 993f510 to 164ad6a Compare March 26, 2026 00:21
Copy link
Copy Markdown
Collaborator

@lucalavezzo lucalavezzo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @swinney thanks a lot for this and sorry for the slow turnaround in review this -- per the discussion also with @juanpablosalas in the meeting yesterday I understood I thought that we would abandon launching vllm through archi, and just connect to a running server (as we do with ollama now). Can you confirm that is the plan, and fix the PR accordingly?


---

<<<<<<< HEAD
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you clean this up?


---

### Grader Interface
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want this in the docs, it likely doesn't work

| `REDMINE_USER` / `REDMINE_PW` | Redmine source |

See [Data Sources](data_sources.md) and [Services](services.md) for service-specific secrets.
=======
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All interfaces already have their own doc page. why was this added?

@@ -0,0 +1,320 @@
# vLLM Provider
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does vllm get its own provider page? should be the same as the other providers

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also looks outdated (references the side-car idea)

- Agents & Tools: agents_tools.md
- Configuration: configuration.md
- CLI Reference: cli_reference.md
- Advanced Setup and Deployment: advanced_setup_deploy.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already in this list

- Configuration: configuration.md
- CLI Reference: cli_reference.md
- Advanced Setup and Deployment: advanced_setup_deploy.md
- vLLM Provider: vllm.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, please remove

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks outdated, does it still work with the new agents set up in the cli?

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there both basic-vllm and basic-gpu deployments? do we need both?

Comment on lines +484 to +509
# Pass vLLM model name from provider config to compose template
vllm_cfg = context.config_manager.config.get("services", {}).get("chat_app", {}).get("providers", {}).get("vllm", {})
if vllm_cfg.get("default_model"):
template_vars["vllm_model"] = vllm_cfg["default_model"]
if vllm_cfg.get("tool_call_parser"):
template_vars["vllm_tool_parser"] = vllm_cfg["tool_call_parser"]

# Pass vLLM server configuration keys to compose template
if vllm_cfg.get("gpu_memory_utilization"):
template_vars["vllm_gpu_memory_utilization"] = vllm_cfg["gpu_memory_utilization"]
if vllm_cfg.get("max_model_len"):
template_vars["vllm_max_model_len"] = vllm_cfg["max_model_len"]
if vllm_cfg.get("tensor_parallel_size"):
template_vars["vllm_tensor_parallel_size"] = vllm_cfg["tensor_parallel_size"]
if vllm_cfg.get("dtype"):
template_vars["vllm_dtype"] = vllm_cfg["dtype"]
if vllm_cfg.get("quantization"):
template_vars["vllm_quantization"] = vllm_cfg["quantization"]
if "enforce_eager" in vllm_cfg:
template_vars["vllm_enforce_eager"] = vllm_cfg["enforce_eager"]
if vllm_cfg.get("max_num_seqs"):
template_vars["vllm_max_num_seqs"] = vllm_cfg["max_num_seqs"]
if "enable_prefix_caching" in vllm_cfg:
template_vars["vllm_enable_prefix_caching"] = vllm_cfg["enable_prefix_caching"]
template_vars["vllm_engine_args"] = vllm_cfg.get("engine_args", {})

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed here? I don't like that we have provider-specific hooks in a generic templates manager.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we abandoned the idea of a vllm server, is this not the case anymore?

@lucalavezzo lucalavezzo self-assigned this Mar 26, 2026
vLLM infrastructure (Docker compose service, GPU config, engine args,
smoke tests) is now the operator's responsibility. archi connects to
any vLLM instance via base_url — Docker, bare metal, Slurm, or k8s.

removes: vllm-server from compose template, service registry,
templates manager, service builder. reverts unrelated changes
(retriever URL fix, .gitignore, chat_app whitespace). rewrites
vllm docs as "how to connect" instead of "how to deploy".
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this as you removed vLLM as an Archi service

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this change, just to keep history clean

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants