Add vLLM provider for local GPU inference by swinney · Pull Request #442 · archi-physics/archi

swinney · 2026-02-13T21:41:21Z

Summary

New VLLMProvider — thin client that wraps a locally hosted vLLM server via its OpenAI-compatible /v1 API. Supports model listing, connection validation, and chat model creation through LangChain's ChatOpenAI.
vLLM sidecar service — vllm-server Docker Compose service with GPU passthrough, health checks, tool-calling flags (--enable-auto-tool-choice, --tool-call-parser), and conditional host networking. Deployed via archi create --services chatbot,vllm-server --gpu-ids all.
CLI integration — registered in service registry, wired into deployment plan, Compose template renders vLLM config from YAML (services.vllm.model, services.vllm.tool_parser), and VLLM_BASE_URL is auto-injected into the chatbot container.
Smoke test support — SMOKE_PROVIDER=vllm path in run_smoke_preview.sh, vLLM-specific preflight checks, and vllm_smoke.py for end-to-end chat completion testing.
Documentation — full docs/docs/vllm.md page (architecture, config reference, tool calling, troubleshooting), vLLM subsection in user guide, example deployment in examples/deployments/basic-vllm/.
Unit tests — 18 tests covering init, env overrides, URL normalization, model operations, and provider registration.

Breaking Changes

Removed --sources / -src flag from archi create and archi evaluate. Data sources are now configured exclusively via the YAML config file under data_manager.sources. The links source remains enabled by default.

Changes

Area	Files
Provider	`src/archi/providers/vllm_provider.py`, `__init__.py`, `base.py`
CLI/Compose	`service_registry.py`, `service_builder.py`, `templates_manager.py`, `base-compose.yaml`, `cli_main.py`
Smoke tests	`run_smoke_preview.sh`, `combined_smoke.sh`, `preflight.py`, `vllm_smoke.py`
Unit tests	`tests/unit/test_vllm_provider.py`
Docs	`docs/docs/vllm.md`, `docs/docs/user_guide.md`, `docs/mkdocs.yml`
Example	`examples/deployments/basic-vllm/`

Test plan

All 18 vLLM unit tests pass (pytest tests/unit/test_vllm_provider.py)
Full unit suite shows no regressions (73/81 pass; 8 failures are pre-existing and unrelated)
Deploy with archi create --services chatbot,vllm-server --gpu-ids all on a GPU host
Verify curl localhost:8000/v1/models returns the served model
End-to-end chat through the chatbot UI using a vLLM-backed model

Copilot

Pull request overview

This PR adds comprehensive support for vLLM as a first-class provider for local GPU inference, enabling high-throughput LLM serving on NVIDIA GPUs. The implementation includes a new VLLMProvider class, Docker Compose sidecar service, CLI integration, smoke tests, extensive documentation, and 18 unit tests.

Changes:

New VLLMProvider thin client wrapping vLLM's OpenAI-compatible /v1 API
Docker Compose vllm-server sidecar with GPU passthrough, health checks, and tool-calling support
CLI integration for deployment via archi create --services vllm-server --gpu-ids all
Smoke test infrastructure for vLLM validation
Full documentation page (docs/vllm.md) with architecture, configuration, and troubleshooting
MONIT OpenSearch tools and skill loading utilities for the CMS CompOps agent
Removal of --sources CLI flag (breaking change)

Reviewed changes

Copilot reviewed 45 out of 46 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
`src/archi/providers/vllm_provider.py`	New VLLMProvider implementation with OpenAI-compatible API client
`src/archi/providers/base.py`	Added `ProviderType.VLLM` enum and vLLM to provider registry
`src/archi/providers/__init__.py`	Registered VLLMProvider and updated provider name mapping
`src/cli/service_registry.py`	Registered `vllm-server` as a compute service
`src/cli/templates/base-compose.yaml`	Added vLLM server service definition with GPU config
`src/cli/managers/templates_manager.py`	Pass vLLM config to Compose template
`src/cli/utils/service_builder.py`	Added vllm-server to service state and compute services
`src/cli/cli_main.py`	Removed `--sources` CLI flag from create/evaluate commands
`src/cli/utils/helpers.py`	Removed `parse_sources_option` function
`tests/unit/test_vllm_provider.py`	18 unit tests for VLLMProvider
`tests/smoke/vllm_smoke.py`	End-to-end smoke test for vLLM deployments
`tests/smoke/preflight.py`	Added vLLM-specific preflight checks
`tests/smoke/combined_smoke.sh`	Integrated vLLM smoke test path
`scripts/dev/run_smoke_preview.sh`	Added vLLM provider support to smoke test script
`docs/docs/vllm.md`	Comprehensive vLLM documentation
`docs/docs/user_guide.md`	Restructured to link to dedicated topic pages
`docs/mkdocs.yml`	Added vLLM page and restructured navigation
`src/archi/pipelines/agents/tools/monit_opensearch.py`	New MONIT OpenSearch client and tool factories
`src/archi/pipelines/agents/utils/skill_utils.py`	Skill loading utility for agent tools
`src/archi/pipelines/agents/cms_comp_ops_agent.py`	Integrated MONIT tools and skill loading
`examples/deployments/basic-vllm/`	Example vLLM deployment configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/cli/managers/templates_manager.py

src/cli/utils/helpers.py

src/cli/cli_main.py

src/interfaces/chat_app/app.py

src/cli/templates/base-compose.yaml

src/archi/pipelines/agents/cms_comp_ops_agent.py

swinney · 2026-02-18T03:04:00Z

I went through copilot's review suggestions and resolved or explored the reasoning behind them.

juanpablosalas · 2026-03-18T15:26:52Z

Hi @swinney ,

Before I review this in depth, can you rebase the commits so that this PR only contains the relevant changes for the vLLM provider?

On that note, you shouldn't push the openspec changes specs and proposal files. Regarding the documentation, vLLM should not be in a separate page, but rather as a provider description [1]. For the WisDQM instance, I've been using the same local provider with vLLM - have you tried this and still feel adding a specific vLLM provider is necessary? I suppose the plus side would be to having multiple local providers in the same instance (e.g. one Ollama and another vLLM) but other than that is there something else?

From what I'm understanding, this also creates a vLLM server directly from Archi - is this what you want to do? Or rather create a vLLM server and then just use that as a provider. I'm not sure that we want Archi to also control the vLLM deployment itself. Just a few things to think about.

[1] https://archi-physics.github.io/archi/models_providers/#quick-start-by-provider

swinney · 2026-03-18T17:48:36Z

Agreed to:

Not include openspec related files. ( I thought they might be good left in for context to others. )
Squash commits outside of the main set of changes.
Pull the vllm docker instance out of this pull request so that it is managed externally.
Update documentation.

Introduce a new vLLM provider enabling self-hosted LLM inference as a first-class deployment option in archi. This includes: - vLLM provider implementation with OpenAI-compatible API interface - Docker service template for vllm-server with configurable engine args - CLI integration: service registry, template manager, and compose generation - Example deployment config (basic-vllm) and GPU config example - Documentation for vLLM setup and usage - Unit tests and smoke tests for the vLLM provider - Include source URL in retriever tool output Co-Authored-By: Claude Opus 4.6 (1M context) <[email protected]>

lucalavezzo

Hi @swinney thanks a lot for this and sorry for the slow turnaround in review this -- per the discussion also with @juanpablosalas in the meeting yesterday I understood I thought that we would abandon launching vllm through archi, and just connect to a running server (as we do with ollama now). Can you confirm that is the plan, and fix the PR accordingly?

lucalavezzo · 2026-03-26T13:24:45Z

docs/docs/user_guide.md


 ---

+<<<<<<< HEAD


can you clean this up?

lucalavezzo · 2026-03-26T13:25:50Z

docs/docs/user_guide.md

+
+---
+
+### Grader Interface


I don't think we want this in the docs, it likely doesn't work

lucalavezzo · 2026-03-26T13:27:28Z

docs/docs/user_guide.md

 | `REDMINE_USER` / `REDMINE_PW` | Redmine source |

 See [Data Sources](data_sources.md) and [Services](services.md) for service-specific secrets.
+=======


All interfaces already have their own doc page. why was this added?

lucalavezzo · 2026-03-26T13:28:09Z

docs/docs/vllm.md

@@ -0,0 +1,320 @@
+# vLLM Provider


why does vllm get its own provider page? should be the same as the other providers

this also looks outdated (references the side-car idea)

lucalavezzo · 2026-03-26T13:29:08Z

docs/mkdocs.yml

  - Agents & Tools: agents_tools.md
  - Configuration: configuration.md
  - CLI Reference: cli_reference.md
+  - Advanced Setup and Deployment: advanced_setup_deploy.md


this is already in this list

lucalavezzo · 2026-03-26T13:29:15Z

docs/mkdocs.yml

  - Configuration: configuration.md
  - CLI Reference: cli_reference.md
+  - Advanced Setup and Deployment: advanced_setup_deploy.md
+  - vLLM Provider: vllm.md


as mentioned above, please remove

lucalavezzo · 2026-03-26T13:31:10Z

examples/deployments/basic-gpu/config.yaml

this looks outdated, does it still work with the new agents set up in the cli?

lucalavezzo · 2026-03-26T13:31:36Z

examples/deployments/basic-vllm/config.yaml

why is there both basic-vllm and basic-gpu deployments? do we need both?

lucalavezzo · 2026-03-26T13:34:24Z

src/cli/managers/templates_manager.py

+        # Pass vLLM model name from provider config to compose template
+        vllm_cfg = context.config_manager.config.get("services", {}).get("chat_app", {}).get("providers", {}).get("vllm", {})
+        if vllm_cfg.get("default_model"):
+            template_vars["vllm_model"] = vllm_cfg["default_model"]
+        if vllm_cfg.get("tool_call_parser"):
+            template_vars["vllm_tool_parser"] = vllm_cfg["tool_call_parser"]
+
+        # Pass vLLM server configuration keys to compose template
+        if vllm_cfg.get("gpu_memory_utilization"):
+            template_vars["vllm_gpu_memory_utilization"] = vllm_cfg["gpu_memory_utilization"]
+        if vllm_cfg.get("max_model_len"):
+            template_vars["vllm_max_model_len"] = vllm_cfg["max_model_len"]
+        if vllm_cfg.get("tensor_parallel_size"):
+            template_vars["vllm_tensor_parallel_size"] = vllm_cfg["tensor_parallel_size"]
+        if vllm_cfg.get("dtype"):
+            template_vars["vllm_dtype"] = vllm_cfg["dtype"]
+        if vllm_cfg.get("quantization"):
+            template_vars["vllm_quantization"] = vllm_cfg["quantization"]
+        if "enforce_eager" in vllm_cfg:
+            template_vars["vllm_enforce_eager"] = vllm_cfg["enforce_eager"]
+        if vllm_cfg.get("max_num_seqs"):
+            template_vars["vllm_max_num_seqs"] = vllm_cfg["max_num_seqs"]
+        if "enable_prefix_caching" in vllm_cfg:
+            template_vars["vllm_enable_prefix_caching"] = vllm_cfg["enable_prefix_caching"]
+        template_vars["vllm_engine_args"] = vllm_cfg.get("engine_args", {})
+


why is this needed here? I don't like that we have provider-specific hooks in a generic templates manager.

lucalavezzo · 2026-03-26T13:35:06Z

src/cli/templates/base-compose.yaml

I thought we abandoned the idea of a vllm server, is this not the case anymore?

vLLM infrastructure (Docker compose service, GPU config, engine args, smoke tests) is now the operator's responsibility. archi connects to any vLLM instance via base_url — Docker, bare metal, Slurm, or k8s. removes: vllm-server from compose template, service registry, templates manager, service builder. reverts unrelated changes (retriever URL fix, .gitignore, chat_app whitespace). rewrites vllm docs as "how to connect" instead of "how to deploy".

juanpablosalas · 2026-03-27T14:32:06Z

src/cli/utils/service_builder.py

I think you can remove this as you removed vLLM as an Archi service

juanpablosalas · 2026-03-27T14:35:30Z

tests/smoke/preflight.py

you can remove this change, just to keep history clean

swinney changed the base branch from main to dev February 15, 2026 19:52

swinney requested review from Copilot, lucalavezzo and pmlugato February 17, 2026 22:16

Copilot started reviewing on behalf of swinney February 17, 2026 22:18 View session

Copilot AI reviewed Feb 17, 2026

View reviewed changes

pmlugato added the enhancement New feature or request label Feb 23, 2026

swinney force-pushed the add-vllm-provider branch from 2f0e900 to 7e722ba Compare February 24, 2026 01:35

juanpablosalas self-requested a review March 18, 2026 15:12

swinney assigned juanpablosalas Mar 18, 2026

swinney force-pushed the add-vllm-provider branch from f2f9f2b to 993f510 Compare March 18, 2026 19:06

swinney force-pushed the add-vllm-provider branch from 993f510 to 164ad6a Compare March 26, 2026 00:21

lucalavezzo requested changes Mar 26, 2026

View reviewed changes

lucalavezzo self-assigned this Mar 26, 2026

juanpablosalas reviewed Mar 27, 2026

View reviewed changes

src/cli/utils/service_builder.py

Copy link
Copy Markdown

Collaborator

juanpablosalas Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you can remove this as you removed vLLM as an Archi service

juanpablosalas reviewed Mar 27, 2026

View reviewed changes

tests/smoke/preflight.py

Copy link
Copy Markdown

Collaborator

juanpablosalas Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can remove this change, just to keep history clean

Conversation

swinney commented Feb 13, 2026 • edited by juanpablosalas Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Breaking Changes

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

swinney commented Feb 18, 2026

Uh oh!

juanpablosalas commented Mar 18, 2026

Uh oh!

swinney commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

lucalavezzo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

swinney commented Feb 13, 2026 •

edited by juanpablosalas

Loading

swinney commented Mar 18, 2026 •

edited

Loading