Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
568 changes: 568 additions & 0 deletions docs/docs/user_guide.md

Large diffs are not rendered by default.

259 changes: 259 additions & 0 deletions docs/docs/vllm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
# vLLM Provider
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does vllm get its own provider page? should be the same as the other providers

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this also looks outdated (references the side-car idea)


Run open-weight models on your own GPUs using [vLLM](https://docs.vllm.ai/) as an inference backend. Archi connects to any vLLM server via its OpenAI-compatible API — you deploy and manage vLLM independently.

## Why vLLM?

| | vLLM | Ollama | API providers |
|---|---|---|---|
| **Throughput** | High (PagedAttention, continuous batching) | Moderate | N/A (cloud) |
| **Multi-GPU** | Tensor-parallel across GPUs | Single GPU | N/A |
| **Tool calling** | Supported (with parser flag) | Model-dependent | Supported |
| **Cost** | Hardware only | Hardware only | Per-token |
| **Privacy** | Data stays on-premises | Data stays on-premises | Data leaves your network |

vLLM is the best fit when you need high-throughput local inference, multi-GPU support, or full data privacy with tool-calling capabilities.

## Architecture

```
┌──────────────────────┐ ┌──────────────────────┐
│ archi deployment │ │ vLLM (external) │
│ │ │ │
│ ┌────────────────┐ │ HTTP │ Docker container │
│ │ VLLMProvider │──│────────>│ OR bare metal │
│ │ (Python client)│ │ :8000 │ OR Slurm job │
│ └────────────────┘ │ /v1/* │ OR Kubernetes pod │
│ │ │ │
└──────────────────────┘ └──────────────────────┘
```

Archi's `VLLMProvider` is a thin client that talks to vLLM's `/v1` API using the same `ChatOpenAI` LangChain class it would use for the OpenAI API. From the pipeline's perspective, vLLM looks identical to a remote OpenAI endpoint.

**Archi does not manage the vLLM server.** You deploy, configure, and maintain vLLM independently — whether as a Docker container, a bare metal process, a Slurm job, or a Kubernetes pod. Archi only needs a `base_url` to connect.

## Quick Start

### 1. Start a vLLM server

See [Running vLLM](#running-vllm) below for Docker, bare metal, and Slurm examples.

### 2. Configure archi

In your config YAML, set up the vLLM provider with the URL of your server:

```yaml
services:
chat_app:
default_provider: vllm
default_model: "vllm:Qwen/Qwen3-8B"
providers:
vllm:
enabled: true
base_url: http://localhost:8000/v1 # URL of your vLLM server
default_model: "Qwen/Qwen3-8B"
models:
- "vllm:Qwen/Qwen3-8B"
```

### 3. Deploy archi

```bash
archi create -n my-deployment \
-c config.yaml \
-e .env \
--services chatbot
```

### 4. Verify

```bash
# Check vLLM is serving
curl http://localhost:8000/v1/models

# Check archi can reach it
curl http://localhost:7861/api/health
```

## Configuration Reference

### Provider settings

The vLLM provider is configured under `services.chat_app.providers.vllm`:

```yaml
services:
chat_app:
providers:
vllm:
enabled: true
base_url: http://localhost:8000/v1
default_model: "Qwen/Qwen3-8B"
models:
- "vllm:Qwen/Qwen3-8B"
```

| Setting | Type | Default | Description |
|---|---|---|---|
| `enabled` | bool | `false` | Enable the vLLM provider |
| `base_url` | string | `http://localhost:8000/v1` | vLLM server OpenAI-compatible endpoint |
| `default_model` | string | — | HuggingFace model ID to use for inference |
| `models` | list | — | Available model IDs for the UI model selector |

### Model references

Anywhere a model is referenced in `pipeline_map`, use the `vllm/` prefix:

```yaml
archi:
pipeline_map:
CMSCompOpsAgent:
models:
required:
agent_model: vllm/Qwen/Qwen3-8B
```

The part after `vllm/` must match the HuggingFace model ID that vLLM is serving.

> **Model naming**: vLLM uses HuggingFace model IDs (e.g. `Qwen/Qwen3-8B`), not Ollama-style names (e.g. `Qwen/Qwen3:8B`).

## Running vLLM

Archi does not manage the vLLM server. Below are examples for common deployment scenarios.

### Docker

```bash
docker run -d \
--name vllm-server \
--gpus all \
--ipc=host \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
-p 8000:8000 \
-e NCCL_P2P_DISABLE=1 \
vllm/vllm-openai:latest \
--model Qwen/Qwen3-8B \
--enable-auto-tool-choice \
--tool-call-parser hermes
```

Key flags:
- `--gpus all` — GPU passthrough
- `--ipc=host` — required for NCCL multi-GPU communication (Docker's default 64MB shm causes crashes)
- `--ulimit memlock=-1` — prevents OS from swapping out VRAM-mapped buffers
- `NCCL_P2P_DISABLE=1` — required for V100s and older GPU topologies

### Bare metal

```bash
pip install vllm

python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-8B \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--host 0.0.0.0 \
--port 8000
```

### Slurm

```bash
#!/bin/bash
#SBATCH --gres=gpu:4
#SBATCH --cpus-per-task=16
#SBATCH --mem=128G
#SBATCH --time=7-00:00:00

module load cuda
source activate vllm

python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen3-8B \
--tensor-parallel-size 4 \
--enable-auto-tool-choice \
--tool-call-parser hermes \
--host 0.0.0.0 \
--port 8000
```

Then set `base_url` in your archi config to the Slurm node's address.

### Common vLLM server flags

These are configured on the vLLM server itself, not in archi:

| Flag | Description |
|---|---|
| `--gpu-memory-utilization 0.9` | Fraction of GPU VRAM to use (0.0-1.0) |
| `--max-model-len 8192` | Cap context window to reduce memory |
| `--tensor-parallel-size 4` | Shard model across N GPUs |
| `--dtype bfloat16` | Force weight precision |
| `--quantization awq` | Run quantized weights (awq, gptq, fp8) |
| `--enforce-eager` | Disable CUDA graphs to save memory |
| `--max-num-seqs 256` | Limit concurrent sequences |
| `--enable-auto-tool-choice` | Enable tool calling pathway |
| `--tool-call-parser hermes` | Parser for structured tool calls |

See [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) for the full reference.

## Tool Calling

vLLM supports function/tool calling for ReAct agents, but requires explicit server flags:

- `--enable-auto-tool-choice` — enables the tool calling pathway
- `--tool-call-parser <parser>` — selects the parser for the model family

| Model family | Parser |
|---|---|
| Qwen (Qwen2.5, Qwen3) | `hermes` |
| Mistral / Mixtral | `mistral` |
| Llama 3 | `llama3_json` |

These flags must be set when starting the vLLM server, not in archi's config.

## Troubleshooting

### Archi can't reach vLLM

**Symptom**: `ConnectionError: Connection refused` or timeout.

- Verify vLLM is running: `curl http://<vllm-host>:8000/v1/models`
- If vLLM is on a different host, ensure network connectivity and firewall rules allow port 8000
- If running in Docker, ensure the archi container can reach the vLLM host (use `--network=host` or configure Docker networking)
- Check that `base_url` in your archi config matches the actual vLLM server address

### Model not found (404)

**Symptom**: `Error: model 'Qwen/Qwen3:8B' does not exist`.

vLLM uses HuggingFace model IDs, not Ollama-style names. Check:

- Config uses the exact model ID from `curl <vllm-host>:8000/v1/models`
- Use dashes, not colons: `Qwen/Qwen3-8B` (not `Qwen/Qwen3:8B`)

### Tool calling returns 400

**Symptom**: `400 Bad Request: "auto" tool choice requires --enable-auto-tool-choice`.

The vLLM server wasn't started with tool calling flags. Add to your vLLM launch command:

```bash
--enable-auto-tool-choice --tool-call-parser hermes
```

### Slow first response

The first request after startup may be slow (30-60s) while vLLM compiles CUDA kernels. Subsequent requests will be significantly faster. If this is a problem, start vLLM with `--enforce-eager` to skip CUDA graph compilation (at the cost of lower throughput).

### Insufficient VRAM

If vLLM crashes or the model doesn't fit in GPU memory:

- Lower `--gpu-memory-utilization` (e.g. `0.7`)
- Set `--max-model-len` to a smaller value (e.g. `4096`)
- Add `--quantization awq` or `--quantization gptq` if quantized weights are available
- Set `--enforce-eager` to disable CUDA graphs
- Increase `--tensor-parallel-size` and use more GPUs
- Try a smaller model
2 changes: 2 additions & 0 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ nav:
- Agents & Tools: agents_tools.md
- Configuration: configuration.md
- CLI Reference: cli_reference.md
- Advanced Setup and Deployment: advanced_setup_deploy.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is already in this list

- vLLM Provider: vllm.md
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned above, please remove

- API Reference: api_reference.md
- Benchmarking: benchmarking.md
- Advanced Setup: advanced_setup_deploy.md
Expand Down
12 changes: 12 additions & 0 deletions examples/deployments/basic-vllm/condense.prompt
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Prompt used to condense a chat history and a follow up question into a stand alone question.
# This is a very general prompt for condensing histories, so for base installs it will not need to be modified
#
# All condensing prompts must have the following tags in them, which will be filled with the appropriate information:
# {chat_history}
# {question}
#
Given the following conversation between you (the AI named archi), a human user who needs help, and an expert, and a follow up question, rephrase the follow up question to be a standalone question, in its original language.

Chat History: {history}
Follow Up Input: {question}
Standalone question:
41 changes: 41 additions & 0 deletions examples/deployments/basic-vllm/config.yaml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is there both basic-vllm and basic-gpu deployments? do we need both?

Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Basic configuration file for an Archi deployment
# using a vLLM server for LLM inference.
#
# The vLLM server must be running and accessible at the base_url below.
# Archi does not manage the vLLM server — see docs/docs/vllm.md for setup guidance.
#
# run with:
# archi create --name my-archi-vllm --config examples/deployments/basic-vllm/config.yaml --services chatbot

name: my_archi

services:
chat_app:
agent_class: CMSCompOpsAgent
agents_dir: examples/agents
default_provider: vllm
default_model: "vllm:Qwen/Qwen3-8B"
providers:
vllm:
enabled: true
base_url: http://localhost:8000/v1 # URL of your vLLM server
default_model: "Qwen/Qwen3-8B"
models:
- "vllm:Qwen/Qwen3-8B"
trained_on: "FASRC DOCS"
port: 7861
external_port: 7861
vectorstore:
backend: postgres
data_manager:
port: 7889
external_port: 7889
auth:
enabled: false

data_manager:
sources:
links:
input_lists:
- config/sources.list
embedding_name: HuggingFaceEmbeddings
39 changes: 39 additions & 0 deletions examples/deployments/basic-vllm/miscellanea.list
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# PPC
https://ppc.mit.edu/blog/2016/05/08/hello-world/
https://ppc.mit.edu/
https://ppc.mit.edu/news/
https://ppc.mit.edu/publications/
https://ppc.mit.edu/blog/2025/02/08/detailed-schedule-for-the-european-strategy/
https://ppc.mit.edu/blog/2025/02/14/first-cms-week-in-2025/
https://ppc.mit.edu/blog/2025/02/18/exploring-the-higgs-boson-in-our-latest-result/
https://ppc.mit.edu/blog/2025/02/04/news-from-the-chamonix-meeting/
https://ppc.mit.edu/blog/2025/02/11/cms-data-archival-at-mit/
https://ppc.mit.edu/blog/2025/03/28/cern-gets-support-from-canada/
https://ppc.mit.edu/blog/2025/04/08/breakthrough-prize-in-physics-2025/
https://ppc.mit.edu/blog/2025/04/04/the-fcc-at-cern-a-feasibly-circular-collider/
https://ppc.mit.edu/blog/2025/04/08/cleo-reached-magic-issue-number-5000/
https://ppc.mit.edu/blog/2025/04/14/maximizing-cms-competitive-advantage/
https://ppc.mit.edu/blog/2025/04/25/sueps-at-aps-march-april-meeting/
https://ppc.mit.edu/blog/2025/04/18/round-three/
https://ppc.mit.edu/blog/2025/04/14/first-beams-with-a-splash-in-2025/
https://ppc.mit.edu/blog/2025/05/27/fcc-weak-in-vienna-building-our-future/
https://ppc.mit.edu/blog/2025/06/04/new-paper-on-arxiv-submit-a-physics-analysis-facility-at-mit/
https://ppc.mit.edu/blog/2025/06/16/summer-cms-week-2025/
https://ppc.mit.edu/blog/2025/05/05/cms-records-first-2025-high-energy-collisions/
https://ppc.mit.edu/blog/2025/06/17/long-term-vision-for-particle-physics-from-the-national-academies/
https://ppc.mit.edu/blog/2025/06/20/conclusion-of-junes-cern-council-session-has-major-consequences-for-cms/
https://ppc.mit.edu/blog/2025/06/20/highest-pileup-recorded-at-cms-last-night/
https://ppc.mit.edu/blog/2025/06/25/selfie-station-at-wilson-hall/
https://ppc.mit.edu/mariarosaria-dalfonso/
https://ppc.mit.edu/kenneth-long-2/
https://ppc.mit.edu/blog/2025/06/27/open-symposium-on-the-european-strategy-for-particle-physics/
https://ppc.mit.edu/blog/2025/07/03/bridging-physics-and-computing-throughput-computing-2025/
https://ppc.mit.edu/pietro-lugato-2/
https://ppc.mit.edu/luca-lavezzo/
https://ppc.mit.edu/zhangqier-wang-2/
https://ppc.mit.edu/blog/2025/07/14/welcome-our-first-ever-in-house-masters-student/
# A2
https://ppc.mit.edu/a2/
# Personnel
https://people.csail.mit.edu/kraska
https://physics.mit.edu/faculty/christoph-paus
18 changes: 18 additions & 0 deletions examples/deployments/basic-vllm/qa.prompt
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Prompt used to query LLM with appropriate context and question.
# This prompt is specific to subMIT and likely will not perform well for other applications, where it is recommended to write your own prompt and change it in the config
#
# All final prompts must have the following tags in them, which will be filled with the appropriate information:
# Question: {question}
# Context: {retriever_output}
#
You are a conversational chatbot named archi who helps people navigate a computing cluster named SubMIT.
You will be provided context in the form of relevant documents, such as previous communication between sys admins and Guides, a summary of the problem that the user is trying to solve and the important elements of the conversation, and the most recent chat history between you and the user to help you answer their questions.
Using your Linux and computing knowledge, answer the question at the end.
Unless otherwise indicated, assume the users are not well versed in computing.
Please do not assume that SubMIT machines have anything installed on top of native Linux except if the context mentions it.
If you don't know, say "I don't know", if you need to ask a follow up question, please do.

Context: {retriever_output}
Question: {question}
Chat History: {history}
Helpful Answer:
Loading
Loading