-
Notifications
You must be signed in to change notification settings - Fork 47
Add vLLM provider for local GPU inference #442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Large diffs are not rendered by default.
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,259 @@ | ||
| # vLLM Provider | ||
|
|
||
| Run open-weight models on your own GPUs using [vLLM](https://docs.vllm.ai/) as an inference backend. Archi connects to any vLLM server via its OpenAI-compatible API — you deploy and manage vLLM independently. | ||
|
|
||
| ## Why vLLM? | ||
|
|
||
| | | vLLM | Ollama | API providers | | ||
| |---|---|---|---| | ||
| | **Throughput** | High (PagedAttention, continuous batching) | Moderate | N/A (cloud) | | ||
| | **Multi-GPU** | Tensor-parallel across GPUs | Single GPU | N/A | | ||
| | **Tool calling** | Supported (with parser flag) | Model-dependent | Supported | | ||
| | **Cost** | Hardware only | Hardware only | Per-token | | ||
| | **Privacy** | Data stays on-premises | Data stays on-premises | Data leaves your network | | ||
|
|
||
| vLLM is the best fit when you need high-throughput local inference, multi-GPU support, or full data privacy with tool-calling capabilities. | ||
|
|
||
| ## Architecture | ||
|
|
||
| ``` | ||
| ┌──────────────────────┐ ┌──────────────────────┐ | ||
| │ archi deployment │ │ vLLM (external) │ | ||
| │ │ │ │ | ||
| │ ┌────────────────┐ │ HTTP │ Docker container │ | ||
| │ │ VLLMProvider │──│────────>│ OR bare metal │ | ||
| │ │ (Python client)│ │ :8000 │ OR Slurm job │ | ||
| │ └────────────────┘ │ /v1/* │ OR Kubernetes pod │ | ||
| │ │ │ │ | ||
| └──────────────────────┘ └──────────────────────┘ | ||
| ``` | ||
|
|
||
| Archi's `VLLMProvider` is a thin client that talks to vLLM's `/v1` API using the same `ChatOpenAI` LangChain class it would use for the OpenAI API. From the pipeline's perspective, vLLM looks identical to a remote OpenAI endpoint. | ||
|
|
||
| **Archi does not manage the vLLM server.** You deploy, configure, and maintain vLLM independently — whether as a Docker container, a bare metal process, a Slurm job, or a Kubernetes pod. Archi only needs a `base_url` to connect. | ||
|
|
||
| ## Quick Start | ||
|
|
||
| ### 1. Start a vLLM server | ||
|
|
||
| See [Running vLLM](#running-vllm) below for Docker, bare metal, and Slurm examples. | ||
|
|
||
| ### 2. Configure archi | ||
|
|
||
| In your config YAML, set up the vLLM provider with the URL of your server: | ||
|
|
||
| ```yaml | ||
| services: | ||
| chat_app: | ||
| default_provider: vllm | ||
| default_model: "vllm:Qwen/Qwen3-8B" | ||
| providers: | ||
| vllm: | ||
| enabled: true | ||
| base_url: http://localhost:8000/v1 # URL of your vLLM server | ||
| default_model: "Qwen/Qwen3-8B" | ||
| models: | ||
| - "vllm:Qwen/Qwen3-8B" | ||
| ``` | ||
|
|
||
| ### 3. Deploy archi | ||
|
|
||
| ```bash | ||
| archi create -n my-deployment \ | ||
| -c config.yaml \ | ||
| -e .env \ | ||
| --services chatbot | ||
| ``` | ||
|
|
||
| ### 4. Verify | ||
|
|
||
| ```bash | ||
| # Check vLLM is serving | ||
| curl http://localhost:8000/v1/models | ||
|
|
||
| # Check archi can reach it | ||
| curl http://localhost:7861/api/health | ||
| ``` | ||
|
|
||
| ## Configuration Reference | ||
|
|
||
| ### Provider settings | ||
|
|
||
| The vLLM provider is configured under `services.chat_app.providers.vllm`: | ||
|
|
||
| ```yaml | ||
| services: | ||
| chat_app: | ||
| providers: | ||
| vllm: | ||
| enabled: true | ||
| base_url: http://localhost:8000/v1 | ||
| default_model: "Qwen/Qwen3-8B" | ||
| models: | ||
| - "vllm:Qwen/Qwen3-8B" | ||
| ``` | ||
|
|
||
| | Setting | Type | Default | Description | | ||
| |---|---|---|---| | ||
| | `enabled` | bool | `false` | Enable the vLLM provider | | ||
| | `base_url` | string | `http://localhost:8000/v1` | vLLM server OpenAI-compatible endpoint | | ||
| | `default_model` | string | — | HuggingFace model ID to use for inference | | ||
| | `models` | list | — | Available model IDs for the UI model selector | | ||
|
|
||
| ### Model references | ||
|
|
||
| Anywhere a model is referenced in `pipeline_map`, use the `vllm/` prefix: | ||
|
|
||
| ```yaml | ||
| archi: | ||
| pipeline_map: | ||
| CMSCompOpsAgent: | ||
| models: | ||
| required: | ||
| agent_model: vllm/Qwen/Qwen3-8B | ||
| ``` | ||
|
|
||
| The part after `vllm/` must match the HuggingFace model ID that vLLM is serving. | ||
|
|
||
| > **Model naming**: vLLM uses HuggingFace model IDs (e.g. `Qwen/Qwen3-8B`), not Ollama-style names (e.g. `Qwen/Qwen3:8B`). | ||
|
|
||
| ## Running vLLM | ||
|
|
||
| Archi does not manage the vLLM server. Below are examples for common deployment scenarios. | ||
|
|
||
| ### Docker | ||
|
|
||
| ```bash | ||
| docker run -d \ | ||
| --name vllm-server \ | ||
| --gpus all \ | ||
| --ipc=host \ | ||
| --ulimit memlock=-1 \ | ||
| --ulimit stack=67108864 \ | ||
| -p 8000:8000 \ | ||
| -e NCCL_P2P_DISABLE=1 \ | ||
| vllm/vllm-openai:latest \ | ||
| --model Qwen/Qwen3-8B \ | ||
| --enable-auto-tool-choice \ | ||
| --tool-call-parser hermes | ||
| ``` | ||
|
|
||
| Key flags: | ||
| - `--gpus all` — GPU passthrough | ||
| - `--ipc=host` — required for NCCL multi-GPU communication (Docker's default 64MB shm causes crashes) | ||
| - `--ulimit memlock=-1` — prevents OS from swapping out VRAM-mapped buffers | ||
| - `NCCL_P2P_DISABLE=1` — required for V100s and older GPU topologies | ||
|
|
||
| ### Bare metal | ||
|
|
||
| ```bash | ||
| pip install vllm | ||
|
|
||
| python -m vllm.entrypoints.openai.api_server \ | ||
| --model Qwen/Qwen3-8B \ | ||
| --enable-auto-tool-choice \ | ||
| --tool-call-parser hermes \ | ||
| --host 0.0.0.0 \ | ||
| --port 8000 | ||
| ``` | ||
|
|
||
| ### Slurm | ||
|
|
||
| ```bash | ||
| #!/bin/bash | ||
| #SBATCH --gres=gpu:4 | ||
| #SBATCH --cpus-per-task=16 | ||
| #SBATCH --mem=128G | ||
| #SBATCH --time=7-00:00:00 | ||
|
|
||
| module load cuda | ||
| source activate vllm | ||
|
|
||
| python -m vllm.entrypoints.openai.api_server \ | ||
| --model Qwen/Qwen3-8B \ | ||
| --tensor-parallel-size 4 \ | ||
| --enable-auto-tool-choice \ | ||
| --tool-call-parser hermes \ | ||
| --host 0.0.0.0 \ | ||
| --port 8000 | ||
| ``` | ||
|
|
||
| Then set `base_url` in your archi config to the Slurm node's address. | ||
|
|
||
| ### Common vLLM server flags | ||
|
|
||
| These are configured on the vLLM server itself, not in archi: | ||
|
|
||
| | Flag | Description | | ||
| |---|---| | ||
| | `--gpu-memory-utilization 0.9` | Fraction of GPU VRAM to use (0.0-1.0) | | ||
| | `--max-model-len 8192` | Cap context window to reduce memory | | ||
| | `--tensor-parallel-size 4` | Shard model across N GPUs | | ||
| | `--dtype bfloat16` | Force weight precision | | ||
| | `--quantization awq` | Run quantized weights (awq, gptq, fp8) | | ||
| | `--enforce-eager` | Disable CUDA graphs to save memory | | ||
| | `--max-num-seqs 256` | Limit concurrent sequences | | ||
| | `--enable-auto-tool-choice` | Enable tool calling pathway | | ||
| | `--tool-call-parser hermes` | Parser for structured tool calls | | ||
|
|
||
| See [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) for the full reference. | ||
|
|
||
| ## Tool Calling | ||
|
|
||
| vLLM supports function/tool calling for ReAct agents, but requires explicit server flags: | ||
|
|
||
| - `--enable-auto-tool-choice` — enables the tool calling pathway | ||
| - `--tool-call-parser <parser>` — selects the parser for the model family | ||
|
|
||
| | Model family | Parser | | ||
| |---|---| | ||
| | Qwen (Qwen2.5, Qwen3) | `hermes` | | ||
| | Mistral / Mixtral | `mistral` | | ||
| | Llama 3 | `llama3_json` | | ||
|
|
||
| These flags must be set when starting the vLLM server, not in archi's config. | ||
|
|
||
| ## Troubleshooting | ||
|
|
||
| ### Archi can't reach vLLM | ||
|
|
||
| **Symptom**: `ConnectionError: Connection refused` or timeout. | ||
|
|
||
| - Verify vLLM is running: `curl http://<vllm-host>:8000/v1/models` | ||
| - If vLLM is on a different host, ensure network connectivity and firewall rules allow port 8000 | ||
| - If running in Docker, ensure the archi container can reach the vLLM host (use `--network=host` or configure Docker networking) | ||
| - Check that `base_url` in your archi config matches the actual vLLM server address | ||
|
|
||
| ### Model not found (404) | ||
|
|
||
| **Symptom**: `Error: model 'Qwen/Qwen3:8B' does not exist`. | ||
|
|
||
| vLLM uses HuggingFace model IDs, not Ollama-style names. Check: | ||
|
|
||
| - Config uses the exact model ID from `curl <vllm-host>:8000/v1/models` | ||
| - Use dashes, not colons: `Qwen/Qwen3-8B` (not `Qwen/Qwen3:8B`) | ||
|
|
||
| ### Tool calling returns 400 | ||
|
|
||
| **Symptom**: `400 Bad Request: "auto" tool choice requires --enable-auto-tool-choice`. | ||
|
|
||
| The vLLM server wasn't started with tool calling flags. Add to your vLLM launch command: | ||
|
|
||
| ```bash | ||
| --enable-auto-tool-choice --tool-call-parser hermes | ||
| ``` | ||
|
|
||
| ### Slow first response | ||
|
|
||
| The first request after startup may be slow (30-60s) while vLLM compiles CUDA kernels. Subsequent requests will be significantly faster. If this is a problem, start vLLM with `--enforce-eager` to skip CUDA graph compilation (at the cost of lower throughput). | ||
|
|
||
| ### Insufficient VRAM | ||
|
|
||
| If vLLM crashes or the model doesn't fit in GPU memory: | ||
|
|
||
| - Lower `--gpu-memory-utilization` (e.g. `0.7`) | ||
| - Set `--max-model-len` to a smaller value (e.g. `4096`) | ||
| - Add `--quantization awq` or `--quantization gptq` if quantized weights are available | ||
| - Set `--enforce-eager` to disable CUDA graphs | ||
| - Increase `--tensor-parallel-size` and use more GPUs | ||
| - Try a smaller model | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -11,6 +11,8 @@ nav: | |
| - Agents & Tools: agents_tools.md | ||
| - Configuration: configuration.md | ||
| - CLI Reference: cli_reference.md | ||
| - Advanced Setup and Deployment: advanced_setup_deploy.md | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this is already in this list |
||
| - vLLM Provider: vllm.md | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as mentioned above, please remove |
||
| - API Reference: api_reference.md | ||
| - Benchmarking: benchmarking.md | ||
| - Advanced Setup: advanced_setup_deploy.md | ||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,12 @@ | ||
| # Prompt used to condense a chat history and a follow up question into a stand alone question. | ||
| # This is a very general prompt for condensing histories, so for base installs it will not need to be modified | ||
| # | ||
| # All condensing prompts must have the following tags in them, which will be filled with the appropriate information: | ||
| # {chat_history} | ||
| # {question} | ||
| # | ||
| Given the following conversation between you (the AI named archi), a human user who needs help, and an expert, and a follow up question, rephrase the follow up question to be a standalone question, in its original language. | ||
|
|
||
| Chat History: {history} | ||
| Follow Up Input: {question} | ||
| Standalone question: |
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. why is there both basic-vllm and basic-gpu deployments? do we need both? |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,41 @@ | ||
| # Basic configuration file for an Archi deployment | ||
| # using a vLLM server for LLM inference. | ||
| # | ||
| # The vLLM server must be running and accessible at the base_url below. | ||
| # Archi does not manage the vLLM server — see docs/docs/vllm.md for setup guidance. | ||
| # | ||
| # run with: | ||
| # archi create --name my-archi-vllm --config examples/deployments/basic-vllm/config.yaml --services chatbot | ||
|
|
||
| name: my_archi | ||
|
|
||
| services: | ||
| chat_app: | ||
| agent_class: CMSCompOpsAgent | ||
| agents_dir: examples/agents | ||
| default_provider: vllm | ||
| default_model: "vllm:Qwen/Qwen3-8B" | ||
| providers: | ||
| vllm: | ||
| enabled: true | ||
| base_url: http://localhost:8000/v1 # URL of your vLLM server | ||
| default_model: "Qwen/Qwen3-8B" | ||
| models: | ||
| - "vllm:Qwen/Qwen3-8B" | ||
| trained_on: "FASRC DOCS" | ||
| port: 7861 | ||
| external_port: 7861 | ||
| vectorstore: | ||
| backend: postgres | ||
| data_manager: | ||
| port: 7889 | ||
| external_port: 7889 | ||
| auth: | ||
| enabled: false | ||
|
|
||
| data_manager: | ||
| sources: | ||
| links: | ||
| input_lists: | ||
| - config/sources.list | ||
| embedding_name: HuggingFaceEmbeddings |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,39 @@ | ||
| # PPC | ||
| https://ppc.mit.edu/blog/2016/05/08/hello-world/ | ||
| https://ppc.mit.edu/ | ||
| https://ppc.mit.edu/news/ | ||
| https://ppc.mit.edu/publications/ | ||
| https://ppc.mit.edu/blog/2025/02/08/detailed-schedule-for-the-european-strategy/ | ||
| https://ppc.mit.edu/blog/2025/02/14/first-cms-week-in-2025/ | ||
| https://ppc.mit.edu/blog/2025/02/18/exploring-the-higgs-boson-in-our-latest-result/ | ||
| https://ppc.mit.edu/blog/2025/02/04/news-from-the-chamonix-meeting/ | ||
| https://ppc.mit.edu/blog/2025/02/11/cms-data-archival-at-mit/ | ||
| https://ppc.mit.edu/blog/2025/03/28/cern-gets-support-from-canada/ | ||
| https://ppc.mit.edu/blog/2025/04/08/breakthrough-prize-in-physics-2025/ | ||
| https://ppc.mit.edu/blog/2025/04/04/the-fcc-at-cern-a-feasibly-circular-collider/ | ||
| https://ppc.mit.edu/blog/2025/04/08/cleo-reached-magic-issue-number-5000/ | ||
| https://ppc.mit.edu/blog/2025/04/14/maximizing-cms-competitive-advantage/ | ||
| https://ppc.mit.edu/blog/2025/04/25/sueps-at-aps-march-april-meeting/ | ||
| https://ppc.mit.edu/blog/2025/04/18/round-three/ | ||
| https://ppc.mit.edu/blog/2025/04/14/first-beams-with-a-splash-in-2025/ | ||
| https://ppc.mit.edu/blog/2025/05/27/fcc-weak-in-vienna-building-our-future/ | ||
| https://ppc.mit.edu/blog/2025/06/04/new-paper-on-arxiv-submit-a-physics-analysis-facility-at-mit/ | ||
| https://ppc.mit.edu/blog/2025/06/16/summer-cms-week-2025/ | ||
| https://ppc.mit.edu/blog/2025/05/05/cms-records-first-2025-high-energy-collisions/ | ||
| https://ppc.mit.edu/blog/2025/06/17/long-term-vision-for-particle-physics-from-the-national-academies/ | ||
| https://ppc.mit.edu/blog/2025/06/20/conclusion-of-junes-cern-council-session-has-major-consequences-for-cms/ | ||
| https://ppc.mit.edu/blog/2025/06/20/highest-pileup-recorded-at-cms-last-night/ | ||
| https://ppc.mit.edu/blog/2025/06/25/selfie-station-at-wilson-hall/ | ||
| https://ppc.mit.edu/mariarosaria-dalfonso/ | ||
| https://ppc.mit.edu/kenneth-long-2/ | ||
| https://ppc.mit.edu/blog/2025/06/27/open-symposium-on-the-european-strategy-for-particle-physics/ | ||
| https://ppc.mit.edu/blog/2025/07/03/bridging-physics-and-computing-throughput-computing-2025/ | ||
| https://ppc.mit.edu/pietro-lugato-2/ | ||
| https://ppc.mit.edu/luca-lavezzo/ | ||
| https://ppc.mit.edu/zhangqier-wang-2/ | ||
| https://ppc.mit.edu/blog/2025/07/14/welcome-our-first-ever-in-house-masters-student/ | ||
| # A2 | ||
| https://ppc.mit.edu/a2/ | ||
| # Personnel | ||
| https://people.csail.mit.edu/kraska | ||
| https://physics.mit.edu/faculty/christoph-paus |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| # Prompt used to query LLM with appropriate context and question. | ||
| # This prompt is specific to subMIT and likely will not perform well for other applications, where it is recommended to write your own prompt and change it in the config | ||
| # | ||
| # All final prompts must have the following tags in them, which will be filled with the appropriate information: | ||
| # Question: {question} | ||
| # Context: {retriever_output} | ||
| # | ||
| You are a conversational chatbot named archi who helps people navigate a computing cluster named SubMIT. | ||
| You will be provided context in the form of relevant documents, such as previous communication between sys admins and Guides, a summary of the problem that the user is trying to solve and the important elements of the conversation, and the most recent chat history between you and the user to help you answer their questions. | ||
| Using your Linux and computing knowledge, answer the question at the end. | ||
| Unless otherwise indicated, assume the users are not well versed in computing. | ||
| Please do not assume that SubMIT machines have anything installed on top of native Linux except if the context mentions it. | ||
| If you don't know, say "I don't know", if you need to ask a follow up question, please do. | ||
|
|
||
| Context: {retriever_output} | ||
| Question: {question} | ||
| Chat History: {history} | ||
| Helpful Answer: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does vllm get its own provider page? should be the same as the other providers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this also looks outdated (references the side-car idea)