diff --git a/docs/docs/user_guide.md b/docs/docs/user_guide.md index c512520e7..99157bebf 100644 --- a/docs/docs/user_guide.md +++ b/docs/docs/user_guide.md @@ -99,6 +99,7 @@ See the [Configuration Reference](configuration.md) for the full YAML schema and --- +<<<<<<< HEAD ## Secrets Secrets are stored in a `.env` file passed via `--env-file`. Required secrets depend on your deployment: @@ -116,6 +117,573 @@ Secrets are stored in a `.env` file passed via `--env-file`. Required secrets de | `REDMINE_USER` / `REDMINE_PW` | Redmine source | See [Data Sources](data_sources.md) and [Services](services.md) for service-specific secrets. +======= +## Interfaces/Services + +These are the different apps that Archi supports, which allow you to interact with the AI pipelines. + +### Piazza Interface + +Set up Archi to read posts from your Piazza forum and post draft responses to a specified Slack channel. To do this, a Piazza login (email and password) is required, plus the network ID of your Piazza channel, and lastly, a Webhook for the slack channel Archi will post to. See below for a step-by-step description of this. + +1. Go to [https://api.slack.com/apps](https://api.slack.com/apps) and sign in to workspace where you will eventually want Archi to post to (note doing this in a business workspace like the MIT one will require approval of the app/bot). +2. Click 'Create New App', and then 'From scratch'. Name your app and again select the correct workspace. Then hit 'Create App' +3. Now you have your app, and there are a few things to configure before you can launch Archi: +4. Go to Incoming Webhooks under Features, and toggle it on. +5. Click 'Add New Webhook', and select the channel you want Archi to post to. +6. Now, copy the 'Webhook URL' and paste it into the secrets file, and handle it like any other secret! + +#### Configuration + +Beyond standard required configuration fields, the network ID of the Piazza channel is required (see below for an example config). You can get the network ID by simply navigating to the class homepage, and grabbing the sequence that follows 'https://piazza.com/class/'. For example, the 8.01 Fall 2024 homepage is: 'https://piazza.com/class/m0g3v0ahsqm2lg'. The network ID is thus 'm0g3v0ahsqm2lg'. + +Example minimal config for the Piazza interface: + +```yaml +name: bare_minimum_configuration #REQUIRED + +data_manager: + sources: + links: + input_lists: + - class_info.list # class info links + +archi: + [... archi config ...] + +services: + piazza: + network_id: # REQUIRED + chat_app: + trained_on: "Your class materials" # REQUIRED +``` + +#### Secrets + +The necessary secrets for deploying the Piazza service are the following: + +```bash +PIAZZA_EMAIL=... +PIAZZA_PASSWORD=... +SLACK_WEBHOOK=... +``` + +The Slack webhook secret is described above. The Piazza email and password should be those of one of the class instructors. Remember to put this information in files named following what is written above. + +#### Running + +To run the Piazza service, simply add the piazza flag. For example: + +```bash +archi create [...] --services=chatbot,piazza +``` + +--- + +### Redmine/Mailbox Interface + +Archi will read all new tickets in a Redmine project, and draft a response as a comment to the ticket. +Once the ticket is updated to the "Resolved" status by an admin, Archi will send the response as an email to the user who opened the ticket. +The admin can modify Archi's response before sending it out. + +#### Configuration + +```yaml +services: + redmine_mailbox: + url: https://redmine.example.com + project: my-project + redmine_update_time: 10 + mailbox_update_time: 10 + answer_tag: "-- Archi -- Resolving email was sent" +``` + +#### Secrets + +Add the following secrets to your `.env` file: +```bash +IMAP_USER=... +IMAP_PW=... +REDMINE_USER=... +REDMINE_PW=... +SENDER_SERVER=... +SENDER_PORT=587 +SENDER_REPLYTO=... +SENDER_USER=... +SENDER_PW=... +``` + +#### Running + +```bash +archi create [...] --services=chatbot,redmine-mailer +``` + +--- + +### Mattermost Interface + +Set up Archi to read posts from your Mattermost forum and post draft responses to a specified Mattermost channel. + +#### Configuration + +```yaml +services: + mattermost: + update_time: 60 +``` + +#### Secrets + +You need to specify a webhook, access token, and channel identifiers: +```bash +MATTERMOST_WEBHOOK=... +MATTERMOST_PAK=... +MATTERMOST_CHANNEL_ID_READ=... +MATTERMOST_CHANNEL_ID_WRITE=... +``` + +#### Running + +To run the Mattermost service, include it when selecting services. For example: +```bash +archi create [...] --services=chatbot,mattermost +``` + +--- + +### Grafana Interface + +Monitor the performance of your Archi instance with the Grafana interface. This service provides a web-based dashboard to visualize various metrics related to system performance, LLM usage, and more. + +> Note, if you are deploying a version of Archi you have already used (i.e., you haven't removed the images/volumes for a given `--name`), the postgres will have already been created without the Grafana user created, and it will not work, so make sure to deploy a fresh instance. + +#### Configuration + +```yaml +services: + grafana: + external_port: 3000 +``` + +#### Secrets + +Grafana shares the Postgres database with other services, so you need both the database password and a Grafana-specific password: +```bash +PG_PASSWORD= +GRAFANA_PG_PASSWORD= +``` + +#### Running + +Deploy Grafana alongside your other services: +```bash +archi create [...] --services=chatbot,grafana +``` +and you should see something like this +``` +CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES +87f1c7289d29 docker.io/library/postgres:17 postgres 9 minutes ago Up 9 minutes (healthy) 5432/tcp postgres-gtesting2 +40130e8e23de docker.io/library/grafana-gtesting2:2000 9 minutes ago Up 9 minutes 0.0.0.0:3000->3000/tcp, 3000/tcp grafana-gtesting2 +d6ce8a149439 localhost/chat-gtesting2:2000 python -u archi/... 9 minutes ago Up 9 minutes 0.0.0.0:7861->7861/tcp chat-gtesting2 +``` +where the grafana interface is accessible at `your-hostname:3000`. To change the external port from `3000`, you can do this in the config at `services.grafana.external_port`. The default login and password are both "admin", which you will be prompted to change should you want to after first logging in. Navigate to the Archi dashboard from the home page by going to the menu > Dashboards > Archi > Archi Usage. Note, `your-hostname` here is the just name of the machine. Grafana uses its default configuration which is `localhost` but unlike the chat interface, there are no APIs where we template with a selected hostname, so the container networking handles this nicely. + +> Pro tip: once at the web interface, for the "Recent Conversation Messages (Clean Text + Link)" panel, click the three little dots in the top right hand corner of the panel, click "Edit", and on the right, go to e.g., "Override 4" (should have Fields with name: clean text, also Override 7 for context column) and override property "Cell options > Cell value inspect". This will allow you to expand the text boxes with messages longer than can fit. Make sure you click apply to keep the changes. + +> Pro tip 2: If you want to download all of the information from any panel as a CSV, go to the same three dots and click "Inspect", and you should see the option. + +--- + +### Grader Interface + +Interface to launch a website which for a provided solution and rubric (and a couple of other things detailed below), will grade scanned images of a handwritten solution for the specified problem(s). + +> Nota bene: this is not yet fully generalized and "service" ready, but instead for testing grading pipelines and a base off of which to build a potential grading app. + +#### Requirements + +To launch the service the following files are required: + +- `users.csv`. This file is .csv file that contains two columns: "MIT email" and "Unique code", e.g.: + +``` +MIT email,Unique code +username@mit.edu,222 +``` + +For now, the system requires the emails to be in the MIT domain, namely, contain "@mit.edu". TODO: make this an argument that is passed (e.g., school/email domain) + +- `solution_with_rubric_*.txt`. These are .txt files that contain the problem solution followed by the rubric. The naming of the files should follow exactly, where the `*` is the problem number. There should be one of these files for every problem you want the app to be able to grade. The top of the file should be the problem name with a line of dashes ("-") below, e.g.: + +``` +Anti-Helmholtz Coils +--------------------------------------------------- +``` + +These files should live in a directory which you will pass to the config, and Archi will handle the rest. + +- `admin_password.txt`. This file will be passed as a secret and be the admin code to login in to the page where you can reset attempts for students. + +#### Secrets + +The only grading specific secret is the admin password, which like shown above, should be put in the following file + +```bash +ADMIN_PASSWORD=your_password +``` + +Then it behaves like any other secret. + +#### Configuration + +The required fields in the configuration file are different from the rest of the Archi services. Below is an example: + +```yaml +name: grading_test # REQUIRED + +archi: + pipelines: + - GradingPipeline + pipeline_map: + GradingPipeline: + prompts: + required: + final_grade_prompt: final_grade.prompt + models: + required: + final_grade_model: OllamaInterface + ImageProcessingPipeline: + prompts: + required: + image_processing_prompt: image_processing.prompt + models: + required: + image_processing_model: OllamaInterface + +services: + chat_app: + trained_on: "rubrics, class info, etc." # REQUIRED + grader_app: + num_problems: 1 # REQUIRED + local_rubric_dir: ~/grading/my_rubrics # REQUIRED + local_users_csv_dir: ~/grading/logins # REQUIRED + +data_manager: + [...] +``` + +1. `name` -- The name of your configuration (required). +2. `archi.pipelines` -- List of pipelines to use (e.g., `GradingPipeline`, `ImageProcessingPipeline`). +3. `archi.pipeline_map` -- Mapping of pipelines to their required prompts and models. +4. `archi.pipeline_map.GradingPipeline.prompts.required.final_grade_prompt` -- Path to the grading prompt file for evaluating student solutions. +5. `archi.pipeline_map.GradingPipeline.models.required.final_grade_model` -- Model class for grading (e.g., `OllamaInterface`, `HuggingFaceOpenLLM`). +6. `archi.pipeline_map.ImageProcessingPipeline.prompts.required.image_processing_prompt` -- Path to the prompt file for image processing. +7. `archi.pipeline_map.ImageProcessingPipeline.models.required.image_processing_model` -- Model class for image processing (e.g., `OllamaInterface`, `HuggingFaceImageLLM`). +8. `services.chat_app.trained_on` -- A brief description of the data or materials Archi is trained on (required). +9. `services.grader_app.num_problems` -- Number of problems the grading service should expect (must match the number of rubric files). +10. `services.grader_app.local_rubric_dir` -- Directory containing the `solution_with_rubric_*.txt` files. +11. `services.grader_app.local_users_csv_dir` -- Directory containing the `users.csv` file. + +For ReAct-style agents (e.g., `CMSCompOpsAgent`), you may optionally set `archi.pipeline_map..recursion_limit` (default `100`) to control the LangGraph recursion cap; when the limit is hit, the agent returns a final wrap-up response using the collected context. + +#### Running + +```bash +archi create [...] --services=grader +``` + +--- + +## Models + +Models are either: + +1. Hosted locally, either via VLLM or HuggingFace transformers. +2. Accessed via an API, e.g., OpenAI, Anthropic, etc. +3. Accessed via an Ollama server instance. + +### Local Models + +To use a local model, specify one of the local model classes in `models.py`: + +- `HuggingFaceOpenLLM` +- `HuggingFaceImageLLM` +- `VLLM` + +### vLLM + +For high-throughput GPU inference with tool-calling support, Archi can connect to an external [vLLM](https://docs.vllm.ai/) server. Reference models with the `vllm/` prefix in your config: + +```yaml +services: + chat_app: + providers: + vllm: + enabled: true + base_url: http://your-vllm-host:8000/v1 + default_model: "Qwen/Qwen3-8B" +``` + +You deploy and manage the vLLM server independently. See the [vLLM Provider](vllm.md) page for setup examples (Docker, bare metal, Slurm) and troubleshooting. + +### Models via APIs + +We support the following model classes in `models.py` for models accessed via APIs: + +- `OpenAILLM` +- `OpenRouterLLM` +- `AnthropicLLM` + +#### OpenRouter + +OpenRouter uses the OpenAI-compatible API. Configure it by setting `OpenRouterLLM` in your config and providing +`OPENROUTER_API_KEY`. Optional attribution headers can be set via `OPENROUTER_SITE_URL` and `OPENROUTER_APP_NAME`. + +```yaml +archi: + model_class_map: + OpenRouterLLM: + class: OpenRouterLLM + kwargs: + model_name: openai/gpt-4o-mini + temperature: 0.7 +``` + +### Ollama + +In order to use an Ollama server instance for the chatbot, it is possible to specify `OllamaInterface` for the model name. To then correctly use models on the Ollama server, in the keyword args, specify both the url of the server and the name of a model hosted on the server. + +```yaml +archi: + model_class_map: + OllamaInterface: + kwargs: + base_model: "gemma3" # example + url: "url-for-server" + +``` + +In this case, the `gemma3` model is hosted on the Ollama server at `url-for-server`. You can check which models are hosted on your server by going to `url-for-server/models`. + +### Bring Your Own Key (BYOK) + +Archi supports Bring Your Own Key (BYOK), allowing users to provide their own API keys for LLM providers at runtime. This enables: + +- **Cost attribution**: Users pay for their own API usage +- **Provider flexibility**: Switch between providers without admin intervention +- **Privacy**: Use personal accounts for sensitive queries + +#### Key Hierarchy + +API keys are resolved in the following order (highest priority first): + +1. **Environment Variables**: Admin-configured keys (e.g., `OPENAI_API_KEY`) +2. **Docker Secrets**: Keys mounted at `/run/secrets/` +3. **Session Storage**: User-provided keys via the Settings UI + +!!! note + Environment variable keys always take precedence. If an admin configures a key via environment variable, users cannot override it with their own key. + +#### Using BYOK in the Chat Interface + +1. Open the **Settings** modal (gear icon) +2. Expand the **API Keys** section +3. For each provider you want to use: + - Enter your API key in the input field + - Click **Save** to store it in your session +4. Select your preferred **Provider** and **Model** from the dropdowns +5. Start chatting! + +**Status Indicators:** + +| Icon | Meaning | +|------|---------| +| ✓ Env | Key configured via environment variable (cannot be changed) | +| ✓ Session | Key configured via your session | +| ○ | No key configured | + +#### Supported Providers + +| Provider | Environment Variable | API Key Format | +|----------|---------------------|----------------| +| OpenAI | `OPENAI_API_KEY` | `sk-...` | +| Anthropic | `ANTHROPIC_API_KEY` | `sk-ant-...` | +| Google Gemini | `GOOGLE_API_KEY` | `AIza...` | +| OpenRouter | `OPENROUTER_API_KEY` | `sk-or-...` | + +#### Security Considerations + +- **Keys are never logged** - API keys are redacted from all log output +- **Keys are never echoed** - The UI only shows masked placeholders +- **Session-scoped** - Keys are cleared when you log out or your session expires +- **HTTPS recommended** - For production deployments, always use HTTPS to protect keys in transit + +#### API Endpoints + +For programmatic access, the following endpoints are available: + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/api/providers/keys` | GET | Get status of all provider keys | +| `/api/providers/keys/set` | POST | Set a session API key (validates before storing) | +| `/api/providers/keys/clear` | POST | Clear a session API key | + +--- + +## Vector Store + +The vector store is a database that stores document embeddings, enabling semantic and/or lexical search over your knowledge base. Archi uses PostgreSQL with pgvector as the default vector store backend to index and retrieve relevant documents based on similarity to user queries. + +### Backend Selection + +Archi uses PostgreSQL with the pgvector extension as its vector store backend. This provides production-grade vector similarity search integrated with your existing PostgreSQL database. + +Configure vector store settings in your configuration file: + +```yaml +services: + vectorstore: + backend: postgres # PostgreSQL with pgvector (only supported backend) +``` + +### Configuration + +Vector store settings are configured under the `data_manager` section: + +```yaml +data_manager: + collection_name: default_collection + embedding_name: OpenAIEmbeddings + chunk_size: 1000 + chunk_overlap: 0 + reset_collection: true + num_documents_to_retrieve: 5 + distance_metric: cosine +``` + +#### Core Settings + +- **`collection_name`**: Name of the vector store collection. Default: `default_collection` +- **`chunk_size`**: Maximum size of text chunks (in characters) when splitting documents. Default: `1000` +- **`chunk_overlap`**: Number of overlapping characters between consecutive chunks. Default: `0` +- **`reset_collection`**: If `true`, deletes and recreates the collection on startup. Default: `true` +- **`num_documents_to_retrieve`**: Number of relevant document chunks to retrieve for each query. Default: `5` + +#### Distance Metrics + +The `distance_metric` determines how similarity is calculated between embeddings: + +- **`cosine`**: Cosine similarity (default) - measures the angle between vectors +- **`l2`**: Euclidean distance - measures straight-line distance +- **`ip`**: Inner product - measures dot product similarity + +```yaml +data_manager: + distance_metric: cosine # Options: cosine, l2, ip +``` + +### Embedding Models + +Embeddings convert text into numerical vectors. Archi supports multiple embedding providers: + +#### OpenAI Embeddings + +```yaml +data_manager: + embedding_name: OpenAIEmbeddings + embedding_class_map: + OpenAIEmbeddings: + class: OpenAIEmbeddings + kwargs: + model: text-embedding-3-small + similarity_score_reference: 10 +``` + +#### HuggingFace Embeddings + +```yaml +data_manager: + embedding_name: HuggingFaceEmbeddings + embedding_class_map: + HuggingFaceEmbeddings: + class: HuggingFaceEmbeddings + kwargs: + model_name: sentence-transformers/all-MiniLM-L6-v2 + model_kwargs: + device: cpu + encode_kwargs: + normalize_embeddings: true + similarity_score_reference: 10 + query_embedding_instructions: null +``` + +### Supported Document Formats + +The vector store can process the following file types: + +- **Text files**: `.txt`, `.C` +- **Markdown**: `.md` +- **Python**: `.py` +- **HTML**: `.html` +- **PDF**: `.pdf` + +Documents are automatically loaded with the appropriate parser based on file extension. + +### Document Synchronization + +Archi automatically synchronizes your data directory with the vector store: + +1. **Adding documents**: New files in the data directory are automatically chunked, embedded, and added to the collection +2. **Removing documents**: Files deleted from the data directory are removed from the collection +3. **Source tracking**: Each ingested artifact is recorded in the Postgres catalog (`resources` table) with its resource hash and relative file path + +### Hybrid Search + +Combine semantic search with keyword-based BM25 search for improved retrieval: + +```yaml +data_manager: + use_hybrid_search: true + bm25_weight: 0.6 + semantic_weight: 0.4 +``` + +- **`use_hybrid_search`**: Enable hybrid search combining BM25 and semantic similarity. Default: `true` +- **`bm25_weight`**: Weight for BM25 keyword scores (base config default: `0.6`). +- **`semantic_weight`**: Weight for semantic similarity scores (base config default: `0.4`). +- **BM25 tuning**: Parameters like `k1` and `b` are set when the PostgreSQL BM25 index is created and are no longer configurable via this file. + +### Stemming + +By specifying the stemming option within your configuration, stemming functionality for the documents in Archi will be enabled. By doing so, documents inserted into the retrieval pipeline, as well as the query that is matched with them, will be stemmed and simplified for faster and more accurate lookup. + +```yaml +data_manager: + stemming: + enabled: true +``` + +When enabled, both documents and queries are processed using the Porter Stemmer algorithm to reduce words to their root forms (e.g., "running" → "run"), improving matching accuracy. + +### PostgreSQL Backend (Default) + +Archi uses PostgreSQL with pgvector for vector storage by default. The PostgreSQL service is automatically started when you deploy with the chatbot service. + +```yaml +services: + postgres: + host: postgres + port: 5432 + database: archi + vectorstore: + backend: postgres +``` + +Required secrets for PostgreSQL: +```bash +PG_PASSWORD=your_secure_password +``` --- diff --git a/docs/docs/vllm.md b/docs/docs/vllm.md new file mode 100644 index 000000000..5d7a030b9 --- /dev/null +++ b/docs/docs/vllm.md @@ -0,0 +1,259 @@ +# vLLM Provider + +Run open-weight models on your own GPUs using [vLLM](https://docs.vllm.ai/) as an inference backend. Archi connects to any vLLM server via its OpenAI-compatible API — you deploy and manage vLLM independently. + +## Why vLLM? + +| | vLLM | Ollama | API providers | +|---|---|---|---| +| **Throughput** | High (PagedAttention, continuous batching) | Moderate | N/A (cloud) | +| **Multi-GPU** | Tensor-parallel across GPUs | Single GPU | N/A | +| **Tool calling** | Supported (with parser flag) | Model-dependent | Supported | +| **Cost** | Hardware only | Hardware only | Per-token | +| **Privacy** | Data stays on-premises | Data stays on-premises | Data leaves your network | + +vLLM is the best fit when you need high-throughput local inference, multi-GPU support, or full data privacy with tool-calling capabilities. + +## Architecture + +``` +┌──────────────────────┐ ┌──────────────────────┐ +│ archi deployment │ │ vLLM (external) │ +│ │ │ │ +│ ┌────────────────┐ │ HTTP │ Docker container │ +│ │ VLLMProvider │──│────────>│ OR bare metal │ +│ │ (Python client)│ │ :8000 │ OR Slurm job │ +│ └────────────────┘ │ /v1/* │ OR Kubernetes pod │ +│ │ │ │ +└──────────────────────┘ └──────────────────────┘ +``` + +Archi's `VLLMProvider` is a thin client that talks to vLLM's `/v1` API using the same `ChatOpenAI` LangChain class it would use for the OpenAI API. From the pipeline's perspective, vLLM looks identical to a remote OpenAI endpoint. + +**Archi does not manage the vLLM server.** You deploy, configure, and maintain vLLM independently — whether as a Docker container, a bare metal process, a Slurm job, or a Kubernetes pod. Archi only needs a `base_url` to connect. + +## Quick Start + +### 1. Start a vLLM server + +See [Running vLLM](#running-vllm) below for Docker, bare metal, and Slurm examples. + +### 2. Configure archi + +In your config YAML, set up the vLLM provider with the URL of your server: + +```yaml +services: + chat_app: + default_provider: vllm + default_model: "vllm:Qwen/Qwen3-8B" + providers: + vllm: + enabled: true + base_url: http://localhost:8000/v1 # URL of your vLLM server + default_model: "Qwen/Qwen3-8B" + models: + - "vllm:Qwen/Qwen3-8B" +``` + +### 3. Deploy archi + +```bash +archi create -n my-deployment \ + -c config.yaml \ + -e .env \ + --services chatbot +``` + +### 4. Verify + +```bash +# Check vLLM is serving +curl http://localhost:8000/v1/models + +# Check archi can reach it +curl http://localhost:7861/api/health +``` + +## Configuration Reference + +### Provider settings + +The vLLM provider is configured under `services.chat_app.providers.vllm`: + +```yaml +services: + chat_app: + providers: + vllm: + enabled: true + base_url: http://localhost:8000/v1 + default_model: "Qwen/Qwen3-8B" + models: + - "vllm:Qwen/Qwen3-8B" +``` + +| Setting | Type | Default | Description | +|---|---|---|---| +| `enabled` | bool | `false` | Enable the vLLM provider | +| `base_url` | string | `http://localhost:8000/v1` | vLLM server OpenAI-compatible endpoint | +| `default_model` | string | — | HuggingFace model ID to use for inference | +| `models` | list | — | Available model IDs for the UI model selector | + +### Model references + +Anywhere a model is referenced in `pipeline_map`, use the `vllm/` prefix: + +```yaml +archi: + pipeline_map: + CMSCompOpsAgent: + models: + required: + agent_model: vllm/Qwen/Qwen3-8B +``` + +The part after `vllm/` must match the HuggingFace model ID that vLLM is serving. + +> **Model naming**: vLLM uses HuggingFace model IDs (e.g. `Qwen/Qwen3-8B`), not Ollama-style names (e.g. `Qwen/Qwen3:8B`). + +## Running vLLM + +Archi does not manage the vLLM server. Below are examples for common deployment scenarios. + +### Docker + +```bash +docker run -d \ + --name vllm-server \ + --gpus all \ + --ipc=host \ + --ulimit memlock=-1 \ + --ulimit stack=67108864 \ + -p 8000:8000 \ + -e NCCL_P2P_DISABLE=1 \ + vllm/vllm-openai:latest \ + --model Qwen/Qwen3-8B \ + --enable-auto-tool-choice \ + --tool-call-parser hermes +``` + +Key flags: +- `--gpus all` — GPU passthrough +- `--ipc=host` — required for NCCL multi-GPU communication (Docker's default 64MB shm causes crashes) +- `--ulimit memlock=-1` — prevents OS from swapping out VRAM-mapped buffers +- `NCCL_P2P_DISABLE=1` — required for V100s and older GPU topologies + +### Bare metal + +```bash +pip install vllm + +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen3-8B \ + --enable-auto-tool-choice \ + --tool-call-parser hermes \ + --host 0.0.0.0 \ + --port 8000 +``` + +### Slurm + +```bash +#!/bin/bash +#SBATCH --gres=gpu:4 +#SBATCH --cpus-per-task=16 +#SBATCH --mem=128G +#SBATCH --time=7-00:00:00 + +module load cuda +source activate vllm + +python -m vllm.entrypoints.openai.api_server \ + --model Qwen/Qwen3-8B \ + --tensor-parallel-size 4 \ + --enable-auto-tool-choice \ + --tool-call-parser hermes \ + --host 0.0.0.0 \ + --port 8000 +``` + +Then set `base_url` in your archi config to the Slurm node's address. + +### Common vLLM server flags + +These are configured on the vLLM server itself, not in archi: + +| Flag | Description | +|---|---| +| `--gpu-memory-utilization 0.9` | Fraction of GPU VRAM to use (0.0-1.0) | +| `--max-model-len 8192` | Cap context window to reduce memory | +| `--tensor-parallel-size 4` | Shard model across N GPUs | +| `--dtype bfloat16` | Force weight precision | +| `--quantization awq` | Run quantized weights (awq, gptq, fp8) | +| `--enforce-eager` | Disable CUDA graphs to save memory | +| `--max-num-seqs 256` | Limit concurrent sequences | +| `--enable-auto-tool-choice` | Enable tool calling pathway | +| `--tool-call-parser hermes` | Parser for structured tool calls | + +See [vLLM documentation](https://docs.vllm.ai/en/latest/serving/openai_compatible_server.html) for the full reference. + +## Tool Calling + +vLLM supports function/tool calling for ReAct agents, but requires explicit server flags: + +- `--enable-auto-tool-choice` — enables the tool calling pathway +- `--tool-call-parser ` — selects the parser for the model family + +| Model family | Parser | +|---|---| +| Qwen (Qwen2.5, Qwen3) | `hermes` | +| Mistral / Mixtral | `mistral` | +| Llama 3 | `llama3_json` | + +These flags must be set when starting the vLLM server, not in archi's config. + +## Troubleshooting + +### Archi can't reach vLLM + +**Symptom**: `ConnectionError: Connection refused` or timeout. + +- Verify vLLM is running: `curl http://:8000/v1/models` +- If vLLM is on a different host, ensure network connectivity and firewall rules allow port 8000 +- If running in Docker, ensure the archi container can reach the vLLM host (use `--network=host` or configure Docker networking) +- Check that `base_url` in your archi config matches the actual vLLM server address + +### Model not found (404) + +**Symptom**: `Error: model 'Qwen/Qwen3:8B' does not exist`. + +vLLM uses HuggingFace model IDs, not Ollama-style names. Check: + +- Config uses the exact model ID from `curl :8000/v1/models` +- Use dashes, not colons: `Qwen/Qwen3-8B` (not `Qwen/Qwen3:8B`) + +### Tool calling returns 400 + +**Symptom**: `400 Bad Request: "auto" tool choice requires --enable-auto-tool-choice`. + +The vLLM server wasn't started with tool calling flags. Add to your vLLM launch command: + +```bash +--enable-auto-tool-choice --tool-call-parser hermes +``` + +### Slow first response + +The first request after startup may be slow (30-60s) while vLLM compiles CUDA kernels. Subsequent requests will be significantly faster. If this is a problem, start vLLM with `--enforce-eager` to skip CUDA graph compilation (at the cost of lower throughput). + +### Insufficient VRAM + +If vLLM crashes or the model doesn't fit in GPU memory: + +- Lower `--gpu-memory-utilization` (e.g. `0.7`) +- Set `--max-model-len` to a smaller value (e.g. `4096`) +- Add `--quantization awq` or `--quantization gptq` if quantized weights are available +- Set `--enforce-eager` to disable CUDA graphs +- Increase `--tensor-parallel-size` and use more GPUs +- Try a smaller model diff --git a/docs/mkdocs.yml b/docs/mkdocs.yml index feeeee508..4ae6bc6a4 100644 --- a/docs/mkdocs.yml +++ b/docs/mkdocs.yml @@ -11,6 +11,8 @@ nav: - Agents & Tools: agents_tools.md - Configuration: configuration.md - CLI Reference: cli_reference.md + - Advanced Setup and Deployment: advanced_setup_deploy.md + - vLLM Provider: vllm.md - API Reference: api_reference.md - Benchmarking: benchmarking.md - Advanced Setup: advanced_setup_deploy.md diff --git a/examples/deployments/basic-vllm/condense.prompt b/examples/deployments/basic-vllm/condense.prompt new file mode 100644 index 000000000..cca6c4581 --- /dev/null +++ b/examples/deployments/basic-vllm/condense.prompt @@ -0,0 +1,12 @@ +# Prompt used to condense a chat history and a follow up question into a stand alone question. +# This is a very general prompt for condensing histories, so for base installs it will not need to be modified +# +# All condensing prompts must have the following tags in them, which will be filled with the appropriate information: +# {chat_history} +# {question} +# +Given the following conversation between you (the AI named archi), a human user who needs help, and an expert, and a follow up question, rephrase the follow up question to be a standalone question, in its original language. + +Chat History: {history} +Follow Up Input: {question} +Standalone question: \ No newline at end of file diff --git a/examples/deployments/basic-vllm/config.yaml b/examples/deployments/basic-vllm/config.yaml new file mode 100644 index 000000000..26573052e --- /dev/null +++ b/examples/deployments/basic-vllm/config.yaml @@ -0,0 +1,41 @@ +# Basic configuration file for an Archi deployment +# using a vLLM server for LLM inference. +# +# The vLLM server must be running and accessible at the base_url below. +# Archi does not manage the vLLM server — see docs/docs/vllm.md for setup guidance. +# +# run with: +# archi create --name my-archi-vllm --config examples/deployments/basic-vllm/config.yaml --services chatbot + +name: my_archi + +services: + chat_app: + agent_class: CMSCompOpsAgent + agents_dir: examples/agents + default_provider: vllm + default_model: "vllm:Qwen/Qwen3-8B" + providers: + vllm: + enabled: true + base_url: http://localhost:8000/v1 # URL of your vLLM server + default_model: "Qwen/Qwen3-8B" + models: + - "vllm:Qwen/Qwen3-8B" + trained_on: "FASRC DOCS" + port: 7861 + external_port: 7861 + vectorstore: + backend: postgres + data_manager: + port: 7889 + external_port: 7889 + auth: + enabled: false + +data_manager: + sources: + links: + input_lists: + - config/sources.list + embedding_name: HuggingFaceEmbeddings diff --git a/examples/deployments/basic-vllm/miscellanea.list b/examples/deployments/basic-vllm/miscellanea.list new file mode 100644 index 000000000..7e973aba6 --- /dev/null +++ b/examples/deployments/basic-vllm/miscellanea.list @@ -0,0 +1,39 @@ +# PPC +https://ppc.mit.edu/blog/2016/05/08/hello-world/ +https://ppc.mit.edu/ +https://ppc.mit.edu/news/ +https://ppc.mit.edu/publications/ +https://ppc.mit.edu/blog/2025/02/08/detailed-schedule-for-the-european-strategy/ +https://ppc.mit.edu/blog/2025/02/14/first-cms-week-in-2025/ +https://ppc.mit.edu/blog/2025/02/18/exploring-the-higgs-boson-in-our-latest-result/ +https://ppc.mit.edu/blog/2025/02/04/news-from-the-chamonix-meeting/ +https://ppc.mit.edu/blog/2025/02/11/cms-data-archival-at-mit/ +https://ppc.mit.edu/blog/2025/03/28/cern-gets-support-from-canada/ +https://ppc.mit.edu/blog/2025/04/08/breakthrough-prize-in-physics-2025/ +https://ppc.mit.edu/blog/2025/04/04/the-fcc-at-cern-a-feasibly-circular-collider/ +https://ppc.mit.edu/blog/2025/04/08/cleo-reached-magic-issue-number-5000/ +https://ppc.mit.edu/blog/2025/04/14/maximizing-cms-competitive-advantage/ +https://ppc.mit.edu/blog/2025/04/25/sueps-at-aps-march-april-meeting/ +https://ppc.mit.edu/blog/2025/04/18/round-three/ +https://ppc.mit.edu/blog/2025/04/14/first-beams-with-a-splash-in-2025/ +https://ppc.mit.edu/blog/2025/05/27/fcc-weak-in-vienna-building-our-future/ +https://ppc.mit.edu/blog/2025/06/04/new-paper-on-arxiv-submit-a-physics-analysis-facility-at-mit/ +https://ppc.mit.edu/blog/2025/06/16/summer-cms-week-2025/ +https://ppc.mit.edu/blog/2025/05/05/cms-records-first-2025-high-energy-collisions/ +https://ppc.mit.edu/blog/2025/06/17/long-term-vision-for-particle-physics-from-the-national-academies/ +https://ppc.mit.edu/blog/2025/06/20/conclusion-of-junes-cern-council-session-has-major-consequences-for-cms/ +https://ppc.mit.edu/blog/2025/06/20/highest-pileup-recorded-at-cms-last-night/ +https://ppc.mit.edu/blog/2025/06/25/selfie-station-at-wilson-hall/ +https://ppc.mit.edu/mariarosaria-dalfonso/ +https://ppc.mit.edu/kenneth-long-2/ +https://ppc.mit.edu/blog/2025/06/27/open-symposium-on-the-european-strategy-for-particle-physics/ +https://ppc.mit.edu/blog/2025/07/03/bridging-physics-and-computing-throughput-computing-2025/ +https://ppc.mit.edu/pietro-lugato-2/ +https://ppc.mit.edu/luca-lavezzo/ +https://ppc.mit.edu/zhangqier-wang-2/ +https://ppc.mit.edu/blog/2025/07/14/welcome-our-first-ever-in-house-masters-student/ +# A2 +https://ppc.mit.edu/a2/ +# Personnel +https://people.csail.mit.edu/kraska +https://physics.mit.edu/faculty/christoph-paus diff --git a/examples/deployments/basic-vllm/qa.prompt b/examples/deployments/basic-vllm/qa.prompt new file mode 100644 index 000000000..8ed5c6420 --- /dev/null +++ b/examples/deployments/basic-vllm/qa.prompt @@ -0,0 +1,18 @@ +# Prompt used to query LLM with appropriate context and question. +# This prompt is specific to subMIT and likely will not perform well for other applications, where it is recommended to write your own prompt and change it in the config +# +# All final prompts must have the following tags in them, which will be filled with the appropriate information: +# Question: {question} +# Context: {retriever_output} +# +You are a conversational chatbot named archi who helps people navigate a computing cluster named SubMIT. +You will be provided context in the form of relevant documents, such as previous communication between sys admins and Guides, a summary of the problem that the user is trying to solve and the important elements of the conversation, and the most recent chat history between you and the user to help you answer their questions. +Using your Linux and computing knowledge, answer the question at the end. +Unless otherwise indicated, assume the users are not well versed in computing. +Please do not assume that SubMIT machines have anything installed on top of native Linux except if the context mentions it. +If you don't know, say "I don't know", if you need to ask a follow up question, please do. + +Context: {retriever_output} +Question: {question} +Chat History: {history} +Helpful Answer: diff --git a/src/archi/providers/__init__.py b/src/archi/providers/__init__.py index cc968f5b7..e041505cd 100644 --- a/src/archi/providers/__init__.py +++ b/src/archi/providers/__init__.py @@ -76,13 +76,15 @@ def _ensure_providers_registered() -> None: from src.archi.providers.gemini_provider import GeminiProvider from src.archi.providers.openrouter_provider import OpenRouterProvider from src.archi.providers.local_provider import LocalProvider + from src.archi.providers.vllm_provider import VLLMProvider from src.archi.providers.cern_litellm_provider import CERNLiteLLMProvider - + register_provider(ProviderType.OPENAI, OpenAIProvider) register_provider(ProviderType.ANTHROPIC, AnthropicProvider) register_provider(ProviderType.GEMINI, GeminiProvider) register_provider(ProviderType.OPENROUTER, OpenRouterProvider) register_provider(ProviderType.LOCAL, LocalProvider) + register_provider(ProviderType.VLLM, VLLMProvider) register_provider(ProviderType.CERN_LITELLM, CERNLiteLLMProvider) @@ -168,7 +170,7 @@ def get_provider_by_name(name: str, **kwargs) -> BaseProvider: "openrouter": ProviderType.OPENROUTER, "local": ProviderType.LOCAL, "ollama": ProviderType.LOCAL, - "vllm": ProviderType.LOCAL, + "vllm": ProviderType.VLLM, "cern_litellm": ProviderType.CERN_LITELLM, } diff --git a/src/archi/providers/base.py b/src/archi/providers/base.py index 8157c70b3..bec087b45 100644 --- a/src/archi/providers/base.py +++ b/src/archi/providers/base.py @@ -25,6 +25,7 @@ class ProviderType(str, Enum): GEMINI = "gemini" OPENROUTER = "openrouter" LOCAL = "local" + VLLM = "vllm" CERN_LITELLM = "cern_litellm" @@ -117,8 +118,8 @@ def set_api_key(self, api_key: str) -> None: @property def is_configured(self) -> bool: """Check if the provider has necessary credentials configured.""" - # Local providers may not need an API key - if self.provider_type == ProviderType.LOCAL: + # Local/vLLM providers may not need an API key + if self.provider_type in (ProviderType.LOCAL, ProviderType.VLLM): return bool(self.config.base_url) return bool(self._api_key) diff --git a/src/archi/providers/vllm_provider.py b/src/archi/providers/vllm_provider.py new file mode 100644 index 000000000..c6e710857 --- /dev/null +++ b/src/archi/providers/vllm_provider.py @@ -0,0 +1,172 @@ +"""vLLM provider -- thin client for OpenAI-compatible vLLM servers. + +Wraps a locally hosted vLLM instance whose ``/v1`` API is wire-compatible +with OpenAI. No real API key is required; the placeholder ``"not-needed"`` +is sent instead. +""" + +import json +import os +import urllib.error +import urllib.request +from typing import Any, Dict, List, Optional + +from langchain_core.language_models.chat_models import BaseChatModel + +from src.archi.providers.base import ( + BaseProvider, + ModelInfo, + ProviderConfig, + ProviderType, +) +from src.utils.logging import get_logger + +logger = get_logger(__name__) + + +DEFAULT_VLLM_BASE_URL = "http://localhost:8000/v1" + + +class VLLMProvider(BaseProvider): + """ + Provider for vLLM inference servers. + + Communicates with a vLLM server via its OpenAI-compatible API. + The base URL can be configured via: + 1. VLLM_BASE_URL environment variable (highest priority) + 2. ProviderConfig.base_url + 3. Default: http://localhost:8000/v1 + """ + + provider_type = ProviderType.VLLM + display_name = "vLLM" + + @staticmethod + def _normalize_base_url(url: Optional[str]) -> Optional[str]: + """Ensure the base URL has a scheme so urllib requests succeed.""" + if not url: + return url + if url.startswith(("http://", "https://")): + return url + return f"http://{url}" + + def __init__(self, config: Optional[ProviderConfig] = None): + """Initialize the vLLM provider. + + Resolves the server URL in priority order: ``VLLM_BASE_URL`` env + var > ``config.base_url`` > ``DEFAULT_VLLM_BASE_URL``. Bare + ``host:port`` URLs are normalised with an ``http://`` scheme. + + Args: + config: Optional provider configuration. When *None*, a + default config targeting ``localhost:8000`` is created. + """ + env_base_url = self._normalize_base_url(os.environ.get("VLLM_BASE_URL")) + + if config is None: + config = ProviderConfig( + provider_type=ProviderType.VLLM, + base_url=env_base_url or DEFAULT_VLLM_BASE_URL, + api_key="not-needed", + enabled=True, + ) + else: + if env_base_url: + config.base_url = env_base_url + elif not config.base_url: + config.base_url = DEFAULT_VLLM_BASE_URL + config.base_url = self._normalize_base_url(config.base_url) + + super().__init__(config) + + def get_chat_model(self, model_name: str, **kwargs) -> BaseChatModel: + """Create a ChatOpenAI instance pointed at the vLLM server. + + Args: + model_name: HuggingFace model ID served by vLLM + (e.g. ``"Qwen/Qwen3-8B"``). + **kwargs: Extra arguments forwarded to ChatOpenAI. + + Returns: + A ChatOpenAI instance configured for the vLLM endpoint. + """ + from langchain_openai import ChatOpenAI + + model_kwargs = { + "model": model_name, + "base_url": self.config.base_url, + "api_key": self._api_key or "not-needed", + "streaming": True, + **self.config.extra_kwargs, + **kwargs, + } + + return ChatOpenAI(**model_kwargs) + + def list_models(self) -> List[ModelInfo]: + """Return available models, querying the server first. + + Falls back to statically configured models if the server is + unreachable. + + Returns: + A list of :class:`ModelInfo` discovered from the server or + from config, or an empty list if neither yields results. + """ + fetched = self._fetch_vllm_models() + if fetched: + return fetched + if self.config.models: + return self.config.models + return [] + + def _fetch_vllm_models(self) -> List[ModelInfo]: + """Fetch models from the vLLM ``/v1/models`` endpoint. + + Returns: + A list of :class:`ModelInfo`, or an empty list if the + server is unreachable or returns an unexpected payload. + """ + try: + url = f"{self.config.base_url}/models" + req = urllib.request.Request(url, method="GET") + with urllib.request.urlopen(req, timeout=10) as response: + if response.status == 200: + data = json.loads(response.read().decode()) + models = [] + for model_data in data.get("data", []): + model_id = model_data.get("id", "") + models.append(ModelInfo( + id=model_id, + name=model_id, + display_name=model_id, + supports_tools=True, + supports_streaming=True, + )) + logger.debug( + "[VLLMProvider] Discovered %d models: %s", + len(models), + [m.id for m in models], + ) + return models + except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError, json.JSONDecodeError) as e: + logger.warning("[VLLMProvider] Failed to fetch models from %s: %s", self.config.base_url, e) + + return [] + + def validate_connection(self) -> bool: + """Check whether the vLLM server is reachable. + + Sends a GET to ``/v1/models`` with a short timeout. + + Returns: + True if the server responds with HTTP 200, False otherwise. + """ + try: + url = f"{self.config.base_url}/models" + req = urllib.request.Request(url, method="GET") + with urllib.request.urlopen(req, timeout=5) as response: + return response.status == 200 + except (urllib.error.URLError, urllib.error.HTTPError, TimeoutError) as e: + logger.warning("[VLLMProvider] Connection failed: %s", e) + return False diff --git a/src/cli/templates/base-compose.yaml b/src/cli/templates/base-compose.yaml index 218ff7f86..387447097 100644 --- a/src/cli/templates/base-compose.yaml +++ b/src/cli/templates/base-compose.yaml @@ -120,10 +120,12 @@ services: container_name: {{ chatbot_container_name }} {% if postgres_enabled -%} depends_on: + {% if postgres_enabled -%} postgres: condition: service_healthy config-seed: condition: service_completed_successfully + {% endif -%} {% endif -%} environment: PGHOST: {{ 'localhost' if host_mode else 'postgres' }} @@ -555,6 +557,7 @@ services: {%- endif %} {%- endif %} + {% if benchmarking_enabled -%} benchmark: image: {{ benchmarking_image }}:{{ benchmarking_tag }} diff --git a/src/cli/utils/service_builder.py b/src/cli/utils/service_builder.py index a6ff1334b..f4e046c5f 100644 --- a/src/cli/utils/service_builder.py +++ b/src/cli/utils/service_builder.py @@ -144,9 +144,11 @@ class ServiceBuilder: def get_available_services() -> Dict[str, str]: available_services = service_registry.get_application_services() integration_services = service_registry.get_integration_services() + compute_services = service_registry.get_services_by_category('compute') return { **{name: svc.description for name, svc in available_services.items()}, **{name: svc.description for name, svc in integration_services.items()}, + **{name: svc.description for name, svc in compute_services.items()}, } @staticmethod diff --git a/tests/smoke/preflight.py b/tests/smoke/preflight.py index 7334e26a7..0814b79d2 100644 --- a/tests/smoke/preflight.py +++ b/tests/smoke/preflight.py @@ -208,6 +208,7 @@ def main() -> None: _check_postgres() # ChromaDB removed - PostgreSQL with pgvector is the only supported backend _check_data_manager_catalog() + _check_ollama_model() config_path = os.getenv("ARCHI_CONFIG_PATH") diff --git a/tests/unit/test_vllm_provider.py b/tests/unit/test_vllm_provider.py new file mode 100644 index 000000000..ae2ea9d1a --- /dev/null +++ b/tests/unit/test_vllm_provider.py @@ -0,0 +1,205 @@ +"""Unit tests for VLLMProvider.""" + +import json +import unittest +import urllib.error +from unittest.mock import MagicMock, patch + +from src.archi.providers.base import ModelInfo, ProviderConfig, ProviderType +from src.archi.providers.vllm_provider import VLLMProvider, DEFAULT_VLLM_BASE_URL + + +class TestVLLMProviderInit(unittest.TestCase): + """Test VLLMProvider initialization.""" + + def test_default_config(self): + provider = VLLMProvider() + assert provider.config.base_url == DEFAULT_VLLM_BASE_URL + assert provider.config.provider_type == ProviderType.VLLM + assert provider._api_key == "not-needed" + + def test_custom_base_url(self): + config = ProviderConfig( + provider_type=ProviderType.VLLM, + base_url="http://gpu-node:9000/v1", + ) + provider = VLLMProvider(config) + assert provider.config.base_url == "http://gpu-node:9000/v1" + + @patch.dict("os.environ", {"VLLM_BASE_URL": "http://env-host:8000/v1"}) + def test_env_overrides_default(self): + provider = VLLMProvider() + assert provider.config.base_url == "http://env-host:8000/v1" + + @patch.dict("os.environ", {"VLLM_BASE_URL": "http://env-host:8000/v1"}) + def test_env_overrides_config(self): + config = ProviderConfig( + provider_type=ProviderType.VLLM, + base_url="http://config-host:8000/v1", + ) + provider = VLLMProvider(config) + assert provider.config.base_url == "http://env-host:8000/v1" + + def test_api_key_defaults_to_not_needed(self): + # When no config provided, api_key is set to "not-needed" + provider = VLLMProvider() + assert provider._api_key == "not-needed" + + def test_api_key_not_mutated_on_passed_config(self): + # When config is provided without api_key, __init__ should not mutate it + config = ProviderConfig(provider_type=ProviderType.VLLM) + VLLMProvider(config) + assert config.api_key is None + + def test_normalizes_base_url_without_scheme(self): + config = ProviderConfig( + provider_type=ProviderType.VLLM, + base_url="gpu-node:8000/v1", + ) + provider = VLLMProvider(config) + assert provider.config.base_url == "http://gpu-node:8000/v1" + + def test_base_url_defaults_when_config_has_none(self): + config = ProviderConfig(provider_type=ProviderType.VLLM, base_url=None) + provider = VLLMProvider(config) + assert provider.config.base_url == DEFAULT_VLLM_BASE_URL + + +class TestVLLMProviderGetChatModel(unittest.TestCase): + """Test get_chat_model returns ChatOpenAI with correct params.""" + + @patch("langchain_openai.ChatOpenAI", autospec=True) + def test_returns_chat_openai_with_defaults(self, mock_chat_openai): + provider = VLLMProvider() + provider.get_chat_model("my-model") + + mock_chat_openai.assert_called_once() + call_kwargs = mock_chat_openai.call_args[1] + assert call_kwargs["model"] == "my-model" + assert call_kwargs["base_url"] == DEFAULT_VLLM_BASE_URL + assert call_kwargs["api_key"] == "not-needed" + assert call_kwargs["streaming"] is True + + @patch("langchain_openai.ChatOpenAI", autospec=True) + def test_custom_base_url_passed_through(self, mock_chat_openai): + config = ProviderConfig( + provider_type=ProviderType.VLLM, + base_url="http://custom:8000/v1", + ) + provider = VLLMProvider(config) + provider.get_chat_model("Qwen/Qwen2.5-7B") + + call_kwargs = mock_chat_openai.call_args[1] + assert call_kwargs["base_url"] == "http://custom:8000/v1" + + +class TestVLLMProviderListModels(unittest.TestCase): + """Test list_models with mocked /v1/models endpoint.""" + + def _mock_response(self, data, status=200): + mock_resp = MagicMock() + mock_resp.status = status + mock_resp.read.return_value = json.dumps(data).encode() + mock_resp.__enter__ = MagicMock(return_value=mock_resp) + mock_resp.__exit__ = MagicMock(return_value=False) + return mock_resp + + @patch("src.archi.providers.vllm_provider.urllib.request.urlopen") + def test_fetches_models_from_server(self, mock_urlopen): + mock_urlopen.return_value = self._mock_response({ + "data": [ + {"id": "Qwen/Qwen2.5-7B-Instruct-1M"}, + {"id": "meta-llama/Llama-3-8B"}, + ] + }) + + provider = VLLMProvider() + models = provider.list_models() + + assert len(models) == 2 + assert models[0].id == "Qwen/Qwen2.5-7B-Instruct-1M" + assert models[1].id == "meta-llama/Llama-3-8B" + assert all(isinstance(m, ModelInfo) for m in models) + + @patch("src.archi.providers.vllm_provider.urllib.request.urlopen", side_effect=urllib.error.URLError("Connection refused")) + def test_falls_back_to_config_models(self, mock_urlopen): + config = ProviderConfig( + provider_type=ProviderType.VLLM, + base_url=DEFAULT_VLLM_BASE_URL, + models=[ModelInfo(id="fallback-model", name="fallback-model", display_name="Fallback")], + ) + provider = VLLMProvider(config) + models = provider.list_models() + + assert len(models) == 1 + assert models[0].id == "fallback-model" + + @patch("src.archi.providers.vllm_provider.urllib.request.urlopen", side_effect=urllib.error.URLError("Connection refused")) + def test_returns_empty_when_no_config_models(self, mock_urlopen): + provider = VLLMProvider() + models = provider.list_models() + assert models == [] + + +class TestVLLMProviderValidateConnection(unittest.TestCase): + """Test validate_connection.""" + + def _mock_response(self, status=200): + mock_resp = MagicMock() + mock_resp.status = status + mock_resp.__enter__ = MagicMock(return_value=mock_resp) + mock_resp.__exit__ = MagicMock(return_value=False) + return mock_resp + + @patch("src.archi.providers.vllm_provider.urllib.request.urlopen") + def test_returns_true_on_200(self, mock_urlopen): + mock_urlopen.return_value = self._mock_response(200) + provider = VLLMProvider() + assert provider.validate_connection() is True + + @patch("src.archi.providers.vllm_provider.urllib.request.urlopen", side_effect=urllib.error.URLError("Connection refused")) + def test_returns_false_on_failure(self, mock_urlopen): + provider = VLLMProvider() + assert provider.validate_connection() is False + + +class TestVLLMProviderRegistration(unittest.TestCase): + """Test that vLLM is properly registered in the provider system.""" + + def test_provider_type_enum_exists(self): + assert ProviderType.VLLM == "vllm" + + def test_get_provider_returns_vllm(self): + from src.archi.providers import ( + _PROVIDER_REGISTRY, _PROVIDER_INSTANCES, + register_provider, get_provider, + ) + # Manually register only VLLMProvider to avoid importing all providers + _PROVIDER_REGISTRY.clear() + _PROVIDER_INSTANCES.clear() + register_provider(ProviderType.VLLM, VLLMProvider) + + provider = get_provider("vllm") + assert isinstance(provider, VLLMProvider) + + _PROVIDER_REGISTRY.clear() + _PROVIDER_INSTANCES.clear() + + def test_get_provider_by_name_returns_vllm(self): + from src.archi.providers import ( + _PROVIDER_REGISTRY, _PROVIDER_INSTANCES, + register_provider, get_provider_by_name, + ) + _PROVIDER_REGISTRY.clear() + _PROVIDER_INSTANCES.clear() + register_provider(ProviderType.VLLM, VLLMProvider) + + provider = get_provider_by_name("vllm") + assert isinstance(provider, VLLMProvider) + + _PROVIDER_REGISTRY.clear() + _PROVIDER_INSTANCES.clear() + + +if __name__ == "__main__": + unittest.main()