Deploy GPU-accelerated language model inference with optional RAG (Retrieval-Augmented Generation) capabilities on the ACTIVATE platform. Optimized for HPC environments using Apptainer/Singularity.
This workflow deploys an OpenAI-compatible inference server powered by vLLM, with optional RAG capabilities for context-aware responses using your own documents.
┌──────────────────────────────────────────────────────────────┐
│ ACTIVATE Platform │
│ ┌────────────────┐ ┌────────────────┐ ┌──────────────┐ │
│ │ Your Browser │──▶│ RAG Proxy │──▶│ vLLM Server │ │
│ │ (OpenWebUI, │ │ (context │ │ (LLM │ │
│ │ Cline, etc) │ │ injection) │ │ inference) │ │
│ └────────────────┘ └───────┬────────┘ └──────────────┘ │
│ │ │
│ ┌───────▼────────┐ │
│ │ RAG Server │ │
│ │ + ChromaDB │ │
│ │ + Indexer │ │
│ └────────────────┘ │
└──────────────────────────────────────────────────────────────┘
| Component | Purpose |
|---|---|
| vLLM Server | High-performance LLM inference with PagedAttention |
| RAG Proxy | OpenAI-compatible API with automatic context injection |
| RAG Server | Semantic search over indexed documents |
| ChromaDB | Vector database for document embeddings |
| Auto-Indexer | Watches document directory and indexes new files |
- Navigate to the ACTIVATE workflow marketplace
- Select vLLM + RAG workflow
- Choose your compute cluster and scheduler (SLURM/PBS/SSH)
Choose how to provide model weights:
| Option | When to Use |
|---|---|
| 📁 Local Path | Model weights pre-staged on cluster |
| 🤗 HuggingFace Clone | Clone from HuggingFace using git-lfs (HPC-friendly, caches locally) |
The HuggingFace Clone option uses git clone with git-lfs, which is more widely supported on HPC systems than the HuggingFace API. Models are cloned once to your cache directory and reused for subsequent runs.
Common configurations:
# 4-GPU with bfloat16 (recommended for large models)
--dtype bfloat16 --tensor-parallel-size 4 --gpu-memory-utilization 0.85
# Single GPU with memory constraints
--dtype float16 --max-model-len 4096 --gpu-memory-utilization 0.8- Submit the workflow
- Click the Open WebUI link in the job output, or
- Connect your IDE (Cline, Continue, etc.) to the provided endpoint
| Mode | Description |
|---|---|
| vLLM + RAG | Full stack with document retrieval |
| vLLM Only | Inference server without RAG |
Once running, the service exposes OpenAI-compatible endpoints:
# List models
curl http://localhost:8081/v1/models
# Chat completion
curl -X POST http://localhost:8081/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "your-model", "messages": [{"role": "user", "content": "Hello!"}]}'
# Health check
curl http://localhost:8081/healthactivate-rag-vllm/
├── workflow.yaml # ACTIVATE workflow definition
├── start_service.sh # Service entrypoint
├── rag_proxy.py # OpenAI-compatible proxy
├── rag_server.py # RAG search server
├── indexer.py # Document indexer
├── run_local.sh # Local development runner
├── singularity/ # Apptainer/Singularity container configs
├── docker/ # Docker configs (local dev)
├── lib/ # Shared utilities
├── configs/ # HPC preset configurations
└── docs/ # Additional documentation
| Document | Description |
|---|---|
| Local Development Guide | Running locally for debugging |
| Workflow Configuration | YAML workflow customization |
| Architecture | System design details |
| Implementation Plan | Development roadmap |
| Issue | Solution |
|---|---|
| CUDA out of memory | Reduce --gpu-memory-utilization or --max-model-len |
| Model not found | Verify path exists with config.json; check HF_TOKEN for gated models |
| git-lfs not found | Workflow auto-installs git-lfs locally if missing |
| Apptainer/Singularity not found | Load module: module load apptainer or module load singularity |
| Port in use | Service auto-finds available ports; check for existing instances |
tail -f logs/vllm.out # vLLM server
tail -f logs/rag.out # RAG servicesSee LICENSE.md