ACTIVATE — vLLM + RAG

Deploy GPU-accelerated language model inference with optional RAG (Retrieval-Augmented Generation) capabilities on the ACTIVATE platform. Optimized for HPC environments using Apptainer/Singularity.

Overview

This workflow deploys an OpenAI-compatible inference server powered by vLLM, with optional RAG capabilities for context-aware responses using your own documents.

┌──────────────────────────────────────────────────────────────┐
│                    ACTIVATE Platform                          │
│  ┌────────────────┐   ┌────────────────┐   ┌──────────────┐  │
│  │  Your Browser  │──▶│  RAG Proxy     │──▶│  vLLM Server │  │
│  │  (OpenWebUI,   │   │  (context      │   │  (LLM        │  │
│  │   Cline, etc)  │   │   injection)   │   │   inference) │  │
│  └────────────────┘   └───────┬────────┘   └──────────────┘  │
│                               │                               │
│                       ┌───────▼────────┐                      │
│                       │  RAG Server    │                      │
│                       │  + ChromaDB    │                      │
│                       │  + Indexer     │                      │
│                       └────────────────┘                      │
└──────────────────────────────────────────────────────────────┘

Components

Component	Purpose
vLLM Server	High-performance LLM inference with PagedAttention
RAG Proxy	OpenAI-compatible API with automatic context injection
RAG Server	Semantic search over indexed documents
ChromaDB	Vector database for document embeddings
Auto-Indexer	Watches document directory and indexes new files

Quick Start (ACTIVATE Platform)

1. Deploy from Marketplace

Navigate to the ACTIVATE workflow marketplace
Select vLLM + RAG workflow
Choose your compute cluster and scheduler (SLURM/PBS/SSH)

2. Configure Model Source

Choose how to provide model weights:

Option	When to Use
📁 Local Path	Model weights pre-staged on cluster
🤗 HuggingFace Clone	Clone from HuggingFace using git-lfs (HPC-friendly, caches locally)

The HuggingFace Clone option uses git clone with git-lfs, which is more widely supported on HPC systems than the HuggingFace API. Models are cloned once to your cache directory and reused for subsequent runs.

3. Set vLLM Parameters

Common configurations:

# 4-GPU with bfloat16 (recommended for large models)
--dtype bfloat16 --tensor-parallel-size 4 --gpu-memory-utilization 0.85

# Single GPU with memory constraints
--dtype float16 --max-model-len 4096 --gpu-memory-utilization 0.8

4. Submit and Connect

Submit the workflow
Click the Open WebUI link in the job output, or
Connect your IDE (Cline, Continue, etc.) to the provided endpoint

Deployment Modes

Mode	Description
vLLM + RAG	Full stack with document retrieval
vLLM Only	Inference server without RAG

API Endpoints

Once running, the service exposes OpenAI-compatible endpoints:

# List models
curl http://localhost:8081/v1/models

# Chat completion
curl -X POST http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "your-model", "messages": [{"role": "user", "content": "Hello!"}]}'

# Health check
curl http://localhost:8081/health

Project Structure

activate-rag-vllm/
├── workflow.yaml          # ACTIVATE workflow definition
├── start_service.sh       # Service entrypoint
├── rag_proxy.py           # OpenAI-compatible proxy
├── rag_server.py          # RAG search server
├── indexer.py             # Document indexer
├── run_local.sh           # Local development runner
├── singularity/           # Apptainer/Singularity container configs
├── docker/                # Docker configs (local dev)
├── lib/                   # Shared utilities
├── configs/               # HPC preset configurations
└── docs/                  # Additional documentation

Documentation

Document	Description
Local Development Guide	Running locally for debugging
Workflow Configuration	YAML workflow customization
Architecture	System design details
Implementation Plan	Development roadmap

Demo

Troubleshooting

Common Issues

Issue	Solution
CUDA out of memory	Reduce `--gpu-memory-utilization` or `--max-model-len`
Model not found	Verify path exists with `config.json`; check HF_TOKEN for gated models
git-lfs not found	Workflow auto-installs git-lfs locally if missing
Apptainer/Singularity not found	Load module: `module load apptainer` or `module load singularity`
Port in use	Service auto-finds available ports; check for existing instances

Logs

tail -f logs/vllm.out   # vLLM server
tail -f logs/rag.out    # RAG services

License

See LICENSE.md

Name		Name	Last commit message	Last commit date
Latest commit History 267 Commits
configs		configs
docker		docker
docs		docs
lib		lib
scripts		scripts
singularity		singularity
tests		tests
yamls		yamls
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README-VLLM.md		README-VLLM.md
README.md		README.md
clean.sh		clean.sh
controller.sh		controller.sh
indexer.py		indexer.py
indexer_config.yaml		indexer_config.yaml
local.env.example		local.env.example
rag_proxy.py		rag_proxy.py
rag_server.py		rag_server.py
run_local.sh		run_local.sh
start_service.sh		start_service.sh
thumbnail-vllm.png		thumbnail-vllm.png
thumbnail.png		thumbnail.png
workflow.yaml		workflow.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ACTIVATE — vLLM + RAG

Overview

Components

Quick Start (ACTIVATE Platform)

1. Deploy from Marketplace

2. Configure Model Source

3. Set vLLM Parameters

4. Submit and Connect

Deployment Modes

API Endpoints

Project Structure

Documentation

Demo

Troubleshooting

Common Issues

Logs

License

About

Uh oh!

Releases 1

Packages

Uh oh!

Contributors 3

Languages

License

parallelworks/activate-rag-vllm

Folders and files

Latest commit

History

Repository files navigation

ACTIVATE — vLLM + RAG

Overview

Components

Quick Start (ACTIVATE Platform)

1. Deploy from Marketplace

2. Configure Model Source

3. Set vLLM Parameters

4. Submit and Connect

Deployment Modes

API Endpoints

Project Structure

Documentation

Demo

Troubleshooting

Common Issues

Logs

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors 3

Languages

Packages