Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
197 changes: 197 additions & 0 deletions ep/bench/README_launch_vllm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
# vLLM Multi-Node Expert Parallel Deployment Guide

This guide provides example scripts and instructions for deploying vLLM with Expert Parallelism (EP) across multiple nodes.

## 🎯 Overview

**Expert Parallelism (EP)** allows experts in Mixture-of-Experts (MoE) models to be deployed on separate GPUs, increasing locality, efficiency, and throughput. EP is typically coupled with Data Parallelism (DP).

## 📦 Prerequisites

Before deploying vLLM with EP, ensure you have:

### Hardware Requirements

- **Multi-GPU nodes** (typically 8 GPUs per node)
- **High-speed interconnect** (InfiniBand, AWS EFA, or high-bandwidth Ethernet)
- **GPU memory** sufficient for model size + KV cache

### Software Requirements

- **Python 3.8+**
- **PyTorch** with CUDA support
- **vLLM** with EP support
- **Network access** between nodes

## 🚀 Installation

### 1. Install vLLM with EP Support

Follow the official guide:
```bash
# Install vLLM (latest version with EP support)
pip install vllm
```

For detailed EP setup, refer to:
📖 [vLLM Expert Parallel Deployment](https://docs.vllm.ai/en/stable/serving/expert_parallel_deployment.html)

### 2. Install DeepGEMM Library

DeepGEMM provides optimized kernels for MoE operations:

```bash
# Clone and install DeepGEMM
git clone https://github.com/deepseek-ai/DeepGEMM.git
cd DeepGEMM
pip install -e .
```

📖 [DeepGEMM Installation Guide](https://github.com/deepseek-ai/DeepGEMM#installation)

### 3. Install EP Kernels

```bash
# Install DeepEP and pplx-kernels
# Follow vLLM's guide for EP kernels setup
```

### 4. (Optional) AWS EFA Setup

For AWS instances with EFA:

```bash
# Install AWS OFI-NCCL plugin
# This is pre-installed on AWS Deep Learning AMIs
sudo apt-get install aws-ofi-nccl
```

### 5. (Optional) Disaggregated Serving

For prefill/decode split deployments:

```bash
# Install gdrcopy, ucx, and nixl
pip install nixl

# For optimal performance, install gdrcopy
# See: https://github.com/NVIDIA/gdrcopy
```

## ⚙️ Configuration

### Backend Selection

vLLM provides three EP communication backends:

| Backend | Use Case | Features | Best For |
|---------|----------|----------|----------|
| `pplx` | Single node | Chunked prefill support | Development, intra-node |
| `deepep_high_throughput` | Multi-node prefill | Grouped GEMM | High throughput, prefill-dominated |
| `deepep_low_latency` | Multi-node decode | CUDA graph support | Low latency, decode-dominated |
| `allgather_reducescatter` | Multi-node | NCCL-based | InfiniBand/EFA networks |

### Network Interface Detection

Find your network interface:

```bash
# List all network interfaces
ip addr show

# Common interface names:
# - eth0, eno1, enp0s3 (Ethernet)
# - ib0, ib1 (InfiniBand)
# - enp74s0, ens5 (Custom/AWS EFA)
```

### Environment Setup

Edit the provided scripts (`launch_vllm_node1.sh` and `launch_vllm_node2.sh`) to configure:

1. **PYTHONPATH** - Paths to vLLM, DeepGEMM, and EP kernels
2. **LD_LIBRARY_PATH** - Path to PyTorch libraries
3. **Network interfaces** - Set `GLOO_SOCKET_IFNAME`, `NCCL_SOCKET_IFNAME`
4. **Backend** - Choose appropriate `VLLM_ALL2ALL_BACKEND`

## 🚢 Deployment

### Single Node Deployment

For single-node deployment (e.g., 8 GPUs on one node):

```bash
# Using pplx backend (recommended for single node)
VLLM_ALL2ALL_BACKEND=pplx VLLM_USE_DEEP_GEMM=1 \
vllm serve deepseek-ai/DeepSeek-V3-0324 \
--tensor-parallel-size 1 \
--data-parallel-size 8 \
--enable-expert-parallel
```

### Multi-Node Deployment (2+ Nodes)

#### Step 1: Start Node 1 (Primary)

On the **first node** (primary node that handles API requests):

```bash
# Get Node 1's IP address
NODE1_IP=$(hostname -I | awk '{print $1}')

# Launch Node 1
bash launch_vllm_node1.sh $NODE1_IP 13345 deepseek-ai/DeepSeek-V3-0324 16 8 8
```

**Arguments:**
- `NODE1_IP` - IP address of Node 1
- `13345` - RPC port for coordination
- `deepseek-ai/DeepSeek-V3-0324` - Model to serve
- `16` - Total DP size (across all nodes)
- `8` - Local DP size (GPUs on this node)
- `8` - Number of API servers

#### Step 2: Start Node 2+ (Secondary)

On **each additional node** (secondary nodes in headless mode):

```bash
# Use Node 1's IP (not this node's IP!)
NODE1_IP="10.1.59.30" # Replace with actual Node 1 IP

# Launch Node 2 (headless)
bash launch_vllm_node2.sh $NODE1_IP 13345 deepseek-ai/DeepSeek-V3-0324 16 8 8
```

**Arguments:**
- `NODE1_IP` - IP address of **Node 1** (primary)
- `13345` - Same RPC port as Node 1
- `deepseek-ai/DeepSeek-V3-0324` - Same model as Node 1
- `16` - Same total DP size as Node 1
- `8` - Local DP size on this node
- `8` - Starting rank (= sum of previous nodes' local DP)

### Example: 2-Node Deployment

**Configuration:**
- 2 nodes × 8 GPUs = 16 GPUs total
- DP size = 16 (8 per node)
- Model: DeepSeek-V3-0324

**Node 1 (10.1.59.30):**
```bash
bash launch_vllm_node1.sh 10.1.59.30 13345 deepseek-ai/DeepSeek-V3-0324 16 8 8
```

**Node 2 (10.1.60.57):**
```bash
bash launch_vllm_node2.sh 10.1.59.30 13345 deepseek-ai/DeepSeek-V3-0324 16 8 8
```

### Startup Sequence

1. **Start Node 1 first** - Wait for API servers to start
2. **Wait 30-60 seconds** - Allow model loading and initialization
3. **Start Node 2** - It will connect to Node 1 via RPC
4. **Verify connection** - Check logs for "Connected all rings"

150 changes: 150 additions & 0 deletions ep/bench/launch_vllm_node1.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,150 @@
#!/bin/bash
# Node 1 (Primary) - Multi-node vLLM with Expert Parallel (EP)
# This node handles incoming requests
#
# Prerequisites:
# 1. Install vLLM with EP support: https://docs.vllm.ai/en/stable/serving/expert_parallel_deployment.html#architecture-overview
# 2. Install DeepGEMM: https://github.com/deepseek-ai/DeepGEMM#installation
# 3. Install EP kernels: Follow vLLM's EP installation guide
# 4. For AWS EFA: Install AWS OFI-NCCL plugin

set -e

echo "🚀 Launching vLLM Node 1 (Primary) with Expert Parallel..."

# Check if IP is provided
if [ -z "$1" ]; then
echo "❌ Error: Node IP address is required!"
echo ""
echo "Usage: $0 <NODE1_IP> [RPC_PORT] [MODEL] [TOTAL_DP_SIZE] [LOCAL_DP_SIZE] [API_SERVERS]"
echo ""
echo "Example:"
echo " $0 10.1.107.86 13345 deepseek-ai/DeepSeek-V3-0324 16 8 8"
echo ""
echo "💡 To find your IP address, run: hostname -I"
exit 1
fi

# ============================================================================
# ENVIRONMENT CONFIGURATION
# ============================================================================
# IMPORTANT: Adjust these paths according to your installation
# ============================================================================

# Python path configuration (adjust to your installation)
# Example paths - modify according to your setup:
export PYTHONPATH=/path/to/vllm:$PYTHONPATH
export PYTHONPATH=/path/to/DeepGEMM:$PYTHONPATH
export PYTHONPATH=/path/to/DeepEP:$PYTHONPATH
export PYTHONPATH=/path/to/pplx-kernels:$PYTHONPATH

# PyTorch library path (required for DeepGEMM)
export LD_LIBRARY_PATH=/path/to/python/site-packages/torch/lib:$LD_LIBRARY_PATH

# Example for conda/pip installation:
export LD_LIBRARY_PATH=$(python3 -c "import torch; import os; print(os.path.join(torch.__path__[0], 'lib'))"):$LD_LIBRARY_PATH

# ============================================================================
# BACKEND CONFIGURATION
# ============================================================================
# Choose the appropriate backend based on your setup:
# - pplx: Single node deployment
# - deepep_low_latency: Multi-node, low-latency (decode-dominated workloads)
# - deepep_high_throughput: Multi-node, high-throughput (prefill-dominated)
# - allgather_reducescatter: Multi-node with NCCL (works well with InfiniBand/EFA)

export VLLM_ALL2ALL_BACKEND=allgather_reducescatter
export VLLM_USE_DEEP_GEMM=1

# ============================================================================
# NETWORK CONFIGURATION
# ============================================================================

# For InfiniBand/EFA clusters: Prevent initialization hangs
# This ensures torch distributed uses Ethernet for initial setup
# Find your network interface: ip addr show | grep -E 'eth|ib|enp'
#
# Common interfaces:
# - eth0, eno1, enp0s3 (Ethernet)
# - ib0, ib1 (InfiniBand)
# - enp74s0, ens5 (Custom/AWS EFA)

export GLOO_SOCKET_IFNAME=eth0 # Change to your primary network interface
export NCCL_SOCKET_IFNAME=eth0 # Uncomment if using NCCL
export TP_SOCKET_IFNAME=eth0 # Uncomment if using tensor parallel

# ============================================================================
# NCCL CONFIGURATION (Optional - for advanced users)
# ============================================================================

# AWS EFA NCCL plugin (uncomment if using AWS EFA):
export NCCL_NET_PLUGIN="/opt/amazon/ofi-nccl/lib/x86_64-linux-gnu/libnccl-net.so"

# NCCL performance tuning (optional):
export NCCL_P2P_NET_CHUNKSIZE=524288
export NCCL_BUFFSIZE=8388608

# ============================================================================
# ARGUMENTS PARSING
# ============================================================================

NODE1_IP="$1" # Node 1 IP address (REQUIRED)
RPC_PORT="${2:-13345}" # RPC communication port
MODEL="${3:-deepseek-ai/DeepSeek-V3-0324}" # Model to serve
TOTAL_DP_SIZE="${4:-16}" # Total DP size across all nodes
LOCAL_DP_SIZE="${5:-8}" # Local DP size on this node
API_SERVERS="${6:-8}" # Number of API servers

# Recommendations:
# - TOTAL_DP_SIZE = LOCAL_DP_SIZE * NUMBER_OF_NODES
# - LOCAL_DP_SIZE = Number of GPUs per node (typically 8 for 8xGPU nodes)
# - API_SERVERS = LOCAL_DP_SIZE (one server per local DP process)

# ============================================================================
# CONFIGURATION SUMMARY
# ============================================================================

echo ""
echo "╔═══════════════════════════════════════════════════════════════╗"
echo "║ vLLM Expert Parallel Configuration ║"
echo "╚═══════════════════════════════════════════════════════════════╝"
echo ""
echo "Backend Configuration:"
echo " • Backend: ${VLLM_ALL2ALL_BACKEND}"
echo " • DeepGEMM: Enabled"
echo ""
echo "Node Configuration:"
echo " • Role: Primary (handles API requests)"
echo " • Model: ${MODEL}"
echo " • Node IP: ${NODE1_IP}"
echo " • RPC Port: ${RPC_PORT}"
echo ""
echo "Parallelism Configuration:"
echo " • Total Data Parallel Size: ${TOTAL_DP_SIZE} (across all nodes)"
echo " • Local Data Parallel Size: ${LOCAL_DP_SIZE} (this node)"
echo " • API Servers: ${API_SERVERS}"
echo " • Expert Parallel: Enabled (automatically calculated)"
echo ""
echo "═══════════════════════════════════════════════════════════════"
echo ""

# ============================================================================
# LAUNCH vLLM SERVER
# ============================================================================

vllm serve "${MODEL}" \
--tensor-parallel-size 1 \ # TP size (usually 1 for EP)
--enable-expert-parallel \ # Enable Expert Parallel
--data-parallel-size "${TOTAL_DP_SIZE}" \ # Total DP across all nodes
--data-parallel-size-local "${LOCAL_DP_SIZE}" \ # Local DP on this node
--data-parallel-address "${NODE1_IP}" \ # Primary node IP
--data-parallel-rpc-port "${RPC_PORT}" \ # RPC port for coordination
--api-server-count="${API_SERVERS}" \ # Number of API servers
--trust-remote-code # Allow custom model code

# Additional useful options (uncomment as needed):
# --max-model-len 8192 \ # Max sequence length
# --gpu-memory-utilization 0.9 \ # GPU memory usage (0.0-1.0)
# --dtype auto \ # Data type (auto/float16/bfloat16)
# --enable-chunked-prefill \ # Enable chunked prefill
# --port 8000 \ # API server port
Loading