Skip to content

Commit ff2d8ae

Browse files
authored
[Doc] Add vllm integration v1 doc (#129)
* [Doc] Add vllm integration v1 doc. Signed-off-by: Shangming Cai <[email protected]>
1 parent d3224c5 commit ff2d8ae

4 files changed

+162
-1
lines changed
File renamed without changes.

doc/en/vllm-integration.md doc/en/vllm-integration-v0.1.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
# vLLM Disaggregated Prefill/Decode Demo
22

33
## Overview
4-
Currently, we support mooncake-transfer-engine integration with the vLLM project based on [PR 8498](https://github.com/vllm-project/vllm/pull/8498) (vllm version: v0.6.2) to accelerate KVCache transfer for inter-node disaggregated Prefill/Decode scenario ([Benchmark results](vllm-benchmark-results.md)). In the future, we will bypass PR 8498, release a disaggregated KVStore, and fully integrate it with the vLLM Prefix Caching feature to support multi-instance KVCache Sharing.
4+
Currently, we support mooncake-transfer-engine integration with the vLLM project based on [PR 8498](https://github.com/vllm-project/vllm/pull/8498) (vllm version: v0.6.2) to accelerate KVCache transfer for inter-node disaggregated Prefill/Decode scenario ([Benchmark results](vllm-benchmark-results-v0.1.md)). In the future, we will bypass PR 8498, release a disaggregated KVStore, and fully integrate it with the vLLM Prefix Caching feature to support multi-instance KVCache Sharing.
55

66
![vllm-integration-demo](../../image/vllm-integration-demo.gif)
77

doc/en/vllm-integration-v1.md

+161
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# vLLM Disaggregated Prefill with MooncakeStore
2+
3+
## Overview
4+
This is the latest version of the MooncakeStore integration doc with the vLLM project based on [PR 10502](https://github.com/vllm-project/vllm/pull/10502) and [PR 12957](https://github.com/vllm-project/vllm/pull/12957) to support KVCache transfer for intra-node and inter-node disaggregated Prefill/Decode scenario. Benchmark results will be released soon.
5+
6+
Main changes from v0.x to v1:
7+
- XpYd support and orchestration
8+
- dynamic changing the population of prefill group and decode group
9+
- More stable and more fault-tolerant
10+
- The sudden crash of a single vllm instance is tolerable
11+
- Since instance-to-instance connections are removed, each instance works as a vanilla vllm instance, which means it can serve the requests that are not from the proxy and finish them normally
12+
13+
14+
**_Please note that this is still an experimental version and will be modified anytime based on feedback from the vLLM community._**
15+
16+
## Installation
17+
### Prerequisite
18+
Please install the MooncakeStore according to the [instructions](build.md) first.
19+
20+
### Install an experimental version of vLLM
21+
#### 1. Clone vLLM from official repo
22+
```bash
23+
git clone [email protected]:kvcache-ai/vllm.git
24+
```
25+
#### 2. Build
26+
##### 2.1 Build from source
27+
```bash
28+
cd vllm
29+
git checkout xpyd_preview
30+
pip3 install vllm --upgrade
31+
VLLM_USE_PRECOMPILED=1 pip3 install -e .
32+
```
33+
- If you encounter any problems that you cannot solve, please refer to the [vLLM official compilation guide](https://docs.vllm.ai/en/latest/getting_started/installation/index.html).
34+
35+
## Configuration
36+
### Prepare configuration file to Run Example over RDMA
37+
38+
- Prepare a _**mooncake.json**_ file for both Prefill and Decode instances
39+
40+
```json
41+
{
42+
"local_hostname": "192.168.0.137",
43+
"metadata_server": "etcd://192.168.0.137:2379",
44+
"protocol": "rdma",
45+
"device_name": "erdma_0",
46+
"master_server_address": "192.168.0.137:50001"
47+
}
48+
```
49+
- "local_hostname": The IP address of the current node used to communicate with the etcd server for metadata.
50+
- **_All prefill instances and decode instances can share this config file on the same node._**
51+
- "metadata_server": The etcd server of the mooncake transfer engine. For example,
52+
- Use `etcd` as backend: `"192.168.0.137:2379"`, `"etcd://192.168.0.137:2379"` or `"etcd://192.168.0.137:2379,192.168.0.138:2379"`
53+
- Use `redis` as backend: `"redis://192.168.0.137:6379"`
54+
- Use `http` as backend: `"http://192.168.0.137:8080/metadata"`
55+
- "protocol": The protocol to be used for data transmission. ("rdma/tcp")
56+
- "device_name": The device to be used for data transmission, it is required when "protocol" is set to "rdma". If multiple NIC devices are used, they can be separated by commas such as "erdma_0,erdma_1". Please note that there are no spaces between them.
57+
- "master_server_address": The IP address and the port of the master deamon process of MooncakeStore.
58+
### Prepare configuration file to Run Example over TCP
59+
60+
- Prepare a _**mooncake.json**_ file for both Prefill and Decode instances
61+
```json
62+
{
63+
"local_hostname": "192.168.0.137",
64+
"metadata_server": "etcd://192.168.0.137:2379",
65+
"protocol": "tcp",
66+
"device_name": "",
67+
"master_server_address": "192.168.0.137:50001"
68+
}
69+
```
70+
71+
## Run Example
72+
- Please change the IP addresses and ports in the following guide according to your env.
73+
```bash
74+
# Begin from `root` of your cloned repo!
75+
76+
# 1. Start the etcd server
77+
etcd --listen-client-urls http://0.0.0.0:2379 --advertise-client-urls http://localhost:2379
78+
# You may need to terminate other etcd processes before running the above command
79+
80+
# 2. Start the mooncake_master server
81+
mooncake_master --port 50001
82+
# If some vllm instances exit unexpectedly, some connection metadata will be corrupted since they are not properly cleaned. In that case, we recommend you restart the mooncake_master before running another test.
83+
84+
# 3. Run multiple vllm instances
85+
# kv_producer role
86+
MOONCAKE_CONFIG_PATH=./mooncake.json python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8100 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer"}'
87+
88+
CUDA_VISIBLE_DEVICES=1 MOONCAKE_CONFIG_PATH=./mooncake.json python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8101 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_producer"}'
89+
90+
# kv_consumer role
91+
CUDA_VISIBLE_DEVICES=2 MOONCAKE_CONFIG_PATH=./mooncake.json python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8200 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer"}'
92+
93+
CUDA_VISIBLE_DEVICES=3 MOONCAKE_CONFIG_PATH=./mooncake.json python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8201 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_consumer"}'
94+
95+
# kv_both role
96+
CUDA_VISIBLE_DEVICES=4 MOONCAKE_CONFIG_PATH=./mooncake.json python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8300 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'
97+
98+
CUDA_VISIBLE_DEVICES=5 MOONCAKE_CONFIG_PATH=./mooncake.json python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8301 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'
99+
100+
CUDA_VISIBLE_DEVICES=6 MOONCAKE_CONFIG_PATH=./mooncake.json python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8302 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'
101+
102+
CUDA_VISIBLE_DEVICES=7 MOONCAKE_CONFIG_PATH=./mooncake.json python3 -m vllm.entrypoints.openai.api_server --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --port 8303 --max-model-len 10000 --gpu-memory-utilization 0.8 --kv-transfer-config '{"kv_connector":"MooncakeStoreConnector","kv_role":"kv_both"}'
103+
```
104+
105+
- `MOONCAKE_CONFIG_PATH` is the path to the mooncake.json configuration file.
106+
- `VLLM_USE_MODELSCOPE` is optional, if you have access to huggingface, please remove it.
107+
- The `--model` parameter specifies the model to use.
108+
- The `--port` parameter specifies the vllm service port on which to listen.
109+
- The `--max-model-len` parameter specifies the maximum length of the model.
110+
- Option `--tensor_parallel_size` \ `-tp` is supported. Example: append `-tp 2` to the run command to run vllm with multiple GPUs.
111+
- Note: All instances should have the same tensor_parallel_size.
112+
- If you want to run the prefill instance and decode instance on the same node, please set up different `CUDA_VISIBLE_DEVICES`. For example, `CUDA_VISIBLE_DEVICES=0,1` for the prefill instance and `CUDA_VISIBLE_DEVICES=2,3` for the decode instance.
113+
114+
- The `--kv-transfer-config` parameter specifies the connector and its config to be used.
115+
- Please set up `kv_connector` to `MooncakeStoreConnector`.
116+
- `kv_role` is the node's role, either 'kv_producer', 'kv_consumer' or 'kv_both'.
117+
118+
```bash
119+
# 4. Start the proxy server
120+
cd vllm
121+
python3 examples/online_serving/disagg_examples/disagg_proxy_demo.py --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --prefill localhost:8100 localhost:8101 --decode localhost:8200 localhost:8201 --port 8000
122+
```
123+
124+
- The `--model` parameter specifies the model to use, also specifies the tokenizer used by the proxy server.
125+
- The `--port` parameter specifies the vllm service port on which to listen.
126+
- The `--prefill` or `-p` specifies the ip and port of the vllm prefill instances.
127+
- The `--decode` or `-d` specifies the ip and port of the vllm decode instances.
128+
129+
```bash
130+
# If you want to dynamically adjust the instances of p-nodes and d-nodes during runtime, you need to configure this environment variables.
131+
export ADMIN_API_KEY="xxxxxxxx"
132+
# or add it before the command:
133+
ADMIN_API_KEY="xxxxxxxx" python3 vllm/examples/online_serving/disagg_examples/disagg_demo.py --model Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4 --prefill localhost:8100 localhost:8101 --decode localhost:8200 localhost:8201 --port 8000 --scheduling round_robin
134+
135+
# Then use this command to add instances into prefill group or decode group
136+
curl -X POST "http://localhost:8000/instances/add" -H "Content-Type: application/json" -H "X-API-Key: $ADMIN_API_KEY" -d '{"type": "prefill", "instance": "localhost:8300"}'
137+
138+
curl -X POST "http://localhost:8000/instances/add" -H "Content-Type: application/json" -H "X-API-Key: $ADMIN_API_KEY" -d '{"type": "decode", "instance": "localhost:8301"}'
139+
140+
curl -X POST "http://localhost:8000/instances/add" -H "Content-Type: application/json" -H "X-API-Key: $ADMIN_API_KEY" -d '{"type": "prefill", "instance": "localhost:8302"}'
141+
142+
curl -X POST "http://localhost:8000/instances/add" -H "Content-Type: application/json" -H "X-API-Key: $ADMIN_API_KEY" -d '{"type": "decode", "instance": "localhost:8303"}'
143+
144+
# Use this command to get the proxy status
145+
curl localhost:8000/status | jq
146+
```
147+
148+
Mooncake team implements this simple disagg_proxy based on round-robin as a demo. In the production stage, service providers and users can also implement corresponding global proxy strategies according to their needs.
149+
150+
**_Be sure to change the IP address in the commands._**
151+
152+
153+
## Test with openai compatible request
154+
```
155+
curl -s http://localhost:8000/v1/completions -H "Content-Type: application/json" -d '{
156+
"model": "Qwen/Qwen2.5-7B-Instruct-GPTQ-Int4",
157+
"prompt": "San Francisco is a",
158+
"max_tokens": 1000
159+
}'
160+
```
161+
- If you are not testing on the proxy server, please change the `localhost` to the IP address of the proxy server.
File renamed without changes.

0 commit comments

Comments
 (0)