Skip to content

[Docs] Optimal Deployment #2768

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Aug 1, 2025
Merged

[Docs] Optimal Deployment #2768

merged 12 commits into from
Aug 1, 2025

Conversation

ming1753
Copy link
Collaborator

@ming1753 ming1753 commented Jul 9, 2025

Add ERNIE-4.5-VL-28B-A3B-Paddle Optimal Deployment

Copy link

paddle-bot bot commented Jul 9, 2025

Thanks for your contribution!

> **gpu-memory-utilization**
- **参数:** `--gpu-memory-utilization`
- **用处:** 用于控制 FastDeploy 初始化服务的可用显存,默认0.9,即预留10%的显存备用。
- **推荐:** A卡上推荐0.9,H卡上推荐0.8~0.9。如果服务压测时提示显存不足,可以尝试调低该值。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A卡-> A100/A800
H卡 -> H100/H800?

--quantization wint4 \
--enable-mm \
```
### **Example**: Dual-GPU Wint8 with 128K Context Length Configuration
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Context Length Configuration -> context length,Wint8 -> ?

@ming1753 ming1753 changed the title Optimal Deployment [Docs] Optimal Deployment Jul 9, 2025
|:----------:|:----------:|:------:|:------:|
| A30 | wint4 | 432.99 | 17396.92 |
| L20 | wint4<br>wint8 | 3311.34<br>2423.36 | 46566.81<br>60790.91 |
| H20 | wint4<br>wint8<br>bfloat16 | 3827.27<br>3578.23<br>4100.83 | 89770.14<br>95434.02<br>84543.00 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the bf16 is best?

| L20 [48G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>2, 4 |
| H20 [144G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>1, 2, 4 |
| A100 [80G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>1, 2, 4 |
| H800 [80G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>1, 2, 4 |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

和其他文档保持一致,WINT4 WINT8都大写。

### 2.1 基础:启动服务
**示例1:** 4090上单卡部署32K上下文的服务
```shell
python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --tensor-parallel-size 1 --max-model-len 32768 --max-num-seqs 256 --limit-mm-per-prompt '{"image": 100, "video": 100}' --reasoning-parser ernie-45-vl --gpu-memory-utilization 0.9 --kv-cache-ratio 0.75 --enable-chunked-prefill --max-num-batched-tokens 384 --quantization wint4 --enable-mm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

命令可以换行下,否则太长了

```
**示例2:** H800上双卡部署128K上下文的服务
```shell
python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --tensor-parallel-size 2 --max-model-len 131072 --max-num-seqs 256 --limit-mm-per-prompt '{"image": 100, "video": 100}' --reasoning-parser ernie-45-vl --gpu-memory-utilization 0.9 --kv-cache-ratio 0.75 --enable-chunked-prefill --max-num-batched-tokens 384 --quantization wint4 --enable-mm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

同上,建议直接按照最新的不设置 kv-cache-ratioe 的来,跟随2.1版本

⚠️ 注:更长的上下文会显著增加GPU显存需求,设置更长的上下文之前确保硬件资源是满足的。
> **最大序列数量**
- **参数:** `--max-num-seqs`
- **描述:** 控制服务可以处理的最大序列数量,支持1~256。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

当前不支持256以上的 batch-size?如果不支持,后续需要排查

> **上下文长度**
- **参数:** `--max-model-len`
- **描述:** 控制模型可处理的最大上下文长度。
- **推荐:** 更长的上下文会导致吞吐降低,根据实际情况设置,最长支持**128k**(131072)长度的上下文。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

表明是此模型限制128K长度


- **其他相关配置**:

`--max-num-batched-tokens`:限制每个chunk的最大token数量,推荐设置为384。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里为什么设置384 这么小,需要解释下

- bfloat16 (未设置 `--quantization` 参数时,默认使用bfloat16)

- **推荐:**
- 除非您有极其严格的精度要求,否则我们强烈建议使用wint4量化。这将显著降低内存占用并提升吞吐量。
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里"强烈建议"可能不一定合适,可以去掉

**示例1:** H800上8卡部署128K上下文的服务
```shell
python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --tensor-parallel-size 8 --max-model-len 131072 --max-num-seqs 16 --limit-mm-per-prompt '{"image": 100, "video": 100}' --reasoning-parser ernie-45-vl --gpu-memory-utilization 0.8 --kv-cache-ratio 0.75 --enable-chunked-prefill --max-num-batched-tokens 384 --quantization wint4 --enable-mm
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

一些问题同上面文档

| A100 [80G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>1, 2, 4 |
| H800 [80G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>1, 2, 4 |

### 1.2 安装fastdeploy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fastdeploy -> FastDeploy

@ming1753 ming1753 merged commit fc5f43c into PaddlePaddle:develop Aug 1, 2025
9 of 13 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants