[Docs] Optimal Deployment #2768

ming1753 · 2025-07-09T04:13:48Z

Add ERNIE-4.5-VL-28B-A3B-Paddle Optimal Deployment

paddle-bot · 2025-07-09T04:13:52Z

Thanks for your contribution!

vivienfanghuagood · 2025-07-09T08:29:11Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+> **gpu-memory-utilization**
+- **参数：** `--gpu-memory-utilization`
+- **用处：** 用于控制 FastDeploy 初始化服务的可用显存，默认0.9，即预留10%的显存备用。
+- **推荐：** A卡上推荐0.9，H卡上推荐0.8～0.9。如果服务压测时提示显存不足，可以尝试调低该值。


A卡-> A100/A800
H卡 -> H100/H800?

vivienfanghuagood · 2025-07-09T08:32:49Z

docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+       --quantization wint4 \
+       --enable-mm \
+```
+###  **Example**: Dual-GPU Wint8 with 128K Context Length Configuration 


Context Length Configuration -> context length，Wint8 -> ?

vivienfanghuagood · 2025-07-09T08:39:05Z

docs/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+|:----------:|:----------:|:------:|:------:|
+| A30 | wint4 | 432.99 | 17396.92 |
+| L20 | wint4<br>wint8 | 3311.34<br>2423.36  | 46566.81<br>60790.91 |
+| H20 | wint4<br>wint8<br>bfloat16 | 3827.27<br>3578.23<br>4100.83  | 89770.14<br>95434.02<br>84543.00  |


Why the bf16 is best?

…into docs

qingqing01 · 2025-07-29T05:27:12Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+| L20 [48G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>2, 4  |
+| H20 [144G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>1, 2, 4 |
+| A100 [80G] | wint4<br>wint8<br>bfloat16 |  1, 2, 4<br>1, 2, 4<br>1, 2, 4 |
+| H800 [80G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>1, 2, 4 |


和其他文档保持一致，WINT4 WINT8都大写。

qingqing01 · 2025-07-29T05:36:43Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+###  2.1 基础：启动服务
+ **示例1：** 4090上单卡部署32K上下文的服务
+```shell
+python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --tensor-parallel-size 1 --max-model-len 32768 --max-num-seqs 256 --limit-mm-per-prompt '{"image": 100, "video": 100}' --reasoning-parser ernie-45-vl --gpu-memory-utilization 0.9 --kv-cache-ratio 0.75 --enable-chunked-prefill --max-num-batched-tokens 384 --quantization wint4 --enable-mm


命令可以换行下，否则太长了

qingqing01 · 2025-07-29T05:37:10Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+```
+ **示例2：** H800上双卡部署128K上下文的服务
+```shell
+python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-VL-28B-A3B-Paddle --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --tensor-parallel-size 2 --max-model-len 131072 --max-num-seqs 256 --limit-mm-per-prompt '{"image": 100, "video": 100}' --reasoning-parser ernie-45-vl --gpu-memory-utilization 0.9 --kv-cache-ratio 0.75 --enable-chunked-prefill --max-num-batched-tokens 384 --quantization wint4 --enable-mm


同上，建议直接按照最新的不设置 kv-cache-ratioe 的来，跟随2.1版本

qingqing01 · 2025-07-29T05:40:14Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+   ⚠️ 注：更长的上下文会显著增加GPU显存需求，设置更长的上下文之前确保硬件资源是满足的。
+>  **最大序列数量**  
+- **参数：** `--max-num-seqs`  
+- **描述：** 控制服务可以处理的最大序列数量，支持1～256。


当前不支持256以上的 batch-size？如果不支持，后续需要排查

qingqing01 · 2025-07-29T05:40:55Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+> **上下文长度**  
+- **参数：** `--max-model-len`  
+- **描述：** 控制模型可处理的最大上下文长度。
+- **推荐：** 更长的上下文会导致吞吐降低，根据实际情况设置，最长支持**128k**（131072）长度的上下文。


表明是此模型限制128K长度

qingqing01 · 2025-07-29T05:42:08Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+
+- **其他相关配置**:
+
+    `--max-num-batched-tokens`：限制每个chunk的最大token数量，推荐设置为384。


这里为什么设置384 这么小，需要解释下

qingqing01 · 2025-07-29T05:42:43Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+  - bfloat16 (未设置 `--quantization` 参数时，默认使用bfloat16)
+
+- **推荐：** 
+    - 除非您有极其严格的精度要求，否则我们强烈建议使用wint4量化。这将显著降低内存占用并提升吞吐量。


这里"强烈建议"可能不一定合适，可以去掉

qingqing01 · 2025-07-29T05:43:34Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-424B-A47B-Paddle.md

+ **示例1：** H800上8卡部署128K上下文的服务
+```shell
+python -m fastdeploy.entrypoints.openai.api_server --model baidu/ERNIE-4.5-VL-424B-A47B-Paddle --port 8180 --metrics-port 8181 --engine-worker-queue-port 8182 --tensor-parallel-size 8 --max-model-len 131072 --max-num-seqs 16 --limit-mm-per-prompt '{"image": 100, "video": 100}' --reasoning-parser ernie-45-vl --gpu-memory-utilization 0.8 --kv-cache-ratio 0.75 --enable-chunked-prefill --max-num-batched-tokens 384 --quantization wint4 --enable-mm
+```


一些问题同上面文档

qingqing01 · 2025-07-29T05:51:50Z

docs/zh/optimal_deployment/ERNIE-4.5-VL-28B-A3B-Paddle.md

+| A100 [80G] | wint4<br>wint8<br>bfloat16 |  1, 2, 4<br>1, 2, 4<br>1, 2, 4 |
+| H800 [80G] | wint4<br>wint8<br>bfloat16 | 1, 2, 4<br>1, 2, 4<br>1, 2, 4 |
+
+###  1.2 安装fastdeploy


fastdeploy -> FastDeploy

Optimal Deployment

86fb9bd

ming1753 added 3 commits July 9, 2025 14:51

add zh doc

1fbaa59

modify context

5ff7765

modify context

bd7705d

vivienfanghuagood reviewed Jul 9, 2025

View reviewed changes

modify context

c24a9df

ming1753 changed the title ~~Optimal Deployment~~ [Docs] Optimal Deployment Jul 9, 2025

vivienfanghuagood reviewed Jul 9, 2025

View reviewed changes

ming1753 added 4 commits July 22, 2025 15:25

Merge branch 'develop' of https://github.com/PaddlePaddle/FastDeploy …

5d61d1d

…into docs

modify docs

da734d3

slightly modify

71945a7

add en doc

b313a1a

qingqing01 reviewed Jul 29, 2025

View reviewed changes

ming1753 added 3 commits July 31, 2025 17:18

modify

f99cc79

fix bug

9464eea

format

aa857f0

qingqing01 approved these changes Aug 1, 2025

View reviewed changes

ming1753 merged commit fc5f43c into PaddlePaddle:develop Aug 1, 2025
9 of 13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Docs] Optimal Deployment #2768

[Docs] Optimal Deployment #2768

Uh oh!

ming1753 commented Jul 9, 2025

Uh oh!

paddle-bot bot commented Jul 9, 2025

Uh oh!

vivienfanghuagood Jul 9, 2025

Uh oh!

vivienfanghuagood Jul 9, 2025

Uh oh!

vivienfanghuagood Jul 9, 2025

Uh oh!

qingqing01 Jul 29, 2025

Uh oh!

qingqing01 Jul 29, 2025

Uh oh!

qingqing01 Jul 29, 2025

Uh oh!

qingqing01 Jul 29, 2025

Uh oh!

qingqing01 Jul 29, 2025

Uh oh!

qingqing01 Jul 29, 2025

Uh oh!

qingqing01 Jul 29, 2025

Uh oh!

qingqing01 Jul 29, 2025

Uh oh!

qingqing01 Jul 29, 2025

Uh oh!

Uh oh!

Uh oh!


		- 其他相关配置:

		`--max-num-batched-tokens`：限制每个chunk的最大token数量，推荐设置为384。

[Docs] Optimal Deployment #2768

[Docs] Optimal Deployment #2768

Uh oh!

Conversation

ming1753 commented Jul 9, 2025

Uh oh!

paddle-bot bot commented Jul 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!