Skip to content

Conversation

@grimoire
Copy link
Collaborator

@grimoire grimoire commented Sep 24, 2025

requirements

enable different TP for Attention/MLP/MoE

@grimoire grimoire marked this pull request as ready for review September 25, 2025 04:52
@grimoire grimoire mentioned this pull request Sep 25, 2025
# Prefill
prefill_request_dict = copy.deepcopy(request_dict)
prefill_request_dict['max_tokens'] = 1
prefill_request_dict['max_completion_tokens'] = 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's max_completion_tokens used for? What's the difference between prefill_request_dict['max_completion_tokens'] and prefill_request_dict['max_tokens']

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grimoire
Copy link
Collaborator Author

grimoire commented Nov 2, 2025

Fixed

@lvhan028
Copy link
Collaborator

lvhan028 commented Nov 8, 2025

h800,

lmdeploy serve proxy
export LMDEPLOY_DP_MASTER_ADDR=0.0.0.0
export LMDEPLOY_DP_MASTER_PORT=8888
lmdeploy serve api_server Qwen/Qwen3-235B-A22B --dp 2 --tp 8 --max-batch-size 64 --cache-max-entry-count 0.6 --max-prefill-token-num 4096 --proxy-url http://0.0.0.0:8000

# oc evaluation
opencompass workspace/eval/qwen3_235b_infer.py -m infer -w workspace/eval/qwen3-235b-dp2-tp8 -r latest

Got OOM

(RayWorkerWrapper pid=4133398) Traceback (most recent call last):
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 810, in _on_finish_callback
(RayWorkerWrapper pid=4133398)     task.result()
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 780, in _async_loop_background
(RayWorkerWrapper pid=4133398)     await self._async_step_background(**forward_inputs, )
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 704, in _async_step_background
(RayWorkerWrapper pid=4133398)     output = await self._async_model_forward(
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 540, in _async_model_forward
(RayWorkerWrapper pid=4133398)     ret = await __long_context_single_forward(inputs, max_seqlen)
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/engine/model_agent.py", line 506, in __long_context_single_forward
(RayWorkerWrapper pid=4133398)     inp.build_dp_meta()
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 307, in build_dp_meta
(RayWorkerWrapper pid=4133398)     self.dp_meta = DPMeta.build(self.input_ids.numel())
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 44, in build
(RayWorkerWrapper pid=4133398)     tp_sizes = cls._gather_tp_sizes(mlp_tp, seqlen, dist_ctx, layer_type='mlp')
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/model_inputs.py", line 32, in _gather_tp_sizes
(RayWorkerWrapper pid=4133398)     dist.all_gather_object(tp_sizes, seqlen, group=gather_group)
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/lmdeploy/lmdeploy/pytorch/distributed.py", line 414, in all_gather_object
(RayWorkerWrapper pid=4133398)     return dist.all_gather_object(object_list, obj, group=group)
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/miniconda3/envs/lmdeploy-0.10.0/lib/python3.10/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
(RayWorkerWrapper pid=4133398)     return func(*args, **kwargs)
(RayWorkerWrapper pid=4133398)   File "/nvme1/lvhan/miniconda3/envs/lmdeploy-0.10.0/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 3171, in all_gather_object
(RayWorkerWrapper pid=4133398)     input_tensor.resize_(max_object_size)
(RayWorkerWrapper pid=4133398) torch.OutOfMemoryError: CUDA out of memory. Tried to allocate more than 1EB memory.
(RayWorkerWrapper pid=4133404) [Gloo] Rank 4 is connected to 7 peer ranks. Expected number of connected peer ranks is : 7 [repeated 6x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/user-guides/configure-logging.html#log-deduplication for more options.)

@lvhan028
Copy link
Collaborator

When serving the Qwen/Qwen3-235B-A22B-FP8 model, an unbalanced CUDA memory occupation is observed. In contrast, this issue is not present with the Qwen/Qwen3-235B-A22B model.

lmdeploy serve api_server Qwen/Qwen3-235B-A22B-FP8 --dp 2 --tp 8 --backend pytorch --proxy-url http://0.0.0.0:8000
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 570.133.20             Driver Version: 570.133.20     CUDA Version: 12.8     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA L20Y                    On  |   00000000:19:00.0 Off |                    0 |
| N/A   32C    P0            132W /  700W |   73311MiB /  81559MiB |      2%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   1  NVIDIA L20Y                    On  |   00000000:3B:00.0 Off |                    0 |
| N/A   29C    P0            130W /  700W |   73309MiB /  81559MiB |      1%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   2  NVIDIA L20Y                    On  |   00000000:4C:00.0 Off |                    0 |
| N/A   29C    P0            132W /  700W |   73311MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   3  NVIDIA L20Y                    On  |   00000000:5D:00.0 Off |                    0 |
| N/A   30C    P0            131W /  700W |   73311MiB /  81559MiB |      1%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   4  NVIDIA L20Y                    On  |   00000000:8B:00.0 Off |                    0 |
| N/A   30C    P0            127W /  700W |   69779MiB /  81559MiB |      1%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   5  NVIDIA L20Y                    On  |   00000000:D6:00.0 Off |                    0 |
| N/A   30C    P0            124W /  700W |   69779MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   6  NVIDIA L20Y                    On  |   00000000:DD:00.0 Off |                    0 |
| N/A   31C    P0            130W /  700W |   69779MiB /  81559MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
|   7  NVIDIA L20Y                    On  |   00000000:E4:00.0 Off |                    0 |
| N/A   29C    P0            135W /  700W |   69779MiB /  81559MiB |      1%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

@grimoire
Copy link
Collaborator Author

When serving the Qwen/Qwen3-235B-A22B-FP8 model, an unbalanced CUDA memory occupation is observed.

FFN dim 1536=128x12 where 128 is the fp8 block size. 12 blocks can not be split in to 8 ranks evenly.

@lvhan028
Copy link
Collaborator

lvhan028 commented Nov 11, 2025

Evaluation test failed.

lmdeploy serve api_server Qwen/Qwen3-235B-A22B-Thinking-2507 --tp 8
dataset                       version    metric                      mode    qwen3-235b-thinking-2507
----------------------------  ---------  --------------------------  ------  --------------------------
core_average                  -          -                           -       -
                              -          -                           -       -
Instruction Following         -          -                           -       -
IFEval                        -          -                           -       -
                              -          -                           -       -
General Reasoning             -          -                           -       -
hle_llmjudge                  -          -                           -       -
GPQA_diamond_repeat_4         772ea0     accuracy (4 runs average)   gen     80.93
                              -          -                           -       -
Math Calculation              -          -                           -       -
aime2025_repeat_32            5e9f4f     accuracy (32 runs average)  gen     80.52

https://rank.opencompass.org.cn/leaderboard-llm-academic/?m=REALTIME

Qwen3-235B-A22B-Thinking-2507, gpqa(79.8), aime2025(90.9)
The evalution result can be found on shared/opencompass/oc_academic/qwen3-235b-thinking-2507/pt-tp8/20251110_223206

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants