Skip to content

Conversation

@ash-sigh
Copy link
Contributor

@ash-sigh ash-sigh commented Oct 10, 2025

Motivation

Bug fix for Ascend npu

Modifications

  1. use torch_npu.npu_scatter_nd_update_ instead of deprecated torch_npu._npu_reshape_and_cache. [Bug] [Ascend] Launching Qwen3-VL-30B-A3B-Instruct got operator error. #11374
  2. run image_processor on CPU, because transformers has some limitations on Ascend.
  3. fix npu graph index_head_dim AttributeError due to DS-V3.2 changes.Support DeepSeek V3.2 Exp #11061

Accuracy Tests

Server launch script

export HCCL_OP_EXPANSION_MODE="AIV"
#export CPU_AFFINITY_CONF=1,npu0:192-223,npu1:192-223,npu2:128-159,npu3:128-15
export STREAMS_PER_DEVICE=32

python -m sglang.launch_server \
    --model-path /model/Qwen3-VL-30B-A3B-Instruct/ \
    --tp-size 2  --device npu \
    --attention-backend ascend \
    --mm-attention-backend ascend_attn \
    --trust-remote-code

Accuracy test command

python3 -m sglang.test.few_shot_gsm8k --num-questions 200

Accuracy test result

Downloading from https://raw.githubusercontent.com/openai/grade-school-math/master/grade_school_math/data/test.jsonl to /tmp/test.jsonl
/tmp/test.jsonl: 732kB [00:07, 101kB/s]
100%|███████████████████████████████████████████████████████████████████| 200/200 [00:44<00:00, 4.51it/s]
Accuracy: 0.945
Invalid: 0.000
Latency: 44.533 s
Output throughput: 646.760 token/s

Benchmarking and Profiling

python -m sglang.bench_serving --backend sglang --num-prompt 10 --dataset-path ShareGPT_V3_unfiltered_cleaned_split.json

#Input tokens: 1997
#Output tokens: 2798
Starting warmup with 1 sequences...
Warmup completed with 1 sequences. Starting main benchmark run...
100%|███████████████████████████████████████████████| 10/10 [00:22<00:00, 2.20s/it]

============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max request concurrency: not set
Successful requests: 10
Benchmark duration (s): 22.04
Total input tokens: 1997
Total input text tokens: 1997
Total input vision tokens: 0
Total generated tokens: 2798
Total generated tokens (retokenized): 2783
Request throughput (req/s): 0.45
Input token throughput (tok/s): 90.60
Output token throughput (tok/s): 126.94
Total token throughput (tok/s): 217.54
Concurrency: 5.93
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 13079.63
Median E2E Latency (ms): 14950.25
---------------Time to First Token----------------
Mean TTFT (ms): 404.65
Median TTFT (ms): 403.89
P99 TTFT (ms): 450.20
---------------Inter-Token Latency----------------
Mean ITL (ms): 45.50
Median ITL (ms): 46.50
P95 ITL (ms): 50.18
P99 ITL (ms): 52.89
Max ITL (ms): 130.14
==================================================

Checklist

@ash-sigh ash-sigh changed the title Bug fix for Ascend npu bug fix for Ascend npu Oct 10, 2025
@ash-sigh ash-sigh closed this Oct 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants