Skip to content

Conversation

@Alexhaoge
Copy link
Contributor

@Alexhaoge Alexhaoge commented Sep 29, 2025

Motivation

The first commit adds Ascend NPU support for sglang.check_env script that allow users to export their environments for raising Github issues or other diagnostic purposes.

The second commit refactors the script, which allows vendors to write difference version check procedures without breaking others' implemetation while avoiding naming confusion.

Modifications

We provide two implementations and request decision from the community.

Legacy approach

commit 1624b08
Add additional branches on the existing functions to get device information and driver/complier/toolkit version. Details are as follows,

  • get_cuda_info:
    • Add a new function to output device names due to torch_npu interface differences.
    • Lazy add torch_npu to PACKAGE_LIST to avoid unnecessary package check for other hardwares.
  • _get_cuda_version_info: Use multiple environment variable to locate CANN path, fall back to default installation path if none of them works.
  • _get_nvcc_info:
    • Find CANN toolkit version in $CANN_HOME/version.cfg. The toolkit versioning rule is different with Bisheng compiler (unlike CUDA).
    • Find Bisheng complier under CANN path and output the first line of its version.
  • _get_cuda_driver_version: Use npu-smi info -t board -i 0 to get driver version since driver applies for all NPUs in a single server.

Class-based refractoring

commit 0acff70
As @Alcanderian pointed out, function name like _get_cuda_version_info can be ambiguous for inlcuding multiple hardware types, so we propose to rework the script as follows,

  • BaseEnv: create a base environment checker class and move common helper functions into it, like get_package_versions, get_device_info, get_hypervisor_vendor
  • Each hardware type will have a checker subclass. Subclasses should implement get_info(get cuda info) and get_topo (get device topology).
  • Dispatch env checker in __main__ and call BaseEnv.check_env() to print the enviroment info.

Accuracy Tests

Script output using Atlas 800T A2 server with main-910b docker image.

root@w25:/home# python -m sglang.check_env
/usr/local/python3.11.13/lib/python3.11/site-packages/torch/cuda/init.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
/usr/local/python3.11.13/lib/python3.11/site-packages/torch/cuda/init.py:61: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
Python: 3.11.13 (main, Jul 26 2025, 07:27:32) [GCC 11.4.0]
NPU available: True
NPU 0,1,2,3,4,5,6,7: Ascend910B3
CANN_HOME: /usr/local/Ascend/ascend-toolkit/latest
CANN: 8.2.0.0.201:8.2.RC1
BiSheng: 2025-07-23T11:24:13+08:00 clang version 15.0.5 (clang-5c68a1cb1231 flang-5c68a1cb1231)
Ascend Driver Version: 25.2.0
PyTorch: 2.6.0+cpu
torch_npu: 2.6.0.post1
sgl-kernel-npu: 0.1.0
sglang: 0.5.3.post3
sgl_kernel: Module Not Found
flashinfer_python: Module Not Found
triton: Module Not Found
transformers: 4.57.1
torchao: 0.9.0
numpy: 1.26.4
aiohttp: 3.13.0
fastapi: 0.119.0
hf_transfer: 0.1.9
huggingface_hub: 0.35.3
interegular: 0.3.3
modelscope: 1.31.0
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 6.0.0
pydantic: 2.12.2
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.37.0
uvloop: 0.21.0
vllm: 0.8.5.post1+empty
xgrammar: 0.1.25
openai: 1.99.1
tiktoken: 0.12.0
anthropic: Module Not Found
litellm: Module Not Found
decord: Module Not Found
Ascend Topology:
NPU0 NPU1 NPU2 NPU3 NPU4 NPU5 NPU6 NPU7 CPU Affinity
NPU0 X HCCS HCCS HCCS HCCS HCCS HCCS HCCS 192-223
NPU1 HCCS X HCCS HCCS HCCS HCCS HCCS HCCS 192-223
NPU2 HCCS HCCS X HCCS HCCS HCCS HCCS HCCS 128-159
NPU3 HCCS HCCS HCCS X HCCS HCCS HCCS HCCS 128-159
NPU4 HCCS HCCS HCCS HCCS X HCCS HCCS HCCS 0-31
NPU5 HCCS HCCS HCCS HCCS HCCS X HCCS HCCS 0-31
NPU6 HCCS HCCS HCCS HCCS HCCS HCCS X HCCS 64-95
NPU7 HCCS HCCS HCCS HCCS HCCS HCCS HCCS X 64-95

Legend:

X = Self
SYS = Path traversing PCIe and NUMA nodes. Nodes are connected through SMP, such as QPI, UPI.
PHB = Path traversing PCIe and the PCIe host bridge of a CPU.
PIX = Path traversing a single PCIe switch
PXB = Path traversing multipul PCIe switches
HCCS = Connection traversing HCCS.
NA = Unknown relationship.

ulimit soft: 1073741816

Script output on H20 with lmsysorg/sglang:v0.5.3.post1-cu129-amd64 docker

python -m sglang.check_env
/usr/local/lib/python3.12/dist-packages/torch/cuda/init.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
import pynvml # type: ignore[import]
Python: 3.12.11 (main, Jun 4 2025, 08:56:18) [GCC 11.4.0]
CUDA available: True
GPU 0,1,2,3,4,5,6,7: NVIDIA H20
GPU 0,1,2,3,4,5,6,7 Compute Capability: 9.0
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.9, V12.9.86
CUDA Driver Version: 550.127.08
PyTorch: 2.8.0+cu129
sglang: 0.5.3.post3
sgl_kernel: 0.3.15
flashinfer_python: 0.4.0
triton: 3.4.0
transformers: 4.57.1
torchao: 0.9.0
numpy: 2.3.3
aiohttp: 3.13.0
fastapi: 0.118.2
hf_transfer: 0.1.9
huggingface_hub: 0.35.3
interegular: 0.3.3
modelscope: 1.30.0
orjson: 3.11.3
outlines: 0.1.11
packaging: 25.0
psutil: 7.1.0
pydantic: 2.12.0
python-multipart: 0.0.20
pyzmq: 27.1.0
uvicorn: 0.37.0
uvloop: 0.21.0
vllm: Module Not Found
xgrammar: 0.1.25
openai: 1.99.1
tiktoken: 0.12.0
anthropic: 0.69.0
litellm: Module Not Found
decord: 0.6.0
NVIDIA Topology:
GPU0 GPU1 GPU2 GPU3 GPU4 GPU5 GPU6 GPU7 NIC0 NIC1 NIC2 NIC3 NIC4 NIC5 NIC6 NIC7 NIC8 NIC9 CPU Affinity NUMA Affinity GPU NUMA ID
GPU0 X NV18 NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE NODE PIX SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU1 NV18 X NV18 NV18 NV18 NV18 NV18 NV18 NODE NODE PIX NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU2 NV18 NV18 X NV18 NV18 NV18 NV18 NV18 NODE PIX NODE NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU3 NV18 NV18 NV18 X NV18 NV18 NV18 NV18 PIX NODE NODE NODE SYS SYS SYS SYS SYS SYS 0-55,112-167 0 N/A
GPU4 NV18 NV18 NV18 NV18 X NV18 NV18 NV18 SYS SYS SYS SYS NODE PIX NODE NODE NODE NODE 56-111,168-223 1 N/A
GPU5 NV18 NV18 NV18 NV18 NV18 X NV18 NV18 SYS SYS SYS SYS PIX NODE NODE NODE NODE NODE 56-111,168-223 1 N/A
GPU6 NV18 NV18 NV18 NV18 NV18 NV18 X NV18 SYS SYS SYS SYS NODE NODE NODE NODE NODE PIX 56-111,168-223 1 N/A
GPU7 NV18 NV18 NV18 NV18 NV18 NV18 NV18 X SYS SYS SYS SYS NODE NODE NODE NODE PIX NODE 56-111,168-223 1 N/A
NIC0 NODE NODE NODE PIX SYS SYS SYS SYS X NODE NODE NODE SYS SYS SYS SYS SYS SYS
NIC1 NODE NODE PIX NODE SYS SYS SYS SYS NODE X NODE NODE SYS SYS SYS SYS SYS SYS
NIC2 NODE PIX NODE NODE SYS SYS SYS SYS NODE NODE X NODE SYS SYS SYS SYS SYS SYS
NIC3 PIX NODE NODE NODE SYS SYS SYS SYS NODE NODE NODE X SYS SYS SYS SYS SYS SYS
NIC4 SYS SYS SYS SYS NODE PIX NODE NODE SYS SYS SYS SYS X NODE NODE NODE NODE NODE
NIC5 SYS SYS SYS SYS PIX NODE NODE NODE SYS SYS SYS SYS NODE X NODE NODE NODE NODE
NIC6 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE X PIX NODE NODE
NIC7 SYS SYS SYS SYS NODE NODE NODE NODE SYS SYS SYS SYS NODE NODE PIX X NODE NODE
NIC8 SYS SYS SYS SYS NODE NODE NODE PIX SYS SYS SYS SYS NODE NODE NODE NODE X NODE
NIC9 SYS SYS SYS SYS NODE NODE PIX NODE SYS SYS SYS SYS NODE NODE NODE NODE NODE X

Legend:

X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks

NIC Legend:

NIC0: mlx5_0
NIC1: mlx5_1
NIC2: mlx5_2
NIC3: mlx5_3
NIC4: mlx5_4
NIC5: mlx5_5
NIC6: mlx5_6
NIC7: mlx5_7
NIC8: mlx5_8
NIC9: mlx5_9

ulimit soft: 1048576

Benchmarking and Profiling

Not applicable

Checklist

@Alexhaoge Alexhaoge force-pushed the check_env branch 2 times, most recently from d53a613 to bdc520b Compare October 15, 2025 09:39
@ping1jing2 ping1jing2 marked this pull request as ready for review October 16, 2025 09:26
@Alexhaoge Alexhaoge force-pushed the check_env branch 2 times, most recently from 87a1d28 to 797cbc9 Compare October 17, 2025 08:45
@Alexhaoge Alexhaoge changed the title [Ascend] Add Ascend NPU support for sglang.check_env [Ascend] Add Ascend NPU support for sglang.check_env && rework proposal Oct 22, 2025
@Alexhaoge Alexhaoge changed the title [Ascend] Add Ascend NPU support for sglang.check_env && rework proposal [Ascend] Add Ascend NPU support for sglang.check_env & rework proposal Oct 22, 2025
@sglang-bot sglang-bot merged commit c550ab9 into sgl-project:main Nov 2, 2025
115 of 136 checks passed
mingfeima pushed a commit to mingfeima/sglang that referenced this pull request Nov 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants