Skip to content

Conversation

@zminglei
Copy link
Contributor

@zminglei zminglei commented Dec 5, 2025

Motivation

support piecewise cuda graph for Olmo models

Before the change:

python3 -m sglang.launch_server --model /shared/public/elr-models/allenai/OLMo-2-0325-32B-Instruct/5942a2f5e0bc38c2a5f5200cec2ea236d5984547/ --enable-piecewise-cuda-graph

Capturing num tokens (num_tokens=3968 avail_mem=0.05 GB):   2%|██▎                                                                                                                                 | 1/58 [00:00<00:56,  1.01it/s]Traceback (most recent call last):
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.3", line 11, in forward
    linear_1 = torch._C._nn.linear(add, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_, None);  l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_ = None
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 420.00 MiB. GPU 0 has a total capacity of 139.72 GiB of which 55.94 MiB is free. Process 3869098 has 139.65 GiB memory in use. Of the allocated memory 137.40 GiB is allocated by PyTorch, with 1.78 GiB allocated in private pools (e.g., CUDA Graphs), and 623.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

After the change:

 Computation graph saved to /home/jobuser/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1764912424.387415.py
Capturing num tokens (num_tokens=4 avail_mem=10.32 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:20<00:00,  2.78it/s]
[2025-12-05 05:27:27] Capture piecewise CUDA graph end. Time elapsed: 30.80 s. mem usage=2.39 GB. avail mem=10.30 GB.
python3 -m sglang.launch_server --model /shared/public/elr-models/allenai/OLMo-2-0325-32B-Instruct/5942a2f5e0bc38c2a5f5200cec2ea236d5984547/ --enable-piecewise-cuda-graph

[2025-12-05 05:20:17] INFO:     127.0.0.1:58900 - "POST /generate HTTP/1.1" 200 OK
[2025-12-05 05:20:17] The server is fired up and ready to roll!

Modifications

Accuracy Tests

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.845
Invalid: 0.005
Latency: 16.445 s
Output throughput: 1371.533 token/s

Accuracy is the same as without enabling piecewise cuda graph

Benchmarking and Profiling

python3 -m sglang.bench_serving --backend sglang --dataset-name random-ids --num-prompts 1 --random-input-len 1024 --random-output-len 1

Without piecewise cuda graph:

---------------Time to First Token----------------
Mean TTFT (ms):                          40.35     
Median TTFT (ms):                        40.35     
P99 TTFT (ms):                           40.35      

With piecewise cuda graph:

---------------Time to First Token----------------
Mean TTFT (ms):                          30.49     
Median TTFT (ms):                        30.49     
P99 TTFT (ms):                           30.49   

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@zminglei zminglei marked this pull request as ready for review December 5, 2025 05:33
@hebiao064
Copy link
Collaborator

/tag-and-rerun-ci

@Fridge003 Fridge003 merged commit be4a3ec into sgl-project:main Dec 7, 2025
82 of 87 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants