support piecewise cuda graph for Olmo models #14476

zminglei · 2025-12-05T05:31:02Z

Motivation

support piecewise cuda graph for Olmo models

Before the change:

python3 -m sglang.launch_server --model /shared/public/elr-models/allenai/OLMo-2-0325-32B-Instruct/5942a2f5e0bc38c2a5f5200cec2ea236d5984547/ --enable-piecewise-cuda-graph

Capturing num tokens (num_tokens=3968 avail_mem=0.05 GB):   2%|██▎                                                                                                                                 | 1/58 [00:00<00:56,  1.01it/s]Traceback (most recent call last):
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/fx/graph_module.py", line 400, in __call__
    return super(self.cls, obj).__call__(*args, **kwargs)  # type: ignore[misc]
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/jobuser/zminglei/sglang/venv/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
  File "<eval_with_key>.3", line 11, in forward
    linear_1 = torch._C._nn.linear(add, l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_, None);  l_self_modules_layers_modules_0_modules_mlp_modules_gate_up_proj_parameters_weight_ = None
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 420.00 MiB. GPU 0 has a total capacity of 139.72 GiB of which 55.94 MiB is free. Process 3869098 has 139.65 GiB memory in use. Of the allocated memory 137.40 GiB is allocated by PyTorch, with 1.78 GiB allocated in private pools (e.g., CUDA Graphs), and 623.30 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

After the change:

 Computation graph saved to /home/jobuser/.cache/sglang/torch_compile_cache/rank_0_0/backbone/computation_graph_1764912424.387415.py
Capturing num tokens (num_tokens=4 avail_mem=10.32 GB): 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 58/58 [00:20<00:00,  2.78it/s]
[2025-12-05 05:27:27] Capture piecewise CUDA graph end. Time elapsed: 30.80 s. mem usage=2.39 GB. avail mem=10.30 GB.
python3 -m sglang.launch_server --model /shared/public/elr-models/allenai/OLMo-2-0325-32B-Instruct/5942a2f5e0bc38c2a5f5200cec2ea236d5984547/ --enable-piecewise-cuda-graph

[2025-12-05 05:20:17] INFO:     127.0.0.1:58900 - "POST /generate HTTP/1.1" 200 OK
[2025-12-05 05:20:17] The server is fired up and ready to roll!

Modifications

Accuracy Tests

python benchmark/gsm8k/bench_sglang.py --data-path /shared/public/data/gsm8k/test.jsonl
Accuracy: 0.845
Invalid: 0.005
Latency: 16.445 s
Output throughput: 1371.533 token/s

Accuracy is the same as without enabling piecewise cuda graph

Benchmarking and Profiling

python3 -m sglang.bench_serving --backend sglang --dataset-name random-ids --num-prompts 1 --random-input-len 1024 --random-output-len 1

Without piecewise cuda graph:

---------------Time to First Token----------------
Mean TTFT (ms):                          40.35     
Median TTFT (ms):                        40.35     
P99 TTFT (ms):                           40.35

With piecewise cuda graph:

---------------Time to First Token----------------
Mean TTFT (ms):                          30.49     
Median TTFT (ms):                        30.49     
P99 TTFT (ms):                           30.49

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

gemini-code-assist · 2025-12-05T05:31:05Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

hebiao064 · 2025-12-05T06:06:37Z

/tag-and-rerun-ci

zminglei added 2 commits December 5, 2025 05:19

support piecewise cuda graph for olmo2 models

40d1e53

add olmo

3266d43

zminglei marked this pull request as ready for review December 5, 2025 05:33

Merge branch 'main' into olmo-piecewise

67c750f

hebiao064 approved these changes Dec 5, 2025

View reviewed changes

hebiao064 added the run-ci label Dec 5, 2025

Merge branch 'main' into olmo-piecewise

675b1ac

Fridge003 merged commit be4a3ec into sgl-project:main Dec 7, 2025
82 of 87 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

support piecewise cuda graph for Olmo models #14476

support piecewise cuda graph for Olmo models #14476

zminglei commented Dec 5, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Dec 5, 2025

Uh oh!

hebiao064 commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

support piecewise cuda graph for Olmo models #14476

support piecewise cuda graph for Olmo models #14476

Conversation

zminglei commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 5, 2025

Uh oh!

hebiao064 commented Dec 5, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zminglei commented Dec 5, 2025 •

edited

Loading