Skip to content

Commit 29aedb6

Browse files
authored
Merge branch 'main' into sufeng-buaa/sglang-tracing-part2
2 parents e61c8bc + 590bc4b commit 29aedb6

File tree

92 files changed

+8942
-447
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

92 files changed

+8942
-447
lines changed

.github/workflows/pr-test-rust.yml

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,7 +86,7 @@ jobs:
8686
pytest-rust:
8787
if: github.event_name != 'pull_request' || contains(github.event.pull_request.labels.*.name, 'run-ci')
8888
runs-on: 4-gpu-a10
89-
timeout-minutes: 25
89+
timeout-minutes: 32
9090
steps:
9191
- name: Checkout code
9292
uses: actions/checkout@v4
@@ -144,6 +144,12 @@ jobs:
144144
python3 -m pip --no-cache-dir install --upgrade --break-system-packages genai-bench==0.0.2
145145
pytest -m e2e -s -vv -o log_cli=true --log-cli-level=INFO
146146
147+
- name: Run Python E2E gRPC tests
148+
run: |
149+
bash scripts/killall_sglang.sh "nuk_gpus"
150+
cd sgl-router
151+
SHOW_ROUTER_LOGS=1 ROUTER_LOCAL_MODEL_PATH="/home/ubuntu/models" pytest py_test/e2e_grpc -s -vv -o log_cli=true --log-cli-level=INFO
152+
147153
- name: Upload benchmark results
148154
if: success()
149155
uses: actions/upload-artifact@v4

benchmark/mmmu/data_utils.py

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -75,12 +75,6 @@
7575
}
7676

7777

78-
# DATA SAVING
79-
def save_json(filename, ds):
80-
with open(filename, "w") as f:
81-
json.dump(ds, f, indent=4)
82-
83-
8478
def get_multi_choice_info(options):
8579
"""
8680
Given the list of options for multiple choice question

docker/Dockerfile

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ ARG DEEPEP_COMMIT=9af0e0d0e74f3577af1979c9b9e1ac2cad0104ee
99
ARG FLASHMLA_COMMIT=1408756a88e52a25196b759eaf8db89d2b51b5a1
1010
ARG FAST_HADAMARD_TRANSFORM_COMMIT=7fd811c2b47f63b0b08d2582619f939e14dad77c
1111
ARG CMAKE_BUILD_PARALLEL_LEVEL=2
12-
ARG SGL_KERNEL_VERSION=0.3.15
12+
ARG SGL_KERNEL_VERSION=0.3.16.post3
1313
ENV DEBIAN_FRONTEND=noninteractive \
1414
CUDA_HOME=/usr/local/cuda \
1515
GDRCOPY_HOME=/usr/src/gdrdrv-2.4.4/ \
@@ -152,14 +152,6 @@ RUN if [ "$TARGETARCH" = "amd64" ]; then \
152152
pip install -v . ; \
153153
fi
154154

155-
# Install fast-hadamard-transform
156-
RUN if [ "$TARGETARCH" = "amd64" ]; then \
157-
git clone https://github.com/Dao-AILab/fast-hadamard-transform && \
158-
cd fast-hadamard-transform && \
159-
git checkout ${FAST_HADAMARD_TRANSFORM_COMMIT} && \
160-
pip install . ; \
161-
fi
162-
163155
# Python tools
164156
RUN python3 -m pip install --no-cache-dir \
165157
datamodel_code_generator \

docs/advanced_features/lora.ipynb

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -59,6 +59,17 @@
5959
"### Serving Single Adaptor"
6060
]
6161
},
62+
{
63+
"cell_type": "markdown",
64+
"metadata": {},
65+
"source": [
66+
"**Note:** SGLang supports LoRA adapters through two APIs:\n",
67+
"\n",
68+
"1. **OpenAI-Compatible API** (`/v1/chat/completions`, `/v1/completions`): Use the `model:adapter-name` syntax. See [OpenAI API with LoRA](../basic_usage/openai_api_completions.ipynb#Using-LoRA-Adapters) for examples.\n",
69+
"\n",
70+
"2. **Native API** (`/generate`): Pass `lora_path` in the request body (shown below)."
71+
]
72+
},
6273
{
6374
"cell_type": "code",
6475
"execution_count": null,
@@ -379,6 +390,15 @@
379390
"print(f\"Output from lora1 (updated): \\n{response.json()[1]['text']}\\n\")"
380391
]
381392
},
393+
{
394+
"cell_type": "markdown",
395+
"metadata": {},
396+
"source": [
397+
"### OpenAI-compatible API usage\n",
398+
"\n",
399+
"You can use LoRA adapters via the OpenAI-compatible APIs by specifying the adapter in the `model` field using the `base-model:adapter-name` syntax (for example, `qwen/qwen2.5-0.5b-instruct:adapter_a`). For more details and examples, see the “Using LoRA Adapters” section in the OpenAI API documentation: [openai_api_completions.ipynb](../basic_usage/openai_api_completions.ipynb).\n"
400+
]
401+
},
382402
{
383403
"cell_type": "code",
384404
"execution_count": null,

docs/advanced_features/server_arguments.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -228,6 +228,8 @@ Please consult the documentation below and [server_args.py](https://github.com/s
228228
| `--sampling-backend` | Choose the kernels for sampling layers. | None |
229229
| `--grammar-backend` | Choose the backend for grammar-guided decoding. | None |
230230
| `--mm-attention-backend` | Set multimodal attention backend. | None |
231+
| `--nsa-prefill-backend` | Prefill attention implementation for nsa backend. | `flashmla_sparse` |
232+
| `--nsa-decode-backend` | Decode attention implementation for nsa backend. | `flashmla_kv` |
231233
232234
## Speculative decoding
233235

docs/basic_usage/deepseek.md

Lines changed: 38 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,44 @@ Important Notes:
235235
2. To receive more consistent tool call results, it is recommended to use `--chat-template examples/chat_template/tool_chat_template_deepseekv3.jinja`. It provides an improved unified prompt.
236236

237237

238+
### Thinking Budget for DeepSeek R1
239+
240+
In SGLang, we can implement thinking budget with `CustomLogitProcessor`.
241+
242+
Launch a server with `--enable-custom-logit-processor` flag on.
243+
244+
```
245+
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --port 30000 --host 0.0.0.0 --mem-fraction-static 0.9 --disable-cuda-graph --reasoning-parser deepseek-r1 --enable-custom-logit-processor
246+
```
247+
248+
Sample Request:
249+
250+
```python
251+
import openai
252+
from rich.pretty import pprint
253+
from sglang.srt.sampling.custom_logit_processor import DeepSeekR1ThinkingBudgetLogitProcessor
254+
255+
256+
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="*")
257+
response = client.chat.completions.create(
258+
model="deepseek-ai/DeepSeek-R1",
259+
messages=[
260+
{
261+
"role": "user",
262+
"content": "Question: Is Paris the Capital of France?",
263+
}
264+
],
265+
max_tokens=1024,
266+
extra_body={
267+
"custom_logit_processor": DeepSeekR1ThinkingBudgetLogitProcessor().to_str(),
268+
"custom_params": {
269+
"thinking_budget": 512,
270+
},
271+
},
272+
)
273+
pprint(response)
274+
```
275+
238276
## FAQ
239277

240278
**Q: Model loading is taking too long, and I'm encountering an NCCL timeout. What should I do?**

docs/basic_usage/openai_api_completions.ipynb

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -361,6 +361,50 @@
361361
"For OpenAI compatible structured outputs API, refer to [Structured Outputs](../advanced_features/structured_outputs.ipynb) for more details.\n"
362362
]
363363
},
364+
{
365+
"cell_type": "markdown",
366+
"metadata": {},
367+
"source": [
368+
"## Using LoRA Adapters\n",
369+
"\n",
370+
"SGLang supports LoRA (Low-Rank Adaptation) adapters with OpenAI-compatible APIs. You can specify which adapter to use directly in the `model` parameter using the `base-model:adapter-name` syntax.\n",
371+
"\n",
372+
"**Server Setup:**\n",
373+
"```bash\n",
374+
"python -m sglang.launch_server \\\n",
375+
" --model-path qwen/qwen2.5-0.5b-instruct \\\n",
376+
" --enable-lora \\\n",
377+
" --lora-paths adapter_a=/path/to/adapter_a adapter_b=/path/to/adapter_b\n",
378+
"```\n",
379+
"\n",
380+
"For more details on LoRA serving configuration, see the [LoRA documentation](../advanced_features/lora.ipynb).\n",
381+
"\n",
382+
"**API Call:**\n",
383+
"\n",
384+
"(Recommended) Use the `model:adapter` syntax to specify which adapter to use:\n",
385+
"```python\n",
386+
"response = client.chat.completions.create(\n",
387+
" model=\"qwen/qwen2.5-0.5b-instruct:adapter_a\", # ← base-model:adapter-name\n",
388+
" messages=[{\"role\": \"user\", \"content\": \"Convert to SQL: show all users\"}],\n",
389+
" max_tokens=50,\n",
390+
")\n",
391+
"```\n",
392+
"\n",
393+
"**Backward Compatible: Using `extra_body`**\n",
394+
"\n",
395+
"The old `extra_body` method is still supported for backward compatibility:\n",
396+
"```python\n",
397+
"# Backward compatible method\n",
398+
"response = client.chat.completions.create(\n",
399+
" model=\"qwen/qwen2.5-0.5b-instruct\",\n",
400+
" messages=[{\"role\": \"user\", \"content\": \"Convert to SQL: show all users\"}],\n",
401+
" extra_body={\"lora_path\": \"adapter_a\"}, # ← old method\n",
402+
" max_tokens=50,\n",
403+
")\n",
404+
"```\n",
405+
"**Note:** When both `model:adapter` and `extra_body[\"lora_path\"]` are specified, the `model:adapter` syntax takes precedence."
406+
]
407+
},
364408
{
365409
"cell_type": "code",
366410
"execution_count": null,

docs/basic_usage/sampling_params.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -319,3 +319,27 @@ response = requests.post(
319319
)
320320
print(response.json())
321321
```
322+
323+
Send an OpenAI chat completion request:
324+
325+
```python
326+
import openai
327+
from sglang.utils import print_highlight
328+
329+
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")
330+
331+
response = client.chat.completions.create(
332+
model="meta-llama/Meta-Llama-3-8B-Instruct",
333+
messages=[
334+
{"role": "user", "content": "List 3 countries and their capitals."},
335+
],
336+
temperature=0.0,
337+
max_tokens=32,
338+
extra_body={
339+
"custom_logit_processor": DeterministicLogitProcessor().to_str(),
340+
"custom_params": {"token_id": 5},
341+
},
342+
)
343+
344+
print_highlight(f"Response: {response}")
345+
```

examples/runtime/lora.py

Lines changed: 67 additions & 37 deletions
Original file line numberDiff line numberDiff line change
@@ -1,37 +1,67 @@
1-
# launch server
2-
# python -m sglang.launch_server --model mistralai/Mistral-7B-Instruct-v0.3 --lora-paths /home/ying/test_lora lora1=/home/ying/test_lora_1 lora2=/home/ying/test_lora_2 --disable-radix --disable-cuda-graph --max-loras-per-batch 4
3-
4-
# send requests
5-
# lora_path[i] specifies the LoRA used for text[i], so make sure they have the same length
6-
# use None to specify base-only prompt, e.x. "lora_path": [None, "/home/ying/test_lora"]
7-
import json
8-
9-
import requests
10-
11-
url = "http://127.0.0.1:30000"
12-
json_data = {
13-
"text": [
14-
"prompt 1",
15-
"prompt 2",
16-
"prompt 3",
17-
"prompt 4",
18-
"prompt 5",
19-
"prompt 6",
20-
"prompt 7",
21-
],
22-
"sampling_params": {"max_new_tokens": 32},
23-
"lora_path": [
24-
"/home/ying/test_lora",
25-
"lora1",
26-
"lora2",
27-
"lora1",
28-
"lora2",
29-
None,
30-
None,
31-
],
32-
}
33-
response = requests.post(
34-
url + "/generate",
35-
json=json_data,
36-
)
37-
print(json.dumps(response.json()))
1+
"""
2+
OpenAI-compatible LoRA adapter usage with SGLang.
3+
4+
Server Setup:
5+
python -m sglang.launch_server \\
6+
--model meta-llama/Llama-3.1-8B-Instruct \\
7+
--enable-lora \\
8+
--lora-paths sql=/path/to/sql python=/path/to/python
9+
"""
10+
11+
import openai
12+
13+
client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
14+
15+
16+
def main():
17+
print("SGLang OpenAI-Compatible LoRA Examples\n")
18+
19+
# Example 1: NEW - Adapter in model parameter (OpenAI-compatible)
20+
print("1. Chat with LoRA adapter in model parameter:")
21+
response = client.chat.completions.create(
22+
model="meta-llama/Llama-3.1-8B-Instruct:sql", # ← adapter:name syntax
23+
messages=[{"role": "user", "content": "Convert to SQL: show all users"}],
24+
max_tokens=50,
25+
)
26+
print(f" Response: {response.choices[0].message.content}\n")
27+
28+
# Example 2: Completions API with adapter
29+
print("2. Completion with LoRA adapter:")
30+
response = client.completions.create(
31+
model="meta-llama/Llama-3.1-8B-Instruct:python",
32+
prompt="def fibonacci(n):",
33+
max_tokens=50,
34+
)
35+
print(f" Response: {response.choices[0].text}\n")
36+
37+
# Example 3: OLD - Backward compatible with explicit lora_path
38+
print("3. Backward compatible (explicit lora_path):")
39+
response = client.chat.completions.create(
40+
model="meta-llama/Llama-3.1-8B-Instruct",
41+
messages=[{"role": "user", "content": "Convert to SQL: show all users"}],
42+
extra_body={"lora_path": "sql"},
43+
max_tokens=50,
44+
)
45+
print(f" Response: {response.choices[0].message.content}\n")
46+
47+
# Example 4: Base model (no adapter)
48+
print("4. Base model without adapter:")
49+
response = client.chat.completions.create(
50+
model="meta-llama/Llama-3.1-8B-Instruct",
51+
messages=[{"role": "user", "content": "Hello!"}],
52+
max_tokens=30,
53+
)
54+
print(f" Response: {response.choices[0].message.content}\n")
55+
56+
print("All examples completed!")
57+
58+
59+
if __name__ == "__main__":
60+
try:
61+
main()
62+
except Exception as e:
63+
print(f"Error: {e}")
64+
print(
65+
"\nEnsure server is running:\n"
66+
" python -m sglang.launch_server --model ... --enable-lora --lora-paths ..."
67+
)

examples/runtime/multimodal/llava_onevision_server.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -6,14 +6,14 @@
66
python3 llava_onevision_server.py
77
"""
88

9-
import base64
109
import io
1110
import os
1211
import sys
1312
import time
1413

1514
import numpy as np
1615
import openai
16+
import pybase64
1717
import requests
1818
from decord import VideoReader, cpu
1919
from PIL import Image
@@ -213,7 +213,7 @@ def prepare_video_messages(video_path):
213213
pil_img = Image.fromarray(frame)
214214
buff = io.BytesIO()
215215
pil_img.save(buff, format="JPEG")
216-
base64_str = base64.b64encode(buff.getvalue()).decode("utf-8")
216+
base64_str = pybase64.b64encode(buff.getvalue()).decode("utf-8")
217217
base64_frames.append(base64_str)
218218

219219
messages = [{"role": "user", "content": []}]

0 commit comments

Comments
 (0)