feat: qol for page sizes - decrease defaults and raise error in PD when size is different #14474

raayandhar · 2025-12-05T04:55:12Z

Motivation

Modifications

Lowered the defaults as described here: #14443

To check that the page sizes are the same, we have to communicate this between the prefill and decode servers, so I did that and on the prefill side it will throw an error. But I'm not really entirely happy with this design since the decode server is left hanging and the request will also hang. We should be throwing the error on both sides I think. Also, it only does this once we send a request, not during warmup / server start. I would really appreciate some feedback here as I'm sure there's a more intelligent way of doing this. I will continue working on that part.

Example (on prefill):

Exception in thread Thread-3 (bootstrap_thread):
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/home/raayan/projects/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 865, in bootstrap_thread
    raise ValueError(
ValueError: Page size mismatch: decode server has page_size=64, but prefill server has page_size=32. Both servers must use the same --page-size value.

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.
Work with maintainers to merge your PR. See the PR Merge Process

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

gemini-code-assist · 2025-12-05T04:55:17Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

raayandhar · 2025-12-05T05:35:56Z

Also, I'm not sure what should be the expected behavior -- we throw a ValueError when they mismatch, but the server does not stop. Should we stopping both servers?

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

ShangmingCai · 2025-12-05T07:50:24Z

python/sglang/srt/disaggregation/mooncake/conn.py

    dst_tp_rank: int
    dst_attn_tp_size: int
    dst_kv_item_len: int
+    page_size: int


Maybe this would be better?

Suggested change

page_size: int

dst_page_size: int

sure, will make the change tomorrow

ShangmingCai · 2025-12-05T07:50:58Z

python/sglang/srt/disaggregation/mooncake/conn.py

            dst_tp_rank=int(msg[7].decode("ascii")),
            dst_attn_tp_size=int(msg[8].decode("ascii")),
            dst_kv_item_len=int(msg[9].decode("ascii")),
+            page_size=int(msg[10].decode("ascii")),


ShangmingCai

The motivation has some points. But I don't think we should crash the prefill server. This could cause a chain reaction, leading to the entire cluster crashing due to the addition of only a misconfigured node. This is something we don't want to see. Maybe a warning should be enough.

ShangmingCai · 2025-12-05T07:51:34Z

python/sglang/srt/disaggregation/nixl/conn.py

    decode_tp_size: int
    decode_tp_rank: int
    dst_kv_item_len: int
+    page_size: int


ShangmingCai · 2025-12-05T07:51:39Z

python/sglang/srt/disaggregation/nixl/conn.py

            decode_tp_size=int(msg[8].decode("ascii")),
            decode_tp_rank=int(msg[9].decode("ascii")),
            dst_kv_item_len=int(msg[10].decode("ascii")),
+            page_size=int(msg[11].decode("ascii")),


ShangmingCai · 2025-12-05T07:53:12Z

python/sglang/srt/server_args.py

+                    f"TensorRT-LLM MLA only supports page_size of 32 or 64, changing page_size from {self.page_size} to 32."
                )
-                self.page_size = 64
+                self.page_size = 32


Can you eloborate on this change?

Sure, this is from this issue: #14443 that was opened.

raayandhar · 2025-12-05T07:59:43Z

The motivation has some points. But I don't think we should crash the prefill server. This could cause a chain reaction, leading to the entire cluster crashing due to the addition of only a misconfigured node. This is something we don't want to see. Maybe a warning should be enough.

Hmm, ok. Makes sense. I'll change to log a warning instead, so that if there is some downstream issue caused by the page size we have a better idea what happened from the logs. That's more in line with the original motivation in the issue. Also, should we log the warning on both prefill and decode? Right now we have this on just prefill, I personally think that's fine since the error is related to both servers anyways. But I defer to your judgement. I'm also on slack if you want to discuss further.

ShangmingCai · 2025-12-05T08:04:23Z

@raayandhar I think we don't need to notify the prefill server, we just crash the decode server at the bootstrapping step. I think this is the best way to help users identify the misconfiguration and won't affect any performance.

ShangmingCai · 2025-12-05T08:06:18Z

Sorry, I realize that the prefill page size wasn't registered to the bootstrap server, so the decode node doesn't know.

ShangmingCai · 2025-12-05T08:11:50Z

Maybe this check covers this scenario:

# Sanity check: The data sub-slice to be sent should fit into the dst buffer.
        # This means heads_bytes_per_token_to_send <= (dst_kv_item_len // page_size)
        if heads_bytes_per_token_to_send > (dst_kv_item_len // page_size):
            logger.error(
                f"[{mooncake_session_id}] slice size ({heads_bytes_per_token_to_send}) exceeds "
                f"target token slot size ({dst_kv_item_len // page_size})"
            )
            return -1

raayandhar · 2025-12-05T08:14:20Z

Maybe this check covers this scenario:

# Sanity check: The data sub-slice to be sent should fit into the dst buffer.
        # This means heads_bytes_per_token_to_send <= (dst_kv_item_len // page_size)
        if heads_bytes_per_token_to_send > (dst_kv_item_len // page_size):
            logger.error(
                f"[{mooncake_session_id}] slice size ({heads_bytes_per_token_to_send}) exceeds "
                f"target token slot size ({dst_kv_item_len // page_size})"
            )
            return -1

hmm, I will go and try tomorrow to see what happens and understand better. But I think at least some better messaging pointing to the issue (!= page sizes) could be nice.

b8zhong

When we are in the single node scenario we can use different attention backend for prefill and decode (which can have mismatched page sizes when incorrectly set), can you make sure it also works in the single node case?

raayandhar added 4 commits December 4, 2025 18:06

simple qol changes

5ad4e86

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

debug

a7539f3

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

fix

0651fd7

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

remove debug

198dd22

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

raayandhar requested review from ByronHsu, ShangmingCai and hnyls2002 as code owners December 5, 2025 04:55

small cleanup

6433f16

Signed-off-by: Raayan Dhar [email protected] <[email protected]>

ShangmingCai reviewed Dec 5, 2025

View reviewed changes

b8zhong reviewed Dec 5, 2025

View reviewed changes

feat: qol for page sizes - decrease defaults and raise error in PD when size is different #14474

Are you sure you want to change the base?

feat: qol for page sizes - decrease defaults and raise error in PD when size is different #14474

Conversation

raayandhar commented Dec 5, 2025 • edited by b8zhong Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Dec 5, 2025

Uh oh!

raayandhar commented Dec 5, 2025

Uh oh!

ShangmingCai Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

raayandhar Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

ShangmingCai Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

raayandhar Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

raayandhar commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ShangmingCai commented Dec 5, 2025

Uh oh!

ShangmingCai commented Dec 5, 2025

Uh oh!

ShangmingCai commented Dec 5, 2025

Uh oh!

raayandhar commented Dec 5, 2025

Uh oh!

b8zhong left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

raayandhar commented Dec 5, 2025 •

edited by b8zhong

Loading

raayandhar commented Dec 5, 2025 •

edited

Loading