Skip to content

Conversation

@raayandhar
Copy link
Contributor

@raayandhar raayandhar commented Dec 5, 2025

Motivation

See fix #14439 and fix #14443

Modifications

Lowered the defaults as described here: #14443

To check that the page sizes are the same, we have to communicate this between the prefill and decode servers, so I did that and on the prefill side it will throw an error. But I'm not really entirely happy with this design since the decode server is left hanging and the request will also hang. We should be throwing the error on both sides I think. Also, it only does this once we send a request, not during warmup / server start. I would really appreciate some feedback here as I'm sure there's a more intelligent way of doing this. I will continue working on that part.

Example (on prefill):

Exception in thread Thread-3 (bootstrap_thread):
Traceback (most recent call last):
  File "/usr/lib/python3.12/threading.py", line 1073, in _bootstrap_inner
    self.run()
  File "/usr/lib/python3.12/threading.py", line 1010, in run
    self._target(*self._args, **self._kwargs)
  File "/home/raayan/projects/sglang/python/sglang/srt/disaggregation/mooncake/conn.py", line 865, in bootstrap_thread
    raise ValueError(
ValueError: Page size mismatch: decode server has page_size=64, but prefill server has page_size=32. Both servers must use the same --page-size value.

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@raayandhar
Copy link
Contributor Author

Also, I'm not sure what should be the expected behavior -- we throw a ValueError when they mismatch, but the server does not stop. Should we stopping both servers?

dst_tp_rank: int
dst_attn_tp_size: int
dst_kv_item_len: int
page_size: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this would be better?

Suggested change
page_size: int
dst_page_size: int

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, will make the change tomorrow

dst_tp_rank=int(msg[7].decode("ascii")),
dst_attn_tp_size=int(msg[8].decode("ascii")),
dst_kv_item_len=int(msg[9].decode("ascii")),
page_size=int(msg[10].decode("ascii")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivation has some points. But I don't think we should crash the prefill server. This could cause a chain reaction, leading to the entire cluster crashing due to the addition of only a misconfigured node. This is something we don't want to see. Maybe a warning should be enough.

decode_tp_size: int
decode_tp_rank: int
dst_kv_item_len: int
page_size: int
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

decode_tp_size=int(msg[8].decode("ascii")),
decode_tp_rank=int(msg[9].decode("ascii")),
dst_kv_item_len=int(msg[10].decode("ascii")),
page_size=int(msg[11].decode("ascii")),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Comment on lines +1391 to +1393
f"TensorRT-LLM MLA only supports page_size of 32 or 64, changing page_size from {self.page_size} to 32."
)
self.page_size = 64
self.page_size = 32
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you eloborate on this change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, this is from this issue: #14443 that was opened.

@raayandhar
Copy link
Contributor Author

raayandhar commented Dec 5, 2025

The motivation has some points. But I don't think we should crash the prefill server. This could cause a chain reaction, leading to the entire cluster crashing due to the addition of only a misconfigured node. This is something we don't want to see. Maybe a warning should be enough.

Hmm, ok. Makes sense. I'll change to log a warning instead, so that if there is some downstream issue caused by the page size we have a better idea what happened from the logs. That's more in line with the original motivation in the issue. Also, should we log the warning on both prefill and decode? Right now we have this on just prefill, I personally think that's fine since the error is related to both servers anyways. But I defer to your judgement. I'm also on slack if you want to discuss further.

@ShangmingCai
Copy link
Collaborator

@raayandhar I think we don't need to notify the prefill server, we just crash the decode server at the bootstrapping step. I think this is the best way to help users identify the misconfiguration and won't affect any performance.

@ShangmingCai
Copy link
Collaborator

Sorry, I realize that the prefill page size wasn't registered to the bootstrap server, so the decode node doesn't know.

@ShangmingCai
Copy link
Collaborator

Maybe this check covers this scenario:

# Sanity check: The data sub-slice to be sent should fit into the dst buffer.
        # This means heads_bytes_per_token_to_send <= (dst_kv_item_len // page_size)
        if heads_bytes_per_token_to_send > (dst_kv_item_len // page_size):
            logger.error(
                f"[{mooncake_session_id}] slice size ({heads_bytes_per_token_to_send}) exceeds "
                f"target token slot size ({dst_kv_item_len // page_size})"
            )
            return -1

@raayandhar
Copy link
Contributor Author

Maybe this check covers this scenario:

# Sanity check: The data sub-slice to be sent should fit into the dst buffer.
        # This means heads_bytes_per_token_to_send <= (dst_kv_item_len // page_size)
        if heads_bytes_per_token_to_send > (dst_kv_item_len // page_size):
            logger.error(
                f"[{mooncake_session_id}] slice size ({heads_bytes_per_token_to_send}) exceeds "
                f"target token slot size ({dst_kv_item_len // page_size})"
            )
            return -1

hmm, I will go and try tomorrow to see what happens and understand better. But I think at least some better messaging pointing to the issue (!= page sizes) could be nice.

Copy link
Collaborator

@b8zhong b8zhong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When we are in the single node scenario we can use different attention backend for prefill and decode (which can have mismatched page sizes when incorrectly set), can you make sure it also works in the single node case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

3 participants