-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 #10804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 #10804
Conversation
Summary of ChangesHello @sufeng-buaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request represents the second phase of implementing a comprehensive tracing feature in Sglang, focusing on enhancing observability for distributed request processing. It introduces a more robust four-level OpenTelemetry span structure and explicit trace context propagation, particularly for Prefill/Decode disaggregation and Data Parallel attention. These changes are designed to provide developers with deeper insights into request flow and performance bottlenecks across various Sglang components, laying the groundwork for more advanced debugging and optimization in complex, distributed environments. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request significantly enhances the tracing capabilities by adding support for PD disaggregation and DP attention scenarios. It introduces a more robust four-level span hierarchy to better track request latency across different nodes and threads, which is a great improvement for observability. The changes are well-structured, and the documentation has been updated accordingly. I've found one critical bug that would cause a runtime error and have also suggested a refactoring to reduce code duplication in one of the files. Overall, this is a solid contribution to improving the observability of the system.
98bef23 to
6e2ec1d
Compare
e837462 to
479baef
Compare
|
Hi @ishandhanani , could you please review this PR when you have a moment? Thanks! |
479baef to
f58e846
Compare
|
hello! I have a question, I've looked at both #9962 and this PR but it's unclear to me if there's support for receiving trace/span IDs via the EDIT: looks like not. just made a quick patch for it in #11074 |
|
Please also pay extreme attention to overhead #9962 (comment) |
Is this PR #10808 what you mean? |
OK, thank you for the review. I will fix it as soon as possible. |
5bafa93 to
6f04110
Compare
6f04110 to
d2fe22c
Compare
@merrymercy I have updated the code according to your suggestions. Could you please take a look again? |
|
Hi @sufeng-buaa - curious on if you've looked at https://grafana.com/oss/tempo/. We use grafana for our metrics in SGL already and tempo plays well with the grafana stack. No need to change/update anything just sharing in case it's useful :) |
Another approach is enable sampling, unfortunately, the Python SDK doesn't support this feature, maybe we can do this in SGLang application layer. |
can be considered in next patch. |
29aedb6 to
37357b9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
CC: @slin1237 Do you have time to take another look since there are some modifications in sglang_router as well? |
96f94fb to
9a26634
Compare
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
Signed-off-by: Feng Su <[email protected]>
9a26634 to
ddeddc5
Compare
… Part 2 (#10804) Signed-off-by: Feng Su <[email protected]>
|
@slin1237 this break router image. in router image there is no sglang.srt. got import error
|
Indeed, it will cause an import error if the image not installed sglang. |
Motivation
The PR is response to #8965. For details on the motivation and visual output, please refer to the issue.
Modifications
The PR have split the patch into two parts.
First part is here: #9962, which has been merged.
This is the second part, which include:
Router tracing is currently only implemented for mini_lb, since it is implemented in Python. Support for the Rust-based router will be added in a follow-up commit, after I finish implementing the tracing package in Rust.
Building upon Part 1, to accommodate PD disaggregation, I have upgraded the original three-level span structure to a four-level hierarchy by adding a top-level bootstrap_room_span. While the previous three-level structure could still achieve the original design goals, this change was made with future extensibility in mind. Specifically, we may want to attach certain attribute information to the request root span in the future; however, OpenTelemetry does not allow adding attributes to spans that are propagated from other nodes. Therefore, the updated span hierarchy is as follows:
Checklist