Skip to content

Conversation

@sufeng-buaa
Copy link
Contributor

@sufeng-buaa sufeng-buaa commented Sep 23, 2025

Motivation

The PR is response to #8965. For details on the motivation and visual output, please refer to the issue.

Modifications

The PR have split the patch into two parts.
First part is here: #9962, which has been merged.

This is the second part, which include:

  • Request tracing support for PD disaggregation.
  • Request tracing support for DP attention scenarios.
  • A script for converting OpenTelemetry data to Perfetto data

Router tracing is currently only implemented for mini_lb, since it is implemented in Python. Support for the Rust-based router will be added in a follow-up commit, after I finish implementing the tracing package in Rust.

Building upon Part 1, to accommodate PD disaggregation, I have upgraded the original three-level span structure to a four-level hierarchy by adding a top-level bootstrap_room_span. While the previous three-level structure could still achieve the original design goals, this change was made with future extensibility in mind. Specifically, we may want to attach certain attribute information to the request root span in the future; however, OpenTelemetry does not allow adding attributes to spans that are propagated from other nodes. Therefore, the updated span hierarchy is as follows:

bootstrap room span
├── router req root span
|    └── router thread span
|          └── slice span
├── prefill req root span
|    ├── tokenizer thread span
|    |     └── slice span
|    └── scheduler thread span
|          └── slice span
└── decode req root span
      ├── tokenizer thread span
      |    └── slice span
      └── scheduler thread span
           └── slice span

Checklist

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @sufeng-buaa, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents the second phase of implementing a comprehensive tracing feature in Sglang, focusing on enhancing observability for distributed request processing. It introduces a more robust four-level OpenTelemetry span structure and explicit trace context propagation, particularly for Prefill/Decode disaggregation and Data Parallel attention. These changes are designed to provide developers with deeper insights into request flow and performance bottlenecks across various Sglang components, laying the groundwork for more advanced debugging and optimization in complex, distributed environments.

Highlights

  • Expanded Tracing Support: This pull request extends Sglang's fine-grained request tracing capabilities to cover PD (Prefill/Decode) disaggregation and Data Parallel (DP) attention scenarios, providing more comprehensive visibility into request latency.
  • Upgraded Span Hierarchy: The tracing framework's span structure has been upgraded from a three-level to a four-level hierarchy by introducing a bootstrap_room_span. This new top-level span facilitates future extensibility and allows for attaching attributes to the request root span on different nodes, addressing OpenTelemetry's propagation constraints.
  • Cross-Node Trace Context Propagation: New mechanisms have been implemented to explicitly propagate trace context when request execution flows transfer between different nodes, crucial for distributed architectures like PD disaggregation.
  • Mini Load Balancer Tracing: Tracing has been integrated into the Python-based mini_lb (Mini Load Balancer), enabling tracking of requests as they are dispatched to prefill and decode servers. Support for the Rust-based router is noted as a future enhancement.
  • OpenTelemetry Endpoint Correction: A consistent typo in the OpenTelemetry endpoint argument (--oltp-traces-endpoint to --otlp-traces-endpoint) has been corrected across the codebase and documentation.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request significantly enhances the tracing capabilities by adding support for PD disaggregation and DP attention scenarios. It introduces a more robust four-level span hierarchy to better track request latency across different nodes and threads, which is a great improvement for observability. The changes are well-structured, and the documentation has been updated accordingly. I've found one critical bug that would cause a runtime error and have also suggested a refactoring to reduce code duplication in one of the files. Overall, this is a solid contribution to improving the observability of the system.

@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 98bef23 to 6e2ec1d Compare September 23, 2025 13:03
@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch 3 times, most recently from e837462 to 479baef Compare September 24, 2025 09:50
@sufeng-buaa
Copy link
Contributor Author

Hi @ishandhanani , could you please review this PR when you have a moment? Thanks!

@lun-4
Copy link
Contributor

lun-4 commented Sep 29, 2025

hello! I have a question, I've looked at both #9962 and this PR but it's unclear to me if there's support for receiving trace/span IDs via the traceparent header (through opentelemetry context propagation, manually or automatically via fastapi instrumentation) added or planned to be added in some future PR? in my deployment I'm using sglang behind a service which does authentication and other roles, and it already initializes a lot of important span metadata that would be welcome to link to sglang's own traces.

EDIT: looks like not. just made a quick patch for it in #11074

@merrymercy
Copy link
Contributor

merrymercy commented Oct 1, 2025

Please also pay extreme attention to overhead #9962 (comment)

@sufeng-buaa
Copy link
Contributor Author

hello! I have a question, I've looked at both #9962 and this PR but it's unclear to me if there's support for receiving trace/span IDs via the traceparent header (through opentelemetry context propagation, manually or automatically via fastapi instrumentation) added or planned to be added in some future PR? in my deployment I'm using sglang behind a service which does authentication and other roles, and it already initializes a lot of important span metadata that would be welcome to link to sglang's own traces.

EDIT: looks like not. just made a quick patch for it in #11074

Is this PR #10808 what you mean?

@sufeng-buaa
Copy link
Contributor Author

Please also pay extreme attention to overhead #9962 (comment)

OK, thank you for the review. I will fix it as soon as possible.

@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 5bafa93 to 6f04110 Compare October 10, 2025 02:01
@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 6f04110 to d2fe22c Compare October 10, 2025 03:46
@sufeng-buaa
Copy link
Contributor Author

Please also pay extreme attention to overhead #9962 (comment)

@merrymercy I have updated the code according to your suggestions. Could you please take a look again?

@ishandhanani
Copy link
Collaborator

Hi @sufeng-buaa - curious on if you've looked at https://grafana.com/oss/tempo/. We use grafana for our metrics in SGL already and tempo plays well with the grafana stack.

No need to change/update anything just sharing in case it's useful :)

@acelyc111
Copy link
Collaborator

Instrumentation Overhead Evaluation

The test environment is the same as #9962
Compared to #9962, the overhead of single-span generation and trace context propagation across threads remains unchanged. This section adds support for request tracing in the PD Disaggregation scenario, with an overhead of approximately 60 μs for trace context propagation across nodes (from mini_lb to prefill or decode nodes).
Additionally, the span structure has been extended to four levels, increasing the overhead of trace_req_start() to approximately 95 μs.

Hi, did you consider the exporter thread overhead? I'm not sure if it's caused by GIL, but it seems that the exporter thread blocks the scheduler thread when encoding spans in otel sdk, making the overall overhead not fully overlapped in many scenarios.

Great Question — I actually ran into this issue during high-load testing and spent several weeks tracking it down. The main problem was that the exporter's asynchronous export cycle was too long, causing a sudden spike in generating a large amount of garbage, which in turn triggered prolonged GC pauses that blocked the scheduler thread. I've since addressed this by tuning the schedule_delay_millis and max_export_batch_size parameters to make exports more frequent but smaller (bbd10e7). This helps prevent garbage collection spikes and significantly reduces the risk of blocking the scheduler. While this may increase CPU usage slightly, in most LLM deployment environments CPU resources are typically underutilized, so the trade-off is well worth the improved latency and stability.

Another approach is enable sampling, unfortunately, the Python SDK doesn't support this feature, maybe we can do this in SGLang application layer.

@sufeng-buaa
Copy link
Contributor Author

Instrumentation Overhead Evaluation

The test environment is the same as #9962
Compared to #9962, the overhead of single-span generation and trace context propagation across threads remains unchanged. This section adds support for request tracing in the PD Disaggregation scenario, with an overhead of approximately 60 μs for trace context propagation across nodes (from mini_lb to prefill or decode nodes).
Additionally, the span structure has been extended to four levels, increasing the overhead of trace_req_start() to approximately 95 μs.

Hi, did you consider the exporter thread overhead? I'm not sure if it's caused by GIL, but it seems that the exporter thread blocks the scheduler thread when encoding spans in otel sdk, making the overall overhead not fully overlapped in many scenarios.

Great Question — I actually ran into this issue during high-load testing and spent several weeks tracking it down. The main problem was that the exporter's asynchronous export cycle was too long, causing a sudden spike in generating a large amount of garbage, which in turn triggered prolonged GC pauses that blocked the scheduler thread. I've since addressed this by tuning the schedule_delay_millis and max_export_batch_size parameters to make exports more frequent but smaller (bbd10e7). This helps prevent garbage collection spikes and significantly reduces the risk of blocking the scheduler. While this may increase CPU usage slightly, in most LLM deployment environments CPU resources are typically underutilized, so the trade-off is well worth the improved latency and stability.

Another approach is enable sampling, unfortunately, the Python SDK doesn't support this feature, maybe we can do this in SGLang application layer.

can be considered in next patch.

@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 29aedb6 to 37357b9 Compare October 22, 2025 06:13
Copy link
Collaborator

@ShangmingCai ShangmingCai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ShangmingCai
Copy link
Collaborator

CC: @slin1237 Do you have time to take another look since there are some modifications in sglang_router as well?

@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 96f94fb to 9a26634 Compare October 24, 2025 07:50
@sufeng-buaa sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 9a26634 to ddeddc5 Compare October 27, 2025 02:56
@sufeng-buaa sufeng-buaa requested a review from key4ng as a code owner October 27, 2025 02:56
@zhyncs
Copy link
Member

zhyncs commented Oct 28, 2025

@zhyncs zhyncs merged commit ea96106 into sgl-project:main Oct 28, 2025
107 of 119 checks passed
hnyls2002 pushed a commit that referenced this pull request Oct 29, 2025
@jinmingyi1998
Copy link
Contributor

jinmingyi1998 commented Oct 29, 2025

@slin1237 this break router image. in router image there is no sglang.srt. got import error

from sglang.srt.tracing.trace import (

@sufeng-buaa
Copy link
Contributor Author

@slin1237 this break router image. in router image there is no sglang.srt. got import error

from sglang.srt.tracing.trace import (

Indeed, it will cause an import error if the image not installed sglang.
Let me figure out a way to fix it as soon.

@sufeng-buaa
Copy link
Contributor Author

@slin1237 this break router image. in router image there is no sglang.srt. got import error

from sglang.srt.tracing.trace import (

I pushed a fix patch, link is here #12338

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.