[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 #10804

sufeng-buaa · 2025-09-23T12:36:22Z

Motivation

The PR is response to #8965. For details on the motivation and visual output, please refer to the issue.

Modifications

The PR have split the patch into two parts.
First part is here: #9962, which has been merged.

This is the second part, which include:

Request tracing support for PD disaggregation.
Request tracing support for DP attention scenarios.
A script for converting OpenTelemetry data to Perfetto data

Router tracing is currently only implemented for mini_lb, since it is implemented in Python. Support for the Rust-based router will be added in a follow-up commit, after I finish implementing the tracing package in Rust.

Building upon Part 1, to accommodate PD disaggregation, I have upgraded the original three-level span structure to a four-level hierarchy by adding a top-level bootstrap_room_span. While the previous three-level structure could still achieve the original design goals, this change was made with future extensibility in mind. Specifically, we may want to attach certain attribute information to the request root span in the future; however, OpenTelemetry does not allow adding attributes to spans that are propagated from other nodes. Therefore, the updated span hierarchy is as follows:

bootstrap room span
├── router req root span
|    └── router thread span
|          └── slice span
├── prefill req root span
|    ├── tokenizer thread span
|    |     └── slice span
|    └── scheduler thread span
|          └── slice span
└── decode req root span
      ├── tokenizer thread span
      |    └── slice span
      └── scheduler thread span
           └── slice span

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-09-23T12:36:53Z

Summary of Changes

Hello @sufeng-buaa, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents the second phase of implementing a comprehensive tracing feature in Sglang, focusing on enhancing observability for distributed request processing. It introduces a more robust four-level OpenTelemetry span structure and explicit trace context propagation, particularly for Prefill/Decode disaggregation and Data Parallel attention. These changes are designed to provide developers with deeper insights into request flow and performance bottlenecks across various Sglang components, laying the groundwork for more advanced debugging and optimization in complex, distributed environments.

Highlights

Expanded Tracing Support: This pull request extends Sglang's fine-grained request tracing capabilities to cover PD (Prefill/Decode) disaggregation and Data Parallel (DP) attention scenarios, providing more comprehensive visibility into request latency.
Upgraded Span Hierarchy: The tracing framework's span structure has been upgraded from a three-level to a four-level hierarchy by introducing a bootstrap_room_span. This new top-level span facilitates future extensibility and allows for attaching attributes to the request root span on different nodes, addressing OpenTelemetry's propagation constraints.
Cross-Node Trace Context Propagation: New mechanisms have been implemented to explicitly propagate trace context when request execution flows transfer between different nodes, crucial for distributed architectures like PD disaggregation.
Mini Load Balancer Tracing: Tracing has been integrated into the Python-based mini_lb (Mini Load Balancer), enabling tracking of requests as they are dispatched to prefill and decode servers. Support for the Rust-based router is noted as a future enhancement.
OpenTelemetry Endpoint Correction: A consistent typo in the OpenTelemetry endpoint argument (--oltp-traces-endpoint to --otlp-traces-endpoint) has been corrected across the codebase and documentation.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request significantly enhances the tracing capabilities by adding support for PD disaggregation and DP attention scenarios. It introduces a more robust four-level span hierarchy to better track request latency across different nodes and threads, which is a great improvement for observability. The changes are well-structured, and the documentation has been updated accordingly. I've found one critical bug that would cause a runtime error and have also suggested a refactoring to reduce code duplication in one of the files. Overall, this is a solid contribution to improving the observability of the system.

python/sglang/srt/disaggregation/decode.py

sgl-router/py_src/sglang_router/mini_lb.py

python/sglang/srt/managers/schedule_batch.py

sufeng-buaa · 2025-09-25T06:22:38Z

Hi @ishandhanani , could you please review this PR when you have a moment? Thanks!

lun-4 · 2025-09-29T14:34:53Z

hello! I have a question, I've looked at both #9962 and this PR but it's unclear to me if there's support for receiving trace/span IDs via the traceparent header (through opentelemetry context propagation, manually or automatically via fastapi instrumentation) added or planned to be added in some future PR? in my deployment I'm using sglang behind a service which does authentication and other roles, and it already initializes a lot of important span metadata that would be welcome to link to sglang's own traces.

EDIT: looks like not. just made a quick patch for it in #11074

merrymercy · 2025-10-01T07:19:31Z

Please also pay extreme attention to overhead #9962 (comment)

sufeng-buaa · 2025-10-09T02:29:59Z

hello! I have a question, I've looked at both #9962 and this PR but it's unclear to me if there's support for receiving trace/span IDs via the traceparent header (through opentelemetry context propagation, manually or automatically via fastapi instrumentation) added or planned to be added in some future PR? in my deployment I'm using sglang behind a service which does authentication and other roles, and it already initializes a lot of important span metadata that would be welcome to link to sglang's own traces.

EDIT: looks like not. just made a quick patch for it in #11074

Is this PR #10808 what you mean?

sufeng-buaa · 2025-10-09T02:33:23Z

Please also pay extreme attention to overhead #9962 (comment)

OK, thank you for the review. I will fix it as soon as possible.

sufeng-buaa · 2025-10-10T06:20:49Z

Please also pay extreme attention to overhead #9962 (comment)

@merrymercy I have updated the code according to your suggestions. Could you please take a look again?

ishandhanani · 2025-10-12T20:48:42Z

Hi @sufeng-buaa - curious on if you've looked at https://grafana.com/oss/tempo/. We use grafana for our metrics in SGL already and tempo plays well with the grafana stack.

No need to change/update anything just sharing in case it's useful :)

acelyc111 · 2025-10-22T02:20:40Z

Instrumentation Overhead Evaluation

The test environment is the same as #9962
Compared to #9962, the overhead of single-span generation and trace context propagation across threads remains unchanged. This section adds support for request tracing in the PD Disaggregation scenario, with an overhead of approximately 60 μs for trace context propagation across nodes (from mini_lb to prefill or decode nodes).
Additionally, the span structure has been extended to four levels, increasing the overhead of trace_req_start() to approximately 95 μs.

Hi, did you consider the exporter thread overhead? I'm not sure if it's caused by GIL, but it seems that the exporter thread blocks the scheduler thread when encoding spans in otel sdk, making the overall overhead not fully overlapped in many scenarios.

Great Question — I actually ran into this issue during high-load testing and spent several weeks tracking it down. The main problem was that the exporter's asynchronous export cycle was too long, causing a sudden spike in generating a large amount of garbage, which in turn triggered prolonged GC pauses that blocked the scheduler thread. I've since addressed this by tuning the schedule_delay_millis and max_export_batch_size parameters to make exports more frequent but smaller (bbd10e7). This helps prevent garbage collection spikes and significantly reduces the risk of blocking the scheduler. While this may increase CPU usage slightly, in most LLM deployment environments CPU resources are typically underutilized, so the trade-off is well worth the improved latency and stability.

Another approach is enable sampling, unfortunately, the Python SDK doesn't support this feature, maybe we can do this in SGLang application layer.

sufeng-buaa · 2025-10-22T05:33:05Z

Instrumentation Overhead Evaluation

The test environment is the same as #9962
Compared to #9962, the overhead of single-span generation and trace context propagation across threads remains unchanged. This section adds support for request tracing in the PD Disaggregation scenario, with an overhead of approximately 60 μs for trace context propagation across nodes (from mini_lb to prefill or decode nodes).
Additionally, the span structure has been extended to four levels, increasing the overhead of trace_req_start() to approximately 95 μs.

Hi, did you consider the exporter thread overhead? I'm not sure if it's caused by GIL, but it seems that the exporter thread blocks the scheduler thread when encoding spans in otel sdk, making the overall overhead not fully overlapped in many scenarios.

Great Question — I actually ran into this issue during high-load testing and spent several weeks tracking it down. The main problem was that the exporter's asynchronous export cycle was too long, causing a sudden spike in generating a large amount of garbage, which in turn triggered prolonged GC pauses that blocked the scheduler thread. I've since addressed this by tuning the schedule_delay_millis and max_export_batch_size parameters to make exports more frequent but smaller (bbd10e7). This helps prevent garbage collection spikes and significantly reduces the risk of blocking the scheduler. While this may increase CPU usage slightly, in most LLM deployment environments CPU resources are typically underutilized, so the trade-off is well worth the improved latency and stability.

Another approach is enable sampling, unfortunately, the Python SDK doesn't support this feature, maybe we can do this in SGLang application layer.

can be considered in next patch.

ShangmingCai

LGTM

ShangmingCai · 2025-10-24T06:35:19Z

CC: @slin1237 Do you have time to take another look since there are some modifications in sglang_router as well?

Signed-off-by: Feng Su <[email protected]>

zhyncs · 2025-10-28T08:25:17Z

not related https://github.com/sgl-project/sglang/actions/runs/18828338947/job/53832111756?pr=10804

cooool

… Part 2 (#10804) Signed-off-by: Feng Su <[email protected]>

jinmingyi1998 · 2025-10-29T11:32:02Z

@slin1237 this break router image. in router image there is no sglang.srt. got import error

sglang/sgl-router/py_src/sglang_router/mini_lb.py

Line 21 in 42f8ea4

from sglang.srt.tracing.trace import (

sufeng-buaa · 2025-10-29T11:46:28Z

@slin1237 this break router image. in router image there is no sglang.srt. got import error

sglang/sgl-router/py_src/sglang_router/mini_lb.py

Line 21 in 42f8ea4

from sglang.srt.tracing.trace import (

Indeed, it will cause an import error if the image not installed sglang.
Let me figure out a way to fix it as soon.

sufeng-buaa · 2025-10-29T12:37:10Z

@slin1237 this break router image. in router image there is no sglang.srt. got import error

sglang/sgl-router/py_src/sglang_router/mini_lb.py

Line 21 in 42f8ea4

from sglang.srt.tracing.trace import (

I pushed a fix patch, link is here #12338

sufeng-buaa requested review from ByronHsu, CatherineSue, Ying1123, hnyls2002, ispobock, merrymercy, slin1237 and xiezhq-hermann as code owners September 23, 2025 12:36

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

python/sglang/srt/disaggregation/decode.py Outdated Show resolved Hide resolved

sgl-router/py_src/sglang_router/mini_lb.py Outdated Show resolved Hide resolved

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 98bef23 to 6e2ec1d Compare September 23, 2025 13:03

acelyc111 reviewed Sep 23, 2025

View reviewed changes

sgl-router/py_src/sglang_router/mini_lb.py Outdated Show resolved Hide resolved

sgl-router/py_src/sglang_router/mini_lb.py Show resolved Hide resolved

python/sglang/srt/managers/schedule_batch.py Outdated Show resolved Hide resolved

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch 3 times, most recently from e837462 to 479baef Compare September 24, 2025 09:50

zhanghaotong mentioned this pull request Sep 25, 2025

[Feature] Propagate Trace Headers into Root Span for OpenTelemetry Cross-Service Context #10808

Open

4 tasks

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 479baef to f58e846 Compare September 25, 2025 12:27

sufeng-buaa mentioned this pull request Sep 25, 2025

[Feature] Propose Unified Observability Interface for Request Tracing, PD Metric, and TimeStat Log #10916

Open

2 tasks

stmatengss added the run-ci label Sep 26, 2025

lun-4 mentioned this pull request Sep 29, 2025

[metrics] instrument fastapi for tracing #11074

Open

4 tasks

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 5bafa93 to 6f04110 Compare October 10, 2025 02:01

sufeng-buaa requested a review from JustinTong0323 as a code owner October 10, 2025 02:01

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 6f04110 to d2fe22c Compare October 10, 2025 03:46

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 29aedb6 to 37357b9 Compare October 22, 2025 06:13

ShangmingCai approved these changes Oct 23, 2025

View reviewed changes

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 96f94fb to 9a26634 Compare October 24, 2025 07:50

sufeng-buaa added 14 commits October 27, 2025 09:58

trace: Add cross-node tracing capability for sglang tracing

1522abb

Signed-off-by: Feng Su <[email protected]>

trace: request tracing(pd-disaggregation)

62299fd

Signed-off-by: Feng Su <[email protected]>

trace: support tracing for dp attention

eaae307

Signed-off-by: Feng Su <[email protected]>

trace: update doc

1c800b4

Signed-off-by: Feng Su <[email protected]>

trace: add attributes for events

4782ea9

Signed-off-by: Feng Su <[email protected]>

trace: correct trace parameter name

cbf535c

Signed-off-by: Feng Su <[email protected]>

trace: add a script for converting opentelmetry data to perfetto data

15e2842

Signed-off-by: Feng Su <[email protected]>

trace: fix mini_lb stage naming

8f853b1

trace: trace 'schedule' in batch

bd82302

Signed-off-by: Feng Su <[email protected]>

trace: fix GC block in opentelemtry exporter

561ee5e

Signed-off-by: Feng Su <[email protected]>

trace: add a new enum value for 'quick_finish' stage

555f524

Signed-off-by: Feng Su <[email protected]>

trace: fix lint

76ddec1

Signed-off-by: Feng Su <[email protected]>

trace: Set environment variables to configure OTLP exporter

939eeaa

Signed-off-by: Feng Su <[email protected]>

fix lint

ddeddc5

sufeng-buaa force-pushed the sufeng-buaa/sglang-tracing-part2 branch from 9a26634 to ddeddc5 Compare October 27, 2025 02:56

sufeng-buaa requested a review from key4ng as a code owner October 27, 2025 02:56

slin1237 approved these changes Oct 28, 2025

View reviewed changes

zhyncs approved these changes Oct 28, 2025

View reviewed changes

zhyncs merged commit ea96106 into sgl-project:main Oct 28, 2025
107 of 119 checks passed

hnyls2002 pushed a commit that referenced this pull request Oct 29, 2025

[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency -…

8451995

… Part 2 (#10804) Signed-off-by: Feng Su <[email protected]>

[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 #10804

[Feature] Sglang Tracing: Fine-Grained Tracking for Request Latency - Part 2 #10804

Uh oh!

Conversation

sufeng-buaa commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Checklist

Uh oh!

gemini-code-assist bot commented Sep 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sufeng-buaa commented Sep 25, 2025

Uh oh!

lun-4 commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

merrymercy commented Oct 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sufeng-buaa commented Oct 9, 2025

Uh oh!

sufeng-buaa commented Oct 9, 2025

Uh oh!

sufeng-buaa commented Oct 10, 2025

Uh oh!

ishandhanani commented Oct 12, 2025

Uh oh!

acelyc111 commented Oct 22, 2025

Instrumentation Overhead Evaluation

Uh oh!

sufeng-buaa commented Oct 22, 2025

Instrumentation Overhead Evaluation

Uh oh!

ShangmingCai left a comment

Choose a reason for hiding this comment

Uh oh!

ShangmingCai commented Oct 24, 2025

Uh oh!

zhyncs commented Oct 28, 2025

Uh oh!

Uh oh!

jinmingyi1998 commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sufeng-buaa commented Oct 29, 2025

Uh oh!

sufeng-buaa commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

sufeng-buaa commented Sep 23, 2025 •

edited

Loading

lun-4 commented Sep 29, 2025 •

edited

Loading

merrymercy commented Oct 1, 2025 •

edited

Loading

jinmingyi1998 commented Oct 29, 2025 •

edited

Loading