Temp fix for bug in scale-out profiling by adding env variable to disable profiler around allreduce #1697

frost-intel · 2025-05-23T20:07:41Z

Currently, a bug in PTI causes indefinite hang in all_reduce call when attempting to profile code on multiple nodes. If the upstream Kineto enables the toggle option for XPU PTI (pytorch/kineto#1088), then this PR would allow users to profile distributed scale-out code.

This PR uses an env variable to disable profiling around the call to XPU allreduce. An additional change in pytorch would be necessary to enable toggleCollectionDynamic for XPU.

Ideally, a fix can be implemented in a new PTI version 0.12.3 or later before 2.8 deadline. If not, this fix would be required by customers.

@ashokei

Temp fix for bug in PTI by adding env variable to disable profiler

600f7ba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Temp fix for bug in scale-out profiling by adding env variable to disable profiler around allreduce #1697

Temp fix for bug in scale-out profiling by adding env variable to disable profiler around allreduce #1697

Uh oh!

frost-intel commented May 23, 2025

Uh oh!

Uh oh!

Temp fix for bug in scale-out profiling by adding env variable to disable profiler around allreduce #1697

Are you sure you want to change the base?

Temp fix for bug in scale-out profiling by adding env variable to disable profiler around allreduce #1697

Uh oh!

Conversation

frost-intel commented May 23, 2025

Uh oh!

Uh oh!