Disable shape_padding in TorchRec pipelines to prevent cross-PG deadlock on AMD by kaanbaloglu · Pull Request #4128 · meta-pytorch/torchrec

kaanbaloglu · 2026-04-16T21:46:02Z

Summary:
torch.compile's should_pad_mm heuristic benchmarks GPU kernels via benchmark_gpu(), which calls torch.cuda.synchronize(). This device-wide sync blocks on pending NCCL collectives from other process groups, causing a circular deadlock in distributed training with multiple PGs (e.g. mesh_shard + mesh_replicate).

The deadlock has only been observed on AMD MI350X (maz5 datacenter). To minimize blast radius, this diff scopes the workaround to AMD/ROCm builds via torch.version.hip is not None. NVIDIA jobs continue to benefit from the shape_padding optimization.

This diff disables torch._inductor.config.shape_padding (AMD-only) in TrainPipelinePT2 and TrainPipelineSparseDistCompAutograd, matching the precedent set by Simple FSDP (set_configs_for_simple_fsdp). The MM padding optimization is a nice-to-have that is not critical for model performance — Simple FSDP already disables it for all its users without reported issues.

Differential Revision: D101241634

meta-codesync · 2026-04-16T21:46:10Z

@kaanbaloglu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101241634.

…ock on AMD Summary: torch.compile's should_pad_mm heuristic benchmarks GPU kernels via benchmark_gpu(), which calls torch.cuda.synchronize(). This device-wide sync blocks on pending NCCL collectives from other process groups, causing a circular deadlock in distributed training with multiple PGs (e.g. mesh_shard + mesh_replicate). The deadlock has only been observed on AMD MI350X (maz5 datacenter). To minimize blast radius, this diff scopes the workaround to AMD/ROCm builds via `torch.version.hip is not None`. NVIDIA jobs continue to benefit from the shape_padding optimization. This diff disables torch._inductor.config.shape_padding (AMD-only) in TrainPipelinePT2 and TrainPipelineSparseDistCompAutograd, matching the precedent set by Simple FSDP (set_configs_for_simple_fsdp). The MM padding optimization is a nice-to-have that is not critical for model performance — Simple FSDP already disables it for all its users without reported issues. Differential Revision: D101241634

meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 16, 2026

meta-codesync Bot added fb-exported meta-exported labels Apr 16, 2026

meta-codesync Bot changed the title ~~Disable shape_padding in TorchRec pipelines to prevent cross-PG deadlock~~ Disable shape_padding in TorchRec pipelines to prevent cross-PG deadlock on AMD Apr 23, 2026

kaanbaloglu force-pushed the export-D101241634 branch from cad94cb to 3015e2b Compare April 23, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disable shape_padding in TorchRec pipelines to prevent cross-PG deadlock on AMD#4128

Disable shape_padding in TorchRec pipelines to prevent cross-PG deadlock on AMD#4128
kaanbaloglu wants to merge 1 commit intometa-pytorch:mainfrom
kaanbaloglu:export-D101241634

kaanbaloglu commented Apr 16, 2026 •

edited by meta-codesync Bot

Loading

Uh oh!

meta-codesync Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kaanbaloglu commented Apr 16, 2026 • edited by meta-codesync Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync Bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kaanbaloglu commented Apr 16, 2026 •

edited by meta-codesync Bot

Loading