Skip to content

Disable shape_padding in TorchRec pipelines to prevent cross-PG deadlock on AMD#4128

Open
kaanbaloglu wants to merge 1 commit intometa-pytorch:mainfrom
kaanbaloglu:export-D101241634
Open

Disable shape_padding in TorchRec pipelines to prevent cross-PG deadlock on AMD#4128
kaanbaloglu wants to merge 1 commit intometa-pytorch:mainfrom
kaanbaloglu:export-D101241634

Conversation

@kaanbaloglu
Copy link
Copy Markdown
Contributor

@kaanbaloglu kaanbaloglu commented Apr 16, 2026

Summary:
torch.compile's should_pad_mm heuristic benchmarks GPU kernels via benchmark_gpu(), which calls torch.cuda.synchronize(). This device-wide sync blocks on pending NCCL collectives from other process groups, causing a circular deadlock in distributed training with multiple PGs (e.g. mesh_shard + mesh_replicate).

The deadlock has only been observed on AMD MI350X (maz5 datacenter). To minimize blast radius, this diff scopes the workaround to AMD/ROCm builds via torch.version.hip is not None. NVIDIA jobs continue to benefit from the shape_padding optimization.

This diff disables torch._inductor.config.shape_padding (AMD-only) in TrainPipelinePT2 and TrainPipelineSparseDistCompAutograd, matching the precedent set by Simple FSDP (set_configs_for_simple_fsdp). The MM padding optimization is a nice-to-have that is not critical for model performance — Simple FSDP already disables it for all its users without reported issues.

Differential Revision: D101241634

@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync Bot commented Apr 16, 2026

@kaanbaloglu has exported this pull request. If you are a Meta employee, you can view the originating Diff in D101241634.

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Apr 16, 2026
…ock on AMD

Summary:
torch.compile's should_pad_mm heuristic benchmarks GPU kernels via benchmark_gpu(), which calls torch.cuda.synchronize(). This device-wide sync blocks on pending NCCL collectives from other process groups, causing a circular deadlock in distributed training with multiple PGs (e.g. mesh_shard + mesh_replicate).

The deadlock has only been observed on AMD MI350X (maz5 datacenter). To minimize blast radius, this diff scopes the workaround to AMD/ROCm builds via `torch.version.hip is not None`. NVIDIA jobs continue to benefit from the shape_padding optimization.

This diff disables torch._inductor.config.shape_padding (AMD-only) in TrainPipelinePT2 and TrainPipelineSparseDistCompAutograd, matching the precedent set by Simple FSDP (set_configs_for_simple_fsdp). The MM padding optimization is a nice-to-have that is not critical for model performance — Simple FSDP already disables it for all its users without reported issues.

Differential Revision: D101241634
@meta-codesync meta-codesync Bot changed the title Disable shape_padding in TorchRec pipelines to prevent cross-PG deadlock Disable shape_padding in TorchRec pipelines to prevent cross-PG deadlock on AMD Apr 23, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant