[PyTorch] Add THD support for max_logit/MuonClip #2480

cyanguwa · 2025-12-04T23:54:28Z

Description

This PR continues the work in #2195 and extends the support for max_logit (used in MuonClip) to THD format and both non-CP and CP cases (cp_comm_type = {'p2p', 'a2a', 'all_gather', 'a2a_p2p'}).

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Updated cudnn-frontend to enable THD support for max_logit
Changed the shape of Stats (and subsequently Max, Sum_Exp) from max_tokens_q, h, 1 to num_tokens_q, h, 1

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <[email protected]>

greptile-apps · 2025-12-05T00:01:07Z

Greptile Overview

Greptile Summary

This PR extends THD (Total-sequence-length, Head, Dimension) format support for the max_logit feature used by the MuonClip optimizer. The changes build upon previous work (#2195) to enable THD format across all context parallelism types. The core modifications involve simplifying the tensor architecture from a dual Max/Sum_Exp approach to a unified Stats tensor approach, updating tensor shapes from max_tokens_q to num_tokens_q for proper ragged tensor handling, and removing backend restrictions that previously disabled FusedAttention and UnfusedDotProductAttention for THD format with max_logit. The changes span Python interface layers, backend selection logic, and CUDA kernel implementations, creating a more efficient and unified approach to statistics generation while maintaining backward compatibility.

Important Files Changed

Filename	Score	Overview
`transformer_engine/pytorch/cpp_extensions/fused_attn.py`	5/5	Updates Python interface to handle new CUDA kernel tensor ordering, extracting max_logit from third output tensor instead of second
`transformer_engine/pytorch/attention/dot_product_attention/utils.py`	5/5	Removes backend restrictions that disabled FusedAttention and UnfusedDotProductAttention for THD format with max_logit
`transformer_engine/common/fused_attn/fused_attn_f16_arbitrary_seqlen.cu`	5/5	Implements CUDA kernel changes for THD format support, updates tensor shapes from max_tokens_q to num_tokens_q, simplifies stats generation

Confidence score: 5/5

This PR is safe to merge with minimal risk as it extends existing functionality without breaking changes
Score reflects well-structured changes that properly handle tensor format differences and maintain API compatibility
No files require special attention as all changes are focused, well-commented, and follow established patterns in the codebase

Sequence Diagram

sequenceDiagram
    participant User
    participant DotProductAttention as "DotProductAttention"
    participant FusedAttention as "FusedAttention Backend"
    participant cuDNN as "cuDNN Frontend"
    participant GPU as "GPU Kernels"

    User->>DotProductAttention: "forward() with THD format, return_max_logit=True"
    DotProductAttention->>DotProductAttention: "Check qkv_format == 'thd' and return_max_logit"
    DotProductAttention->>FusedAttention: "Call fused_attn_fwd() with THD tensors"
    FusedAttention->>FusedAttention: "Detect cuDNN runtime >= 90600"
    FusedAttention->>FusedAttention: "Set Stats/Max shape to [num_tokens_q, h, 1]"
    FusedAttention->>FusedAttention: "Apply ragged offset for THD format"
    FusedAttention->>cuDNN: "Create execution graph with updated tensor layouts"
    cuDNN->>GPU: "Execute attention kernels with THD support"
    GPU-->>cuDNN: "Return attention output and max logits"
    cuDNN-->>FusedAttention: "Return computed tensors"
    FusedAttention->>FusedAttention: "Extract max_logit from Stats tensor [tq, h, 1] -> [h]"
    FusedAttention-->>DotProductAttention: "Return output and max_logit"
    DotProductAttention-->>User: "Return attention output with max_logit"

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cyanguwa · 2025-12-05T00:19:01Z

/te-ci pytorch L1

Signed-off-by: Charlene Yang <[email protected]>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

cyanguwa added 2 commits December 4, 2025 15:49

update FE; initial pass at thd

cdd3a4a

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into muon_thd

4b8a6f9

cyanguwa added the 2.11.0 label Dec 5, 2025

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

cyanguwa requested a review from sudhakarsingh27 December 5, 2025 00:25

cyanguwa added 2 commits December 5, 2025 10:28

produce Stats+Max instead of Max+Sum_Exp

c7d2b77

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into muon_thd

a74a9a7

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[PyTorch] Add THD support for max_logit/MuonClip #2480

[PyTorch] Add THD support for max_logit/MuonClip #2480

Uh oh!

cyanguwa commented Dec 4, 2025 •

edited

Loading

Uh oh!

greptile-apps bot commented Dec 5, 2025 •

edited

Loading

Uh oh!

greptile-apps bot left a comment

Uh oh!

cyanguwa commented Dec 5, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

[PyTorch] Add THD support for max_logit/MuonClip #2480

Are you sure you want to change the base?

[PyTorch] Add THD support for max_logit/MuonClip #2480

Uh oh!

Conversation

cyanguwa commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Important Files Changed

Confidence score: 5/5

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

cyanguwa commented Dec 5, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cyanguwa commented Dec 4, 2025 •

edited

Loading

greptile-apps bot commented Dec 5, 2025 •

edited

Loading