Fix the sm120 compilation with CUDA 12 #2482

ptrendx · 2025-12-05T00:27:46Z

Description

PR #2062 used the redux.sync.f32 instruction with arch 120a compilation incorrectly (since this instruction is only available on sm100f). This is the reason for our pyTorch Build GH action failures.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Przemek Tredak <[email protected]>

greptile-apps · 2025-12-05T00:30:28Z

Greptile Overview

Greptile Summary

This PR fixes a compilation error that occurred when building for SM120 architecture with CUDA 12. The issue was introduced in PR #2062 where the redux.sync.max.abs.f32 PTX instruction was incorrectly used for SM120 architecture, despite this instruction only being available on SM100 family-specific architectures (sm100f).

Root cause: The original code used __CUDA_ARCH_HAS_FEATURE__ macros to check for SM100_ALL, SM101_ALL, and SM120_ALL, incorrectly assuming the instruction was available on all these architectures
Fix: Changed from preprocessor conditionals to if constexpr with NVTE_CUDA_ARCH_MATCHES(ptx::FamilySpecific<100>) which correctly restricts the optimized instruction to SM100 family only
Fallback behavior: SM110 and SM120 architectures now correctly use the fallback implementation (abs.f32 + redux.sync.max.u32)
No functional impact: The fallback path produces equivalent results, just without the optimized single-instruction path

Confidence Score: 5/5

This PR is safe to merge - it fixes a compilation error with a minimal, targeted change
The change is small, focused, and correctly fixes a known compilation issue. It uses existing infrastructure (NVTE_CUDA_ARCH_MATCHES, FamilySpecific) that is already proven in the codebase. The fallback path ensures functionality remains correct for architectures that don't support the optimized instruction.
No files require special attention

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/common/util/ptx.cuh	5/5	Fixes SM120 compilation by restricting `redux.sync.max.abs.f32` instruction to SM100 family only using `if constexpr` with `NVTE_CUDA_ARCH_MATCHES`. The fallback path using `redux.sync.max.u32` is used for other architectures.

Sequence Diagram

sequenceDiagram
    participant Compiler as CUDA Compiler
    participant PTX as ptx.cuh
    participant Func as reduce_sync_max_abs_f32()
    
    Compiler->>PTX: Compile for target arch
    PTX->>PTX: NVTE_CUDA_ARCH_MATCHES(FamilySpecific<100>)
    alt SM100 Family (100, 101, 103, etc.)
        PTX->>Func: is_sm_100f = true
        Func->>Func: Use redux.sync.max.abs.f32
    else SM110 or SM120 Family
        PTX->>Func: is_sm_100f = false
        Func->>Func: Use fallback (abs.f32 + redux.sync.max.u32)
    end

greptile-apps

_{1 file reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

timmoon10

LGTM

ptrendx · 2025-12-05T00:57:50Z

/te-ci

ptrendx · 2025-12-05T05:58:24Z

Will look into the failure on B200.

Fix the sm120 compilation with CUDA 12

eadf5f1

Signed-off-by: Przemek Tredak <[email protected]>

ptrendx requested a review from timmoon10 December 5, 2025 00:27

greptile-apps bot reviewed Dec 5, 2025

View reviewed changes

timmoon10 approved these changes Dec 5, 2025

View reviewed changes

ptrendx mentioned this pull request Dec 5, 2025

[Common] Persistent NVFP4 cast + transpose kernel #2412

Open

13 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix the sm120 compilation with CUDA 12 #2482

Fix the sm120 compilation with CUDA 12 #2482

Uh oh!

ptrendx commented Dec 5, 2025

Uh oh!

greptile-apps bot commented Dec 5, 2025

Uh oh!

greptile-apps bot left a comment

Uh oh!

timmoon10 left a comment

Uh oh!

ptrendx commented Dec 5, 2025

Uh oh!

ptrendx commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix the sm120 compilation with CUDA 12 #2482

Are you sure you want to change the base?

Fix the sm120 compilation with CUDA 12 #2482

Uh oh!

Conversation

ptrendx commented Dec 5, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Dec 5, 2025

Greptile Overview

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

ptrendx commented Dec 5, 2025

Uh oh!

ptrendx commented Dec 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants