Skip to content

Conversation

@ptrendx
Copy link
Member

@ptrendx ptrendx commented Dec 5, 2025

Description

PR #2062 used the redux.sync.f32 instruction with arch 120a compilation incorrectly (since this instruction is only available on sm100f). This is the reason for our pyTorch Build GH action failures.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@ptrendx ptrendx requested a review from timmoon10 December 5, 2025 00:27
@greptile-apps
Copy link
Contributor

greptile-apps bot commented Dec 5, 2025

Greptile Overview

Greptile Summary

This PR fixes a compilation error that occurred when building for SM120 architecture with CUDA 12. The issue was introduced in PR #2062 where the redux.sync.max.abs.f32 PTX instruction was incorrectly used for SM120 architecture, despite this instruction only being available on SM100 family-specific architectures (sm100f).

  • Root cause: The original code used __CUDA_ARCH_HAS_FEATURE__ macros to check for SM100_ALL, SM101_ALL, and SM120_ALL, incorrectly assuming the instruction was available on all these architectures
  • Fix: Changed from preprocessor conditionals to if constexpr with NVTE_CUDA_ARCH_MATCHES(ptx::FamilySpecific<100>) which correctly restricts the optimized instruction to SM100 family only
  • Fallback behavior: SM110 and SM120 architectures now correctly use the fallback implementation (abs.f32 + redux.sync.max.u32)
  • No functional impact: The fallback path produces equivalent results, just without the optimized single-instruction path

Confidence Score: 5/5

  • This PR is safe to merge - it fixes a compilation error with a minimal, targeted change
  • The change is small, focused, and correctly fixes a known compilation issue. It uses existing infrastructure (NVTE_CUDA_ARCH_MATCHES, FamilySpecific) that is already proven in the codebase. The fallback path ensures functionality remains correct for architectures that don't support the optimized instruction.
  • No files require special attention

Important Files Changed

File Analysis

Filename Score Overview
transformer_engine/common/util/ptx.cuh 5/5 Fixes SM120 compilation by restricting redux.sync.max.abs.f32 instruction to SM100 family only using if constexpr with NVTE_CUDA_ARCH_MATCHES. The fallback path using redux.sync.max.u32 is used for other architectures.

Sequence Diagram

sequenceDiagram
    participant Compiler as CUDA Compiler
    participant PTX as ptx.cuh
    participant Func as reduce_sync_max_abs_f32()
    
    Compiler->>PTX: Compile for target arch
    PTX->>PTX: NVTE_CUDA_ARCH_MATCHES(FamilySpecific<100>)
    alt SM100 Family (100, 101, 103, etc.)
        PTX->>Func: is_sm_100f = true
        Func->>Func: Use redux.sync.max.abs.f32
    else SM110 or SM120 Family
        PTX->>Func: is_sm_100f = false
        Func->>Func: Use fallback (abs.f32 + redux.sync.max.u32)
    end
Loading

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Collaborator

@timmoon10 timmoon10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ptrendx
Copy link
Member Author

ptrendx commented Dec 5, 2025

/te-ci

@ptrendx
Copy link
Member Author

ptrendx commented Dec 5, 2025

Will look into the failure on B200.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants