[Draft][PyTorch][MOE] Support NVFP4 Grouped Linear #2215

zhongbozhu · 2025-09-30T05:13:06Z

Description

NVFP4 Group Linear Support.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Pipe clean, fix NVFP4 padding
Numerical test pass
Fused Bulk Alloc
Fused multi-swizzle kernel

Unit test

PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 pytest -s -v tests/pytorch/test_numerics.py::test_grouped_linear_accuracy

PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 pytest -s -v tests/pytorch/test_numerics.py::test_padding_grouped_linear_accuracy

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

zhongbozhu · 2025-09-30T23:26:17Z

/te-ci pytorch L1

Signed-off-by: Zhongbo Zhu <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu · 2025-10-03T20:57:35Z

/te-ci pytorch L1

…ck the vec_load_size to 1 to unblock Signed-off-by: Zhongbo Zhu <[email protected]>

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu · 2025-10-04T01:50:41Z

transformer_engine/common/swizzle/swizzle.cu

+      // Current unit test won't capture this issue, but in E2E
+      // using vec_load_size = 1 other than 1 will lead to mis-aligned
+      // address error in MOE training
+      vec_load_size = all_nvfp4 ? 1 : std::min(vec_load_size, vec_load_size_i);


@yaox12 do you have any idea why? Note that NVFP4 is TN only, ie. there will be transpose happening, unlike MXFP8. Plus that the error only happens for WGRAD. Leaving me to believe maybe it's because padding to 32 for NVFP4 in m_splits is not enough if we want vec_load_size more than 1.

So then I increased the padding from 32 to 64, 128. I found that only 128 works. However, I haven't really figured out why the vec_load_size calculation logic is wrong, so I am overriding it to 1 as hack.

Signed-off-by: Zhongbo Zhu <[email protected]>

timmoon10 · 2025-10-10T23:49:47Z

transformer_engine/common/common.h

+  // Check for size (not just pointer) for 0-dim or no token cases.
+  bool has_data() const noexcept { return data.dptr != nullptr || data.shape.size() != 0; }


Mathematically, a 0-D tensor is a scalar with 1 entry.

Suggested change

// Check for size (not just pointer) for 0-dim or no token cases.

bool has_data() const noexcept { return data.dptr != nullptr || data.shape.size() != 0; }

bool has_data() const noexcept { return data.dptr != nullptr; }

timmoon10 · 2025-10-10T23:57:28Z

transformer_engine/pytorch/csrc/extensions/recipe.cpp

  TensorWrapper fake_te_output(
-      nullptr, te_input.shape(),
+      amax_ptr, te_input.shape(),
      DType::kFloat8E4M3,  // It doesn't matter because we only compute amax.
-      amax.data_ptr<float>());
+      amax_ptr, nullptr, amax_ptr);


This horrifying hack is needed because the tensor checking functions assume that the output tensor requires data:

TransformerEngine/transformer_engine/common/transformer_engine.cpp

Line 270 in dd9433e

NVTE_CHECK(t.has_data() || t.has_columnwise_data(), "Output ", name, " is not allocated!");

The right answer is to modify the API for nvte_compute_amax so that the output tensor is an FP32 tensor with one entry. We might use that amax value later to compute an FP8 tensor, an NVFP4 tensor, whatever, but that is completely irrelevant.

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu · 2025-10-11T01:32:11Z

/te-ci pytorch L1

zhongbozhu requested review from ksivaman, timmoon10 and yaox12 September 30, 2025 05:13

zhongbozhu self-assigned this Sep 30, 2025

zhongbozhu and others added 3 commits October 1, 2025 20:22

pipeclean, fix nvfp4 padding of 32 alignment

1679df3

Signed-off-by: Zhongbo Zhu <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

c5bef0b

for more information, see https://pre-commit.ci

numerical test passed

4ac9df6

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu force-pushed the zhongbo/nvfp4_moe branch from 3c9e5ea to 4ac9df6 Compare October 2, 2025 03:23

fix CI failure with test_cast_master_weights_to_fp8 (in a hacky way)

cc86cd0

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu added 2 commits October 3, 2025 17:28

found CUDA mis-aligned address error in training in multi-swizzle, ha…

b2588fb

…ck the vec_load_size to 1 to unblock Signed-off-by: Zhongbo Zhu <[email protected]>

leave comments about alignment issue

0400a8c

Signed-off-by: Zhongbo Zhu <[email protected]>

zhongbozhu commented Oct 4, 2025

View reviewed changes

fused bulk alloc nvfp4

fa9e075

Signed-off-by: Zhongbo Zhu <[email protected]>

timmoon10 reviewed Oct 11, 2025

View reviewed changes

fix RHT sign mask CPU overhead

4253335

Signed-off-by: Zhongbo Zhu <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Draft][PyTorch][MOE] Support NVFP4 Grouped Linear #2215

[Draft][PyTorch][MOE] Support NVFP4 Grouped Linear #2215

Uh oh!

zhongbozhu commented Sep 30, 2025 •

edited

Loading

Uh oh!

zhongbozhu commented Sep 30, 2025

Uh oh!

zhongbozhu commented Oct 3, 2025

Uh oh!

zhongbozhu Oct 4, 2025 •

edited

Loading

Uh oh!

timmoon10 Oct 10, 2025

Uh oh!

timmoon10 Oct 10, 2025

Uh oh!

zhongbozhu commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		// Check for size (not just pointer) for 0-dim or no token cases.
		bool has_data() const noexcept { return data.dptr != nullptr \|\| data.shape.size() != 0; }

[Draft][PyTorch][MOE] Support NVFP4 Grouped Linear #2215

Are you sure you want to change the base?

[Draft][PyTorch][MOE] Support NVFP4 Grouped Linear #2215

Uh oh!

Conversation

zhongbozhu commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Unit test

Checklist:

Uh oh!

zhongbozhu commented Sep 30, 2025

Uh oh!

zhongbozhu commented Oct 3, 2025

Uh oh!

zhongbozhu Oct 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Oct 10, 2025

Choose a reason for hiding this comment

Uh oh!

zhongbozhu commented Oct 11, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhongbozhu commented Sep 30, 2025 •

edited

Loading

zhongbozhu Oct 4, 2025 •

edited

Loading