Skip to content

Conversation

zhongbozhu
Copy link
Collaborator

@zhongbozhu zhongbozhu commented Sep 30, 2025

Description

NVFP4 Group Linear Support.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

  • Pipe clean, fix NVFP4 padding
  • Numerical test pass
  • Fused Bulk Alloc
  • Fused multi-swizzle kernel

Unit test

PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 pytest -s -v tests/pytorch/test_numerics.py::test_grouped_linear_accuracy

PYTORCH_JIT=0 NVTE_TORCH_COMPILE=0 NVTE_ALLOW_NONDETERMINISTIC_ALGO=0 pytest -s -v tests/pytorch/test_numerics.py::test_padding_grouped_linear_accuracy

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@zhongbozhu zhongbozhu self-assigned this Sep 30, 2025
@zhongbozhu
Copy link
Collaborator Author

/te-ci pytorch L1

@zhongbozhu
Copy link
Collaborator Author

/te-ci pytorch L1

…ck the vec_load_size to 1 to unblock

Signed-off-by: Zhongbo Zhu <[email protected]>
// Current unit test won't capture this issue, but in E2E
// using vec_load_size = 1 other than 1 will lead to mis-aligned
// address error in MOE training
vec_load_size = all_nvfp4 ? 1 : std::min(vec_load_size, vec_load_size_i);
Copy link
Collaborator Author

@zhongbozhu zhongbozhu Oct 4, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yaox12 do you have any idea why? Note that NVFP4 is TN only, ie. there will be transpose happening, unlike MXFP8. Plus that the error only happens for WGRAD. Leaving me to believe maybe it's because padding to 32 for NVFP4 in m_splits is not enough if we want vec_load_size more than 1.

So then I increased the padding from 32 to 64, 128. I found that only 128 works. However, I haven't really figured out why the vec_load_size calculation logic is wrong, so I am overriding it to 1 as hack.

Signed-off-by: Zhongbo Zhu <[email protected]>
Comment on lines +157 to +158
// Check for size (not just pointer) for 0-dim or no token cases.
bool has_data() const noexcept { return data.dptr != nullptr || data.shape.size() != 0; }
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mathematically, a 0-D tensor is a scalar with 1 entry.

Suggested change
// Check for size (not just pointer) for 0-dim or no token cases.
bool has_data() const noexcept { return data.dptr != nullptr || data.shape.size() != 0; }
bool has_data() const noexcept { return data.dptr != nullptr; }

Comment on lines 23 to +27
TensorWrapper fake_te_output(
nullptr, te_input.shape(),
amax_ptr, te_input.shape(),
DType::kFloat8E4M3, // It doesn't matter because we only compute amax.
amax.data_ptr<float>());
amax_ptr, nullptr, amax_ptr);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This horrifying hack is needed because the tensor checking functions assume that the output tensor requires data:

NVTE_CHECK(t.has_data() || t.has_columnwise_data(), "Output ", name, " is not allocated!");

The right answer is to modify the API for nvte_compute_amax so that the output tensor is an FP32 tensor with one entry. We might use that amax value later to compute an FP8 tensor, an NVFP4 tensor, whatever, but that is completely irrelevant.

@zhongbozhu
Copy link
Collaborator Author

/te-ci pytorch L1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants