[Pytorch][Bug]MXFP8 Split tensor Bug fix #2427

vthumbe1503 · 2025-11-26T07:47:04Z

Description

Please include a brief summary of the changes, relevant motivation and context.

Fixes #2422

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

We were retrieving the shape from a list of splitted scale inverse tensors rather than the splitted scale inverse tensors themselves fo MXFP8. Fixed it now. Also added the unit test fot the same.
Changed contiguous API for float8 tensor to also handle transpose for L40/Hopper. Also fixed the issue where requires_grad should be maintained on the tensor after calling contigouous on it.

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Varun Thumbe <[email protected]>

for more information, see https://pre-commit.ci

greptile-apps · 2025-11-26T07:58:54Z

Greptile Overview

Greptile Summary

This PR fixes a critical bug in MXFP8 tensor splitting that caused AttributeError: 'list' object has no attribute 'shape' during model checkpointing. The issue occurred because torch.split returns a list of tensors, but the code incorrectly tried to access .shape on the list directly instead of iterating over individual tensors.

Key Changes:

Fixed mxfp8_tensor.py line 437: converts scale_inv_out to list explicitly and iterates over each split tensor for padding (lines 440-446)
Modified float8_tensor.py dequantize() to call contiguous() first to handle transpose for L40/Hopper
Updated contiguous() to make both _data and _transpose contiguous (if present) and preserve requires_grad
Added comprehensive test coverage for torch.chunk on all quantized tensor types including MXFP8

Impact:
The MXFP8 split tensor bug fix directly resolves issue #2422, allowing checkpoint saving to work correctly with MXFP8 tensors in distributed training scenarios.

Confidence Score: 4/5

This PR is safe to merge with minor performance consideration
Score of 4 reflects that the core bug fix is correct and well-tested, but the contiguous() method change removes an early-return optimization that could cause unnecessary tensor copies when data is already contiguous
Pay attention to transformer_engine/pytorch/tensor/float8_tensor.py - the contiguous() method performance impact should be evaluated

Important Files Changed

File Analysis

Filename	Score	Overview
transformer_engine/pytorch/tensor/mxfp8_tensor.py	4/5	Fixed bug where `scale_inv_out.shape` was called on a list returned by `torch.split`, now correctly iterates over split tensors and pads each individually
transformer_engine/pytorch/tensor/float8_tensor.py	3/5	Modified `dequantize()` to call `contiguous()` first, and changed `contiguous()` to always create new tensor with both data and transpose contiguous, but this removes early-return optimization
tests/pytorch/test_quantized_tensor.py	5/5	New comprehensive test file covering all quantized tensor types including MXFP8 chunk test that validates the bug fix
qa/L0_pytorch_unittest/test.sh	5/5	Updated test runner to use renamed test file `test_quantized_tensor.py` instead of `test_float8tensor.py`

Sequence Diagram

sequenceDiagram
    participant User
    participant torch
    participant MXFP8Tensor
    participant Float8Tensor
    
    User->>torch: torch.chunk(mxfp8_tensor, 2, dim=0)
    torch->>MXFP8Tensor: __torch_dispatch__(aten.split.Tensor)
    
    Note over MXFP8Tensor: Split rowwise_data and columnwise_data
    MXFP8Tensor->>MXFP8Tensor: split _rowwise_data and _columnwise_data
    
    Note over MXFP8Tensor: Split scale_inv tensors (BUG WAS HERE)
    MXFP8Tensor->>MXFP8Tensor: split _rowwise_scale_inv
    MXFP8Tensor->>MXFP8Tensor: split _columnwise_scale_inv
    
    Note over MXFP8Tensor: OLD: scale_inv_out.shape ❌<br/>NEW: for each in scale_inv_out ✓
    
    loop For each split scale_inv tensor
        MXFP8Tensor->>MXFP8Tensor: pad split_scale_inv_out if needed
    end
    
    MXFP8Tensor->>MXFP8Tensor: Create MXFP8Tensor for each split
    MXFP8Tensor-->>torch: List[MXFP8Tensor]
    torch-->>User: Tuple of split tensors
    
    Note over User: Checkpoint save succeeds ✓

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2025-11-26T07:58:53Z

transformer_engine/pytorch/tensor/mxfp8_tensor.py

+                        current_shape = split_scale_inv_out.shape
+                        pad_dim0 = (pad_multiple - current_shape[0] % pad_multiple) % pad_multiple
+                        if pad_dim0 > 0:
+                            scale_inv_out[idx] = torch.nn.functional.pad(


logic: Potential tuple mutability issue: if __torch_dispatch__ returns a tuple (immutable), this item assignment will fail with TypeError. Consider converting to list first:

Suggested change

scale_inv_out[idx] = torch.nn.functional.pad(

scale_inv_out = list(scale_inv_out) if isinstance(scale_inv_out, tuple) else scale_inv_out

scale_inv_out[idx] = torch.nn.functional.pad(

Or convert immediately after dispatch on line 427.

vthumbe1503 · 2025-11-26T08:08:28Z

/te-ci pytorch

Signed-off-by: Varun Thumbe <[email protected]>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

vthumbe1503 · 2025-11-26T17:30:04Z

/te-ci pytorch

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

transformer_engine/pytorch/tensor/mxfp8_tensor.py

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <[email protected]>

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

timmoon10 · 2025-11-26T20:09:16Z

tests/pytorch/test_numerics.py

+        ((64, 128), 2, 1),  # Split along second dimension, goes down dequantization path for mxfp8
+    ],
+)
+def test_fp8_split_functionality(quantization_type, shape, chunks, dim):


test_numerics.py is mostly focused on modules. This would be a better fit for test_float8tensor.py, which is more focused on granular functionality in Float8Tensor. Instead of creating a new file for MXFP8Tensor (very clunky, especially as we add more tensor classes in the future), I propose renaming that file to test_quantized_tensor.py so we have a common place for general tests like this.

timmoon10 · 2025-11-26T21:14:41Z

tests/pytorch/test_numerics.py

+@pytest.mark.parametrize(
+    "quantization_type",
+    [
+        "fp8",
+        "mxfp8",
+    ],
+)
+@pytest.mark.parametrize(
+    "shape,chunks,dim",
+    [
+        ((64, 128), 2, 0),  # Split along first dimension, needs padding for mxfp8
+        ((64, 128), 2, 1),  # Split along second dimension, goes down dequantization path for mxfp8
+    ],
+)
+def test_fp8_split_functionality(quantization_type, shape, chunks, dim):
+    """Test torch.chunk on FP8 and MXFP8 tensors and verify correctness via dequantization."""
+    if quantization_type == "fp8" and not fp8_available:
+        pytest.skip(reason_for_no_fp8)
+    if quantization_type == "mxfp8" and not mxfp8_available:
+        pytest.skip(reason_for_no_mxfp8)
+
+    device = "cuda"
+    dtype = torch.bfloat16
+
+    # Create reference tensor
+    torch.manual_seed(1234)
+    torch.cuda.manual_seed(1234)
+    ref_tensor = torch.randn(shape, device=device, dtype=dtype)
+
+    # Quantize the tensor
+    if quantization_type == "fp8":
+        quantizer = Float8Quantizer(
+            scale=torch.ones(1, dtype=torch.float32, device=device).squeeze(),
+            amax=torch.zeros(1, dtype=torch.float32, device=device),
+            fp8_dtype=tex.DType.kFloat8E4M3,
+        )
+        quantized_tensor = quantizer(ref_tensor)
+    elif quantization_type == "mxfp8":
+        quantizer = MXFP8Quantizer(fp8_dtype=tex.DType.kFloat8E4M3)
+        quantized_tensor = quantizer(ref_tensor)
+
+    # Apply torch.chunk on quantized tensor
+    quantized_tensor_dispatch_out = torch.chunk(quantized_tensor, chunks, dim=dim)
+    # need to make tensor contigous for dim=1 splitting.
+    outs = [out.contiguous() for out in quantized_tensor_dispatch_out]
+    if dim == 0 or quantization_type == "fp8":
+        # Dequantize the chunked results
+        chunked_dequantized = [chunk.dequantize() for chunk in outs]
+    else:
+        # When splitting along second dimension, we go down dequantization
+        # route in case of mxfp8 for now.
+        chunked_dequantized = outs
+
+    # Reference: chunk the dequantized tensor directly
+    ref_dequantized = quantized_tensor.dequantize()
+    ref_chunked = torch.chunk(ref_dequantized, chunks, dim=dim)
+
+    # Compare results
+    assert len(chunked_dequantized) == len(
+        ref_chunked
+    ), f"Number of chunks mismatch: {len(chunked_dequantized)} vs {len(ref_chunked)}"
+
+    for i, (chunk_deq, ref_chunk) in enumerate(zip(chunked_dequantized, ref_chunked)):
+        assert (
+            chunk_deq.shape == ref_chunk.shape
+        ), f"Chunk {i} shape mismatch: {chunk_deq.shape} vs {ref_chunk.shape}"
+        torch.testing.assert_close(
+            chunk_deq,
+            ref_chunk,
+        )


We can easily make this in to a general QuantizedTensor test with minimal knowledge of the internal implementation:

Suggested change

@pytest.mark.parametrize(

"quantization_type",

[

"fp8",

"mxfp8",

],

)

@pytest.mark.parametrize(

"shape,chunks,dim",

[

((64, 128), 2, 0), # Split along first dimension, needs padding for mxfp8

((64, 128), 2, 1), # Split along second dimension, goes down dequantization path for mxfp8

],

)

def test_fp8_split_functionality(quantization_type, shape, chunks, dim):

"""Test torch.chunk on FP8 and MXFP8 tensors and verify correctness via dequantization."""

if quantization_type == "fp8" and not fp8_available:

pytest.skip(reason_for_no_fp8)

if quantization_type == "mxfp8" and not mxfp8_available:

pytest.skip(reason_for_no_mxfp8)

device = "cuda"

dtype = torch.bfloat16

# Create reference tensor

torch.manual_seed(1234)

torch.cuda.manual_seed(1234)

ref_tensor = torch.randn(shape, device=device, dtype=dtype)

# Quantize the tensor

if quantization_type == "fp8":

quantizer = Float8Quantizer(

scale=torch.ones(1, dtype=torch.float32, device=device).squeeze(),

amax=torch.zeros(1, dtype=torch.float32, device=device),

fp8_dtype=tex.DType.kFloat8E4M3,

)

quantized_tensor = quantizer(ref_tensor)

elif quantization_type == "mxfp8":

quantizer = MXFP8Quantizer(fp8_dtype=tex.DType.kFloat8E4M3)

quantized_tensor = quantizer(ref_tensor)

# Apply torch.chunk on quantized tensor

quantized_tensor_dispatch_out = torch.chunk(quantized_tensor, chunks, dim=dim)

# need to make tensor contigous for dim=1 splitting.

outs = [out.contiguous() for out in quantized_tensor_dispatch_out]

if dim == 0 or quantization_type == "fp8":

# Dequantize the chunked results

chunked_dequantized = [chunk.dequantize() for chunk in outs]

else:

# When splitting along second dimension, we go down dequantization

# route in case of mxfp8 for now.

chunked_dequantized = outs

# Reference: chunk the dequantized tensor directly

ref_dequantized = quantized_tensor.dequantize()

ref_chunked = torch.chunk(ref_dequantized, chunks, dim=dim)

# Compare results

assert len(chunked_dequantized) == len(

ref_chunked

), f"Number of chunks mismatch: {len(chunked_dequantized)} vs {len(ref_chunked)}"

for i, (chunk_deq, ref_chunk) in enumerate(zip(chunked_dequantized, ref_chunked)):

assert (

chunk_deq.shape == ref_chunk.shape

), f"Chunk {i} shape mismatch: {chunk_deq.shape} vs {ref_chunk.shape}"

torch.testing.assert_close(

chunk_deq,

ref_chunk,

)

@pytest.mark.parametrize(

"quantization", ["fp8", "mxfp8", "fp8_blockwise", "nvfp4"],

)

@pytest.mark.parametrize("dim", [0, 1])

def test_chunk(

*,

quantization: str,

shape: Iterable[int] = (128, 128),

chunks: int = 2,

dim: int,

dtype: torch.dtype = torch.bfloat16,

device: torch.device = "cuda",

) -> None:

# Skip invalid configs

if quantization == "fp8" and not fp8_available:

pytest.skip(reason_for_no_fp8)

if quantization == "fp8_blockwise" and not fp8_blockwise_available:

pytest.skip(reason_for_no_fp8_blockwise)

if quantization == "mxfp8" and not mxfp8_available:

pytest.skip(reason_for_no_mxfp8)

if quantization == "nvfp4" and not nvfp4_available:

pytest.skip(reason_for_no_nvfp4)

# Create quantizer

if quantization == "fp8":

quantizer = Float8CurrentScalingQuantizer(fp8_dtype=tex.DType.kFloat8E4M3)

elif quantization == "mxfp8":

quantizer = MXFP8Quantizer(fp8_dtype=tex.DType.kFloat8E4M3)

elif quantization == "fp8_blockwise":

quantizer = Float8BlockQuantizer(

fp8_dtype=tex.DType.kFloat8E4M3,

block_scaling_dim=1,

)

elif quantization == "nvfp4":

quantizer = NVFP4Quantizer(

with_rht=False,

with_post_rht_amax=False,

with_2d_quantization=False,

stochastic_rounding=False,

with_random_sign_mask=False,

)

else:

raise ValueError(f"Unknown quantizer ({quantizer})")

# Create reference and quantized tensor

ref_tensor = torch.randn(shape, device=device, dtype=dtype)

quantized_tensor = quantizer(ref_tensor)

ref_tensor.copy_(quantized_tensor)

# Chunk tensors

ref_splits = torch.chunk(ref_tensor, chunks, dim=dim)

quantized_splits = torch.chunk(quantized_splits, chunks, dim=dim)

# Check splits

for ref_split, quantized_split in zip(ref_splits, quantized_splits):

# Check split shapes

assert ref_split.size() == ref_chunk.size()

# Check that splits are quantized when expected

if quantization == "fp8":

assert isinstance(quantized_split, Float8Tensor)

if quantization == "mxfp8" and dim == 0:

assert isinstance(quantized_split, MXFP8Tensor)

# Check values

torch.testing.assert_close(quantized_split, ref_split)

timmoon10 · 2025-11-26T23:19:21Z

tests/pytorch/test_numerics.py

+    # Apply torch.chunk on quantized tensor
+    quantized_tensor_dispatch_out = torch.chunk(quantized_tensor, chunks, dim=dim)
+    # need to make tensor contigous for dim=1 splitting.
+    outs = [out.contiguous() for out in quantized_tensor_dispatch_out]


Can we move this contiguous into dequantize? Normal torch.Tensors work without it, so it's a problem in our implementation if we force users do this extra unintuitive step.

timmoon10 · 2025-11-26T23:29:52Z

tests/pytorch/test_numerics.py

+    quantized_tensor_dispatch_out = torch.chunk(quantized_tensor, chunks, dim=dim)
+    # need to make tensor contigous for dim=1 splitting.
+    outs = [out.contiguous() for out in quantized_tensor_dispatch_out]
+    if dim == 0 or quantization_type == "fp8":


Stylistic nit: This doesn't generalize to new recipes. If we generalize, we get something like:

Suggested change

if dim == 0 or quantization_type == "fp8":

if quantization_type == "fp8" or (quantization_type == "mxfp8" and dim == 0):

The extra robustness is not that important, but notice how much more readable this is. Basically we are enumerating the "special cases" where we have to dequantize. It's worth putting thought into generalization, even if we never plan on doing it, because it forces you to understand the code at a logical level.

greptile-apps

_{3 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Varun Thumbe <[email protected]>

for more information, see https://pre-commit.ci

vthumbe1503 · 2025-12-04T04:05:23Z

/te-ci pytorch

vthumbe1503 · 2025-12-04T04:05:38Z

/te-ci pytorch

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps

Additional Comments (1)

transformer_engine/pytorch/tensor/float8_tensor.py, line 557-565 (link)

style: always creates new tensor even when already contiguous - consider early return optimization:

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

vthumbe1503 · 2025-12-07T15:35:24Z

/te-ci pytorch

Signed-off-by: Varun Thumbe <[email protected]>

greptile-apps

_{4 files reviewed, no comments}

_{Edit Code Review Agent Settings | Greptile}

Signed-off-by: Varun Thumbe <[email protected]>

vthumbe1503 · 2025-12-08T06:04:24Z

/te-ci pytorch

for more information, see https://pre-commit.ci

greptile-apps

_{4 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

vthumbe1503 and others added 6 commits November 26, 2025 05:34

bug fixed, test added

52a69b7

Signed-off-by: Varun Thumbe <[email protected]>

fix contigous

5547e4c

Signed-off-by: Varun Thumbe <[email protected]>

Merge branch 'NVIDIA:main' into fix_split_tensor_bug

681ad87

revert unecessary change

0b8ccbb

Signed-off-by: Varun Thumbe <[email protected]>

revert another change

8adcbb3

Signed-off-by: Varun Thumbe <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

18e85ce

for more information, see https://pre-commit.ci

vthumbe1503 requested review from ptrendx and timmoon10 November 26, 2025 07:51

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

address review comments

0ce6c7e

Signed-off-by: Varun Thumbe <[email protected]>

vthumbe1503 changed the title ~~[Pytorch][Bug]MXFP8 Split tensor had bug in padding the scale inverse tensors.~~ [Pytorch][Bug]MXFP8 Split tensor Bug fix Nov 26, 2025

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

Merge branch 'main' into fix_split_tensor_bug

d582d2b

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

transformer_engine/pytorch/tensor/mxfp8_tensor.py Outdated Show resolved Hide resolved

Update transformer_engine/pytorch/tensor/mxfp8_tensor.py

bbc669a

Co-authored-by: greptile-apps[bot] <165735046+greptile-apps[bot]@users.noreply.github.com> Signed-off-by: vthumbe1503 <[email protected]>

greptile-apps bot reviewed Nov 26, 2025

View reviewed changes

timmoon10 reviewed Nov 26, 2025

View reviewed changes

Merge branch 'main' into fix_split_tensor_bug

01b1cc2

greptile-apps bot reviewed Nov 30, 2025

View reviewed changes

vthumbe1503 and others added 4 commits December 4, 2025 04:02

address review comments

fcb4c9f

Signed-off-by: Varun Thumbe <[email protected]>

missed adding renamed file

6a61b08

Signed-off-by: Varun Thumbe <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2fd860b

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

7fc1c6f

for more information, see https://pre-commit.ci

Merge branch 'main' into fix_split_tensor_bug

107b461

greptile-apps bot reviewed Dec 4, 2025

View reviewed changes

Merge branch 'main' into fix_split_tensor_bug

afbed3d

greptile-apps bot reviewed Dec 7, 2025

View reviewed changes

fix minor issue

f84e3f6

Signed-off-by: Varun Thumbe <[email protected]>

greptile-apps bot reviewed Dec 7, 2025

View reviewed changes

fix ci issue

50b7c74

Signed-off-by: Varun Thumbe <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

492156c

for more information, see https://pre-commit.ci

greptile-apps bot reviewed Dec 8, 2025

View reviewed changes

	scale_inv_out[idx] = torch.nn.functional.pad(
	scale_inv_out = list(scale_inv_out) if isinstance(scale_inv_out, tuple) else scale_inv_out
	scale_inv_out[idx] = torch.nn.functional.pad(

	if dim == 0 or quantization_type == "fp8":
	if quantization_type == "fp8" or (quantization_type == "mxfp8" and dim == 0):

[Pytorch][Bug]MXFP8 Split tensor Bug fix #2427

Are you sure you want to change the base?

[Pytorch][Bug]MXFP8 Split tensor Bug fix #2427

Conversation

vthumbe1503 commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps bot commented Nov 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Overview

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Nov 26, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Nov 26, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

timmoon10 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

timmoon10 Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Dec 4, 2025

Uh oh!

vthumbe1503 commented Dec 4, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Additional Comments (1)

Uh oh!

vthumbe1503 commented Dec 7, 2025

Uh oh!

greptile-apps bot left a comment

Choose a reason for hiding this comment

Uh oh!

vthumbe1503 commented Dec 8, 2025

Uh oh!

greptile-apps bot left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

vthumbe1503 commented Nov 26, 2025 •

edited

Loading

greptile-apps bot commented Nov 26, 2025 •

edited

Loading

greptile-apps bot left a comment •

edited

Loading

greptile-apps bot left a comment •

edited

Loading