[PyTorch] Support TP Overlap in Per-Tensor Current Scaling Recipe #1554

BestJuly · 2025-03-10T07:56:32Z

Description

Enable TP/Comm Overlap for Per-Tensor Current Scaling Recipe.

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Unit test related code changes
Module related code changes (i.e., linear, layernorm_linear, layernorm_mlp)
Quantization related code changes (help to use the correct scale_inv value)

Unit Tests

Python Unit Tests

# Test Current Scaling only
pytest -s tests/pytorch/distributed/test_comm_gemm_overlap.py -k "FP8 and CURRENT" -v

# Test all overlapping cases
pytest -s tests/pytorch/distributed/test_comm_gemm_overlap.py -v

# Particular pattern test, take `LayernormLinear` module as an example
# Set UB_SKIPMC=1 if necessary
torchrun --nproc_per_node 2 tests/pytorch/distributed/run_layer_with_overlap.py --seq-length=1024 --batch-size=1 --num-heads=8 --head-dim=64 --layer-type=layernormlinear --fp8 --quantization=fp8_current_scaling

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

timmoon10 · 2025-03-10T20:13:58Z

transformer_engine/pytorch/csrc/extensions/quantizer.cpp

-  opts = opts.dtype(torch::kFloat32).device(torch::kCUDA);
-  at::Tensor scale_inv = at::empty(scale_inv_torch_shape, opts);
+  // In current scaling, scale is not known but we initialize it with 1 to avoid division by zero. If scale is already calculated, it can be correctly set.
+  at::Tensor scale_inv = at::reciprocal(scale);


In principle we don't know if the value of scale is valid at this point. It is a workspace buffer for an intermediate value during quantization, and we only expose it outside quantization to make debugging more convenient: #1471 (comment)

The actual bug is that the UB comm doesn't handle scales properly in copy_into_buffer and get_buffer. This is an indirect fix that assumes that UB is constructing the gathered tensor after it has called the quantizer. This might not always be true: pipeline parallelism involves multiple forward passes and then multiple backward passes, so input_quantizer might have been called multiple times in between a forward and backward pass.

I'm working on a more general bugfix, so I'm fine accepting this as a temporary fix.

UB is initialized without quantization-related info. I think that is the reason why copy_into_buffer does not handle the scale part. Then get_buffer will use the scale information from quantizer, which will duplicate the calculation of scale_inv. Agreed that the current change here is indeed a temporary fix.

I'm working on a more general bugfix

Hi @timmoon10, are you still working on this? I recently find that these at::reciprocal calls significantly harm the performance, especially in the GroupedLinear and MoE models, just like this issue.

tests/pytorch/distributed/run_layer_with_overlap.py

zhongbozhu · 2025-03-11T04:58:14Z

transformer_engine/pytorch/module/layernorm_linear.py

+                # reduce duplicated transpose in `_fix_gathered_fp8_transpose`
+                input_quantizer.set_usage(rowwise=True, columnwise=False)
+            ub_obj_fprop = get_ub(ub_name + "_fprop")
+            ln_out_tmp = ln_out


maybe we should give it a better name

Use ln_out_local instead, which should be more accurate.

zhongbozhu · 2025-03-11T04:59:05Z

transformer_engine/pytorch/module/layernorm_linear.py

-                )
+                if fp8 and ub_bulk_dgrad:
+                    # reduce duplicated transpose in `_fix_gathered_fp8_transpose`
+                    input_quantizer.set_usage(rowwise=True, columnwise=False)


This optimization can also apply to function grad_output_preprocess in base.py, in which we did grad_output = quantizer(grad_output).

Since if we want to do allgather of grad_output under sequence + row parallel, we expect UB to do the allgather of grad_output (or dY), then we should also have an set_usage(rowwise=True, columnwise=False) before it. Otherwise we will also be wasting one at::empty() for malloc and one cast_transpose which could have been just cast.

The actual transpose of grad_output happens in linear/layernormlinear after we call the ub get_buffer.

if isinstance(grad_output, QuantizedTensor): # This is a no-op if platform supports non-TN FP8 GEMM or the transpose # already exists. grad_output.update_usage(rowwise_usage=True, columnwise_usage=True)

There is one thing that I discussed offline with @timmoon10 about this issue when we one day might add TPGEMM overlap to MXFP8.

For MXFP8, the scaling granularity is 1x32, therefore the "transpose" isn't just the data transpose, but also the scaling granularity change to 32x1. So, we might need to do two allgather ops in UB to overlap with GEMM.

In this case, we still need to set_usage(rowwise=True, columnwise=True) to make sure we have two copies of data to communicate.

So instead of doing if ub_bulk_dgrad:, maybe we should do

if ub_bulk_dgrad and is_per_tensor_recipe():
The is_per_tensor_scaling covers both delayed scaling and per-tensor current scaling. Both recipe should benefit from this optimization.

zhongbozhu · 2025-03-11T05:16:15Z

transformer_engine/pytorch/module/layernorm_linear.py

+        if ub_overlap_ag_fprop and isinstance(input_quantizer, Float8CurrentScalingQuantizer):
+            if ub_bulk_dgrad:
+                # reduce duplicated transpose in `_fix_gathered_fp8_transpose`
+                input_quantizer.set_usage(rowwise=True, columnwise=False)


why should we call set_usage(rowwise=True, columnwise=False) twice?

The set_usage will be indeed called several times. Currently, I disable duplicated transpose processing by another set_usage call. Otherwise, we need to add such expressions in a lot of conditions for each possible case. Not sure the overhead for multiple calling set_usage.

timmoon10

Overall LGTM

transformer_engine/common/recipe/__init__.py

timmoon10 · 2025-03-11T18:08:11Z

/te-ci pytorch L1

zhongbozhu · 2025-03-11T18:41:15Z

transformer_engine/pytorch/module/layernorm_linear.py

        ub_obj_fprop = None
        ln_out = None
-        if ub_overlap_ag_fprop:
+        if ub_overlap_ag_fprop and not isinstance(input_quantizer, Float8CurrentScalingQuantizer):


We should leave a comment here to explain why we skip current scaling quantizer.

The reason I guess should be: for current scaling, we don't want layernorm to output fp8 but rather bf16, right?

Have added comments to address this.

zhongbozhu · 2025-03-11T18:47:15Z

transformer_engine/pytorch/module/linear.py

-                )
+                if ub_bulk_dgrad:
+                    # reduce duplicated transpose in `_fix_gathered_fp8_transpose`
+                    input_quantizer.set_usage(rowwise=True, columnwise=False)


we should also add a per_tensor_scaling check here in the if condition?

zhongbozhu · 2025-03-11T18:53:59Z

transformer_engine/pytorch/module/linear.py

@@ -424,6 +429,7 @@ def backward(ctx, grad_output: torch.Tensor) -> Tuple[Union[torch.Tensor, None],
                    ub_obj_dgrad = ctx.ub_obj_gradout
                    ub_type_dgrad = tex.CommOverlapType.AG
                    ub_obj_dgrad.copy_into_buffer(inputmat, ctx.input_quantizer, local_chunk=True)
+                    inputmat = ub_obj_dgrad.get_buffer(ctx.input_quantizer)


why do we need this line?

We are using inputmat_total for wgrad calculation, and user buffer will handle the allgather of inputmat to get input_total, so we already have a line like below that fetches the gathered inputmat_total using ub get_buffer.

wgrad = None if ctx.requires_wgrad: if ctx.ub_bulk_dgrad: inputmat_total = ub_obj_dgrad.get_buffer(ctx.input_quantizer)

If this line is not necessary then we should remove it because get_buffer is not a free op.

Good catch. Removing this line does not affect any results. Have deleted this line.

zhongbozhu · 2025-03-11T23:00:07Z

Nit in description section

# Particular pattern test, take `LayernormLinear` module as an example
# Set UB_SKIPMC=1 if necessary
torchrun --nproc_per_node 2 tests/pytorch/distributed/run_layer_with_overlap.py --seq-length=1024 --batch-size=1 --num-heads=8 --head-dim=64 --layer-type=layernormlinear --fp8 --fp8-recipe=tensorwise

should be --quantization=fp8_current_scaling

# Set UB_SKIPMC=1 if necessary
torchrun --nproc_per_node 2 tests/pytorch/distributed/run_layer_with_overlap.py --seq-length=1024 --batch-size=1 --num-heads=8 --head-dim=64 --layer-type=layernormlinear --fp8 --quantization=fp8_current_scaling

zhongbozhu · 2025-03-12T18:38:53Z

Done some tests myself, LGTM

zhongbozhu · 2025-03-12T23:14:03Z

transformer_engine/pytorch/module/layernorm_linear.py

+            ub_obj_fprop = get_ub(ub_name + "_fprop")
+            ln_out_local = ln_out
+            ln_out = ub_obj_fprop.get_buffer(input_quantizer, local_chunk=True)
+            input_quantizer.quantize(ln_out_local, out=ln_out)


Hmmm, I think I still have questions about this line, and we need to figure this out before we merge.

This line (plus another one in layernorm_mlp) are the only places that you call get_buffer first and then use quantizer.quantize(inp, out) to do the quantize (ie. in-place quiantize without the need to malloc output tensor). This pattern will imply that the output tensor's dataptr is managed by user buffer system.

It's nice that this might save us some time copying data around, but I am worried that in actual E2E training, is it okay to save such tensor for backward if the dataptr belongs to ubuf? Shouldn't the data be compromised if we move to the next layer.

Unfortunately, our layer level unit test cannot catch it so I am not sure how can I prove my point. @timmoon10 what do you think?

Good catch, I think this is a bug from the 2.0 release. Before that, the GEMM function had an option to copy values from the UB buffer into a plain PyTorch tensor:

TransformerEngine/transformer_engine/pytorch/module/layernorm_linear.py

Line 258 in c9ea6be

extra_output_tensor=ln_out if ub_overlap_ag else None,

We didn't handle this when updating to 2.0, and I agree that it seems likely that ln_out is corrupted in between the forward and backward pass.

The key point should be that we use get_buffer and then ln_out is a ub buffer which will be saved and restored in training. All transformer layer shares the same ub so the values would change.
I think it can be replaced by using copy_into_buffer, where ln_out is a normal tensor. This makes sure that ub has the correct values and the saved tensor in prepare_for_saving is not a ub buffer.

zhongbozhu · 2025-03-12T23:14:30Z

transformer_engine/pytorch/module/layernorm_mlp.py

+            ub_obj_lnout = get_ub("fc1_fprop")
+            ln_out_local = ln_out
+            ln_out = ub_obj_lnout.get_buffer(fc1_input_quantizer, local_chunk=True)
+            fc1_input_quantizer.quantize(ln_out_local, out=ln_out)


And here is the other one.

zhongbozhu · 2025-03-13T18:24:48Z

transformer_engine/pytorch/module/layernorm_linear.py

-        if ub_overlap_ag_fprop:
+        # For DelayScaling, output of normalization will be in fp8.
+        # For Float8CurrentScaling, we want the output of normalization in high precision, then quantize to fp8.
+        if ub_overlap_ag_fprop and not fp8:
            ub_obj_fprop = get_ub(ub_name + "_fprop")
            ln_out = ub_obj_fprop.get_buffer(input_quantizer, local_chunk=True)


Why are we making a difference between bf16 case and fp8 case?

Because for bf16 case, ub can be still used so I do not change the logic here.
The main fix is about that the return value is changed to ln_out.clone() if ub_overlap_ag_fprop else ln_out.

zhongbozhu · 2025-03-13T18:30:18Z

transformer_engine/pytorch/module/layernorm_linear.py

            ub_obj_fprop = get_ub(ub_name + "_fprop")
            ln_out = ub_obj_fprop.get_buffer(input_quantizer, local_chunk=True)
+        elif ub_overlap_ag_fprop and not isinstance(input_quantizer, Float8CurrentScalingQuantizer):
+            ln_out = input_quantizer.make_empty(inputmat.shape, dtype=inputmat.dtype, device="cuda")


This elif used to belong to fp8 delayed scaling right? For delayed scaling, we still want layernorm to directly output to an fp8 buffer, only difference is whether this fp8 buffer is allocated by pytorch or ubuf.

Here if we just allocate with inputmat.dtype, are we wasting some memory? Since we were only supposed to allocate the size enough for a fp8 tensor.

transformer_engine/pytorch/module/layernorm_linear.py

timmoon10 · 2025-03-13T21:30:38Z

/te-ci pytorch L1

zhongbozhu · 2025-03-14T16:38:25Z

Cannot find any more issues, guess we are good to go? @timmoon10

timmoon10 · 2025-03-14T18:15:16Z

/te-ci pytorch L1

Signed-off-by: Li Tao <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Li Tao <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Li Tao <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Li Tao <[email protected]>

… limit max gpus for test Signed-off-by: zhongboz <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-03-14T22:56:36Z

/te-ci pytorch L1

timmoon10 · 2025-03-15T00:39:03Z

This fixes some TransformerLayer UB test failures (by capping to TP=4), but it introduces some other TransformerLayer UB test failures (numerical error slightly beyond bounds with FP8 current scaling).

This PR has a critical bugfix (see #1554 (comment)), so I think it's better to merge quickly and fiddle with the test tolerances later.

…IDIA#1554) * support tp-comm-overlap in Current Scaling recipe Signed-off-by: Li Tao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * clean Signed-off-by: Li Tao <[email protected]> * fix test recipe argument to generalize to MXFP8 Signed-off-by: Li Tao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reduce duplicated transpose in certain cases Signed-off-by: Li Tao <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Use per_tensor_scaling() to judge DS or CS Signed-off-by: Li Tao <[email protected]> * minor fixes Signed-off-by: Li Tao <[email protected]> * change comment description Signed-off-by: Li Tao <[email protected]> * add multi-layer unit test for tp overlap Signed-off-by: Li Tao <[email protected]> * support test case that run for several times Signed-off-by: Li Tao <[email protected]> * avoid save ub tensor in prepare_for_saving Signed-off-by: Li Tao <[email protected]> * fix Signed-off-by: Li Tao <[email protected]> * switch to a simple fix Signed-off-by: Li Tao <[email protected]> * formatting Signed-off-by: Li Tao <[email protected]> * simply test cases; avoid additional clone() Signed-off-by: Li Tao <[email protected]> * fall back to get_buffer in layernormmlp Signed-off-by: Li Tao <[email protected]> * use 2 layers for fp8 tpoverlap multi-layer test for better tolerance, limit max gpus for test Signed-off-by: zhongboz <[email protected]> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Signed-off-by: Li Tao <[email protected]> Signed-off-by: zhongboz <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: zhongboz <[email protected]>

timmoon10 reviewed Mar 10, 2025

View reviewed changes

zhongbozhu reviewed Mar 11, 2025

View reviewed changes

timmoon10 approved these changes Mar 11, 2025

View reviewed changes

transformer_engine/common/recipe/__init__.py Outdated Show resolved Hide resolved

zhongbozhu reviewed Mar 11, 2025

View reviewed changes

zhongbozhu reviewed Mar 12, 2025

View reviewed changes

zhongbozhu reviewed Mar 13, 2025

View reviewed changes

transformer_engine/pytorch/module/layernorm_linear.py Show resolved Hide resolved

BestJuly and others added 11 commits March 14, 2025 15:30

support tp-comm-overlap in Current Scaling recipe

7be96ae

Signed-off-by: Li Tao <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b8c30f7

for more information, see https://pre-commit.ci

clean

caf05e1

Signed-off-by: Li Tao <[email protected]>

fix test recipe argument to generalize to MXFP8

d50fab6

Signed-off-by: Li Tao <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

dd63b46

for more information, see https://pre-commit.ci

Reduce duplicated transpose in certain cases

031d3d9

Signed-off-by: Li Tao <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7736469

for more information, see https://pre-commit.ci

Use per_tensor_scaling() to judge DS or CS

4451c91

Signed-off-by: Li Tao <[email protected]>

minor fixes

d5ec83a

Signed-off-by: Li Tao <[email protected]>

change comment description

df6e927

Signed-off-by: Li Tao <[email protected]>

add multi-layer unit test for tp overlap

504285c

Signed-off-by: Li Tao <[email protected]>

BestJuly and others added 8 commits March 14, 2025 15:42

support test case that run for several times

be88da1

Signed-off-by: Li Tao <[email protected]>

avoid save ub tensor in prepare_for_saving

7dca01d

Signed-off-by: Li Tao <[email protected]>

fix

7093783

Signed-off-by: Li Tao <[email protected]>

switch to a simple fix

b022d0f

Signed-off-by: Li Tao <[email protected]>

formatting

e8875ce

Signed-off-by: Li Tao <[email protected]>

simply test cases; avoid additional clone()

7ecd3cc

Signed-off-by: Li Tao <[email protected]>

fall back to get_buffer in layernormmlp

fece570

Signed-off-by: Li Tao <[email protected]>

use 2 layers for fp8 tpoverlap multi-layer test for better tolerance,…

f686a49

… limit max gpus for test Signed-off-by: zhongboz <[email protected]>

zhongbozhu force-pushed the lit/enable_cs_tp_overlap branch from 673ee32 to f686a49 Compare March 14, 2025 22:54

[pre-commit.ci] auto fixes from pre-commit.com hooks

ace0aa9

for more information, see https://pre-commit.ci

timmoon10 merged commit a7eeb28 into NVIDIA:main Mar 15, 2025
11 of 12 checks passed

BestJuly mentioned this pull request Mar 27, 2025

[BUG] Weight gradients with TransformerEngine v2.1 don't match those with TransformerEngine v1.12 #1616

Closed

[PyTorch] Support TP Overlap in Per-Tensor Current Scaling Recipe #1554

[PyTorch] Support TP Overlap in Per-Tensor Current Scaling Recipe #1554

Conversation

BestJuly commented Mar 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Unit Tests

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timmoon10 commented Mar 11, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongbozhu Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhongbozhu commented Mar 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhongbozhu commented Mar 12, 2025

Uh oh!

zhongbozhu Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

timmoon10 commented Mar 13, 2025

Uh oh!

zhongbozhu commented Mar 14, 2025

Uh oh!

timmoon10 commented Mar 14, 2025

Uh oh!

BestJuly commented Mar 10, 2025 •

edited

Loading

zhongbozhu Mar 11, 2025 •

edited

Loading

zhongbozhu commented Mar 11, 2025 •

edited

Loading

zhongbozhu Mar 12, 2025 •

edited

Loading