FSDP grad fusion support #2191

sanandaraj5597 · 2025-09-21T05:33:11Z

This PR adds support gradient fusion for MCore FSDP.

Signed-off-by: Selvaraj Anandaraj <[email protected]>

for more information, see https://pre-commit.ci

timmoon10

I don't think this makes sense. If you configure a TE module with fuse_wgrad_accumulation=True (e.g. here), the correct behavior is to fuse wgrad accumulation. If Mcore FSDP doesn't support it, then it should be Mcore's responsibility to not set that arg.

timmoon10 · 2025-09-23T20:30:54Z

The root problem is that Mcore DDP and FSDP have different behaviors and require different contracts with TE:

DDP uses persistent main_grad buffers and it expects TE to accumulate into it. To adhere to this contract, Mcore zeros out the main_grad before the first microbatch step.
FSDP uses temporary main_grad buffers and it expects TE to overwrite it.

I don't like this PR's approach of switching between these two cases based on whether Mcore is using DDP or FSDP, since that's not actually the important thing. It also needlessly blocks some possible optimizations (DDP might want to overwrite main_grads in the first microbatch, FSDP might want to accumulate into main_grads if a weight is shared).

There are a few possible redesigns:

Deprecate the fuse_wgrad_accumulation kwarg in favor of something like output_wgrad_to_main_grad. Then check param flags to decide whether to overwrite or accumulate into the main_grad:

    grad_weight: torch.Tensor
    accumulate: bool = False
    if output_wgrad_to_main_grad:
        if getattr(weight, "get_main_grad", None) is not None:
            grad_weight = weight.get_main_grad()
        else:
            grad_weight = weight.main_grad
        accumulate = getattr(weight, "_overwrite_main_grad", True)
    else:
        grad_weight = torch.empty(...)

    gemm(..., out=grad_weight, accumulate=accumulate)

Ensuring backward compatibility will be tricky.

Have separate kwargs for fuse_wgrad_accumulation and overwrite_wgrad_main_grad. This means that the two cases are separate code paths and backward compatibility is easier to maintain. However, it also means we can't change behavior between steps.
Keep the fuse_wgrad_accumulation kwarg and purely control behavior with param flags. This is basically the approach used in this PR, although it could be improved by using better names rather than just checking weight.__fsdp_param__. One problem is that fuse_wgrad_accumulation will no longer be an accurate name.

Signed-off-by: Selvaraj Anandaraj <[email protected]>

timmoon10

We should include this behavior in the documentation:

TransformerEngine/transformer_engine/pytorch/module/linear.py

Line 1059 in dd9433e

fuse_wgrad_accumulation : bool, default = 'False'

TransformerEngine/transformer_engine/pytorch/module/layernorm_linear.py

Line 1123 in dd9433e

fuse_wgrad_accumulation : bool, default = 'False'

TransformerEngine/transformer_engine/pytorch/module/layernorm_mlp.py

Line 1482 in dd9433e

fuse_wgrad_accumulation : bool, default = 'False'

TransformerEngine/transformer_engine/pytorch/module/grouped_linear.py

Line 517 in dd9433e

fuse_wgrad_accumulation : bool, default = 'False'

TransformerEngine/transformer_engine/pytorch/ops/linear.py

Line 52 in dd9433e

accumulate_into_main_grad: bool, default = `False`

TransformerEngine/transformer_engine/pytorch/ops/basic/basic_linear.py

Line 77 in dd9433e

accumulate_into_main_grad: bool, default = `False`

transformer_engine/pytorch/ops/basic/basic_linear.py

transformer_engine/pytorch/ops/fused/backward_linear_add.py

transformer_engine/pytorch/ops/fused/backward_linear_scale.py

transformer_engine/pytorch/ops/fused/userbuffers_backward_linear.py

transformer_engine/pytorch/module/grouped_linear.py

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]>

…ar.py Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]>

Signed-off-by: Selvaraj Anandaraj <[email protected]>

for more information, see https://pre-commit.ci

timmoon10 · 2025-10-11T00:58:50Z

/te-ci pytorch

timmoon10

LGTM, pending CI.

It seems that DCO didn't like some commits, but they look fine to me. Maybe there's something misconfigured with your GitHub account's emails or maybe DCO is just buggy? In any case, I'm happy leaving this PR as-is and overriding DCO.

Selvaraj Anandaraj and others added 2 commits September 20, 2025 22:31

FSDP grad fusion support

44b4730

Signed-off-by: Selvaraj Anandaraj <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

1f7ebad

for more information, see https://pre-commit.ci

timmoon10 requested changes Sep 22, 2025

View reviewed changes

timmoon10 mentioned this pull request Sep 22, 2025

FSDP grad fusion support #2192

Merged

Selvaraj Anandaraj and others added 2 commits October 9, 2025 22:26

Re-factored grad overwriting usage

45afadd

Signed-off-by: Selvaraj Anandaraj <[email protected]>

Merge branch 'main' into fsdp_grad_fusion

e87eddd

sanandaraj5597 requested a review from timmoon10 October 10, 2025 05:27

timmoon10 reviewed Oct 10, 2025

View reviewed changes

timmoon10 requested changes Oct 10, 2025

View reviewed changes

transformer_engine/pytorch/module/grouped_linear.py Outdated Show resolved Hide resolved

sanandaraj5597 and others added 5 commits October 10, 2025 11:19

Update transformer_engine/pytorch/ops/basic/basic_linear.py

89d908c

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]>

Update transformer_engine/pytorch/ops/fused/backward_linear_add.py

de5972e

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]>

Update transformer_engine/pytorch/ops/fused/backward_linear_scale.py

b4a4122

Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]>

Update transformer_engine/pytorch/ops/fused/userbuffers_backward_line…

5fe4991

…ar.py Co-authored-by: Tim Moon <[email protected]> Signed-off-by: Selvaraj Anandaraj <[email protected]>

Modified API usage, added arg details

e632725

Signed-off-by: Selvaraj Anandaraj <[email protected]>

sanandaraj5597 requested a review from timmoon10 October 10, 2025 22:30

[pre-commit.ci] auto fixes from pre-commit.com hooks

e6e7ea1

for more information, see https://pre-commit.ci

timmoon10 approved these changes Oct 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

FSDP grad fusion support #2191

FSDP grad fusion support #2191

Uh oh!

sanandaraj5597 commented Sep 21, 2025

Uh oh!

timmoon10 left a comment •

edited

Loading

Uh oh!

timmoon10 commented Sep 23, 2025

Uh oh!

timmoon10 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented Oct 11, 2025

Uh oh!

timmoon10 left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

FSDP grad fusion support #2191

Are you sure you want to change the base?

FSDP grad fusion support #2191

Uh oh!

Conversation

sanandaraj5597 commented Sep 21, 2025

Uh oh!

timmoon10 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timmoon10 commented Sep 23, 2025

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

timmoon10 commented Oct 11, 2025

Uh oh!

timmoon10 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timmoon10 left a comment •

edited

Loading