Skip to content

Add Hybrid Transformer block fusion#4463

Open
janEbert wants to merge 10 commits intoNVIDIA:mainfrom
janEbert:hybrid-transformer-fusion
Open

Add Hybrid Transformer block fusion#4463
janEbert wants to merge 10 commits intoNVIDIA:mainfrom
janEbert:hybrid-transformer-fusion

Conversation

@janEbert
Copy link
Copy Markdown
Contributor

@janEbert janEbert commented Apr 24, 2026

Add fusion operator [...] to --hybrid-layer-pattern, enabling packing subsequent sequence mixer + channel mixer operations into one TransformerLayer. This way, we have better expectations of the operations and can leverage existing optimizations in TransformerLayer.

For checkpointing, we save the model as if it was unfused, so that we have a canonical format and backward-compatibility. The model can be loaded in fused form from the unfused checkpoint. The transformations to the state dict are only applied when saving or loading.

@janEbert janEbert requested review from a team as code owners April 24, 2026 19:44
@svcnvidia-nemo-ci svcnvidia-nemo-ci marked this pull request as draft April 24, 2026 19:44
@github-actions
Copy link
Copy Markdown
Contributor

This PR has been automatically converted to draft because all PRs must start as drafts.

When you are ready for review, click Ready for Review to begin the review process. This will:

  1. Add the oncall reviewer (optional reviewer)
  2. Add required review teams based on your changes

See the contribution guide for more details.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 24, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 24, 2026
@janEbert janEbert changed the title Hybrid transformer fusion Add Hybrid Transformer block fusion Apr 27, 2026
janEbert and others added 3 commits April 27, 2026 19:34
Should yield the most forward- and backward-compatibility. Unfused
checkpoints can be loaded into a fused model.
Do not modify `sharded_state_dict` method, instead use
`generate_state_dict` to apply the transformation only when
loading/saving.
@janEbert janEbert force-pushed the hybrid-transformer-fusion branch from 626f895 to b7e2b05 Compare April 27, 2026 17:34
@janEbert janEbert force-pushed the hybrid-transformer-fusion branch from 62292ad to c5ca092 Compare April 27, 2026 17:37
@janEbert janEbert marked this pull request as ready for review April 27, 2026 17:37
@janEbert janEbert requested review from a team as code owners April 27, 2026 17:37
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 27, 2026 17:37
Copy link
Copy Markdown
Member

@Phlip79 Phlip79 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Should we reject MTP usage within brackets?
  • Can you add a working forward/backward test? TestFusedLayerValidation just has failure path coverage.

janEbert and others added 2 commits April 28, 2026 13:53
- Similarly disallowed as the pipe symbol.
- More explicitly mentioned and handled.
@janEbert janEbert force-pushed the hybrid-transformer-fusion branch from 3fefa13 to ed1aeee Compare April 28, 2026 16:06
@janEbert
Copy link
Copy Markdown
Contributor Author

Great points! Addressed both.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants