Honor COMPACT data_format for FP8 blockwise scales in MoE up-projection path to remove 5× redundant rowwise_scale_inv.T.contiguous() passes #2199
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
In Megatron-Core + Transformer Engine (TE), we quantize activations to FP8 before the MoE up-projection and then run the dispatch. This is compatible with TE’s FP8 fprop for MoE up-projections. However, the current code path implicitly assumes a GEMM-ready scale layout and ends up transposing scales multiple times.
Pipeline (FP8 forward path)
(1)Quantize
(2)DeepEP dispatch
(3)Create Float8BlockwiseQTensor
(4)permute
(5)GroupedLinear fprop (dequantize)
For Float8BlockQuantizer / Float8BlockwiseQTensor, honor data_format == tex.Float8BlockScaleTensorFormat.COMPACT in this path and keep scales in the compact layout, instead of always materializing rowwise_scale_inv.T.contiguous().
Using COMPACT here eliminates three rowwise_scale_inv.T.contiguous() passes across the pipeline (1) & (3) & (4)
This reduces needless transposes/copies on the critical path and improves end-to-end MFU and reduce peak GPU memory usage during training.
And it can eliminates another two rowwise_scale_inv.T.contiguous() passes across pipeline in step(4) create Float8BlockwiseQTensor and step (5), but this need modify together with GroupedLinear, and this not including this PR.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: