feat(attention): Add rotary_base_per_layer for Step-3.5-Flash#4473
feat(attention): Add rotary_base_per_layer for Step-3.5-Flash#4473shifangx wants to merge 7 commits intoNVIDIA:devfrom
Conversation
3d6565f to
422246b
Compare
…r for Step-3.5-Flash Adds two new optional, off-by-default features to TransformerConfig and SelfAttention to faithfully represent the Step-3.5-Flash architecture. - attention_per_head_gate: adds a separate ColumnParallelLinear(hidden_size -> num_attention_heads) whose sigmoid output gates each head independently (Step-3.5-Flash g_proj). Applied after core attention, before linear_proj. - rotary_base_per_layer: Optional[List[float]] -- per-layer RoPE theta values. When set, each SelfAttention creates its own RotaryEmbedding; the shared model-level rotary_pos_emb in GPTModel is not created. Both features default to False/None and have no effect on existing models. Co-Authored-By: Claude Sonnet 4.6 <[email protected]>
422246b to
6f80c19
Compare
There was a problem hiding this comment.
PR adds two opt-in features (use_head_wise_attn_gate + rotary_base_per_layer) for Step-3.5-Flash; defaults are off, existing models unaffected.
Main suggestions (see inline comments for details):
-
[CRITICAL] Fold
use_head_wise_attn_gateintolinear_qkvand merge it with the existingattention_output_gatepath. Share_split_qkv/_apply_output_gate(broadcasting handles both gate shapes — no new branch needed). Benefits:- Eliminates the TE/local backend divergence in
g_projmath (submodules.linear_qkvresolves toTELayerNormColumnParallelLinearunder TE, which adds an independent learnable LayerNorm). - Avoids two near-parallel "output gating" implementations coexisting long-term and drifting.
- Saves one GEMM kernel launch.
- Makes the two gate flags naturally mutually exclusive.
- Eliminates the TE/local backend divergence in
-
[IMPORTANT] The
rotary_base_per_layerforward override depends on the model-levelrotary_pos_embalready existing, contradicting the PR description's claim that "model-level rotary_pos_emb is not created" — this PR does not modifyGPTModel, so both rotaries actually coexist. -
[IMPORTANT]
_build_per_layer_rotary_pos_embduplicates the rotary-construction logic fromgpt_model.py; recommend extracting a shared factory function. -
[IMPORTANT] No test coverage at all for
rotary_base_per_layer.
5/6. [SUGGESTION] assert False, "Invalid position embedding type" is uninformative; getattr(self.config, 'rotary_base_per_layer', None) is unnecessary (the field is added by this PR).
…isting attention_output_gate path.
|
/ok to test 6f80c19 |
@shifangx, there was an error processing your request: See the following link for more information: https://docs.gha-runners.nvidia.com/cpr/e/2/ |
f728d71 to
9bd8555
Compare
9bd8555 to
6ad0ceb
Compare
|
/ok to test 7485da4 |
c943f32 to
c75616a
Compare
|
/ok to test c75616a |
c75616a to
5671195
Compare
|
/ok to test 5671195 |
c250903 to
bea28f6
Compare
|
/ok to test bea28f6 |
|
Please also submit a PR to the main branch, Thanks! |
What does this PR do ?
feat(attention): Add rotary_base_per_layer for Step-3.5-Flash
When set, each SelfAttention creates its own RotaryEmbedding
There are quite a few places that assume the GPT model includes RotaryEmbedding by default, so this PR needs to keep RotaryEmbedding within the GPT model.
The new feature default to False/None and have no effect on existing models.
Issue tracking
For PRs from open-source community contributors:
Linked issue:
Contribution process
Pre-checks
Code review
Feel free to message or comment the @mcore-oncall to help accelerate your merge into main. The less complex your PR is, the faster it will be approved and merged!
All PRs start as draft. If you open a non-draft PR, it will be automatically converted to draft.
Step 1: Mark PR as "Ready for Review"
.github/CODEOWNERS.Final Review might get declined if these requirements are not fulfilled.
Step 2: Final Review
For PRs that change
megatron/core, once all expert reviewers have approved, theFinal Reviewlabel is applied automatically and final reviewers are assigned.For PRs outside
megatron/core, this step is skipped.Step 3: Approved
Once all required reviewers have approved, the
Approvedlabel is applied automatically.Merge
Any member of mcore-engineers will be able to merge your PR.
For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
[email protected]or[email protected].