docs(moe): correct moe_router_topk_scaling_factor docstring by edenfunf · Pull Request #4470 · NVIDIA/Megatron-LM

edenfunf · 2026-04-25T01:02:49Z

Summary

The docstring of moe_router_topk_scaling_factor (transformer_config.py:694) said it "only works when moe_router_pre_softmax enabled", but the code in topk_routing_with_score_function (moe_utils.py:818) applies the scaling unconditionally on the post-top-k probabilities for every score_function and regardless of moe_router_pre_softmax:

if scaling_factor:
    probs = probs * scaling_factor

Several shipped recipes confirm the current code behavior is the intended contract — they set a non-None scaling factor while keeping the default moe_router_pre_softmax=False:

examples/post_training/modelopt/conf/deepseek-ai/DeepSeek-R1.sh (2.5)
examples/post_training/modelopt/conf/moonshotai/Kimi-K2-Instruct.sh (2.827)
examples/post_training/modelopt/conf/meta-llama/Llama-4-Scout-17B-16E-Instruct.sh (1.0)
examples/post_training/modelopt/conf/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16.sh (5.0)

So the fix is on the docs side, not the code side.

Changes

megatron/core/transformer/transformer_config.py — rewrite the docstring to describe the actual unconditional post-top-k scaling behavior.
tests/unit_tests/transformer/moe/test_routers.py — add test_topk_scaling_factor_applies_for_all_pre_softmax_settings, a parametrized unit test that pins the contract over the cross product of {use_pre_softmax: True/False} × {score_function: softmax/sigmoid}.

Test plan

Run on a Linux box / CI:

pytest tests/unit_tests/transformer/moe/test_routers.py::test_topk_scaling_factor_applies_for_all_pre_softmax_settings -v

Local result (4/4 PASS, ~3.3s, CPU-only — no CUDA / process group required):

test_topk_scaling_factor_applies_for_all_pre_softmax_settings[softmax-True]  PASSED
test_topk_scaling_factor_applies_for_all_pre_softmax_settings[softmax-False] PASSED
test_topk_scaling_factor_applies_for_all_pre_softmax_settings[sigmoid-True]  PASSED
test_topk_scaling_factor_applies_for_all_pre_softmax_settings[sigmoid-False] PASSED

The test asserts:

The selected expert top-k mask is identical with and without scaling (scaling is monotonic, so the choice of experts is preserved).
The non-zero entries of the routing-probability tensor differ by exactly the scaling factor.

If a future change ever re-introduces the "only when pre_softmax" gating, this test will fail in the pre_softmax=False cases, surfacing the silent breakage in production recipes.

) The docstring stated the scaling factor "only works when moe_router_pre_softmax enabled", but topk_routing_with_score_function applies it unconditionally on the post-top-k probabilities for every score_function and pre_softmax setting. All shipped recipes (DeepSeek-R1, Kimi-K2, Llama-4, Nemotron, etc.) rely on this by combining a non-None scaling factor with the default moe_router_pre_softmax=False. Update the docstring to describe the actual behavior, and add a parametrized unit test in test_routers.py that pins the contract across {pre_softmax, score_function}.

copy-pr-bot · 2026-04-25T01:02:52Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yaox12 · 2026-04-27T02:56:36Z

/ok to test 864f518

yaox12

LGTM. Docstring- and test-only changes.

github-actions Bot added the community-request label Apr 25, 2026

yaox12 marked this pull request as ready for review April 27, 2026 02:56

yaox12 requested review from a team as code owners April 27, 2026 02:56

Merge branch 'main' into fix/1875-moe-topk-scaling-factor-docstring

864f518

svcnvidia-nemo-ci requested a review from a team April 27, 2026 02:56

svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 27, 2026

yaox12 approved these changes Apr 27, 2026

View reviewed changes

svcnvidia-nemo-ci added the complexity: low label Apr 27, 2026

copy-pr-bot Bot temporarily deployed to test April 27, 2026 02:57 Inactive

svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels Apr 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(moe): correct moe_router_topk_scaling_factor docstring#4470

docs(moe): correct moe_router_topk_scaling_factor docstring#4470
edenfunf wants to merge 2 commits intoNVIDIA:mainfrom
edenfunf:fix/1875-moe-topk-scaling-factor-docstring

edenfunf commented Apr 25, 2026

Uh oh!

copy-pr-bot Bot commented Apr 25, 2026

Uh oh!

yaox12 commented Apr 27, 2026

Uh oh!

yaox12 left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

edenfunf commented Apr 25, 2026

Summary

Changes

Test plan

Uh oh!

copy-pr-bot Bot commented Apr 25, 2026

Uh oh!

yaox12 commented Apr 27, 2026

Uh oh!

yaox12 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yaox12 left a comment •

edited

Loading