Skip to content

docs(moe): correct moe_router_topk_scaling_factor docstring#4470

Open
edenfunf wants to merge 2 commits intoNVIDIA:mainfrom
edenfunf:fix/1875-moe-topk-scaling-factor-docstring
Open

docs(moe): correct moe_router_topk_scaling_factor docstring#4470
edenfunf wants to merge 2 commits intoNVIDIA:mainfrom
edenfunf:fix/1875-moe-topk-scaling-factor-docstring

Conversation

@edenfunf
Copy link
Copy Markdown

Summary

Fixes #1875.

The docstring of moe_router_topk_scaling_factor (transformer_config.py:694) said it "only works when moe_router_pre_softmax enabled", but the code in topk_routing_with_score_function (moe_utils.py:818) applies the scaling unconditionally on the post-top-k probabilities for every score_function and regardless of moe_router_pre_softmax:

if scaling_factor:
    probs = probs * scaling_factor

Several shipped recipes confirm the current code behavior is the intended contract — they set a non-None scaling factor while keeping the default moe_router_pre_softmax=False:

  • examples/post_training/modelopt/conf/deepseek-ai/DeepSeek-R1.sh (2.5)
  • examples/post_training/modelopt/conf/moonshotai/Kimi-K2-Instruct.sh (2.827)
  • examples/post_training/modelopt/conf/meta-llama/Llama-4-Scout-17B-16E-Instruct.sh (1.0)
  • examples/post_training/modelopt/conf/nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16.sh (5.0)

So the fix is on the docs side, not the code side.

Changes

  • megatron/core/transformer/transformer_config.py — rewrite the docstring to describe the actual unconditional post-top-k scaling behavior.
  • tests/unit_tests/transformer/moe/test_routers.py — add test_topk_scaling_factor_applies_for_all_pre_softmax_settings, a parametrized unit test that pins the contract over the cross product of {use_pre_softmax: True/False} × {score_function: softmax/sigmoid}.

Test plan

Run on a Linux box / CI:

pytest tests/unit_tests/transformer/moe/test_routers.py::test_topk_scaling_factor_applies_for_all_pre_softmax_settings -v

Local result (4/4 PASS, ~3.3s, CPU-only — no CUDA / process group required):

test_topk_scaling_factor_applies_for_all_pre_softmax_settings[softmax-True]  PASSED
test_topk_scaling_factor_applies_for_all_pre_softmax_settings[softmax-False] PASSED
test_topk_scaling_factor_applies_for_all_pre_softmax_settings[sigmoid-True]  PASSED
test_topk_scaling_factor_applies_for_all_pre_softmax_settings[sigmoid-False] PASSED

The test asserts:

  1. The selected expert top-k mask is identical with and without scaling (scaling is monotonic, so the choice of experts is preserved).
  2. The non-zero entries of the routing-probability tensor differ by exactly the scaling factor.

If a future change ever re-introduces the "only when pre_softmax" gating, this test will fail in the pre_softmax=False cases, surfacing the silent breakage in production recipes.

)

The docstring stated the scaling factor "only works when moe_router_pre_softmax
enabled", but topk_routing_with_score_function applies it unconditionally on the
post-top-k probabilities for every score_function and pre_softmax setting. All
shipped recipes (DeepSeek-R1, Kimi-K2, Llama-4, Nemotron, etc.) rely on this by
combining a non-None scaling factor with the default moe_router_pre_softmax=False.

Update the docstring to describe the actual behavior, and add a parametrized unit
test in test_routers.py that pins the contract across {pre_softmax, score_function}.
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 25, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yaox12 yaox12 marked this pull request as ready for review April 27, 2026 02:56
@yaox12 yaox12 requested review from a team as code owners April 27, 2026 02:56
@svcnvidia-nemo-ci svcnvidia-nemo-ci requested a review from a team April 27, 2026 02:56
@yaox12
Copy link
Copy Markdown
Member

yaox12 commented Apr 27, 2026

/ok to test 864f518

@svcnvidia-nemo-ci svcnvidia-nemo-ci added this to the Core 0.16 milestone Apr 27, 2026
Copy link
Copy Markdown
Member

@yaox12 yaox12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Docstring- and test-only changes.

@svcnvidia-nemo-ci svcnvidia-nemo-ci added waiting-on-customer Waiting on the original author to respond waiting-on-maintainers Waiting on maintainers to respond and removed waiting-on-customer Waiting on the original author to respond labels Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

--moe-router-topk-scaling-factor argument works wrongly

4 participants