wip MoE refactor #2600

HDCharles · 2025-07-25T02:29:53Z

Summary:

now that the pytorch grouped_mm kernels don't require padding, refactoring the moe implementation to use that rather than what was there before.

DONE
-implement moe with grouped_mm [x]
-add handling for generic module swap to AOQuantizable (MoEMapping) [x] -refactor MoEQuantConfig to swap generic modules [x]

TODO
-add dispatch from grouped_mm to linear decomposition of quantized kernel
-compare linear decomposition vs new linear decomposition vs grouped_mm for eager, compile, autotuned compile linear decomposition
-compare linear decomposition vs new linear decomposition for quantized kernels
-add scaled_group_gemm and fbgemm kernel (probably in a new PR)

ISSUE:
the autotuned grouped_mm kernels don't give the correct output, but then work in eager and compile with reduce-overhead. why?

see new_run.log output, first 2 runs are fine, line 144 is nonsense

Test Plan:

sh run.sh

Reviewers:

Subscribers:

Tasks:

Tags:

pytorch-bot · 2025-07-25T02:29:56Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2600

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❗ 1 Active SEVs

There are 1 currently active SEVs. If your PR is affected, please view them below:

Large PR CI queueing due to LF runners being down

❌ 11 New Failures, 1 Cancelled Job

As of commit d41d7b9 with merge base 0e00df3 ():

NEW FAILURES - The following jobs have failed:

Code Analysis with Ruff / build (3.9) (gh)
Process completed with exit code 1.
PR Label Check / Check PR Labels (gh)
Process completed with exit code 1.
Run Regression Tests / test (CPU 2.5.1, linux.4xlarge, torch==2.5.1 --index-url https://download.pytorch.org/whl/cpu, cp... / linux-job (gh)
RuntimeError: Command docker exec -t 88393da1d0a48817f9e407500e50e5b1bc47140e9d8a106ef2db2643f3f93237 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.6, linux.4xlarge, torch==2.6.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t affbe888b15a9070fa88a66e577b3ab15098d7e4b1051bb028fa6ad5dae773f9 /exec failed with exit code 2
Run Regression Tests / test (CPU 2.7, linux.4xlarge, torch==2.7.0 --index-url https://download.pytorch.org/whl/cpu, cpu) / linux-job (gh)
RuntimeError: Command docker exec -t 58587ad33e931ab3f44fc47b52b2967a9d78e3a6685b576082f7327b60714677 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.5.1, linux.g5.12xlarge.nvidia.gpu, torch==2.5.1 --index-url https://download.pytorch... / linux-job (gh)
RuntimeError: Command docker exec -t bd3c21b22622a812dff39a69fe9f026b42960b7f0a112406461d743a218e2793 /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.6, linux.g5.12xlarge.nvidia.gpu, torch==2.6.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t 693e04d1b958448c11c9b57588bc29bf4010596c430f6e73015994ef98b58f7c /exec failed with exit code 2
Run Regression Tests / test (CUDA 2.7, linux.g5.12xlarge.nvidia.gpu, torch==2.7.0, cuda, 12.6) / linux-job (gh)
RuntimeError: Command docker exec -t ec43aa832f7e08ea787109cb116effd2be62348133a086edccc0273435233ffa /exec failed with exit code 2
Run Regression Tests / test-nightly (CPU Nightly, linux.4xlarge, --pre torch --index-url https://download.pytorch.org/wh... / linux-job (gh)
RuntimeError: Command docker exec -t 384d7e5582d1912182bada193ba02651b2f0ef3fb260305c5c9c3aaa7a61c20a /exec failed with exit code 2
Run Regression Tests / test-nightly (CUDA Nightly, linux.g5.12xlarge.nvidia.gpu, --pre torch --index-url https://downloa... / linux-job (gh)
RuntimeError: Command docker exec -t ffc382d0a52bd557f9c8394bccddffe16768f626255072562a948aaed73e0beb /exec failed with exit code 2
Run TorchAO Experimental Tests / test-cpu-ops (macos-14) (gh)
torchao/experimental/tests/test_int8_dynamic_activation_intx_weight.py::TestInt8DynamicActivationIntxWeight::test_moe_quant_intx

CANCELLED JOB - The following job was cancelled. Please retry:

Run TorchAO Experimental Tests / test-cpu-ops (linux.arm64.2xlarge) (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

alexsamardzic · 2025-07-25T12:29:55Z

This one replaces #2325, right?

I'm struggling to run the run.sh script (i.e. generate.py script), keep getting "CUDA out of memory" errors on H100... Are you using PyTorch built from source and if not, which PyTorch package version do you have installed?

Would you mind finding mm_grouped.py file in your PyTorch installation, then changing can_use_triton_kernel() function there to just return False, and then re-trying? This will force eager (non-Triton) version of grouped MM kernel to be used even for max-autotune; namely, I suspect that the garbage output may not be from grouped MM Triton kernel itself, but maybe from max-autotuning the whole layer, and that would test it.

As a side note, it seems that MoEFeedForwardAOQuantizable should be imported for this and this.

HDCharles · 2025-07-26T00:33:17Z

This one replaces #2325, right?

I'm struggling to run the run.sh script (i.e. generate.py script), keep getting "CUDA out of memory" errors on H100... Are you using PyTorch built from source and if not, which PyTorch package version do you have installed?

Would you mind finding mm_grouped.py file in your PyTorch installation, then changing can_use_triton_kernel() function there to just return False, and then re-trying? This will force eager (non-Triton) version of grouped MM kernel to be used even for max-autotune; namely, I suspect that the garbage output may not be from grouped MM Triton kernel itself, but maybe from max-autotuning the whole layer, and that would test it.

As a side note, it seems that MoEFeedForwardAOQuantizable should be imported for this and this.

can you run it with batch_size 1?

i'll try the fix

yeah i haven't done the quantization dispatch stuff yet.

vkuzo · 2025-07-28T12:50:27Z

torchao/prototype/moe_quant/utils.py

-    """Configuration for applying quantization to MoE
-    Args:
-        `base_config`: normal AO Config
+class DummyModule(torch.nn.Module):


I think a better solution is to make torchao APIs work on parameters. The current workaround is fine for prototype, but we'd want more proper support for non-prototype.

vkuzo · 2025-07-28T12:52:48Z

torchao/_models/mixtral-moe/model.py

@@ -310,7 +310,7 @@ def apply_rotary_emb(x: Tensor, freqs_cis: Tensor) -> Tensor:
 # T'(e) tokens for expert e


-class MOEFeedForwardAOQuantizable(nn.Module):
+class MoEFeedForwardAOQuantizable(nn.Module):


It seems unlikely that people are going to swap their MoE module to AO's version. Can we just target torch._grouped_mm calls directly without requiring a module swap?

What would it mean to "target" it specifically? If given model compiled, the compiled version of this operator will be used anyway, not sure what else torchao could do about it...

alexsamardzic · 2025-07-28T15:59:46Z

can you run it with batch_size 1?

Nope, with both batch_size 1 and 8, it runs out of memory.

Summary: now that the pytorch grouped_mm kernels don't require padding, refactoring the moe implementation to use that rather than what was there before. DONE -implement moe with grouped_mm [x] -add handling for generic module swap to AOQuantizable (MoEMapping) [x] -refactor MoEQuantConfig to swap generic modules [x] TODO -add dispatch from grouped_mm to linear decomposition of quantized kernel -compare linear decomposition vs new linear decomposition vs grouped_mm for eager, compile, autotuned compile linear decomposition -compare linear decomposition vs new linear decomposition for quantized kernels -add scaled_group_gemm and fbgemm kernel (probably in a new PR) ISSUE: the autotuned grouped_mm kernels don't give the correct output, but then work in eager and compile with reduce-overhead. why? see new_run.log output, first 2 runs are fine, line 144 is nonsense Test Plan: sh run.sh Reviewers: Subscribers: Tasks: Tags:

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 25, 2025

HDCharles requested a review from alexsamardzic July 25, 2025 02:30

vkuzo reviewed Jul 28, 2025

View reviewed changes

HDCharles force-pushed the 092_BE_MoE branch from f51485a to d41d7b9 Compare July 28, 2025 17:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

wip MoE refactor #2600

wip MoE refactor #2600

Uh oh!

HDCharles commented Jul 25, 2025

Uh oh!

pytorch-bot bot commented Jul 25, 2025 •

edited

Loading

Uh oh!

alexsamardzic commented Jul 25, 2025

Uh oh!

HDCharles commented Jul 26, 2025 •

edited

Loading

Uh oh!

vkuzo Jul 28, 2025

Uh oh!

vkuzo Jul 28, 2025

Uh oh!

alexsamardzic Jul 28, 2025

Uh oh!

alexsamardzic commented Jul 28, 2025

Uh oh!

Uh oh!

wip MoE refactor #2600

Are you sure you want to change the base?

wip MoE refactor #2600

Uh oh!

Conversation

HDCharles commented Jul 25, 2025

Uh oh!

pytorch-bot bot commented Jul 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2600

❗ 1 Active SEVs

❌ 11 New Failures, 1 Cancelled Job

Uh oh!

alexsamardzic commented Jul 25, 2025

Uh oh!

HDCharles commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vkuzo Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

vkuzo Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

alexsamardzic Jul 28, 2025

Choose a reason for hiding this comment

Uh oh!

alexsamardzic commented Jul 28, 2025

Uh oh!

Uh oh!

pytorch-bot bot commented Jul 25, 2025 •

edited

Loading

HDCharles commented Jul 26, 2025 •

edited

Loading