WIP: DP4AMatMul fix matmul for subgoup size 64 GPUs #23637

sushraja-msft · 2025-02-11T04:04:31Z

Description

This change moves away from using subgroup ops for quantization. This is because on AMD GPUs subgroup size is 64 and that is not handled in our quantization function, resulting in garbage output. Implementing subgroup size 64 quantization requires changing the workgroup size and then implementing support for subgroup size 128 becomes a challenge.

With the new implementation perf on intel ALD remains about the same 4.36s for 1000K prefill.

Tests for this change are present here
https://github.com/microsoft/onnxruntime/blob/e66650350b85cb5e3a408f6576fe6a7f4f4ddebc/onnxruntime/test/contrib_ops/matmul_4bits_test.cc

However, to trigger the current issue they must be run on a GPU with subgroup size 64.

sushraja-msft · 2025-02-11T04:07:34Z

@qjia7 - FYI, I am not able to add you as a reviewer but wanted to share for awareness.

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc

sushraja-msft requested a review from guschmue February 11, 2025 04:05

qjia7 reviewed Feb 11, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Show resolved Hide resolved

guschmue added the ep:WebGPU ort-web webgpu provider label Feb 11, 2025

sushraja-msft changed the title ~~DP4AMatMul fix matmul for subgoup size 64 GPUs~~ WIP: DP4AMatMul fix matmul for subgoup size 64 GPUs Feb 12, 2025

sushraja-msft force-pushed the user/sushraja/fix_dp4_quantization branch 2 times, most recently from 1ab5a6a to 1decc48 Compare February 12, 2025 20:59

Do not use subgroups for quantization

4f473cb

sushraja-msft force-pushed the user/sushraja/fix_dp4_quantization branch from 1decc48 to 4f473cb Compare February 12, 2025 22:11

qjia7 reviewed Feb 13, 2025

View reviewed changes

onnxruntime/contrib_ops/webgpu/quantization/matmul_nbits.cc Show resolved Hide resolved

guschmue approved these changes Feb 13, 2025

View reviewed changes

guschmue merged commit 4e24d37 into main Feb 13, 2025
96 of 98 checks passed

guschmue deleted the user/sushraja/fix_dp4_quantization branch February 13, 2025 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: DP4AMatMul fix matmul for subgoup size 64 GPUs #23637

WIP: DP4AMatMul fix matmul for subgoup size 64 GPUs #23637

sushraja-msft commented Feb 11, 2025

sushraja-msft commented Feb 11, 2025

WIP: DP4AMatMul fix matmul for subgoup size 64 GPUs #23637

WIP: DP4AMatMul fix matmul for subgoup size 64 GPUs #23637

Conversation

sushraja-msft commented Feb 11, 2025

Description

sushraja-msft commented Feb 11, 2025