Skip to content

Conversation

@sushraja-msft
Copy link
Contributor

Description

This change moves away from using subgroup ops for quantization. This is because on AMD GPUs subgroup size is 64 and that is not handled in our quantization function, resulting in garbage output. Implementing subgroup size 64 quantization requires changing the workgroup size and then implementing support for subgroup size 128 becomes a challenge.

With the new implementation perf on intel ALD remains about the same 4.36s for 1000K prefill.

Tests for this change are present here
https://github.com/microsoft/onnxruntime/blob/e66650350b85cb5e3a408f6576fe6a7f4f4ddebc/onnxruntime/test/contrib_ops/matmul_4bits_test.cc

However, to trigger the current issue they must be run on a GPU with subgroup size 64.

@sushraja-msft
Copy link
Contributor Author

@qjia7 - FYI, I am not able to add you as a reviewer but wanted to share for awareness.

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Feb 11, 2025
@sushraja-msft sushraja-msft changed the title DP4AMatMul fix matmul for subgoup size 64 GPUs WIP: DP4AMatMul fix matmul for subgoup size 64 GPUs Feb 12, 2025
@sushraja-msft sushraja-msft force-pushed the user/sushraja/fix_dp4_quantization branch 2 times, most recently from 1ab5a6a to 1decc48 Compare February 12, 2025 20:59
@sushraja-msft sushraja-msft force-pushed the user/sushraja/fix_dp4_quantization branch from 1decc48 to 4f473cb Compare February 12, 2025 22:11
@guschmue guschmue merged commit 4e24d37 into main Feb 13, 2025
96 of 98 checks passed
@guschmue guschmue deleted the user/sushraja/fix_dp4_quantization branch February 13, 2025 21:24
guschmue pushed a commit that referenced this pull request Mar 6, 2025
### Description
This change moves away from using subgroup ops for quantization. This is
because on AMD GPUs subgroup size is 64 and that is not handled in our
quantization function, resulting in garbage output. Implementing
subgroup size 64 quantization requires changing the workgroup size and
then implementing support for subgroup size 128 becomes a challenge.

With the new implementation perf on intel ALD remains about the same
4.36s for 1000K prefill.


Tests for this change are present here 

https://github.com/microsoft/onnxruntime/blob/e66650350b85cb5e3a408f6576fe6a7f4f4ddebc/onnxruntime/test/contrib_ops/matmul_4bits_test.cc

However, to trigger the current issue they must be run on a GPU with
subgroup size 64.
ashrit-ms pushed a commit that referenced this pull request Mar 17, 2025
### Description
This change moves away from using subgroup ops for quantization. This is
because on AMD GPUs subgroup size is 64 and that is not handled in our
quantization function, resulting in garbage output. Implementing
subgroup size 64 quantization requires changing the workgroup size and
then implementing support for subgroup size 128 becomes a challenge.

With the new implementation perf on intel ALD remains about the same
4.36s for 1000K prefill.


Tests for this change are present here 

https://github.com/microsoft/onnxruntime/blob/e66650350b85cb5e3a408f6576fe6a7f4f4ddebc/onnxruntime/test/contrib_ops/matmul_4bits_test.cc

However, to trigger the current issue they must be run on a GPU with
subgroup size 64.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ep:WebGPU ort-web webgpu provider

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants