[ET-VK] Quantized Int8 Linear #13816

SS-JIA · 2025-08-29T16:52:35Z

Stack from ghstack (oldest at bottom):

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

q8ta_q8csw variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
q8csw variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: D81323424

Title says it all! This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]

pytorch-bot · 2025-08-29T16:52:38Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13816

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job

As of commit a29d61c with merge base 1a7441f ():

NEW FAILURE - The following job has failed:

Build documentation / build (buck2) / Build doc (gh)
At least one of the pre-conditions you specified did not hold

CANCELLED JOB - The following job was cancelled. Please retry:

pull / test-openvino-linux / linux-job (gh)
##[error]The operation was canceled.

This comment was automatically generated by Dr. CI and updates every 15 minutes.

github-actions · 2025-08-29T16:53:05Z

This PR needs a `release notes:` label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

facebook-github-bot · 2025-08-29T16:53:07Z

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all! This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]

facebook-github-bot · 2025-08-29T17:42:01Z

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all! This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]

facebook-github-bot · 2025-08-29T18:27:42Z

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all! This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]

facebook-github-bot · 2025-08-30T00:34:04Z

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all! This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]

facebook-github-bot · 2025-09-02T18:45:55Z

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all! This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]

facebook-github-bot · 2025-09-04T10:55:39Z

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all! This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]

facebook-github-bot · 2025-09-04T15:45:41Z

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all! This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]

facebook-github-bot · 2025-09-05T18:38:34Z

This pull request was exported from Phabricator. Differential Revision: D81323424

digantdesai

still looking...

digantdesai · 2025-09-02T15:07:09Z

backends/vulkan/test/custom_ops/quantized_linear.cpp

+  // Perform quantized linear transformation (matrix multiplication)
+  for (int64_t b = 0; b < batch_size; ++b) {
+    for (int64_t out_f = 0; out_f < out_features; ++out_f) {
+      float sum = 0.0f;


No int32 accumulation in the reference? Why not use q->dq->linear->q->dq canonical pytorch pattern then?

I updated the test file to use int accumulation for the 8bit activation quantized path, as you suggested!

digantdesai · 2025-09-02T15:22:26Z

backends/vulkan/runtime/graph/ops/impl/QuantizedLinear.cpp

+ *   from prefill to decode. GEMM is typically a compute bound operation, which
+ *   will benefit from accelerated int8 accumulation. On the other hand, GEMV
+ *   is usually memory bound, which means it may actually suffer from the extra
+ *   cost of having to quantize and pack the input tensor. Therefore,


Did we benchmark this?

Haven't had the time to validate - but decided to just go this path for now because weight only quant is needed anyway to support devices that don't have the int8 dot product extension + the fact that ML drift also utilizes this approach. I plan to do some more experimentation to confirm post GA

backends/vulkan/runtime/graph/ops/glsl/linear_fp_bias_load.glslh

digantdesai · 2025-09-05T15:01:38Z

backends/vulkan/runtime/graph/ops/glsl/linear_fp_input_tile.glslh

+  VEC4_T data[TILE_M][TILE_K4];
+};
+
+#ifdef DEBUG_MODE


not defined otherwise?

The usage of this is to be able to

#define DEBUG_MODE

in order to access debugging functions in the shader template.

digantdesai · 2025-09-05T15:03:04Z

backends/vulkan/runtime/graph/ops/glsl/linear_fp_input_tile_load.glslh

+
+#include "linear_fp_input_tile.glslh"
+
+#ifdef INPUT_BUFFER


Nit: why is this not inside load_input_x4?

that's a great point. I will include this + other ifdef cleanups in a follow up diff.

digantdesai · 2025-09-05T15:04:38Z

backends/vulkan/runtime/graph/ops/glsl/linear_fp_input_tile_load.glslh

+    const int m_start,
+    const int K4,
+    const int M) {
+#if TILE_K4 == 1


why do we need this specialization? Does compiler not do this for you?

Yeah, I plan to simplify these ifdefs in my most recent diff.

digantdesai · 2025-09-05T15:10:37Z

backends/vulkan/runtime/graph/ops/glsl/linear_q8csw_tiled.glsl

+  const int m = out_tile_y * TILE_M;
+
+  const int n4 = div_4(n);
+  const int m4 = div_4(m);


m4 seems redundant, remove?

digantdesai · 2025-09-05T15:29:43Z

backends/vulkan/runtime/graph/ops/glsl/linear_q8csw_tiled.glsl

+  const int out_tile_x = int(gl_GlobalInvocationID.x);
+  const int out_tile_y = int(gl_GlobalInvocationID.y);


I am still trying to process this and figuring out what the right way to do this, but just as a hint, in absence of shared memory based memory reuse, at a high level, it feels like we want each local group to generate a tile of output tiles to ensure memory reuse on both weight tiles and input tiles.

IIUC this produces a row or a column of output tiles, but not a tile of tiles in 2D.

Currently each thread produces a 4x4 output block. The header files have been configured to make the tile size configurable in groups of 4.

Title says it all! This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure. For both linear and convolution, two versions are implemented: 1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension 2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation. The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension. These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff. Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/) [ghstack-poisoned]

facebook-github-bot · 2025-09-07T18:40:36Z

This pull request was exported from Phabricator. Differential Revision: D81323424

SS-JIA requested review from larryliu0820 and kirklandsign as code owners August 29, 2025 16:52

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Aug 29, 2025

facebook-github-bot added the fb-exported label Aug 29, 2025

SS-JIA mentioned this pull request Sep 2, 2025

[ET-VK][ez] Enable max_pool2d.default #13879

Merged

This was referenced Sep 5, 2025

[ET-VK] Fast path for choose_qparams #14019

Merged

[ET-VK] Implement linear_q4gsw #14020

Merged

manuelcandales approved these changes Sep 5, 2025

View reviewed changes

digantdesai reviewed Sep 5, 2025

View reviewed changes

ssjia and others added 2 commits September 7, 2025 10:54

facebook-github-bot merged commit 6a61722 into gh/SS-JIA/316/base Sep 8, 2025
113 of 118 checks passed

facebook-github-bot deleted the gh/SS-JIA/316/head branch September 8, 2025 00:04

facebook-github-bot temporarily deployed to cherry-pick-bot September 8, 2025 00:04 — with GitHub Actions Inactive

pytorchbot mentioned this pull request Sep 8, 2025

[ET-VK] Quantized Int8 Linear #14041

Merged

		const int out_tile_x = int(gl_GlobalInvocationID.x);
		const int out_tile_y = int(gl_GlobalInvocationID.y);


		#include "linear_fp_input_tile.glslh"

		#ifdef INPUT_BUFFER

[ET-VK] Quantized Int8 Linear #13816

[ET-VK] Quantized Int8 Linear #13816

Uh oh!

Conversation

SS-JIA commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13816

❌ 1 New Failure, 1 Cancelled Job

Uh oh!

github-actions bot commented Aug 29, 2025

This PR needs a release notes: label

Uh oh!

facebook-github-bot commented Aug 29, 2025

Uh oh!

facebook-github-bot commented Aug 29, 2025

Uh oh!

facebook-github-bot commented Aug 29, 2025

Uh oh!

facebook-github-bot commented Aug 30, 2025

Uh oh!

facebook-github-bot commented Sep 2, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 4, 2025

Uh oh!

facebook-github-bot commented Sep 5, 2025

Uh oh!

digantdesai left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

facebook-github-bot commented Sep 7, 2025

Uh oh!

Uh oh!

Uh oh!

SS-JIA commented Aug 29, 2025 •

edited

Loading

pytorch-bot bot commented Aug 29, 2025 •

edited

Loading

This PR needs a `release notes:` label