Skip to content

Conversation

SS-JIA
Copy link
Contributor

@SS-JIA SS-JIA commented Aug 29, 2025

Stack from ghstack (oldest at bottom):

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

  1. q8ta_q8csw variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
  2. q8csw variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: D81323424

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
Copy link

pytorch-bot bot commented Aug 29, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/13816

Note: Links to docs will display an error until the docs builds have been completed.

❌ 1 New Failure, 1 Cancelled Job

As of commit a29d61c with merge base 1a7441f (image):

NEW FAILURE - The following job has failed:

CANCELLED JOB - The following job was cancelled. Please retry:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

Copy link

This PR needs a release notes: label

If your change should be included in the release notes (i.e. would users of this library care about this change?), please use a label starting with release notes:. This helps us keep track and include your important work in the next release notes.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "release notes: none"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81323424

Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81323424

Copy link
Contributor

@digantdesai digantdesai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

still looking...

// Perform quantized linear transformation (matrix multiplication)
for (int64_t b = 0; b < batch_size; ++b) {
for (int64_t out_f = 0; out_f < out_features; ++out_f) {
float sum = 0.0f;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No int32 accumulation in the reference? Why not use q->dq->linear->q->dq canonical pytorch pattern then?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I updated the test file to use int accumulation for the 8bit activation quantized path, as you suggested!

* from prefill to decode. GEMM is typically a compute bound operation, which
* will benefit from accelerated int8 accumulation. On the other hand, GEMV
* is usually memory bound, which means it may actually suffer from the extra
* cost of having to quantize and pack the input tensor. Therefore,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we benchmark this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't had the time to validate - but decided to just go this path for now because weight only quant is needed anyway to support devices that don't have the int8 dot product extension + the fact that ML drift also utilizes this approach. I plan to do some more experimentation to confirm post GA

VEC4_T data[TILE_M][TILE_K4];
};

#ifdef DEBUG_MODE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not defined otherwise?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The usage of this is to be able to

#define DEBUG_MODE

in order to access debugging functions in the shader template.


#include "linear_fp_input_tile.glslh"

#ifdef INPUT_BUFFER
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: why is this not inside load_input_x4?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's a great point. I will include this + other ifdef cleanups in a follow up diff.

const int m_start,
const int K4,
const int M) {
#if TILE_K4 == 1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this specialization? Does compiler not do this for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I plan to simplify these ifdefs in my most recent diff.

const int m = out_tile_y * TILE_M;

const int n4 = div_4(n);
const int m4 = div_4(m);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

m4 seems redundant, remove?

Comment on lines +58 to +59
const int out_tile_x = int(gl_GlobalInvocationID.x);
const int out_tile_y = int(gl_GlobalInvocationID.y);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still trying to process this and figuring out what the right way to do this, but just as a hint, in absence of shared memory based memory reuse, at a high level, it feels like we want each local group to generate a tile of output tiles to ensure memory reuse on both weight tiles and input tiles.

IIUC this produces a row or a column of output tiles, but not a tile of tiles in 2D.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently each thread produces a 4x4 output block. The header files have been configured to make the tile size configurable in groups of 4.

ssjia and others added 2 commits September 7, 2025 10:54
Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
Title says it all!

This PR adds implementations for int8 linear layers. Convolution is implemented in a later step, computing convolution as matrix multiplication via the im2col procedure.

For both linear and convolution, two versions are implemented:

1. `q8ta_q8csw` variant which quantized the input tensor and then performs integer accumulation via the int8 dot product extension
2. `q8csw` variant which dequantized the weight tensor in-shader and performs floating point accumulation.

The second one is needed to provide an alternative path for executing quantized models if the target GPU does not support int8 dot product extension.

These new ops are tested via the custom op testing + benchmarking framework introduced in the previous diff.

Differential Revision: [D81323424](https://our.internmc.facebook.com/intern/diff/D81323424/)

[ghstack-poisoned]
@facebook-github-bot
Copy link
Contributor

This pull request was exported from Phabricator. Differential Revision: D81323424

@facebook-github-bot facebook-github-bot merged commit 6a61722 into gh/SS-JIA/316/base Sep 8, 2025
113 of 118 checks passed
@facebook-github-bot facebook-github-bot deleted the gh/SS-JIA/316/head branch September 8, 2025 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants