Skip to content

Conversation

cyxlily
Copy link

@cyxlily cyxlily commented Sep 13, 2025

Add act_pre_scale into Int4OpaqueTensor for AWQ.

Copy link

pytorch-bot bot commented Sep 13, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2997

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit a5675fe with merge base c4d4799 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Sep 13, 2025
Add act_pre_scale into Int4OpaqueTensor for AWQ.

Signed-off-by: Cui, Yuxin <[email protected]>
Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but I have a few questions.

@@ -254,14 +295,21 @@ def quantize_and_eval(
quantize_(model, quant_config)
print(f"time for convert: {time.time() - t0:.02f} seconds")
quant_config = AWQConfig(base_config, step="prepare_for_loading")
model.config.quantization_config = TorchAoConfig(quant_config)
#model.config.quantization_config = TorchAoConfig(quant_config)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change needed?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I update and remove this change.

@Xia-Weiwen
Copy link
Collaborator

CC @mingfeima for review. Thanks.

Comment on lines 263 to 270
if device == "cuda":
base_config = Int4WeightOnlyConfig(group_size=group_size, version=2)
elif device == "cpu":
base_config = Int4WeightOnlyConfig(
group_size=group_size, packing_format="opaque", version=2
)
else:
assert False, "Unsupported device: {}".format(device)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not very familar with the concept here, could you explain why cpu needs opaque packing_format?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's because packing_format describes a fix format of how the quantized weight data are laid out in memory, but int4 cpu has a format that is based on specific hardwares/tensor shapes etc.:

We use AVX512 to compute TINYGEMM on CPU. We can also leverage AVX512_VNNI and AMX instructions with torch.compile and max-autotune.
For data locality, we preshuffle the data in plain layout (N, K/2) to (N/block_n, K, block_n/2), where block_n = 64/32/16.
See https://github.com/pytorch/pytorch/blob/32eee8ed225d9f10fbbcb38c24b8b44c24c0c97c/aten/src/ATen/native/cpu/int4mm_kernel.cpp#L583 for more details.

base_config = Int4WeightOnlyConfig(group_size=group_size, version=2)
elif device == "cpu":
base_config = Int4WeightOnlyConfig(
group_size=group_size, packing_format="opaque", version=2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

version=2 can be removed now, it's the default now

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I update and remove version=2

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the version=2 here since it's the default.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I removed version=2 in CPU

Copy link
Collaborator

@Xia-Weiwen Xia-Weiwen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.


inductor_config.cpp_wrapper = True
inductor_config.max_autotune = True
inductor_config.max_autotune_gemm_backends = "CPP,ATEN"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script will also be used for CUDA. So, I think triton is needed here. Or let's simply remove this line to use the default ones.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed

Comment on lines 103 to 106
# Making sure activation pre scaling is successfully applied to the activation.
# manual_scaled_quantized (input * 2 → quantize with act_pre_scale=None) should equal
# auto_scaled_quantized (original input → quantize with act_pre_scale=2),
# Proving that the act_pre_scale factor correctly applies input scaling
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Making sure activation pre scaling is successfully applied to the activation.
# manual_scaled_quantized (input * 2 → quantize with act_pre_scale=None) should equal
# auto_scaled_quantized (original input → quantize with act_pre_scale=2),
# Proving that the act_pre_scale factor correctly applies input scaling
# Make sure activation pre scaling is successfully applied to the activation.

Let's make it more concise.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

Comment on lines 109 to 113
# Making sure quantization with pre-scaling is successfully applied to the activation.
# The error > 20 indicats that quantized computation with activation pre-scaling
# produces significantly different results from simply scaling the original
# floating-point output, confirming that pre-scaling is applied during
# quantization rather than post-processing.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Making sure quantization with pre-scaling is successfully applied to the activation.
# The error > 20 indicats that quantized computation with activation pre-scaling
# produces significantly different results from simply scaling the original
# floating-point output, confirming that pre-scaling is applied during
# quantization rather than post-processing.
# If pre-scaling is auto-applied, the quantization error should be low, i.e., compute_error (SQNR) is high

This may be simpler.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

base_config = Int4WeightOnlyConfig(group_size=group_size, version=2)
elif device == "cpu":
base_config = Int4WeightOnlyConfig(
group_size=group_size, packing_format="opaque", version=2
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the version=2 here since it's the default.

Cui, Yuxin added 4 commits September 16, 2025 14:56
Signed-off-by: Cui, Yuxin <[email protected]>
Signed-off-by: Cui, Yuxin <[email protected]>
Signed-off-by: Cui, Yuxin <[email protected]>
@@ -28,7 +29,7 @@
def get_config(group_size):
return Int4WeightOnlyConfig(
group_size=group_size,
int4_packing_format="opaque",
packing_format="opaque",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be int4_packing_format I think

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

# If pre-scaling is auto-applied, the quantization error should be low,
# i.e., compute_error (SQNR) is high
self.assertTrue(
compute_error(original * _ACT_PRE_SCALE, auto_scaled_quantized) > 20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: original --> original_output

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, updated

Copy link
Contributor

@jerryzh168 jerryzh168 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Xia-Weiwen Xia-Weiwen added the topic: not user facing Use this tag if you don't want this PR to show up in release notes label Sep 17, 2025
@cyxlily
Copy link
Author

cyxlily commented Sep 17, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

  • Facebook CLA Check

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: superuser

@cyxlily
Copy link
Author

cyxlily commented Sep 17, 2025

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: 1 mandatory check(s) are pending/not yet run. The first few are:

  • Facebook CLA Check

Dig deeper by viewing the pending checks on hud

Details for Dev Infra team Raised by workflow job

Failing merge rule: superuser

@Xia-Weiwen Xia-Weiwen merged commit 067b273 into pytorch:main Sep 17, 2025
22 of 23 checks passed
@cyxlily cyxlily deleted the awq branch September 17, 2025 08:24
Signed-off-by: Cui, Yuxin <[email protected]>
Cui, Yuxin added 2 commits September 17, 2025 09:42
Signed-off-by: Cui, Yuxin <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. topic: not user facing Use this tag if you don't want this PR to show up in release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants