-
Notifications
You must be signed in to change notification settings - Fork 336
Support Int4OpaqueTensor for AWQ #2997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/ao/2997
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit a5675fe with merge base c4d4799 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Add act_pre_scale into Int4OpaqueTensor for AWQ. Signed-off-by: Cui, Yuxin <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM but I have a few questions.
test/quantization/quantize_/workflows/int4/test_int4_opaque_tensor.py
Outdated
Show resolved
Hide resolved
torchao/prototype/awq/example.py
Outdated
@@ -254,14 +295,21 @@ def quantize_and_eval( | |||
quantize_(model, quant_config) | |||
print(f"time for convert: {time.time() - t0:.02f} seconds") | |||
quant_config = AWQConfig(base_config, step="prepare_for_loading") | |||
model.config.quantization_config = TorchAoConfig(quant_config) | |||
#model.config.quantization_config = TorchAoConfig(quant_config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I update and remove this change.
CC @mingfeima for review. Thanks. |
torchao/prototype/awq/example.py
Outdated
if device == "cuda": | ||
base_config = Int4WeightOnlyConfig(group_size=group_size, version=2) | ||
elif device == "cpu": | ||
base_config = Int4WeightOnlyConfig( | ||
group_size=group_size, packing_format="opaque", version=2 | ||
) | ||
else: | ||
assert False, "Unsupported device: {}".format(device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not very familar with the concept here, could you explain why cpu needs opaque
packing_format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's because packing_format describes a fix format of how the quantized weight data are laid out in memory, but int4 cpu has a format that is based on specific hardwares/tensor shapes etc.:
We use AVX512 to compute TINYGEMM on CPU. We can also leverage AVX512_VNNI and AMX instructions with torch.compile and max-autotune. | |
For data locality, we preshuffle the data in plain layout (N, K/2) to (N/block_n, K, block_n/2), where block_n = 64/32/16. | |
See https://github.com/pytorch/pytorch/blob/32eee8ed225d9f10fbbcb38c24b8b44c24c0c97c/aten/src/ATen/native/cpu/int4mm_kernel.cpp#L583 for more details. |
torchao/prototype/awq/example.py
Outdated
base_config = Int4WeightOnlyConfig(group_size=group_size, version=2) | ||
elif device == "cpu": | ||
base_config = Int4WeightOnlyConfig( | ||
group_size=group_size, packing_format="opaque", version=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
version=2 can be removed now, it's the default now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I update and remove version=2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the version=2
here since it's the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I removed version=2 in CPU
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks.
torchao/prototype/awq/example.py
Outdated
|
||
inductor_config.cpp_wrapper = True | ||
inductor_config.max_autotune = True | ||
inductor_config.max_autotune_gemm_backends = "CPP,ATEN" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script will also be used for CUDA. So, I think triton
is needed here. Or let's simply remove this line to use the default ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed
# Making sure activation pre scaling is successfully applied to the activation. | ||
# manual_scaled_quantized (input * 2 → quantize with act_pre_scale=None) should equal | ||
# auto_scaled_quantized (original input → quantize with act_pre_scale=2), | ||
# Proving that the act_pre_scale factor correctly applies input scaling |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Making sure activation pre scaling is successfully applied to the activation. | |
# manual_scaled_quantized (input * 2 → quantize with act_pre_scale=None) should equal | |
# auto_scaled_quantized (original input → quantize with act_pre_scale=2), | |
# Proving that the act_pre_scale factor correctly applies input scaling | |
# Make sure activation pre scaling is successfully applied to the activation. |
Let's make it more concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated
# Making sure quantization with pre-scaling is successfully applied to the activation. | ||
# The error > 20 indicats that quantized computation with activation pre-scaling | ||
# produces significantly different results from simply scaling the original | ||
# floating-point output, confirming that pre-scaling is applied during | ||
# quantization rather than post-processing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# Making sure quantization with pre-scaling is successfully applied to the activation. | |
# The error > 20 indicats that quantized computation with activation pre-scaling | |
# produces significantly different results from simply scaling the original | |
# floating-point output, confirming that pre-scaling is applied during | |
# quantization rather than post-processing. | |
# If pre-scaling is auto-applied, the quantization error should be low, i.e., compute_error (SQNR) is high |
This may be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated
torchao/prototype/awq/example.py
Outdated
base_config = Int4WeightOnlyConfig(group_size=group_size, version=2) | ||
elif device == "cpu": | ||
base_config = Int4WeightOnlyConfig( | ||
group_size=group_size, packing_format="opaque", version=2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the version=2
here since it's the default.
Signed-off-by: Cui, Yuxin <[email protected]>
Signed-off-by: Cui, Yuxin <[email protected]>
Signed-off-by: Cui, Yuxin <[email protected]>
Signed-off-by: Cui, Yuxin <[email protected]>
@@ -28,7 +29,7 @@ | |||
def get_config(group_size): | |||
return Int4WeightOnlyConfig( | |||
group_size=group_size, | |||
int4_packing_format="opaque", | |||
packing_format="opaque", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be int4_packing_format
I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated
# If pre-scaling is auto-applied, the quantization error should be low, | ||
# i.e., compute_error (SQNR) is high | ||
self.assertTrue( | ||
compute_error(original * _ACT_PRE_SCALE, auto_scaled_quantized) > 20 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: original
--> original_output
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, updated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@pytorchbot merge |
Merge failedReason: 1 mandatory check(s) are pending/not yet run. The first few are:
Dig deeper by viewing the pending checks on hud |
@pytorchbot merge |
Merge failedReason: 1 mandatory check(s) are pending/not yet run. The first few are:
Dig deeper by viewing the pending checks on hud |
Signed-off-by: Cui, Yuxin <[email protected]>
Signed-off-by: Cui, Yuxin <[email protected]>
Signed-off-by: Cui, Yuxin <[email protected]>
Add act_pre_scale into Int4OpaqueTensor for AWQ.