-
Notifications
You must be signed in to change notification settings - Fork 51
Support loading for static quant weight fp8 act fp8 #730
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
Signed-off-by: yiliu30 <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for loading static quantized models with FP8 weights and FP8 activations by implementing a new quantized linear layer class and updating the model conversion infrastructure.
Key changes:
- Implemented
WeightFP8ActFP8StaticQuantLinear
class for handling FP8 weight and activation quantization - Updated model conversion logic to detect and handle FP8 static quantization configurations
- Enhanced test coverage to verify both export and loading functionality for static FP8 quantization
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 3 comments.
File | Description |
---|---|
test/test_cpu/test_export.py | Extended test to verify loading of static FP8 quantized models and renamed test method |
auto_round/inference/convert_model.py | Added support for act_dynamic parameter and FP8 static quantization detection in model conversion |
auto_round/inference/backend.py | Added FP8 static quantization detection function and updated dynamic import logic |
auto_round/export/export_to_autoround/export_to_fp8_woq.py | Implemented new WeightFP8ActFP8StaticQuantLinear class with quantization/dequantization methods |
This PR is unnecessary for now, you need to work with Heng to fix the FP8 |
@wenhuach21 The purpose of this PR is to support loading an existing qmodel from disk and then evaluating its accuracy. cc @n1ck-guo |
Yes, but the primary purpose is for evaluation, which the fake model should cover well #731. This is not a product feature, and it involves changes to critical product code. As discussed earlier, please hold this PR for now, or move the code elsewhere without modifying the important HF model inference code. |
No description provided.