-
Notifications
You must be signed in to change notification settings - Fork 3.6k
Mistral Large 3 NVFP4 support #14485
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Linda-Stadter <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
Signed-off-by: Daniel Campora <[email protected]>
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
Signed-off-by: Xinyuan Tong <[email protected]>
Signed-off-by: Xinyuan Tong <[email protected]>
|
/tag-and-rerun-ci |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a mm op, why put under attention layer?
|
|
||
| from .compressed_tensors_scheme import CompressedTensorsScheme | ||
| from .compressed_tensors_w4a4_nvfp4 import CompressedTensorsW4A4Fp4 | ||
| from .compressed_tensors_w4a16_nvfp4 import CompressedTensorsW4A16Fp4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have we tested the w4a16 code path? If not, better to do it in another PR. We may need it on Hopper or previous arch, and we don't have w4a16 moe support for now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment, can do in follow-up PR.
|
|
||
| if is_activation_quantization_format(self.quant_format): | ||
| if self._is_fp4a4_nvfp4(weight_quant, input_quant): | ||
| if cutlass_fp4_supported(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
w4a4 supports both flashinfer and cutlass, right? I think we should do something similar to the below method, check the capability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be only used by w4a16.
| ) | ||
|
|
||
|
|
||
| def swizzle_blockscale(scale: torch.Tensor): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment to clarify this method is nvfp4 specific.
Support Mistral Large 3 NVFP4.
Depends on #14466.
Checklist