-
Notifications
You must be signed in to change notification settings - Fork 3.3k
Fix DSR1 accuracy for flashinfer_trtllm MoE with FP8 quantization #11081
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @trevor-m, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical accuracy regression in the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
fe5b41a to
e79924e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request correctly fixes a critical accuracy issue for DSR1 models using MoE with FP8 quantization and the flashinfer_trtllm backend. The change extends the w13 -> w31 weight mapping to ModelOptFp8MoEMethod, which was previously only applied for ModelOptNvFp4FusedMoEMethod. The fix is well-motivated, and the provided accuracy metrics demonstrate a significant improvement. The code change is logical and improves the robustness of the type checking. I have one minor suggestion to further improve code readability.
|
cc @zhyncs @LorrinWWW Can you check if this is ok for the problem you fixed in #10758 |
e79924e to
507332f
Compare
Motivation
#10758 introduced a change to only apply w13 -> w31 weight mapping with flashinfer_trtllm for NVFP4 quantization, however FP8 quantization is also supported and needs this mapping too, otherwise GSM8K accuracy went to 0.
Modifications
Apply w13 -> w31 for FP8 quantization also.
Accuracy Tests
Command
Before fix
After fix
Benchmarking and Profiling
N/A
Checklist