-
Notifications
You must be signed in to change notification settings - Fork 3.3k
[sgl-kernel] Support float64 moe_sum_reduce cuda kernel #11068
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request successfully adds float64 support to the moe_sum_reduce CUDA kernel and its corresponding Triton benchmark, addressing a failure in MLA tests. It also enhances flexibility by introducing a fallback mechanism for runtime topk_num values. The use of opmath_t in the CUDA kernel for handling various precisions is a solid implementation choice. My review highlights a critical precision loss issue in the Triton kernel for float64 inputs and provides medium-severity suggestions to refactor duplicated code in the CUDA kernel's dispatch logic, which will enhance maintainability.
|
Using torch.tensor(xxx, ...) will raise a 2us memcpy h2d kernel, and it may not compatible with cuda graph? |
This is a ref kernel to verify the float64 acc result. In product code, the triton kernel is unchanged. |
c7325f1 to
340a228
Compare
9551d2f to
c31c659
Compare
2519942 to
02bd5e1
Compare
…11068) Co-authored-by: luoyuan.luo <[email protected]>
…11068) Co-authored-by: luoyuan.luo <[email protected]>
Motivation
The deterministic feature introduced float64 data type in MLA test. The current moe_sum_reduce cuda kernel does not cover this data type. So as when using this new cuda kernel like in #10654 , the following srt test failure occurred:
This PR is to fix this problem, supporting bfloat16, float32 and float64. It also fixes the unsupported runtime topk_num in the previous version, which enters a dead-end when topk_num is not 2/4/8/9 and returns unexpected result.
With the fix the test case passed.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist