-
Notifications
You must be signed in to change notification settings - Fork 380
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Context parallelism with MLA #1552
Comments
Hi @SuperCB You mean Multi-head Latent attention which is used by Deepseek? Technically, nothing should stop us from doing it, we just have not done it yet. Considering popularity of MLA/Deepseek, we should add this support for sure. We will do it. Thanks for bringing this to our attention. |
I am working on it too. I found that the function |
Yeah, A2A implementation probably can work with MLA out of the box. P2P cannot work because it concats K and V into a single tensor for communication, different head_dim of K and V prevents us from doing the concat, but this should be addressable. As I said, technically, there should be no reason preventing MLA+CP, at least I do not know the reasons now, I might find something after I start to work on this. |
I have a question regarding FusedAttention: Why doesn't it support context parallelism with MLA (Multi-head Layer Attention)? What are the technical limitations preventing this compatibility?"
The text was updated successfully, but these errors were encountered: