Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Context parallelism with MLA #1552

Open
SuperCB opened this issue Mar 8, 2025 · 4 comments
Open

Context parallelism with MLA #1552

SuperCB opened this issue Mar 8, 2025 · 4 comments
Assignees

Comments

@SuperCB
Copy link

SuperCB commented Mar 8, 2025

I have a question regarding FusedAttention: Why doesn't it support context parallelism with MLA (Multi-head Layer Attention)? What are the technical limitations preventing this compatibility?"

@xrennvidia
Copy link
Collaborator

xrennvidia commented Mar 10, 2025

Hi @SuperCB

You mean Multi-head Latent attention which is used by Deepseek? Technically, nothing should stop us from doing it, we just have not done it yet. Considering popularity of MLA/Deepseek, we should add this support for sure. We will do it. Thanks for bringing this to our attention.

@SuperCB
Copy link
Author

SuperCB commented Mar 11, 2025

I am working on it too. I found that the function AttnFuncWithCPAndQKVOA2A can support context parallelism for mla? Is my conclusion correct, and what are the main reasons currently preventing mla from supporting context parallelism?

@xrennvidia
Copy link
Collaborator

Yeah, A2A implementation probably can work with MLA out of the box. AttnFuncWithCPAndKVAllGather might work for MLA also.

P2P cannot work because it concats K and V into a single tensor for communication, different head_dim of K and V prevents us from doing the concat, but this should be addressable.

As I said, technically, there should be no reason preventing MLA+CP, at least I do not know the reasons now, I might find something after I start to work on this.

@SuperCB
Copy link
Author

SuperCB commented Mar 11, 2025

I think we can support MLA+CP in P2P by padding the v value, which ensures minimal modifications to the original code. I am currently attempting to use this method.

Image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants