-
Notifications
You must be signed in to change notification settings - Fork 6.1k
Context Parallel w/ Ring & Ulysses & Unified Attention #11941
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Co-Authored-By: Dhruv Nair <[email protected]>
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I am going to review it very soon. But before I do I would like to read a bit about unified attention. Simple searches returned results that didn't seem relevant. Hence the ask. |
Unified CP is a generalization of performing Ulysses and Ring together. Both those methods become special subcases of Unified attention. Paper: https://arxiv.org/abs/2405.07719v3 |
Adds support for ring, ulysses and unified attention natively. For a minimal PoC, I've limited changes to Flux.
Supported attention backends with CP: cuDNN, FA2, Sage.
Requires #11916 to be merged first.
Minimal example
Benchmarks
TODO
Explanation
Each model should define a
_cp_plan
attribute that contains information on how to shard/gather tensors at different stages of the forward.TODO
Note: There were some merge conflicts that I'm not sure I resolved correctly. Some things may be broken. For this reason, I've removed training support and only tested inference. I'll update some of the TODOs tomorrow