Context Parallel w/ Ring & Ulysses & Unified Attention #11941

a-r-r-o-w · 2025-07-16T15:25:16Z

Adds support for ring, ulysses and unified attention natively. For a minimal PoC, I've limited changes to Flux.

Supported attention backends with CP: cuDNN, FA2, Sage.

Requires #11916 to be merged first.

Minimal example

import torch
from diffusers import FluxPipeline

try:
    torch.distributed.init_process_group("nccl")
    rank = torch.distributed.get_rank()
    device = torch.device("cuda", rank % torch.cuda.device_count())
    torch.cuda.set_device(device)

    pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16)
    pipe.to(device)
    # pipe.transformer.parallelize(ring_degree=2)
    pipe.transformer.parallelize(ulysses_degree=2)
    pipe.transformer.set_attention_backend("_native_cudnn")

    prompt = "A cat holding a sign that says 'hello world'"

    # Must specify generator so all ranks start with same latents (or pass your own)
    generator = torch.Generator().manual_seed(42)
    image = pipe(prompt, num_inference_steps=28, guidance_scale=4.0, generator=generator).images[0]
    
    if rank == 0:
        image.save("output.png")

except Exception as e:
    print(f"An error occurred: {e}")
    torch.distributed.breakpoint()
    raise

finally:
    if torch.distributed.is_initialized():
        torch.distributed.destroy_process_group()

Benchmarks

TODO

Explanation

Each model should define a _cp_plan attribute that contains information on how to shard/gather tensors at different stages of the forward.

TODO

Note: There were some merge conflicts that I'm not sure I resolved correctly. Some things may be broken. For this reason, I've removed training support and only tested inference. I'll update some of the TODOs tomorrow

Co-Authored-By: Dhruv Nair <[email protected]>

HuggingFaceDocBuilderDev · 2025-07-16T15:32:28Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

sayakpaul · 2025-07-25T02:34:30Z

I am going to review it very soon. But before I do I would like to read a bit about unified attention. Simple searches returned results that didn't seem relevant. Hence the ask.

a-r-r-o-w · 2025-07-25T15:24:09Z

Unified CP is a generalization of performing Ulysses and Ring together. Both those methods become special subcases of Unified attention. Paper: https://arxiv.org/abs/2405.07719v3

a-r-r-o-w and others added 21 commits July 14, 2025 04:47

update

d7b9e42

update

7e97e43

add coauthor

ecabd2a

Co-Authored-By: Dhruv Nair <[email protected]>

improve test

ff21b7f

handle ip adapter params correctly

b8f7fe6

Merge branch 'main' into to-single-file/flux

17b678f

fix chroma qkv fusion test

0cda91d

fix fastercache implementation

bc64f12

fix more tests

a0b276d

fight more tests

c141520

add back set_attention_backend

4dcd672

update

576da52

update

e909b73

make style

1e7217f

make fix-copies

4f52e34

make ip adapter processor compatible with attention dispatcher

d9c1683

refactor chroma as well

a73cb39

remove rmsnorm assert

1e6b1c5

minify and deprecate npu/xla processors

251bb61

Merge branch 'main' into to-single-file/flux

84d2c84

update

51fed50

a-r-r-o-w requested review from DN6, yiyixuxu, sayakpaul and SunMarc July 16, 2025 15:25

Merge branch 'main' into to-single-file/flux

9f37b87

a-r-r-o-w added the roadmap Add to current release roadmap label Jul 16, 2025

github-project-automation bot added this to Diffusers Roadmap 0.35 Jul 16, 2025

github-project-automation bot moved this to In Progress in Diffusers Roadmap 0.35 Jul 16, 2025

a-r-r-o-w added 8 commits July 16, 2025 21:24

refactor

7973626

refactor; support flash attention 2 with cp

f859fdf

fix

e76fc94

support sage attention with cp

171152f

make torch compile compatible

62f164d

Merge branch 'to-single-file/flux' into attn-dispatcher-cp-and-training

731b3bb

Merge branch 'main' into attn-dispatcher-cp-and-training

ff8ef45

update

26a5a5c

This was referenced Jul 18, 2025

Multi-gpus Context Parallel training support? #11020

Open

Remove logger warnings for attention backends and hard error during runtime instead #11967

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Context Parallel w/ Ring & Ulysses & Unified Attention #11941

Context Parallel w/ Ring & Ulysses & Unified Attention #11941

Uh oh!

a-r-r-o-w commented Jul 16, 2025 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Jul 16, 2025

Uh oh!

sayakpaul commented Jul 25, 2025

Uh oh!

a-r-r-o-w commented Jul 25, 2025

Uh oh!

Uh oh!

Context Parallel w/ Ring & Ulysses & Unified Attention #11941

Are you sure you want to change the base?

Context Parallel w/ Ring & Ulysses & Unified Attention #11941

Uh oh!

Conversation

a-r-r-o-w commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Minimal example

Benchmarks

Explanation

Uh oh!

HuggingFaceDocBuilderDev commented Jul 16, 2025

Uh oh!

sayakpaul commented Jul 25, 2025

Uh oh!

a-r-r-o-w commented Jul 25, 2025

Uh oh!

Uh oh!

a-r-r-o-w commented Jul 16, 2025 •

edited

Loading