Add KV cache for paged/non-paged attention #1355

cyanguwa · 2024-12-04T05:03:04Z

Description

This PR adds KV cache support for FusedAttention, FlashAttention, and UnfusedDotProductAttention backends in TE-PyTorch.

backend  | precision      |    KV cache     | architecture | qkv_format    | page_size
---------------------------------------------------------------------------------------
Fused    | FP16/BF16      | non-paged/paged | sm80+        | bshd,sbhd,thd | >= 1
Flash v2 | FP16/BF16      | non-paged/paged | sm80+        | bshd,sbhd,thd | >= 256
Flash v3 | FP16/BF16      | non-paged/paged | sm90         | bshd,sbhd,thd | >= 1
         | FP8            | non-paged/paged | sm90         | thd           | >= 1
Unfused  | FP32/FP16/BF16 | non-paged/paged | all          | bshd,sbhd,thd | >= 1

KV cache is in bshd format, and incoming tokens can be in bshd, sbhd, or thd
FusedAttention processes QKV in bshd, sbhd_2bshd, or thd_2bshd format directly
FlashAttention v2 converts Q to thd and uses flash_attn_varlen_func for attention
FlashAttention v3 (must be 39e7197 or after) converts Q to thd and uses flash_attn_with_kvcache for attention
UnfusedDotProductAttention converts Q to bshd for attention; for paged, it converts the cache tensors to non-paged first, based on the page table
All backends support pure context, pure generation, and mixed context/generation phases
FusedAttention and FlashAttention support CUDA graph
K cache and V cache should have the same page table
fp8_dpa=True is supported (KV cache is still in FP16/BF16 precision); fp8_mha=True is not
Context parallelism is not supported
RoPE for inference will be fixed in RoPE enhancements #1478
Requires FE1.11 from Update FE to 1.11 #1580

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refractor

Changes

Please list the changes introduced in this PR:

Add KV caching support for FusedAttention, FlashAttention, and UnfusedDotProductAttention
Add mixed q/kv format support to FusedAttention for F16
Adapt to the new FA3 APIs from FA2.7.3+/hopper for CP and non-CP and CP cases

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

cyanguwa · 2024-12-04T05:45:41Z

/te-ci pytorch L0

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2025-01-06T12:27:19Z

/te-ci pytorch L0

transformer_engine/pytorch/attention.py

tests/pytorch/fused_attn/test_paged_attn.py

Signed-off-by: Charlene Yang <[email protected]>

transformer_engine/pytorch/kv_cache_manager_non_paged.py

transformer_engine/pytorch/attention.py

transformer_engine/pytorch/kv_cache_manager_non_paged.py

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

cyanguwa · 2025-03-14T04:09:05Z

/te-ci pytorch L0

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2025-03-14T04:14:52Z

/te-ci pytorch L1

cyanguwa · 2025-03-14T04:15:16Z

/te-ci pytorch L3

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

cyanguwa · 2025-03-14T22:13:55Z

/te-ci pytorch L0 L1 L3

cyanguwa · 2025-03-14T22:16:35Z

/te-ci jax L0

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa · 2025-03-14T23:13:59Z

/te-ci pytorch L0

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

Signed-off-by: Charlene Yang <[email protected]>

for more information, see https://pre-commit.ci

cyanguwa · 2025-03-15T03:30:35Z

/te-ci pytorch L0 L1 L3

cyanguwa and others added 10 commits December 3, 2024 17:01

add paged attention; test_kv_cache_accuray and test_paged_attn pass

44f6ff2

Signed-off-by: Charlene Yang <[email protected]>

remove unnecessary change from last commit

06605e5

Signed-off-by: Charlene Yang <[email protected]>

test_fused_attn pass

0b2eb88

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into paged_attention

d243b79

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

b0a5da4

for more information, see https://pre-commit.ci

remove unnecessary import in test_numerics

b4efd71

Signed-off-by: Charlene Yang <[email protected]>

add license for test

e637a07

Signed-off-by: Charlene Yang <[email protected]>

fix lint

767c8f5

Signed-off-by: Charlene Yang <[email protected]>

add to L0 test

a3bb14f

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

d65933c

for more information, see https://pre-commit.ci

cyanguwa requested review from sudhakarsingh27 and ptrendx December 4, 2024 16:45

cyanguwa added 3 commits January 6, 2025 04:16

Merge branch 'main' into paged_attention

cd626b8

Signed-off-by: Charlene Yang <[email protected]>

update license for test_paged_attn

7c23b96

Signed-off-by: Charlene Yang <[email protected]>

update kv_cache_manager license

2dbf2e1

Signed-off-by: Charlene Yang <[email protected]>

sudhakarsingh27 reviewed Jan 7, 2025

View reviewed changes

transformer_engine/pytorch/attention.py Outdated Show resolved Hide resolved

tests/pytorch/fused_attn/test_paged_attn.py Outdated Show resolved Hide resolved

cyanguwa added 2 commits January 6, 2025 17:09

fix build issue from previous merge

d2f1549

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into paged_attention

81a07e0

sudhakarsingh27 reviewed Jan 28, 2025

View reviewed changes

cyanguwa and others added 7 commits January 29, 2025 07:47

Merge branch 'main' into paged_attention

76282cf

Merge branch 'NVIDIA:main' into paged_attention

366fa65

Merge branch 'main' into paged_attention

9f31f09

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

8dc06e0

for more information, see https://pre-commit.ci

WIP: minor fix/preparation for inference/cuda graph

59dcf48

Signed-off-by: Charlene Yang <[email protected]>

WIP: non-paged

09448a9

Signed-off-by: Charlene Yang <[email protected]>

WIP: non-paged, bshd/sbhd

612637c

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch from 8e80771 to 612637c Compare February 12, 2025 06:28

WIP: non-paged, thd, no CG

f9bd83c

Signed-off-by: Charlene Yang <[email protected]>

pre-commit-ci bot and others added 3 commits March 14, 2025 03:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

a3bc1b4

for more information, see https://pre-commit.ci

minor tweaks

0fd197f

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

e284346

for more information, see https://pre-commit.ci

update FA3 note and L3 test

7a9f357

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa added 6 commits March 13, 2025 21:25

fix lint

2495c80

Signed-off-by: Charlene Yang <[email protected]>

remove redundant import in test

28d9983

Signed-off-by: Charlene Yang <[email protected]>

Merge branch 'main' into paged_attention

674535d

Merge branch 'main' into paged_attention

de48ef6

adopt new FA3 APIs from FA2.7.3+/hopper for CP and non-CP

496776b

Signed-off-by: Charlene Yang <[email protected]>

fix lint

7f1c765

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa force-pushed the paged_attention branch from 4a2fd47 to 7f1c765 Compare March 14, 2025 22:10

cyanguwa and others added 2 commits March 15, 2025 06:11

Merge branch 'main' into paged_attention

de5a2f6

[pre-commit.ci] auto fixes from pre-commit.com hooks

0cf5c0d

for more information, see https://pre-commit.ci

relax tols for TransformerLayers

5578b69

Signed-off-by: Charlene Yang <[email protected]>

cyanguwa and others added 9 commits March 15, 2025 08:29

Merge branch 'main' into paged_attention

6a26e0e

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

2b1b72f

for more information, see https://pre-commit.ci

fix merge

a6c8455

Signed-off-by: Charlene Yang <[email protected]>

fix merge 2

b598cb9

Signed-off-by: Charlene Yang <[email protected]>

fix FA import comments

5e45442

Signed-off-by: Charlene Yang <[email protected]>

relax tols for Ampere

d770116

Signed-off-by: Charlene Yang <[email protected]>

fix fa3 version and reduce messaging

0025478

Signed-off-by: Charlene Yang <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

bec87e7

for more information, see https://pre-commit.ci

Merge branch 'main' into paged_attention

5475163

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add KV cache for paged/non-paged attention #1355

Add KV cache for paged/non-paged attention #1355

cyanguwa commented Dec 4, 2024 •

edited

Loading

cyanguwa commented Dec 4, 2024

cyanguwa commented Jan 6, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 15, 2025

Add KV cache for paged/non-paged attention #1355

Are you sure you want to change the base?

Add KV cache for paged/non-paged attention #1355

Conversation

cyanguwa commented Dec 4, 2024 • edited Loading

Description

Type of change

Changes

Checklist:

cyanguwa commented Dec 4, 2024

cyanguwa commented Jan 6, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 14, 2025

cyanguwa commented Mar 15, 2025

cyanguwa commented Dec 4, 2024 •

edited

Loading