Simplify the `attention` function #2609

danieldk · 2024-10-04T09:44:58Z

What does this PR do?

Use one definition rather than multiple (will make it easier to do shared things once, such as calculating the FP8 KV cache reciprocal).
Add key/value arguments, so that we don't need the PREFILL_IN_KV_CACHE constant.
Make it kwargs-only (to avoid mixing up the various Tensor args).

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

server/text_generation_server/layers/attention/rocm.py

HuggingFaceDocBuilderDev · 2024-10-04T14:50:04Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

drbh

LGTM

just the small nits about function signatures

server/text_generation_server/layers/attention/rocm.py

server/text_generation_server/layers/attention/ipex.py

drbh

nice! looks good to me ✨

- Use one definition rather than multiple. - Add `key`/`value` arguments, so that we don't need the `PREFILL_IN_KVCACHE` constant. - Make it kwargs-only (to avoid mixing up the various `Tensor` args).

Narsil · 2024-10-16T16:56:08Z

server/text_generation_server/layers/attention/cuda.py


-elif ATTENTION == "paged":


You're breaking paged here.

ATTENTION="paged" text-generation-launcher ...

shows the issue.

PAGED still uses v2, not v1 (unless sm is too low)

paged attention is not V1 vs V2, those are separate concerns.

And we cannot use the block_tables implementation for paged + v2, because that requires BLOCK_SIZE=256, where paged attention uses block_size = 16.

Should be fixed now, tested Llama & Mistral with paged, flashattention and flashinfer.

Narsil

LGTM

danieldk marked this pull request as draft October 4, 2024 14:18

Narsil reviewed Oct 4, 2024

View reviewed changes

server/text_generation_server/layers/attention/rocm.py Outdated Show resolved Hide resolved

danieldk force-pushed the maintenance/simplify-attention branch 2 times, most recently from ba0f068 to 4b9aa9c Compare October 4, 2024 14:49

danieldk force-pushed the maintenance/simplify-attention branch 2 times, most recently from 3d9a95b to 9581f20 Compare October 7, 2024 08:04

danieldk changed the title ~~Simplify attention function~~ Simplify the attention function Oct 7, 2024

danieldk marked this pull request as ready for review October 7, 2024 13:40

danieldk force-pushed the maintenance/simplify-attention branch from 9581f20 to 69bd333 Compare October 8, 2024 13:05

danieldk marked this pull request as draft October 8, 2024 13:28

danieldk force-pushed the maintenance/simplify-attention branch 2 times, most recently from 69c9d0d to c56df2d Compare October 8, 2024 14:34

danieldk marked this pull request as ready for review October 8, 2024 15:54

This was referenced Oct 9, 2024

Break cycle between the attention implementations and KV cache #2627

Merged

Add support for FP8 KV cache scales #2628

Merged

drbh requested changes Oct 10, 2024

View reviewed changes

server/text_generation_server/layers/attention/rocm.py Outdated Show resolved Hide resolved

server/text_generation_server/layers/attention/ipex.py Outdated Show resolved Hide resolved

danieldk requested a review from drbh October 11, 2024 13:22

drbh previously approved these changes Oct 11, 2024

View reviewed changes

danieldk dismissed drbh’s stale review via 638d7ab October 16, 2024 13:12

danieldk force-pushed the maintenance/simplify-attention branch from 8627c16 to 638d7ab Compare October 16, 2024 13:12

Simplify the attention function

07128cc

- Use one definition rather than multiple. - Add `key`/`value` arguments, so that we don't need the `PREFILL_IN_KVCACHE` constant. - Make it kwargs-only (to avoid mixing up the various `Tensor` args).

danieldk force-pushed the maintenance/simplify-attention branch from 638d7ab to 07128cc Compare October 16, 2024 13:51

Narsil reviewed Oct 16, 2024

View reviewed changes

Fixup flashinfer support

7822bfd

Narsil approved these changes Oct 17, 2024

View reviewed changes

Narsil merged commit 59ea38c into main Oct 17, 2024
11 of 12 checks passed

Narsil deleted the maintenance/simplify-attention branch October 17, 2024 08:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify the `attention` function #2609

Simplify the `attention` function #2609

danieldk commented Oct 4, 2024 •

edited

Loading

HuggingFaceDocBuilderDev commented Oct 4, 2024

drbh left a comment

drbh left a comment

Narsil Oct 16, 2024

Narsil Oct 16, 2024

Narsil Oct 16, 2024

danieldk Oct 17, 2024

Narsil left a comment

Simplify the attention function #2609

Simplify the attention function #2609

Conversation

danieldk commented Oct 4, 2024 • edited Loading

What does this PR do?

Before submitting

Who can review?

HuggingFaceDocBuilderDev commented Oct 4, 2024

drbh left a comment

Choose a reason for hiding this comment

drbh left a comment

Choose a reason for hiding this comment

Narsil Oct 16, 2024

Choose a reason for hiding this comment

Narsil Oct 16, 2024

Choose a reason for hiding this comment

Narsil Oct 16, 2024

Choose a reason for hiding this comment

danieldk Oct 17, 2024

Choose a reason for hiding this comment

Narsil left a comment

Choose a reason for hiding this comment

Simplify the `attention` function #2609

Simplify the `attention` function #2609

danieldk commented Oct 4, 2024 •

edited

Loading