Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify the attention function #2609

Merged
merged 2 commits into from
Oct 17, 2024
Merged

Simplify the attention function #2609

merged 2 commits into from
Oct 17, 2024

Conversation

danieldk
Copy link
Member

@danieldk danieldk commented Oct 4, 2024

What does this PR do?

  • Use one definition rather than multiple (will make it easier to do shared things once, such as calculating the FP8 KV cache reciprocal).
  • Add key/value arguments, so that we don't need the PREFILL_IN_KV_CACHE constant.
  • Make it kwargs-only (to avoid mixing up the various Tensor args).

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@danieldk danieldk marked this pull request as draft October 4, 2024 14:18
@danieldk danieldk force-pushed the maintenance/simplify-attention branch 2 times, most recently from ba0f068 to 4b9aa9c Compare October 4, 2024 14:49
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@danieldk danieldk force-pushed the maintenance/simplify-attention branch 2 times, most recently from 3d9a95b to 9581f20 Compare October 7, 2024 08:04
@danieldk danieldk changed the title Simplify attention function Simplify the attention function Oct 7, 2024
@danieldk danieldk marked this pull request as ready for review October 7, 2024 13:40
@danieldk danieldk force-pushed the maintenance/simplify-attention branch from 9581f20 to 69bd333 Compare October 8, 2024 13:05
@danieldk danieldk marked this pull request as draft October 8, 2024 13:28
@danieldk danieldk force-pushed the maintenance/simplify-attention branch 2 times, most recently from 69c9d0d to c56df2d Compare October 8, 2024 14:34
@danieldk danieldk marked this pull request as ready for review October 8, 2024 15:54
Copy link
Collaborator

@drbh drbh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

  • just the small nits about function signatures

server/text_generation_server/layers/attention/rocm.py Outdated Show resolved Hide resolved
server/text_generation_server/layers/attention/ipex.py Outdated Show resolved Hide resolved
@danieldk danieldk requested a review from drbh October 11, 2024 13:22
drbh
drbh previously approved these changes Oct 11, 2024
Copy link
Collaborator

@drbh drbh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! looks good to me ✨

- Use one definition rather than multiple.
- Add `key`/`value` arguments, so that we don't need the
  `PREFILL_IN_KVCACHE` constant.
- Make it kwargs-only (to avoid mixing up the various `Tensor` args).
@danieldk danieldk force-pushed the maintenance/simplify-attention branch from 638d7ab to 07128cc Compare October 16, 2024 13:51

elif ATTENTION == "paged":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're breaking paged here.

ATTENTION="paged" text-generation-launcher ...

shows the issue.

PAGED still uses v2, not v1 (unless sm is too low)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

paged attention is not V1 vs V2, those are separate concerns.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And we cannot use the block_tables implementation for paged + v2, because that requires BLOCK_SIZE=256, where paged attention uses block_size = 16.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed now, tested Llama & Mistral with paged, flashattention and flashinfer.

Copy link
Collaborator

@Narsil Narsil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Narsil Narsil merged commit 59ea38c into main Oct 17, 2024
11 of 12 checks passed
@Narsil Narsil deleted the maintenance/simplify-attention branch October 17, 2024 08:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants