Request to replace the model acceleration technique "flash attention" with a more versatile "vllm". #138
jake123456789ok
started this conversation in
Ideas
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I hope that in the future, the YI model can replace the accelerated default "flash attention" with vllm, so that we can benefit from the latest inference speed technology.
("This statement means that vllm already supports the latest 'flash decoding' and there are plans to support 'flash decoding++' in the future.")
Beta Was this translation helpful? Give feedback.
All reactions