-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster generation times #38
Comments
You can increase Try This issue #8 is also working on quantization. |
I think that if you use the NF4 models you can easily increase the batch size value to 16, I'm going to do this experiment to see if using the NF4, which ends up taking up a smaller amount of VRAM, impacts the possibility of increasing the batch size value, as opposed to using the BF16. |
It is of little significance to optimize only the stage2 , as the reasoning in the stage 1 is very time-consuming and the GPU is not fully utilized at all. |
According to multimodal-art-projection/YuE#38, this could help with inference times.
There was someone who made a fork, added the possibility of using sdpa instead of flash attention and applied a patch to transformers to double the speed. |
interesting, thanks for sharing this! |
There's also sage attention (using triton), which is 2 times faster than flash attention. Maybe someone can implement this: |
Inference is taking too long. Are there any plans to optimise the processing times? Was anyone successful in bringing the generate times lower? A100 80gb takes upto 8-13 mins for 30 second clip. That's very expensive for a mono model.
The text was updated successfully, but these errors were encountered: