Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster generation times #38

Open
bluenucleus opened this issue Jan 31, 2025 · 6 comments
Open

Faster generation times #38

bluenucleus opened this issue Jan 31, 2025 · 6 comments

Comments

@bluenucleus
Copy link

Inference is taking too long. Are there any plans to optimise the processing times? Was anyone successful in bringing the generate times lower? A100 80gb takes upto 8-13 mins for 30 second clip. That's very expensive for a mono model.

@a43992899
Copy link
Collaborator

a43992899 commented Jan 31, 2025

You can increase --stage2_batch_size to speed up the inference, since you have large VRAM.

Try --stage2_batch_size 16.

This issue #8 is also working on quantization.

@alisson-anjos
Copy link

alisson-anjos commented Jan 31, 2025

I think that if you use the NF4 models you can easily increase the batch size value to 16, I'm going to do this experiment to see if using the NF4, which ends up taking up a smaller amount of VRAM, impacts the possibility of increasing the batch size value, as opposed to using the BF16.

@austin2035
Copy link

austin2035 commented Feb 1, 2025

It is of little significance to optimize only the stage2 , as the reasoning in the stage 1 is very time-consuming and the GPU is not fully utilized at all.

tvararu added a commit to tvararu/YuE that referenced this issue Feb 1, 2025
According to multimodal-art-projection/YuE#38,
this could help with inference times.
@alisson-anjos
Copy link

There was someone who made a fork, added the possibility of using sdpa instead of flash attention and applied a patch to transformers to double the speed.

https://github.com/deepbeepmeep/YuEGP

@jrked
Copy link

jrked commented Feb 2, 2025

There was someone who made a fork, added the possibility of using sdpa instead of flash attention and applied a patch to transformers to double the speed.

https://github.com/deepbeepmeep/YuEGP

interesting, thanks for sharing this!

@Mozer
Copy link

Mozer commented Feb 2, 2025

There's also sage attention (using triton), which is 2 times faster than flash attention. Maybe someone can implement this:
thu-ml/SageAttention#21 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants