-
Notifications
You must be signed in to change notification settings - Fork 415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An exllamav2 implementation of YuE: 2-10x speedup and 8GB minimum VRAM #44
Comments
Can you share some benchmark numbers ? Like time taken for 30secs of generation on a specific GPU with and without quantization ? So far, I'm getting 17tok/sec on stage_1 with the defaults in the repo. Using vllm gives me around 54tok/sec on a 4090. Using n_segments = 3 so I hit a max context length of ~9000 |
Speed on my 2080ti-22GB using the original bf16 model:
I updated relevent info in the repo. |
I am testing in a colab notebook, and will update this comment with news. EDIT: Cells so far: Cell 1: # Note installing from pypi uses JIT version and requires nvcc+compiler !pip install exllamav2 # Make sure you have git-lfs installed (https://git-lfs.com) # if you don't have root, see git-lfs/git-lfs#4134 (comment) !sudo apt update Cell 2: Next step: |
I can confirm I tested that repo and had significant performance improvements over the upstream solution. I recommend merging the changes made there. |
I managed to get this to run on my laptop with RTX 3060 mobile 6gb using exl2 quantized models and quantized k/v cache, and it seems to at least produce coherent output. Updated the page with perf stats for anyone interested lol. |
Oh boy. =)
Have the same. Interesting! ——— Great repo, thx! |
how to fix this?
|
@rednessisaffair it seems you might have an empty lyrics after preprocessing, and thus no output. Can you verify if your prompt works in the official pipeline? |
I added ICL model in exl2 quants (3, 4, 8 bpw): https://huggingface.co/Ftfyhh/YuE-s1-7B-anneal-en-icl-exl2 |
Hey guys, exllamav2 seems to be working well, saving memory and improving speed. Lost some musicality but still sounds coherent. |
I made a gradio interface for this fork, docker and a template for runpod (https://github.com/alisson-anjos/YuE-exllamav2-UI) |
@jrked did you get any solution for this issue? |
nope |
try to add more sections into lyrics.txt. 3-4 should be ok |
when we use provided genre.txt, lyrics.txt, and provided mp3 then it works but when we try to give custom genre.txt , lyrics.txt, and mp3 then get this error |
Make sure your lyrics strictly follow the example, for example |
RTX 4090 24 VRAM, 61gb RAM, model base BF16 Test 1 Stage 1: cache size 16384, cache mode Q4, 6 segments, 10 minutes 15 seconds Test 2 Stage 1: cache size 16384, cache mode Q6, 6 segments, 9 minutes 35 seconds Test 3 Stage 1: cache size 16384, cache mode Q8, 6 segments, 10 minutes 5 seconds Test 4 Stage 1: cache size 16384, cache mode BF16, 6 segments, 9 minutes 51 seconds Note: it probably had an impact on the VRAM used but I didn't monitor that data. Now I need to quantize the models and test on quantized models. Audio files here |
Check and make sure that your custom files are formatted the same way as the examples. The way that the script processes the prompt is... very particular. |
Could someone re-write a practical step-by-step guide for idiots (read: Me) ?
Thanks for any idiot proof guide. |
When using the exact same parameters, the results generated by this method are inconsistent compared to those generated by the official method. The gender and music style differ, and sometimes the quality is worse than that of the official method. Why is this happening? i use FP16 |
i can't get either to finish running at all yet, but if i were you, it might be good to systematically provide your sample size and testing parameters - top p, temp, same random seed both repos and compare more directly because unless you have a large sample size i suspect the top p, temp, and random seed may be the culprit (just a large standard deviation of quality). |
YuE-exllamav2 uses exllamav2 as an inference backend for stage1 and stage2. Exllamav2 can load the huggingface bf16 checkpoint directly and we saw 2x speedup in stage1 and 10x speedup in stage2. Additionally exllamav2 allows easy quantization of the model, and we find at 4.25bpw stage1 models still work well, thus enabling song generation with a minimum of 8GB vram.
Usage is the same,
python src/yue/infer.py --stage1_use_exl2 --stage2_use_exl2 --stage2_cache_size 32768 [original args]
The text was updated successfully, but these errors were encountered: