-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature Request: YuE (music gen) #11467
Comments
Please see also: |
I think the missing part isn't the audio token generation (it's just a llama model) but the xcodec decoder. |
so it is just writing xcodec in c/c++? |
hmm it uses torch |
xcodec is the important part for this and llasa. For KoboldCpp and other downstream projects that would probably be enough. For llamacpp it would also need an example app similar to OuteTTS so people can test / use yue. |
For YuE we are actually using xcodec 1 NOT 2. As Xcodec2 is speech only. Here's the code we use. How should I assist? git lfs install
git clone https://huggingface.co/m-a-p/xcodec_mini_infer |
Refer to the OuteTTS PR in https://github.com/ggerganov/llama.cpp/pull/10784/files, specifically https://github.com/ggerganov/llama.cpp/blob/c0df192838f51507e06b7293030b43232cd2670f/src/llama.cpp#L9497 and https://github.com/ggerganov/llama.cpp/blob/c0df192838f51507e06b7293030b43232cd2670f/src/llama.cpp#L17176 That is the implementation for the decoder for WavTokenizer in GGML. I would imagine a similar implementation would have to be created for xcodec. For the LLM part that's already done since llama.cpp supports the llama format and you should be able to run inference and generate the output tokens already. |
semi-related (?) exllama version |
Prerequisites
Feature Description
YuE would work similar to the OuteTTS implementation where an LLM (In this case two separate llama models) is involved in the generation of the audio. Yue does not appear to be using wavtokenizer. Instead of just speech YuE is capable of music and sung vocals.
A demo page with links to all the relevant code and models can be found here : https://map-yue.github.io/
Motivation
For end users: Music generation is a use case currently missing from the llamacpp ecosystem, users can leverage quantized versions of the LLM to generate songs on their own or rented hardware, this model is capable of signing making it more flexible than established non-LLM audio models.
For developers: I think this is an interesting next step in llamacpp's TTS experiments since this is also LLM based. We first saw how language models running in llamacpp could be paired with wavtokenizer to produce audible speech. This would rely on the same existing llama infrastructure but paired with new implementations on the music audio side. It seems similar to Llasa-3B due to both of them using xcodec so the implementation may be sharable between both models.
Possible Implementation
The llama models should be able to leverage the existing llama implementation, for the audio side this open source project can be used to reference the audio parts : https://github.com/multimodal-art-projection/YuE (paper is pending).
This seems to require xcodec which if compatible would also be progress towards support for Llasa-3B.
The text was updated successfully, but these errors were encountered: