Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature Request: YuE (music gen) #11467

Open
4 tasks done
henk717 opened this issue Jan 28, 2025 · 9 comments
Open
4 tasks done

Feature Request: YuE (music gen) #11467

henk717 opened this issue Jan 28, 2025 · 9 comments
Labels
enhancement New feature or request

Comments

@henk717
Copy link

henk717 commented Jan 28, 2025

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

YuE would work similar to the OuteTTS implementation where an LLM (In this case two separate llama models) is involved in the generation of the audio. Yue does not appear to be using wavtokenizer. Instead of just speech YuE is capable of music and sung vocals.

A demo page with links to all the relevant code and models can be found here : https://map-yue.github.io/

Motivation

For end users: Music generation is a use case currently missing from the llamacpp ecosystem, users can leverage quantized versions of the LLM to generate songs on their own or rented hardware, this model is capable of signing making it more flexible than established non-LLM audio models.
For developers: I think this is an interesting next step in llamacpp's TTS experiments since this is also LLM based. We first saw how language models running in llamacpp could be paired with wavtokenizer to produce audible speech. This would rely on the same existing llama infrastructure but paired with new implementations on the music audio side. It seems similar to Llasa-3B due to both of them using xcodec so the implementation may be sharable between both models.

Possible Implementation

The llama models should be able to leverage the existing llama implementation, for the audio side this open source project can be used to reference the audio parts : https://github.com/multimodal-art-projection/YuE (paper is pending).

This seems to require xcodec which if compatible would also be progress towards support for Llasa-3B.

@henk717 henk717 added the enhancement New feature or request label Jan 28, 2025
@bennmann
Copy link

Please see also:

multimodal-art-projection/YuE#2

@LostRuins
Copy link
Collaborator

I think the missing part isn't the audio token generation (it's just a llama model) but the xcodec decoder.

@Kreijstal
Copy link

so it is just writing xcodec in c/c++?

@Kreijstal
Copy link

@henk717
Copy link
Author

henk717 commented Jan 31, 2025

xcodec is the important part for this and llasa. For KoboldCpp and other downstream projects that would probably be enough. For llamacpp it would also need an example app similar to OuteTTS so people can test / use yue.

@Kreijstal
Copy link

https://github.com/zhenye234/X-Codec-2.0/blob/main/vq/codec_decoder.py

@a43992899
Copy link

a43992899 commented Feb 1, 2025

For YuE we are actually using xcodec 1 NOT 2. As Xcodec2 is speech only.

Here's the code we use. How should I assist?

git lfs install
git clone https://huggingface.co/m-a-p/xcodec_mini_infer

@LostRuins
Copy link
Collaborator

For YuE we are actually using xcodec 1 NOT 2. As Xcodec2 is speech only.

Here's the code we use. How should I assist?

git lfs install
git clone https://huggingface.co/m-a-p/xcodec_mini_infer

Refer to the OuteTTS PR in https://github.com/ggerganov/llama.cpp/pull/10784/files, specifically https://github.com/ggerganov/llama.cpp/blob/c0df192838f51507e06b7293030b43232cd2670f/src/llama.cpp#L9497 and https://github.com/ggerganov/llama.cpp/blob/c0df192838f51507e06b7293030b43232cd2670f/src/llama.cpp#L17176

That is the implementation for the decoder for WavTokenizer in GGML. I would imagine a similar implementation would have to be created for xcodec. For the LLM part that's already done since llama.cpp supports the llama format and you should be able to run inference and generate the output tokens already.

@bennmann
Copy link

bennmann commented Feb 2, 2025

semi-related (?) exllama version

multimodal-art-projection/YuE#44

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants