-
Notifications
You must be signed in to change notification settings - Fork 83
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming audio generatiron? #44
Comments
I'm working on the streaming feature. It's not very straightforward due to the audio encoder model, but I'll hopefully push an update soon adding this feature for GGUF models for now. |
Can't wait to try it out! 😊. I checked out this PR and it seems to have a good implementation for EXL2. can we extend it to llama_cpp? |
Yes, all the backends will have streaming support, that's the plan. Will probably release first for llama cpp as mentioned. The implementation I've developed is quite different from the PR to handle audio streaming better. I'll get back to that PR for the EXL2 implementation when this is ready. |
+1 for streaming audio. |
Well I am also trying to implement it on my own if possible. Stay tuned. Either me or @edwko will update soon 😊 |
doing GGUF was pretty easy class GGUFModel:
def __init__(
self,
model_path: str,
n_gpu_layers: int = 0,
max_seq_length: int = 4096,
additional_model_config: dict = {}
) -> None:
if not _GGUF_AVAILABLE:
raise ImportError(
"llama_cpp python module not found."
"To use the GGUF model you must install llama cpp python manually."
)
additional_model_config["n_ctx"] = max_seq_length
self.model = Llama(
model_path=model_path,
n_gpu_layers=n_gpu_layers,
**additional_model_config
)
def generate(self, input_ids: list[int], config: GenerationConfig) -> list[int]:
tokens = []
for token in self.model.generate(
input_ids,
temp=config.temperature,
repeat_penalty=config.repetition_penalty,
**config.additional_gen_config,
):
tokens.append(token)
if (llama_token_is_eog(self.model._model.model, token) or
len(tokens) >= config.max_length):
break
return tokens
def generate_stream(self, input_ids: list[int], config: GenerationConfig) -> Generator[int, None, None]:
for token in self.model.generate(
input_ids,
temp=config.temperature,
repeat_penalty=config.repetition_penalty,
**config.additional_gen_config,
):
yield token
if (llama_token_is_eog(self.model._model.model, token) or
len(input_ids) >= config.max_length):
break and edit class InterfaceGGUF(InterfaceHF):
def __init__(
self,
config: GGUFModelConfig
) -> None:
self.device = torch.device(
config.device if config.device is not None
else "cuda" if torch.cuda.is_available()
else "cpu"
)
self.config = config
self._device = config.device
self.languages = config.languages
self.language = config.language
self.verbose = config.verbose
self.audio_codec = AudioCodec(self.device, config.wavtokenizer_model_path)
self.prompt_processor = PromptProcessor(config.tokenizer_path, self.languages)
self.model = GGUFModel(
model_path=config.model_path,
n_gpu_layers=config.n_gpu_layers,
max_seq_length=config.max_seq_length,
additional_model_config=config.additional_model_config
)
def prepare_prompt(self, text: str, speaker: dict = None):
prompt = self.prompt_processor.get_completion_prompt(text, self.language, speaker)
return self.prompt_processor.tokenizer.encode(prompt, add_special_tokens=False)
def generate(
self,
text: str,
speaker: dict = None,
temperature: float = 0.1,
repetition_penalty: float = 1.1,
max_length = 4096,
additional_gen_config = {},
) -> ModelOutput:
input_ids = self.prepare_prompt(text, speaker)
if self.verbose:
logger.info(f"Input tokens: {len(input_ids)}")
logger.info("Generating audio...")
self.check_generation_max_length(max_length)
output = self.model.generate(
input_ids=input_ids,
config=GenerationConfig(
temperature=temperature,
max_length=max_length,
repetition_penalty=repetition_penalty,
additional_gen_config=additional_gen_config,
)
)
audio = self.get_audio(output)
if self.verbose:
logger.info("Audio generation completed")
return ModelOutput(audio, self.audio_codec.sr)
def generate_stream(
self,
text: str,
speaker: dict = None,
temperature: float = 0.1,
repetition_penalty: float = 1.1,
max_length = 4096,
chunk_size = 50,
additional_gen_config = {},
) -> Generator[ModelOutput, None, None]:
"""
Generate audio tokens in a streaming manner.
:param text: Input text to generate audio for
:param speaker: Optional speaker information
:param temperature: Sampling temperature
:param repetition_penalty: Penalty for token repetition
:param max_length: Maximum number of tokens to generate
:param additional_gen_config: Additional generation configurations
:param chunk_size: Number of tokens to generate per chunk
:yield: Incremental ModelOutput with audio chunks
"""
input_ids = self.prepare_prompt(text, speaker)
if self.verbose:
logger.info(f"Input tokens: {len(input_ids)}")
logger.info("Streaming audio generation...")
self.check_generation_max_length(max_length)
# Track tokens for progressive audio generation
generated_tokens = []
# Stream generation
for token in self.model.generate_stream(
input_ids=input_ids,
config=GenerationConfig(
temperature=temperature,
max_length=max_length,
repetition_penalty=repetition_penalty,
additional_gen_config=additional_gen_config,
)
):
generated_tokens.append(token)
# Periodically convert tokens to audio chunks
# You might want to adjust the chunk size based on your specific requirements
if len(generated_tokens) % chunk_size == 0: # Example: generate chunk every 50 tokens
try:
audio_chunk = self.get_audio(generated_tokens)
yield ModelOutput(audio_chunk, self.audio_codec.sr)
generated_tokens = []
except Exception as e:
if self.verbose:
logger.warning(f"Error generating audio chunk: {e}")
# Final audio chunk
if generated_tokens:
final_audio = self.get_audio(generated_tokens)
yield ModelOutput(final_audio, self.audio_codec.sr)
if self.verbose:
logger.info("Streaming audio generation completed") |
opened up #46 for this |
@jadams777 added WIP support for audio streaming with GGUF models, check out the example: #46 (comment) |
I know that this model is llama based, and would like to see if it is capable of doing streaming TTS. like all those llama models are able to generate text as streams right? then allowing this model to do streaming TTS would be a great idea.
Can we do something like obtaining partial Tensors from the model and convert them to audio tokens ?
and then joining the partial audio to make a full audio?
The text was updated successfully, but these errors were encountered: