Streaming audio generatiron? #44

Meshwa428 · 2024-12-07T08:18:57Z

I know that this model is llama based, and would like to see if it is capable of doing streaming TTS. like all those llama models are able to generate text as streams right? then allowing this model to do streaming TTS would be a great idea.
Can we do something like obtaining partial Tensors from the model and convert them to audio tokens ?
and then joining the partial audio to make a full audio?

edwko · 2024-12-07T08:25:01Z

I'm working on the streaming feature. It's not very straightforward due to the audio encoder model, but I'll hopefully push an update soon adding this feature for GGUF models for now.

Meshwa428 · 2024-12-07T09:02:34Z

Can't wait to try it out! 😊. I checked out this PR and it seems to have a good implementation for EXL2. can we extend it to llama_cpp?

edwko · 2024-12-07T10:23:30Z

Yes, all the backends will have streaming support, that's the plan. Will probably release first for llama cpp as mentioned. The implementation I've developed is quite different from the PR to handle audio streaming better. I'll get back to that PR for the EXL2 implementation when this is ready.

jadams777 · 2024-12-07T17:23:13Z

+1 for streaming audio.

Meshwa428 · 2024-12-07T17:24:47Z

Well I am also trying to implement it on my own if possible. Stay tuned. Either me or @edwko will update soon 😊

Meshwa428 · 2024-12-08T08:47:11Z

doing GGUF was pretty easy
edit the model.py with the below code

class GGUFModel:
    def __init__(
            self,
            model_path: str,
            n_gpu_layers: int = 0,
            max_seq_length: int = 4096,
            additional_model_config: dict = {}
    ) -> None:
        
        if not _GGUF_AVAILABLE:
            raise ImportError(
                "llama_cpp python module not found."
                "To use the GGUF model you must install llama cpp python manually."
            )

        additional_model_config["n_ctx"] = max_seq_length
        self.model = Llama(
            model_path=model_path,
            n_gpu_layers=n_gpu_layers,
            **additional_model_config
        )
    
    def generate(self, input_ids: list[int], config: GenerationConfig) -> list[int]:
        tokens = []
        for token in self.model.generate(
            input_ids,
            temp=config.temperature,
            repeat_penalty=config.repetition_penalty,
            **config.additional_gen_config,
        ):
            tokens.append(token)
            if (llama_token_is_eog(self.model._model.model, token) or 
                len(tokens) >= config.max_length):
                break

        return tokens
    
    def generate_stream(self, input_ids: list[int], config: GenerationConfig) -> Generator[int, None, None]:
        for token in self.model.generate(
            input_ids,
            temp=config.temperature,
            repeat_penalty=config.repetition_penalty,
            **config.additional_gen_config,
        ):
            yield token

            if (llama_token_is_eog(self.model._model.model, token) or 
                len(input_ids) >= config.max_length):
                break

and edit interface.py with the below code

class InterfaceGGUF(InterfaceHF):
    def __init__(
        self,
        config: GGUFModelConfig
    ) -> None:
        self.device = torch.device(
            config.device if config.device is not None
            else "cuda" if torch.cuda.is_available()
            else "cpu"
        )
        self.config = config
        self._device = config.device
        self.languages = config.languages
        self.language = config.language
        self.verbose = config.verbose

        self.audio_codec = AudioCodec(self.device, config.wavtokenizer_model_path)
        self.prompt_processor = PromptProcessor(config.tokenizer_path, self.languages)
        self.model = GGUFModel(
            model_path=config.model_path,
            n_gpu_layers=config.n_gpu_layers,
            max_seq_length=config.max_seq_length,
            additional_model_config=config.additional_model_config
        )

    def prepare_prompt(self, text: str, speaker: dict = None):
        prompt = self.prompt_processor.get_completion_prompt(text, self.language, speaker)
        return self.prompt_processor.tokenizer.encode(prompt, add_special_tokens=False)

    def generate(
            self, 
            text: str, 
            speaker: dict = None, 
            temperature: float = 0.1, 
            repetition_penalty: float = 1.1,
            max_length = 4096,
            additional_gen_config = {},
        ) -> ModelOutput:
        input_ids = self.prepare_prompt(text, speaker)
        if self.verbose:
            logger.info(f"Input tokens: {len(input_ids)}")
            logger.info("Generating audio...")
        
        self.check_generation_max_length(max_length)
        
        output = self.model.generate(
            input_ids=input_ids,
            config=GenerationConfig(
                temperature=temperature,
                max_length=max_length,
                repetition_penalty=repetition_penalty,
                additional_gen_config=additional_gen_config,
            )
        )
        audio = self.get_audio(output)
        if self.verbose:
            logger.info("Audio generation completed")

        return ModelOutput(audio, self.audio_codec.sr)
    
    def generate_stream(
            self, 
            text: str, 
            speaker: dict = None, 
            temperature: float = 0.1, 
            repetition_penalty: float = 1.1,
            max_length = 4096,
            chunk_size = 50,
            additional_gen_config = {},
    ) -> Generator[ModelOutput, None, None]:
        """
        Generate audio tokens in a streaming manner.
        
        :param text: Input text to generate audio for
        :param speaker: Optional speaker information
        :param temperature: Sampling temperature
        :param repetition_penalty: Penalty for token repetition
        :param max_length: Maximum number of tokens to generate
        :param additional_gen_config: Additional generation configurations
        :param chunk_size: Number of tokens to generate per chunk
        :yield: Incremental ModelOutput with audio chunks
        """
        input_ids = self.prepare_prompt(text, speaker)
        if self.verbose:
            logger.info(f"Input tokens: {len(input_ids)}")
            logger.info("Streaming audio generation...")
        
        self.check_generation_max_length(max_length)
        
        # Track tokens for progressive audio generation
        generated_tokens = []
        
        # Stream generation
        for token in self.model.generate_stream(
            input_ids=input_ids,
            config=GenerationConfig(
                temperature=temperature,
                max_length=max_length,
                repetition_penalty=repetition_penalty,
                additional_gen_config=additional_gen_config,
            )
        ):
            generated_tokens.append(token)
            
            # Periodically convert tokens to audio chunks
            # You might want to adjust the chunk size based on your specific requirements
            if len(generated_tokens) % chunk_size == 0:  # Example: generate chunk every 50 tokens
                try:
                    audio_chunk = self.get_audio(generated_tokens)
                    yield ModelOutput(audio_chunk, self.audio_codec.sr)
                    generated_tokens = []
                except Exception as e:
                    if self.verbose:
                        logger.warning(f"Error generating audio chunk: {e}")
        
        # Final audio chunk
        if generated_tokens:
            final_audio = self.get_audio(generated_tokens)
            yield ModelOutput(final_audio, self.audio_codec.sr)
        
        if self.verbose:
            logger.info("Streaming audio generation completed")

Meshwa428 · 2024-12-08T08:55:18Z

opened up #46 for this

edwko · 2024-12-09T12:42:45Z

@jadams777 added WIP support for audio streaming with GGUF models, check out the example: #46 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming audio generatiron? #44

Streaming audio generatiron? #44

Meshwa428 commented Dec 7, 2024

edwko commented Dec 7, 2024

Meshwa428 commented Dec 7, 2024 •

edited

Loading

edwko commented Dec 7, 2024

jadams777 commented Dec 7, 2024

Meshwa428 commented Dec 7, 2024

Meshwa428 commented Dec 8, 2024 •

edited

Loading

Meshwa428 commented Dec 8, 2024

edwko commented Dec 9, 2024 •

edited

Loading

Streaming audio generatiron? #44

Streaming audio generatiron? #44

Comments

Meshwa428 commented Dec 7, 2024

edwko commented Dec 7, 2024

Meshwa428 commented Dec 7, 2024 • edited Loading

edwko commented Dec 7, 2024

jadams777 commented Dec 7, 2024

Meshwa428 commented Dec 7, 2024

Meshwa428 commented Dec 8, 2024 • edited Loading

Meshwa428 commented Dec 8, 2024

edwko commented Dec 9, 2024 • edited Loading

Meshwa428 commented Dec 7, 2024 •

edited

Loading

Meshwa428 commented Dec 8, 2024 •

edited

Loading

edwko commented Dec 9, 2024 •

edited

Loading