Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming audio generatiron? #44

Open
Meshwa428 opened this issue Dec 7, 2024 · 8 comments
Open

Streaming audio generatiron? #44

Meshwa428 opened this issue Dec 7, 2024 · 8 comments

Comments

@Meshwa428
Copy link

I know that this model is llama based, and would like to see if it is capable of doing streaming TTS. like all those llama models are able to generate text as streams right? then allowing this model to do streaming TTS would be a great idea.
Can we do something like obtaining partial Tensors from the model and convert them to audio tokens ?
and then joining the partial audio to make a full audio?

@edwko
Copy link
Owner

edwko commented Dec 7, 2024

I'm working on the streaming feature. It's not very straightforward due to the audio encoder model, but I'll hopefully push an update soon adding this feature for GGUF models for now.

@Meshwa428
Copy link
Author

Meshwa428 commented Dec 7, 2024

Can't wait to try it out! 😊. I checked out this PR and it seems to have a good implementation for EXL2. can we extend it to llama_cpp?

@edwko
Copy link
Owner

edwko commented Dec 7, 2024

Yes, all the backends will have streaming support, that's the plan. Will probably release first for llama cpp as mentioned. The implementation I've developed is quite different from the PR to handle audio streaming better. I'll get back to that PR for the EXL2 implementation when this is ready.

@jadams777
Copy link

+1 for streaming audio.

@Meshwa428
Copy link
Author

Well I am also trying to implement it on my own if possible. Stay tuned. Either me or @edwko will update soon 😊

@Meshwa428
Copy link
Author

Meshwa428 commented Dec 8, 2024

doing GGUF was pretty easy
edit the model.py with the below code

class GGUFModel:
    def __init__(
            self,
            model_path: str,
            n_gpu_layers: int = 0,
            max_seq_length: int = 4096,
            additional_model_config: dict = {}
    ) -> None:
        
        if not _GGUF_AVAILABLE:
            raise ImportError(
                "llama_cpp python module not found."
                "To use the GGUF model you must install llama cpp python manually."
            )

        additional_model_config["n_ctx"] = max_seq_length
        self.model = Llama(
            model_path=model_path,
            n_gpu_layers=n_gpu_layers,
            **additional_model_config
        )
    
    def generate(self, input_ids: list[int], config: GenerationConfig) -> list[int]:
        tokens = []
        for token in self.model.generate(
            input_ids,
            temp=config.temperature,
            repeat_penalty=config.repetition_penalty,
            **config.additional_gen_config,
        ):
            tokens.append(token)
            if (llama_token_is_eog(self.model._model.model, token) or 
                len(tokens) >= config.max_length):
                break

        return tokens
    
    def generate_stream(self, input_ids: list[int], config: GenerationConfig) -> Generator[int, None, None]:
        for token in self.model.generate(
            input_ids,
            temp=config.temperature,
            repeat_penalty=config.repetition_penalty,
            **config.additional_gen_config,
        ):
            yield token

            if (llama_token_is_eog(self.model._model.model, token) or 
                len(input_ids) >= config.max_length):
                break

and edit interface.py with the below code

class InterfaceGGUF(InterfaceHF):
    def __init__(
        self,
        config: GGUFModelConfig
    ) -> None:
        self.device = torch.device(
            config.device if config.device is not None
            else "cuda" if torch.cuda.is_available()
            else "cpu"
        )
        self.config = config
        self._device = config.device
        self.languages = config.languages
        self.language = config.language
        self.verbose = config.verbose

        self.audio_codec = AudioCodec(self.device, config.wavtokenizer_model_path)
        self.prompt_processor = PromptProcessor(config.tokenizer_path, self.languages)
        self.model = GGUFModel(
            model_path=config.model_path,
            n_gpu_layers=config.n_gpu_layers,
            max_seq_length=config.max_seq_length,
            additional_model_config=config.additional_model_config
        )

    def prepare_prompt(self, text: str, speaker: dict = None):
        prompt = self.prompt_processor.get_completion_prompt(text, self.language, speaker)
        return self.prompt_processor.tokenizer.encode(prompt, add_special_tokens=False)

    def generate(
            self, 
            text: str, 
            speaker: dict = None, 
            temperature: float = 0.1, 
            repetition_penalty: float = 1.1,
            max_length = 4096,
            additional_gen_config = {},
        ) -> ModelOutput:
        input_ids = self.prepare_prompt(text, speaker)
        if self.verbose:
            logger.info(f"Input tokens: {len(input_ids)}")
            logger.info("Generating audio...")
        
        self.check_generation_max_length(max_length)
        
        output = self.model.generate(
            input_ids=input_ids,
            config=GenerationConfig(
                temperature=temperature,
                max_length=max_length,
                repetition_penalty=repetition_penalty,
                additional_gen_config=additional_gen_config,
            )
        )
        audio = self.get_audio(output)
        if self.verbose:
            logger.info("Audio generation completed")

        return ModelOutput(audio, self.audio_codec.sr)
    
    def generate_stream(
            self, 
            text: str, 
            speaker: dict = None, 
            temperature: float = 0.1, 
            repetition_penalty: float = 1.1,
            max_length = 4096,
            chunk_size = 50,
            additional_gen_config = {},
    ) -> Generator[ModelOutput, None, None]:
        """
        Generate audio tokens in a streaming manner.
        
        :param text: Input text to generate audio for
        :param speaker: Optional speaker information
        :param temperature: Sampling temperature
        :param repetition_penalty: Penalty for token repetition
        :param max_length: Maximum number of tokens to generate
        :param additional_gen_config: Additional generation configurations
        :param chunk_size: Number of tokens to generate per chunk
        :yield: Incremental ModelOutput with audio chunks
        """
        input_ids = self.prepare_prompt(text, speaker)
        if self.verbose:
            logger.info(f"Input tokens: {len(input_ids)}")
            logger.info("Streaming audio generation...")
        
        self.check_generation_max_length(max_length)
        
        # Track tokens for progressive audio generation
        generated_tokens = []
        
        # Stream generation
        for token in self.model.generate_stream(
            input_ids=input_ids,
            config=GenerationConfig(
                temperature=temperature,
                max_length=max_length,
                repetition_penalty=repetition_penalty,
                additional_gen_config=additional_gen_config,
            )
        ):
            generated_tokens.append(token)
            
            # Periodically convert tokens to audio chunks
            # You might want to adjust the chunk size based on your specific requirements
            if len(generated_tokens) % chunk_size == 0:  # Example: generate chunk every 50 tokens
                try:
                    audio_chunk = self.get_audio(generated_tokens)
                    yield ModelOutput(audio_chunk, self.audio_codec.sr)
                    generated_tokens = []
                except Exception as e:
                    if self.verbose:
                        logger.warning(f"Error generating audio chunk: {e}")
        
        # Final audio chunk
        if generated_tokens:
            final_audio = self.get_audio(generated_tokens)
            yield ModelOutput(final_audio, self.audio_codec.sr)
        
        if self.verbose:
            logger.info("Streaming audio generation completed")

@Meshwa428
Copy link
Author

opened up #46 for this

@edwko
Copy link
Owner

edwko commented Dec 9, 2024

@jadams777 added WIP support for audio streaming with GGUF models, check out the example: #46 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants