-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming support #41
base: main
Are you sure you want to change the base?
Conversation
current commits are completely untested, hence draft |
Okay, this works with the following script: import outetts
import sounddevice as sd
import threading
import time
model_config = outetts.EXL2ModelConfig_v1(
model_path="OuteTTSexl2",
language="en",
)
interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)
speaker = interface.load_default_speaker(name="male_1")
audio_queue = []
def player():
global audio_queue
while True:
time.sleep(0.01)
while audio_queue != []:
if isinstance(audio_queue[0], bool):
return
sd.play(audio_queue[0], samplerate=24000)
sd.wait()
audio_queue.pop(0)
threading.Thread(target=player).start()
print("running interface")
for i in interface.generate_stream(
text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
speaker=speaker,
):
audio_queue.append(i.audio.cpu().numpy().squeeze())
audio_queue.append(False) Currently it's only implemented in EXL2. Except for |
Also, doesn't include any sort of fading/other way to reduce chunk clipping noise (yet) |
You should look for the def generate_stream(
self,
text: str,
speaker: dict = None,
temperature: float = 0.1,
repetition_penalty: float = 1.1,
max_length = 4096,
additional_gen_config = {},
additional_dynamic_generator_config = {},
chunk_size: int = 8,
):
if chunk_size < 1:
raise ValueError("Chunk size should be 1 or more")
code_end_token = self.prompt_processor.tokenizer.encode(self.prompt_processor.special_tokens["code_end"], add_special_tokens=False)[0]
logger.info(f"Code end token: {code_end_token}")
# you can use .cpu() instead of .to("cpu")
input_ids = self.prepare_prompt(text, speaker).cpu()
if self.verbose:
logger.info(f"Input tokens: {len(input_ids)}")
logger.info("Generating audio...")
self.check_generation_max_length(max_length)
audio_buffer = []
token_buffer = []
for piece in self.model.generate_stream(
input_ids=input_ids,
config=GenerationConfig(
temperature=temperature,
repetition_penalty=repetition_penalty,
max_length=max_length,
additional_gen_config=additional_gen_config,
),
additional_dynamic_generator_config=additional_dynamic_generator_config
):
# edit
token_buffer.append(piece)
if piece == code_end_token:
audio_buffer.append(token_buffer)
token_buffer = []
if len(audio_buffer) == chunk_size:
output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
audio_buffer = []
yield output
if audio_buffer:
output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
yield output As mentioned before, we should use an audio queue for better handling of audio playback. Also, the output can handle playing directly without needing a separate function: import outetts
import threading
import queue
model_config = outetts.EXL2ModelConfig_v1(
model_path="",
language="en",
)
interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)
speaker = interface.load_default_speaker(name="male_1")
audio_queue = queue.Queue()
def audio_player():
while True:
chunk = audio_queue.get()
if chunk is None:
# No more audio chunks to process
break
chunk.play()
audio_queue.task_done()
audio_thread = threading.Thread(target=audio_player)
audio_thread.start()
for chunk in interface.generate_stream(
text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
speaker=speaker,
):
print(chunk)
audio_queue.put(chunk)
# Signal that audio generation is complete
audio_queue.put(None)
audio_thread.join() |
You should also look into the EXL2 model. For some reason, it fails to generate properly. It keeps generating indefinitely and does not output the |
That works. I assumed decodable tokens would be in a similar place to code_end and would also allow future possibilities of yielding the associated text with the audio chunk.
I wasn't confident in implementing this, thanks for the example
Will do |
We should avoid decoding each time, this approach is better and faster: for piece in self.model.generate_stream(
input_ids=input_ids,
config=GenerationConfig(
temperature=temperature,
repetition_penalty=repetition_penalty,
max_length=max_length,
additional_gen_config=additional_gen_config,
),
additional_dynamic_generator_config=additional_dynamic_generator_config
):
token_buffer.append(piece)
if piece == code_end_token:
audio_buffer.append(token_buffer)
token_buffer = []
if len(audio_buffer) == chunk_size:
output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
audio_buffer = []
yield output
if audio_buffer:
output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
yield output
Maybe an issue with the tokenizer? Not sure what this |
I also noticed in your implementation that you were decoding 8 tokens at a time: if isinstance(piece, int):
pieces.append(piece)
if isinstance(piece, str):
size += 1
if size == chunk_size:
audio = self.get_audio(pieces) This chunk size is way too small, it’s about 0.1 seconds of audio per decode. This likely leads to the issues you described, such as clicking. Also calling The implementation I provided above should address these issues. |
This is every 8 decodable tokens, not every 8 audio tokens. I agree 8 audio tokens is way too few to decode and would probably sound like noise. Decodable tokens seem to only occur after each part of audio gen completes. The models generate_stream() yields every token_id and a string every time there's a decodable token. If there's a decodable token, size goes up by one until it's 8. |
I've pushed a commit for GGUF streaming support 4ef8afb I think we can build on this for the other backends. interface.generate_stream(
text="",
temperature=0.1,
repetition_penalty=1.1,
max_length=4096,
speaker=speaker,
chunk_size=8, # word chunks that will be streamed
stream_segments=False # if set to true will stream split segments instead of word chunks
) |
Sounds good, I'll check it out. Sorry for going silent on this, things on my side got really busy pretty quick. I'll work on this PR once I'm able to. The generate_stream function should just be repastable to InterfaceHF once logic for transferring input_ids to CPU/GPU is implemented straight in the respective engine (super easy), you might have already seen that |
In addition, what could work well for chunking is decoding one overlapping token and then fade it out as that same overlapping token is faded in for the next chunk. That way, the audio volume stays consistent while the popping disappears. The fade function can even be exponential or quadratic as long as both fade in and fade out added together at any time equal 1. Problems with this:
Solves:
|
All good! No rush :)
That's a good idea, we could add some kind of dummy token then fade it. The popping mostly happens as it cuts the audio abruptly in the chunks. If you stream with After some testing you'll see in the implementation I keep the token history as context for wav tokenizer decoder. That fixes an issue where it couldn't handle smaller chunks as it now has the previous history to work with, then we just slice the audio from latest chunk after it's decoded. |
My only issue with this is that decoding time will also increase as the decoding size gets bigger. I think keeping the past ~1-3 tokens should be sufficient without sacrificing too much performance (this is still a decent chunk of decoding performance loss though since decoding 1 token is a lot different than decoding 4. Perhaps only use this with chunks < 4? Or scale according to size, i.e. chunks of 8 have 1 extra token while chunks of 2 have 2 or 3) |
Decoding should be fast, and it's quite dependent on history to keep the audio quality. Maybe not keep all of it, but it should keep solid chunk of words in the context, let's say something like 16-32 words, then after it can have a sliding window that keeps the newer words. |
In my original implementation, chunks consist of chunk_size words (technically not words, since text is split via the tokenizer), not a single word. I think 8 words/chunks/tokens (words) would be enough to keep audio quality, as that alone is enough to have coherent audio. So 8 + 8 would be 16 words of audio context |
Dear experts, have you figured out this feature? When will it be released? |
Implementing a basic version should be pretty easy, I just haven't had the time or motivation to do it. If you're looking for a decent place holder in the mean time, look at #37 (comment) (demo 2 comments above) This won't be the same as final behavior |
Additionally, stuff changed a bit in the codebase so I'll need to sort that out |
Will add streaming support.