Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming support #41

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from
Draft

Streaming support #41

wants to merge 11 commits into from

Conversation

Ednaordinary
Copy link
Contributor

Will add streaming support.

@Ednaordinary
Copy link
Contributor Author

current commits are completely untested, hence draft

@Ednaordinary
Copy link
Contributor Author

Okay, this works with the following script:

import outetts
import sounddevice as sd
import threading
import time

model_config = outetts.EXL2ModelConfig_v1(
    model_path="OuteTTSexl2",
    language="en",
)

interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)

speaker = interface.load_default_speaker(name="male_1")

audio_queue = []
def player():
    global audio_queue
    while True:
        time.sleep(0.01)
        while audio_queue != []:
            if isinstance(audio_queue[0], bool):
                return
            sd.play(audio_queue[0], samplerate=24000)
            sd.wait()
            audio_queue.pop(0)

threading.Thread(target=player).start()

print("running interface")

for i in interface.generate_stream(
    text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,
    speaker=speaker,
):
    audio_queue.append(i.audio.cpu().numpy().squeeze())
audio_queue.append(False)

Currently it's only implemented in EXL2. Except for input_ids = self.prepare_prompt(text, speaker).to("cpu") generate_stream in interface should just be moveable to InterfaceHF, then implementing generate_stream for HF and GGUF shouldn't be too hard

@Ednaordinary
Copy link
Contributor Author

Also, doesn't include any sort of fading/other way to reduce chunk clipping noise (yet)

@edwko
Copy link
Owner

edwko commented Dec 1, 2024

You should look for the code_end token and append word by word for buffering the audio. Here’s an example of how you could implement it:

def generate_stream(
    self, 
    text: str, 
    speaker: dict = None, 
    temperature: float = 0.1, 
    repetition_penalty: float = 1.1,
    max_length = 4096,
    additional_gen_config = {},
    additional_dynamic_generator_config = {},
    chunk_size: int = 8,
):
if chunk_size < 1:
    raise ValueError("Chunk size should be 1 or more")

code_end_token = self.prompt_processor.tokenizer.encode(self.prompt_processor.special_tokens["code_end"], add_special_tokens=False)[0] 
logger.info(f"Code end token: {code_end_token}")

# you can use .cpu() instead of .to("cpu") 
input_ids = self.prepare_prompt(text, speaker).cpu()
if self.verbose:
    logger.info(f"Input tokens: {len(input_ids)}")
    logger.info("Generating audio...")

self.check_generation_max_length(max_length)

audio_buffer = []
token_buffer = []

for piece in self.model.generate_stream(
    input_ids=input_ids,
    config=GenerationConfig(
        temperature=temperature,
        repetition_penalty=repetition_penalty,
        max_length=max_length,
        additional_gen_config=additional_gen_config,
    ),
    additional_dynamic_generator_config=additional_dynamic_generator_config
):

    # edit

    token_buffer.append(piece)
    
    if piece == code_end_token:
        audio_buffer.append(token_buffer)
        token_buffer = []

    if len(audio_buffer) == chunk_size:
        output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
        audio_buffer = []
        yield output
    
if audio_buffer:
    output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
    yield output

As mentioned before, we should use an audio queue for better handling of audio playback. Also, the output can handle playing directly without needing a separate function:

import outetts
import threading
import queue

model_config = outetts.EXL2ModelConfig_v1(
    model_path="",
    language="en",
)

interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)

speaker = interface.load_default_speaker(name="male_1")

audio_queue = queue.Queue()

def audio_player():
    while True:
        chunk = audio_queue.get()
        if chunk is None:
            # No more audio chunks to process
            break
        chunk.play() 
        audio_queue.task_done()

audio_thread = threading.Thread(target=audio_player)
audio_thread.start()

for chunk in interface.generate_stream(
    text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,
    speaker=speaker,
):
    print(chunk)
    audio_queue.put(chunk)

# Signal that audio generation is complete
audio_queue.put(None)

audio_thread.join()

@edwko
Copy link
Owner

edwko commented Dec 1, 2024

You should also look into the EXL2 model. For some reason, it fails to generate properly. It keeps generating indefinitely and does not output the BOS token.

@Ednaordinary
Copy link
Contributor Author

You should look for the code_end token and append word by word for buffering the audio.

That works. I assumed decodable tokens would be in a similar place to code_end and would also allow future possibilities of yielding the associated text with the audio chunk.

As mentioned before, we should use an audio queue for better handling of audio playback.

I wasn't confident in implementing this, thanks for the example

You should also look into the EXL2 model. For some reason, it fails to generate properly.

Will do

@edwko
Copy link
Owner

edwko commented Dec 1, 2024

You should look for the code_end token and append word by word for buffering the audio. Here’s an example of how you could implement it

We should avoid decoding each time, this approach is better and faster:

for piece in self.model.generate_stream(
    input_ids=input_ids,
    config=GenerationConfig(
        temperature=temperature,
        repetition_penalty=repetition_penalty,
        max_length=max_length,
        additional_gen_config=additional_gen_config,
    ),
    additional_dynamic_generator_config=additional_dynamic_generator_config
):

    token_buffer.append(piece)
    
    if piece == code_end_token:
        audio_buffer.append(token_buffer)
        token_buffer = []

    if len(audio_buffer) == chunk_size:
        output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
        audio_buffer = []
        yield output
    
if audio_buffer:
    output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
    yield output

You should also look into the EXL2 model. For some reason, it fails to generate properly. It keeps generating indefinitely and does not output the BOS token.

hardware<|t_0.43|><|code_start|><|1507|><|1714|><|1738|><|1745|><|1054|><|58|><|926|><|1277|><|589|><|1103|><|614|><|1121|><|475|><|1429|><|1124|><|1335|><|423|><|240|><|1598|><|1793|><|464|>immer<|1084|><|321|><|1322|>immer<|1666|><|697|><|1205|>immer<|code_end|>
productsimmer<|t_0.65|><|code_start|><|1179|><|1127|><|672|><|145|><|447|><|563|><|338|><|1361|><|1524|><|1742|><|1448|><|1461|><|871|><|1211|><|828|><|1372|><|724|><|37|><|1819|><|1634|><|853|><|1216|><|149|><|892|><|1186|><|929|><|684|><|1680|><|668|><|414|><|242|><|918|><|832|><|1226|><|767|><|1804|><|1686|><|924|><|671|><|1789|><|1564|><|1464|><|378|><|90|><|1325|><|1471|><|1443|><|1254|><|814|><|code_end|>
<|audio_end|>
where is EOS token?




certain<|t_0.35|><|code_start|><|731|><|1547|><|784|><|778|><|1586|><|541|><|28|><|1133|><|1208|><|327|><|658|><|598|><|300|><|444|><|1192|><|605|><|1152|><|353|><|1590|><|663|><|1453|><|1585|><|883|><|1693|><|1062|><|216|><|code_end|>
that<|t_0.13|><|code_start|><|1413|><|530|><|385|><|1340|><|616|><|388|><|1350|><|1473|><|895|><|174|><|code_end|>

Maybe an issue with the tokenizer? Not sure what this immer -> <|464|>immer<|1084|><|321|><|1322|>immer<|1666|><|697|><|1205|>immer<|code_end|> is or where the EOS token should be, it just seems to skip it.

@edwko
Copy link
Owner

edwko commented Dec 1, 2024

I also noticed in your implementation that you were decoding 8 tokens at a time:

if isinstance(piece, int):
    pieces.append(piece)
if isinstance(piece, str):
    size += 1
if size == chunk_size:
    audio = self.get_audio(pieces)

This chunk size is way too small, it’s about 0.1 seconds of audio per decode. This likely leads to the issues you described, such as clicking. Also calling self.get_audio that frequently will significantly slow down generation.

The implementation I provided above should address these issues.

@Ednaordinary
Copy link
Contributor Author

I also noticed in your implementation that you were decoding 8 tokens at a time

This is every 8 decodable tokens, not every 8 audio tokens. I agree 8 audio tokens is way too few to decode and would probably sound like noise. Decodable tokens seem to only occur after each part of audio gen completes. The models generate_stream() yields every token_id and a string every time there's a decodable token. If there's a decodable token, size goes up by one until it's 8.

@edwko
Copy link
Owner

edwko commented Dec 9, 2024

I've pushed a commit for GGUF streaming support 4ef8afb I think we can build on this for the other backends.

interface.generate_stream(
    text="",
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,
    speaker=speaker,
    chunk_size=8,  # word chunks that will be streamed
    stream_segments=False  # if set to true will stream split segments instead of word chunks
)

@Ednaordinary
Copy link
Contributor Author

Sounds good, I'll check it out. Sorry for going silent on this, things on my side got really busy pretty quick. I'll work on this PR once I'm able to. The generate_stream function should just be repastable to InterfaceHF once logic for transferring input_ids to CPU/GPU is implemented straight in the respective engine (super easy), you might have already seen that

@Ednaordinary
Copy link
Contributor Author

In addition, what could work well for chunking is decoding one overlapping token and then fade it out as that same overlapping token is faded in for the next chunk. That way, the audio volume stays consistent while the popping disappears. The fade function can even be exponential or quadratic as long as both fade in and fade out added together at any time equal 1. Problems with this:

  • One chunk has to wait for the next to decode
  • Complex to implement, especially timing which tokens are where in time
  • Slightly higher latency from the decoder since the same token is decoded multiple times

Solves:

  • No more popping
  • Chunk length can go from 8 -> ~3-2

@Ednaordinary
Copy link
Contributor Author

Ednaordinary commented Dec 10, 2024

(transformed) Sigmoid could be good for this

Sigmoid_Function_90ec70976d~3.png

@edwko
Copy link
Owner

edwko commented Dec 10, 2024

Sorry for going silent on this, things on my side got really busy pretty quick.

All good! No rush :)

what could work well for chunking is decoding one overlapping token and then fade it out as that same overlapping token is faded in for the next chunk.

That's a good idea, we could add some kind of dummy token then fade it. The popping mostly happens as it cuts the audio abruptly in the chunks. If you stream with stream_segments enabled, it will stream like sentences so it mostly is fine and shouldn't pop or be very minimal.

After some testing you'll see in the implementation I keep the token history as context for wav tokenizer decoder. That fixes an issue where it couldn't handle smaller chunks as it now has the previous history to work with, then we just slice the audio from latest chunk after it's decoded.

@Ednaordinary
Copy link
Contributor Author

I keep the token history as context for wav tokenizer decoder.

My only issue with this is that decoding time will also increase as the decoding size gets bigger. I think keeping the past ~1-3 tokens should be sufficient without sacrificing too much performance (this is still a decent chunk of decoding performance loss though since decoding 1 token is a lot different than decoding 4. Perhaps only use this with chunks < 4? Or scale according to size, i.e. chunks of 8 have 1 extra token while chunks of 2 have 2 or 3)

@edwko
Copy link
Owner

edwko commented Dec 17, 2024

I keep the token history as context for wav tokenizer decoder.

My only issue with this is that decoding time will also increase as the decoding size gets bigger. I think keeping the past ~1-3 tokens should be sufficient without sacrificing too much performance (this is still a decent chunk of decoding performance loss though since decoding 1 token is a lot different than decoding 4. Perhaps only use this with chunks < 4? Or scale according to size, i.e. chunks of 8 have 1 extra token while chunks of 2 have 2 or 3)

Decoding should be fast, and it's quite dependent on history to keep the audio quality. Maybe not keep all of it, but it should keep solid chunk of words in the context, let's say something like 16-32 words, then after it can have a sliding window that keeps the newer words.
"~1-3 tokens" we should also look at this not by tokens but rather by the chunk (word).

@Ednaordinary
Copy link
Contributor Author

In my original implementation, chunks consist of chunk_size words (technically not words, since text is split via the tokenizer), not a single word. I think 8 words/chunks/tokens (words) would be enough to keep audio quality, as that alone is enough to have coherent audio. So 8 + 8 would be 16 words of audio context

@fackweb
Copy link

fackweb commented Jan 16, 2025

Dear experts, have you figured out this feature? When will it be released?

@Ednaordinary
Copy link
Contributor Author

Ednaordinary commented Jan 16, 2025

Implementing a basic version should be pretty easy, I just haven't had the time or motivation to do it. If you're looking for a decent place holder in the mean time, look at #37 (comment) (demo 2 comments above)

This won't be the same as final behavior

@Ednaordinary
Copy link
Contributor Author

Additionally, stuff changed a bit in the codebase so I'll need to sort that out

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

3 participants