Streaming support #41

Ednaordinary · 2024-12-01T08:22:53Z

Will add streaming support.

Ednaordinary · 2024-12-01T08:23:22Z

current commits are completely untested, hence draft

Ednaordinary · 2024-12-01T09:19:20Z

Okay, this works with the following script:

import outetts
import sounddevice as sd
import threading
import time

model_config = outetts.EXL2ModelConfig_v1(
    model_path="OuteTTSexl2",
    language="en",
)

interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)

speaker = interface.load_default_speaker(name="male_1")

audio_queue = []
def player():
    global audio_queue
    while True:
        time.sleep(0.01)
        while audio_queue != []:
            if isinstance(audio_queue[0], bool):
                return
            sd.play(audio_queue[0], samplerate=24000)
            sd.wait()
            audio_queue.pop(0)

threading.Thread(target=player).start()

print("running interface")

for i in interface.generate_stream(
    text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,
    speaker=speaker,
):
    audio_queue.append(i.audio.cpu().numpy().squeeze())
audio_queue.append(False)

Currently it's only implemented in EXL2. Except for input_ids = self.prepare_prompt(text, speaker).to("cpu") generate_stream in interface should just be moveable to InterfaceHF, then implementing generate_stream for HF and GGUF shouldn't be too hard

Ednaordinary · 2024-12-01T09:23:45Z

Also, doesn't include any sort of fading/other way to reduce chunk clipping noise (yet)

edwko · 2024-12-01T10:24:23Z

You should look for the code_end token and append word by word for buffering the audio. Here’s an example of how you could implement it:

def generate_stream(
    self, 
    text: str, 
    speaker: dict = None, 
    temperature: float = 0.1, 
    repetition_penalty: float = 1.1,
    max_length = 4096,
    additional_gen_config = {},
    additional_dynamic_generator_config = {},
    chunk_size: int = 8,
):
if chunk_size < 1:
    raise ValueError("Chunk size should be 1 or more")

code_end_token = self.prompt_processor.tokenizer.encode(self.prompt_processor.special_tokens["code_end"], add_special_tokens=False)[0] 
logger.info(f"Code end token: {code_end_token}")

# you can use .cpu() instead of .to("cpu") 
input_ids = self.prepare_prompt(text, speaker).cpu()
if self.verbose:
    logger.info(f"Input tokens: {len(input_ids)}")
    logger.info("Generating audio...")

self.check_generation_max_length(max_length)

audio_buffer = []
token_buffer = []

for piece in self.model.generate_stream(
    input_ids=input_ids,
    config=GenerationConfig(
        temperature=temperature,
        repetition_penalty=repetition_penalty,
        max_length=max_length,
        additional_gen_config=additional_gen_config,
    ),
    additional_dynamic_generator_config=additional_dynamic_generator_config
):

    # edit

    token_buffer.append(piece)
    
    if piece == code_end_token:
        audio_buffer.append(token_buffer)
        token_buffer = []

    if len(audio_buffer) == chunk_size:
        output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
        audio_buffer = []
        yield output
    
if audio_buffer:
    output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
    yield output

As mentioned before, we should use an audio queue for better handling of audio playback. Also, the output can handle playing directly without needing a separate function:

import outetts
import threading
import queue

model_config = outetts.EXL2ModelConfig_v1(
    model_path="",
    language="en",
)

interface = outetts.InterfaceEXL2(model_version="0.2", cfg=model_config)

speaker = interface.load_default_speaker(name="male_1")

audio_queue = queue.Queue()

def audio_player():
    while True:
        chunk = audio_queue.get()
        if chunk is None:
            # No more audio chunks to process
            break
        chunk.play() 
        audio_queue.task_done()

audio_thread = threading.Thread(target=audio_player)
audio_thread.start()

for chunk in interface.generate_stream(
    text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,
    speaker=speaker,
):
    print(chunk)
    audio_queue.put(chunk)

# Signal that audio generation is complete
audio_queue.put(None)

audio_thread.join()

edwko · 2024-12-01T10:24:48Z

You should also look into the EXL2 model. For some reason, it fails to generate properly. It keeps generating indefinitely and does not output the BOS token.

Ednaordinary · 2024-12-01T10:39:43Z

You should look for the code_end token and append word by word for buffering the audio.

That works. I assumed decodable tokens would be in a similar place to code_end and would also allow future possibilities of yielding the associated text with the audio chunk.

As mentioned before, we should use an audio queue for better handling of audio playback.

I wasn't confident in implementing this, thanks for the example

You should also look into the EXL2 model. For some reason, it fails to generate properly.

Will do

edwko · 2024-12-01T10:42:17Z

You should look for the code_end token and append word by word for buffering the audio. Here’s an example of how you could implement it

We should avoid decoding each time, this approach is better and faster:

for piece in self.model.generate_stream(
    input_ids=input_ids,
    config=GenerationConfig(
        temperature=temperature,
        repetition_penalty=repetition_penalty,
        max_length=max_length,
        additional_gen_config=additional_gen_config,
    ),
    additional_dynamic_generator_config=additional_dynamic_generator_config
):

    token_buffer.append(piece)
    
    if piece == code_end_token:
        audio_buffer.append(token_buffer)
        token_buffer = []

    if len(audio_buffer) == chunk_size:
        output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
        audio_buffer = []
        yield output
    
if audio_buffer:
    output = ModelOutput(self.get_audio([item for sublist in audio_buffer for item in sublist]), self.audio_codec.sr)
    yield output

You should also look into the EXL2 model. For some reason, it fails to generate properly. It keeps generating indefinitely and does not output the BOS token.

hardware<|t_0.43|><|code_start|><|1507|><|1714|><|1738|><|1745|><|1054|><|58|><|926|><|1277|><|589|><|1103|><|614|><|1121|><|475|><|1429|><|1124|><|1335|><|423|><|240|><|1598|><|1793|><|464|>immer<|1084|><|321|><|1322|>immer<|1666|><|697|><|1205|>immer<|code_end|>
productsimmer<|t_0.65|><|code_start|><|1179|><|1127|><|672|><|145|><|447|><|563|><|338|><|1361|><|1524|><|1742|><|1448|><|1461|><|871|><|1211|><|828|><|1372|><|724|><|37|><|1819|><|1634|><|853|><|1216|><|149|><|892|><|1186|><|929|><|684|><|1680|><|668|><|414|><|242|><|918|><|832|><|1226|><|767|><|1804|><|1686|><|924|><|671|><|1789|><|1564|><|1464|><|378|><|90|><|1325|><|1471|><|1443|><|1254|><|814|><|code_end|>
<|audio_end|>
where is EOS token?




certain<|t_0.35|><|code_start|><|731|><|1547|><|784|><|778|><|1586|><|541|><|28|><|1133|><|1208|><|327|><|658|><|598|><|300|><|444|><|1192|><|605|><|1152|><|353|><|1590|><|663|><|1453|><|1585|><|883|><|1693|><|1062|><|216|><|code_end|>
that<|t_0.13|><|code_start|><|1413|><|530|><|385|><|1340|><|616|><|388|><|1350|><|1473|><|895|><|174|><|code_end|>

Maybe an issue with the tokenizer? Not sure what this immer -> <|464|>immer<|1084|><|321|><|1322|>immer<|1666|><|697|><|1205|>immer<|code_end|> is or where the EOS token should be, it just seems to skip it.

edwko · 2024-12-01T12:11:53Z

I also noticed in your implementation that you were decoding 8 tokens at a time:

if isinstance(piece, int):
    pieces.append(piece)
if isinstance(piece, str):
    size += 1
if size == chunk_size:
    audio = self.get_audio(pieces)

This chunk size is way too small, it’s about 0.1 seconds of audio per decode. This likely leads to the issues you described, such as clicking. Also calling self.get_audio that frequently will significantly slow down generation.

The implementation I provided above should address these issues.

Ednaordinary · 2024-12-01T21:44:17Z

I also noticed in your implementation that you were decoding 8 tokens at a time

This is every 8 decodable tokens, not every 8 audio tokens. I agree 8 audio tokens is way too few to decode and would probably sound like noise. Decodable tokens seem to only occur after each part of audio gen completes. The models generate_stream() yields every token_id and a string every time there's a decodable token. If there's a decodable token, size goes up by one until it's 8.

edwko · 2024-12-09T12:41:02Z

I've pushed a commit for GGUF streaming support 4ef8afb I think we can build on this for the other backends.

interface.generate_stream(
    text="",
    temperature=0.1,
    repetition_penalty=1.1,
    max_length=4096,
    speaker=speaker,
    chunk_size=8,  # word chunks that will be streamed
    stream_segments=False  # if set to true will stream split segments instead of word chunks
)

Ednaordinary · 2024-12-10T03:43:30Z

Sounds good, I'll check it out. Sorry for going silent on this, things on my side got really busy pretty quick. I'll work on this PR once I'm able to. The generate_stream function should just be repastable to InterfaceHF once logic for transferring input_ids to CPU/GPU is implemented straight in the respective engine (super easy), you might have already seen that

Ednaordinary · 2024-12-10T03:50:58Z

In addition, what could work well for chunking is decoding one overlapping token and then fade it out as that same overlapping token is faded in for the next chunk. That way, the audio volume stays consistent while the popping disappears. The fade function can even be exponential or quadratic as long as both fade in and fade out added together at any time equal 1. Problems with this:

One chunk has to wait for the next to decode
Complex to implement, especially timing which tokens are where in time
Slightly higher latency from the decoder since the same token is decoded multiple times

Solves:

No more popping
Chunk length can go from 8 -> ~3-2

Ednaordinary · 2024-12-10T03:55:33Z

(transformed) Sigmoid could be good for this

edwko · 2024-12-10T08:04:08Z

Sorry for going silent on this, things on my side got really busy pretty quick.

All good! No rush :)

what could work well for chunking is decoding one overlapping token and then fade it out as that same overlapping token is faded in for the next chunk.

That's a good idea, we could add some kind of dummy token then fade it. The popping mostly happens as it cuts the audio abruptly in the chunks. If you stream with stream_segments enabled, it will stream like sentences so it mostly is fine and shouldn't pop or be very minimal.

After some testing you'll see in the implementation I keep the token history as context for wav tokenizer decoder. That fixes an issue where it couldn't handle smaller chunks as it now has the previous history to work with, then we just slice the audio from latest chunk after it's decoded.

Ednaordinary · 2024-12-16T16:53:25Z

I keep the token history as context for wav tokenizer decoder.

My only issue with this is that decoding time will also increase as the decoding size gets bigger. I think keeping the past ~1-3 tokens should be sufficient without sacrificing too much performance (this is still a decent chunk of decoding performance loss though since decoding 1 token is a lot different than decoding 4. Perhaps only use this with chunks < 4? Or scale according to size, i.e. chunks of 8 have 1 extra token while chunks of 2 have 2 or 3)

edwko · 2024-12-17T15:39:17Z

I keep the token history as context for wav tokenizer decoder.

My only issue with this is that decoding time will also increase as the decoding size gets bigger. I think keeping the past ~1-3 tokens should be sufficient without sacrificing too much performance (this is still a decent chunk of decoding performance loss though since decoding 1 token is a lot different than decoding 4. Perhaps only use this with chunks < 4? Or scale according to size, i.e. chunks of 8 have 1 extra token while chunks of 2 have 2 or 3)

Decoding should be fast, and it's quite dependent on history to keep the audio quality. Maybe not keep all of it, but it should keep solid chunk of words in the context, let's say something like 16-32 words, then after it can have a sliding window that keeps the newer words.
"~1-3 tokens" we should also look at this not by tokens but rather by the chunk (word).

Ednaordinary · 2024-12-17T16:24:20Z

In my original implementation, chunks consist of chunk_size words (technically not words, since text is split via the tokenizer), not a single word. I think 8 words/chunks/tokens (words) would be enough to keep audio quality, as that alone is enough to have coherent audio. So 8 + 8 would be 16 words of audio context

fackweb · 2025-01-16T02:39:41Z

Dear experts, have you figured out this feature? When will it be released?

Ednaordinary · 2025-01-16T03:47:00Z

Implementing a basic version should be pretty easy, I just haven't had the time or motivation to do it. If you're looking for a decent place holder in the mean time, look at #37 (comment) (demo 2 comments above)

This won't be the same as final behavior

Ednaordinary · 2025-01-16T03:49:25Z

Additionally, stuff changed a bit in the codebase so I'll need to sort that out

Ednaordinary added 2 commits December 1, 2024 01:14

start with exl2

e0ef34f

add interface piece

7ad7d9a

Ednaordinary added 3 commits December 1, 2024 01:28

fix indent

97930af

make interface work

c4c3c45

make model work

895ad2b

Meshwa428 mentioned this pull request Dec 7, 2024

Streaming audio generatiron? #44

Open

Merge branch 'main' into streaming

0dafe63

Ednaordinary added 5 commits December 16, 2024 09:59

Resolve EXL2

a1de61d

Start to resolve HF

ba8cc64

Remove outdated generate_stream from exl2

872f3e0

Start work on audio buffer caching logic change

60d3ec6

minimal wokr on hf

ee35d09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Streaming support #41

Streaming support #41

Ednaordinary commented Dec 1, 2024

Ednaordinary commented Dec 1, 2024

Ednaordinary commented Dec 1, 2024

Ednaordinary commented Dec 1, 2024

edwko commented Dec 1, 2024 •

edited

Loading

edwko commented Dec 1, 2024

Ednaordinary commented Dec 1, 2024

edwko commented Dec 1, 2024 •

edited

Loading

edwko commented Dec 1, 2024 •

edited

Loading

Ednaordinary commented Dec 1, 2024

edwko commented Dec 9, 2024

Ednaordinary commented Dec 10, 2024

Ednaordinary commented Dec 10, 2024

Ednaordinary commented Dec 10, 2024 •

edited

Loading

edwko commented Dec 10, 2024

Ednaordinary commented Dec 16, 2024

edwko commented Dec 17, 2024

Ednaordinary commented Dec 17, 2024

fackweb commented Jan 16, 2025

Ednaordinary commented Jan 16, 2025 •

edited

Loading

Ednaordinary commented Jan 16, 2025

Streaming support #41

Are you sure you want to change the base?

Streaming support #41

Conversation

Ednaordinary commented Dec 1, 2024

Ednaordinary commented Dec 1, 2024

Ednaordinary commented Dec 1, 2024

Ednaordinary commented Dec 1, 2024

edwko commented Dec 1, 2024 • edited Loading

edwko commented Dec 1, 2024

Ednaordinary commented Dec 1, 2024

edwko commented Dec 1, 2024 • edited Loading

edwko commented Dec 1, 2024 • edited Loading

Ednaordinary commented Dec 1, 2024

edwko commented Dec 9, 2024

Ednaordinary commented Dec 10, 2024

Ednaordinary commented Dec 10, 2024

Ednaordinary commented Dec 10, 2024 • edited Loading

edwko commented Dec 10, 2024

Ednaordinary commented Dec 16, 2024

edwko commented Dec 17, 2024

Ednaordinary commented Dec 17, 2024

fackweb commented Jan 16, 2025

Ednaordinary commented Jan 16, 2025 • edited Loading

Ednaordinary commented Jan 16, 2025

edwko commented Dec 1, 2024 •

edited

Loading

edwko commented Dec 1, 2024 •

edited

Loading

edwko commented Dec 1, 2024 •

edited

Loading

Ednaordinary commented Dec 10, 2024 •

edited

Loading

Ednaordinary commented Jan 16, 2025 •

edited

Loading