[Plug-in] Unlimite Text Length. #105

rzgarespo · 2025-02-17T00:19:19Z

Usage:

python batch_process.py --text-dir ./text_chunks --output-dir ./output_audio --speaker-audio ./your_voice.mp3 --language en-us --seed 420

Another solution: #98 (comment)

Step 1: Open gradio_interface.py

insert the code above: def build_interface(): ( line 196) and save the changes.

##### Plugin code Start #####

def batch_generate_audio(
    model_choice: str,
    text_dir: str,
    output_dir: str,
    language: str = "en-us",
    speaker_audio_path: str = None,
    cfg_scale: float = 2.0,
    min_p: float = 0.15,
    seed: int = 420,
    # Fixed emotion parameters matching Gradio defaults
    e1: float = 1.0,    # Happiness
    e2: float = 0.05,   # Sadness
    e3: float = 0.05,   # Disgust
    e4: float = 0.05,   # Fear
    e5: float = 0.05,   # Surprise
    e6: float = 0.05,   # Anger
    e7: float = 0.1,    # Other
    e8: float = 0.2,    # Neutral
):
    """Batch generate audio from text files with consistent tone"""

    # Load model once
    model = load_model_if_needed(model_choice)
    device = model.device

    # Force deterministic settings
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    # Process speaker embedding once if provided
    speaker_embedding = None
    if speaker_audio_path:
        wav, sr = torchaudio.load(speaker_audio_path)
        speaker_embedding = model.make_speaker_embedding(wav, sr)
        speaker_embedding = speaker_embedding.to(device, dtype=torch.bfloat16)

    # Fixed emotion tensor
    emotion_tensor = torch.tensor(
        [e1, e2, e3, e4, e5, e6, e7, e8],
        device=device,
        dtype=torch.float32
    )

    # Process text files
    text_files = list(Path(text_dir).glob("*.txt"))
    for text_file in text_files:
        with open(text_file, "r", encoding="utf-8") as f:
            text = f.read().strip()[:500]

        # Create consistent conditioning
        cond_dict = make_cond_dict(
            text=text,
            language=language,
            speaker=speaker_embedding,
            emotion=emotion_tensor,
            device=device,
            # Only disable these if your model supports them
            unconditional_keys=["vqscore_8", "dnsmos_ovrl"]
        )

        conditioning = model.prepare_conditioning(cond_dict)

        # Deterministic generation
        with torch.no_grad():
            codes = model.generate(
                prefix_conditioning=conditioning,
                max_new_tokens=86 * 30,
                cfg_scale=cfg_scale,
                batch_size=1,
                sampling_params=dict(min_p=min_p),
            )

        # Audio processing and saving (keep previous fix)
        wav_out = model.autoencoder.decode(codes).cpu().detach()
        if wav_out.dim() == 3: wav_out = wav_out.squeeze(0)
        if wav_out.dim() == 1: wav_out = wav_out.unsqueeze(0)
        if wav_out.dim() == 2 and wav_out.size(0) > 1:
            wav_out = wav_out[0:1, :]

        output_file = output_path / f"{text_file.stem}.wav"
        torchaudio.save(str(output_file), wav_out, model.autoencoder.sampling_rate)

    return f"Processed {len(text_files)} files with consistent tone to {output_dir}"


##### Plugin code End #####

Step 2: Create batch_process.py

In the same directory as gradio.interface.py, create a new file, insert the following code, and save it.

##### batch_process.py #####
import argparse
from gradio_interface import batch_generate_audio

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Batch process text files into audio')
    parser.add_argument('--model', type=str, default="Zyphra/Zonos-v0.1-transformer",
                       help='Model name from Hugging Face Hub')
    parser.add_argument('--text-dir', type=str, required=True,
                       help='Directory containing text files')
    parser.add_argument('--output-dir', type=str, required=True,
                       help='Directory to save audio files')
    parser.add_argument('--speaker-audio', type=str, default=None,
                       help='Path to speaker audio for voice cloning')
    parser.add_argument('--language', type=str, default="en-us",
                       help='Language code for synthesis')
    parser.add_argument('--seed', type=int, default=420,
                       help='Random seed for reproducibility')

    args = parser.parse_args()

    result = batch_generate_audio(
        model_choice=args.model,
        text_dir=args.text_dir,
        output_dir=args.output_dir,
        language=args.language,
        speaker_audio_path=args.speaker_audio,
        seed=args.seed
    )

    print(result)

Step 3: Prepare your text

You can either manually divide your text into small chunks that are a maximum of 27 to 30 seconds long or use a text splitter to segment your text into smaller portions. Zonos works with a maximum of 30 seconds of audio generation and does not count the words. Some sentences with fewer words could generate longer audio based on the settings. For the default setting, I found 40 to 50 words work well. The code below can split your text based on sentences that end with "'.', '!', '?" (You can adjust the filter) and are not longer than 50 words. You can adjust the number of words by changing the value of "word_limit."

put your text into a text file and save it as data.txt.

you can change the file name but update the input_file" value.

Step 4: Split the text

create new 'text_split.py'

Create a new file and copy and paste the following code.
python text_split.py
This script will create a folder named 'text_chunks' and save text files there. If any sentences exceed 50 words, it will generate 'chunk_plus_50_1.txt' for you to manually address.

##### text_split.py #####
import os
import re
from pathlib import Path

def split_into_sentences(text):
    # Split text into sentences and keep track of line numbers
    lines = text.split('\n')
    sentences = []
    line_mappings = []  # Store (sentence, start_line, end_line)
    current_sentence = []
    start_line = 1

    for line_num, line in enumerate(lines, 1):
        words = line.strip().split()
        if not words:  # Skip empty lines
            if current_sentence:
                sentence = ' '.join(current_sentence)
                sentences.append(sentence)
                line_mappings.append((sentence, start_line, line_num))
                current_sentence = []
                start_line = line_num + 1
            continue

        current_sentence.extend(words)

        # Check for sentence endings
        if words[-1].endswith(('.', '!', '?')):
            sentence = ' '.join(current_sentence)
            sentences.append(sentence)
            line_mappings.append((sentence, start_line, line_num))
            current_sentence = []
            start_line = line_num + 1

    # Handle any remaining text
    if current_sentence:
        sentence = ' '.join(current_sentence)
        sentences.append(sentence)
        line_mappings.append((sentence, start_line, len(lines)))

    return sentences, line_mappings

def count_words(text):
    return len(text.split())

def create_mapping_file(chunk_mappings, output_dir):
    with open(output_dir / 'chunk_mapping.txt', 'w', encoding='utf-8') as f:
        f.write("Chunk Mapping Reference:\n")
        f.write("=" * 80 + "\n\n")

        for chunk_info in chunk_mappings:
            f.write(f"File: {chunk_info['filename']}\n")
            f.write(f"Lines: {chunk_info['start_line']} to {chunk_info['end_line']}\n")
            f.write(f"Preview: {chunk_info['preview'][:100]}...\n")
            f.write("-" * 80 + "\n\n")

def create_chunks(input_file, word_limit=50):
    output_dir = Path('text_chunks')
    output_dir.mkdir(exist_ok=True)

    with open(input_file, 'r', encoding='utf-8') as f:
        text = f.read()

    sentences, line_mappings = split_into_sentences(text)
    current_chunk = []
    current_word_count = 0
    chunk_number = 1
    long_sentence_number = 1
    chunk_mappings = []

    sentence_index = 0
    while sentence_index < len(sentences):
        sentence = sentences[sentence_index]
        sentence_word_count = count_words(sentence)
        current_mapping = line_mappings[sentence_index]

        # Handle sentences longer than word limit
        if sentence_word_count > word_limit:
            # Save any accumulated chunk first
            if current_chunk:
                chunk_text = ' '.join(current_chunk)
                filename = f'chunk_{chunk_number}.txt'
                with open(output_dir / filename, 'w', encoding='utf-8') as f:
                    f.write(chunk_text)
                chunk_mappings.append({
                    'filename': filename,
                    'start_line': line_mappings[sentence_index - len(current_chunk)][1],
                    'end_line': line_mappings[sentence_index - 1][2],
                    'preview': chunk_text
                })
                chunk_number += 1
                current_chunk = []
                current_word_count = 0

            # Save long sentence
            filename = f'chunk_plus_{word_limit}_{long_sentence_number}.txt'
            with open(output_dir / filename, 'w', encoding='utf-8') as f:
                f.write(sentence)
            chunk_mappings.append({
                'filename': filename,
                'start_line': current_mapping[1],
                'end_line': current_mapping[2],
                'preview': sentence
            })
            long_sentence_number += 1
            sentence_index += 1
            continue

        if current_word_count + sentence_word_count <= word_limit:
            current_chunk.append(sentence)
            current_word_count += sentence_word_count
        else:
            if current_chunk:
                chunk_text = ' '.join(current_chunk)
                filename = f'chunk_{chunk_number}.txt'
                with open(output_dir / filename, 'w', encoding='utf-8') as f:
                    f.write(chunk_text)
                chunk_mappings.append({
                    'filename': filename,
                    'start_line': line_mappings[sentence_index - len(current_chunk)][1],
                    'end_line': line_mappings[sentence_index - 1][2],
                    'preview': chunk_text
                })
                chunk_number += 1
            current_chunk = [sentence]
            current_word_count = sentence_word_count

        sentence_index += 1

    # Save any remaining chunk
    if current_chunk:
        chunk_text = ' '.join(current_chunk)
        filename = f'chunk_{chunk_number}.txt'
        with open(output_dir / filename, 'w', encoding='utf-8') as f:
            f.write(chunk_text)
        chunk_mappings.append({
            'filename': filename,
            'start_line': line_mappings[sentence_index - len(current_chunk)][1],
            'end_line': line_mappings[sentence_index - 1][2],
            'preview': chunk_text
        })

    # Create mapping reference file
    create_mapping_file(chunk_mappings, output_dir)

if __name__ == '__main__':
    input_file = 'data.txt'
    word_limit = 50  # default word limit

    try:
        create_chunks(input_file, word_limit)
        print(f"Text has been split into chunks in the 'text_chunks' directory")
        print(f"Check 'text_chunks/chunk_mapping.txt' for chunk locations")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

Step 5: Celan your datan & run

Examine the text files and review the results. 'chunk_mapping' provides a detailed description of each chunk. Move it so that Zone does not process it.

Place the 'Speaker Audio (your_voice.mp3 or *.wav)' in the root directory of Zonos or specify the path in the command below.

In Terminal run:
python batch_process.py --text-dir ./text_chunks --output-dir ./output_audio --speaker-audio ./your_voice.mp3 --language en-us --seed 420

In my tests, it will produce up to 80 to 85% consistent voice tone.

The text was updated successfully, but these errors were encountered:

xdevfaheem · 2025-02-19T07:07:41Z

i think, you could concatenate the audio chunk into one single audio waveform tensor, which could be written into a final file

rzgarespo · 2025-02-19T23:16:56Z

i think, you could concatenate the audio chunk into one single audio waveform tensor, which could be written into a final file

Yes, but it depends on the use cases. For instances like this [https://www.youtube.com/watch?v=3OFkEtTFM84], it is practical to have the output in parts.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Plug-in] Unlimite Text Length. #105

[Plug-in] Unlimite Text Length. #105

rzgarespo commented Feb 17, 2025 •

edited

Loading

xdevfaheem commented Feb 19, 2025

rzgarespo commented Feb 19, 2025

[Plug-in] Unlimite Text Length. #105

[Plug-in] Unlimite Text Length. #105

Comments

rzgarespo commented Feb 17, 2025 • edited Loading

Usage:

Step 1: Open gradio_interface.py

Step 2: Create batch_process.py

Step 3: Prepare your text

put your text into a text file and save it as data.txt.

Step 4: Split the text

create new 'text_split.py'

Step 5: Celan your datan & run

Examine the text files and review the results. 'chunk_mapping' provides a detailed description of each chunk. Move it so that Zone does not process it.

Place the 'Speaker Audio (your_voice.mp3 or *.wav)' in the root directory of Zonos or specify the path in the command below.

xdevfaheem commented Feb 19, 2025

rzgarespo commented Feb 19, 2025

rzgarespo commented Feb 17, 2025 •

edited

Loading