Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Plug-in] Unlimite Text Length. #105

Open
rzgarespo opened this issue Feb 17, 2025 · 2 comments
Open

[Plug-in] Unlimite Text Length. #105

rzgarespo opened this issue Feb 17, 2025 · 2 comments

Comments

@rzgarespo
Copy link

rzgarespo commented Feb 17, 2025

Usage:

python batch_process.py --text-dir ./text_chunks --output-dir ./output_audio --speaker-audio ./your_voice.mp3 --language en-us --seed 420

Another solution: #98 (comment)

Image
Image
Image

Step 1: Open gradio_interface.py

insert the code above: def build_interface(): ( line 196) and save the changes.

##### Plugin code Start #####

def batch_generate_audio(
    model_choice: str,
    text_dir: str,
    output_dir: str,
    language: str = "en-us",
    speaker_audio_path: str = None,
    cfg_scale: float = 2.0,
    min_p: float = 0.15,
    seed: int = 420,
    # Fixed emotion parameters matching Gradio defaults
    e1: float = 1.0,    # Happiness
    e2: float = 0.05,   # Sadness
    e3: float = 0.05,   # Disgust
    e4: float = 0.05,   # Fear
    e5: float = 0.05,   # Surprise
    e6: float = 0.05,   # Anger
    e7: float = 0.1,    # Other
    e8: float = 0.2,    # Neutral
):
    """Batch generate audio from text files with consistent tone"""

    # Load model once
    model = load_model_if_needed(model_choice)
    device = model.device

    # Force deterministic settings
    torch.backends.cudnn.deterministic = True
    torch.backends.cudnn.benchmark = False
    torch.manual_seed(seed)
    if torch.cuda.is_available():
        torch.cuda.manual_seed_all(seed)

    # Create output directory
    output_path = Path(output_dir)
    output_path.mkdir(parents=True, exist_ok=True)

    # Process speaker embedding once if provided
    speaker_embedding = None
    if speaker_audio_path:
        wav, sr = torchaudio.load(speaker_audio_path)
        speaker_embedding = model.make_speaker_embedding(wav, sr)
        speaker_embedding = speaker_embedding.to(device, dtype=torch.bfloat16)

    # Fixed emotion tensor
    emotion_tensor = torch.tensor(
        [e1, e2, e3, e4, e5, e6, e7, e8],
        device=device,
        dtype=torch.float32
    )

    # Process text files
    text_files = list(Path(text_dir).glob("*.txt"))
    for text_file in text_files:
        with open(text_file, "r", encoding="utf-8") as f:
            text = f.read().strip()[:500]

        # Create consistent conditioning
        cond_dict = make_cond_dict(
            text=text,
            language=language,
            speaker=speaker_embedding,
            emotion=emotion_tensor,
            device=device,
            # Only disable these if your model supports them
            unconditional_keys=["vqscore_8", "dnsmos_ovrl"]
        )

        conditioning = model.prepare_conditioning(cond_dict)

        # Deterministic generation
        with torch.no_grad():
            codes = model.generate(
                prefix_conditioning=conditioning,
                max_new_tokens=86 * 30,
                cfg_scale=cfg_scale,
                batch_size=1,
                sampling_params=dict(min_p=min_p),
            )

        # Audio processing and saving (keep previous fix)
        wav_out = model.autoencoder.decode(codes).cpu().detach()
        if wav_out.dim() == 3: wav_out = wav_out.squeeze(0)
        if wav_out.dim() == 1: wav_out = wav_out.unsqueeze(0)
        if wav_out.dim() == 2 and wav_out.size(0) > 1:
            wav_out = wav_out[0:1, :]

        output_file = output_path / f"{text_file.stem}.wav"
        torchaudio.save(str(output_file), wav_out, model.autoencoder.sampling_rate)

    return f"Processed {len(text_files)} files with consistent tone to {output_dir}"


##### Plugin code End #####

Step 2: Create batch_process.py

In the same directory as gradio.interface.py, create a new file, insert the following code, and save it.

##### batch_process.py #####
import argparse
from gradio_interface import batch_generate_audio

if __name__ == "__main__":
    parser = argparse.ArgumentParser(description='Batch process text files into audio')
    parser.add_argument('--model', type=str, default="Zyphra/Zonos-v0.1-transformer",
                       help='Model name from Hugging Face Hub')
    parser.add_argument('--text-dir', type=str, required=True,
                       help='Directory containing text files')
    parser.add_argument('--output-dir', type=str, required=True,
                       help='Directory to save audio files')
    parser.add_argument('--speaker-audio', type=str, default=None,
                       help='Path to speaker audio for voice cloning')
    parser.add_argument('--language', type=str, default="en-us",
                       help='Language code for synthesis')
    parser.add_argument('--seed', type=int, default=420,
                       help='Random seed for reproducibility')

    args = parser.parse_args()

    result = batch_generate_audio(
        model_choice=args.model,
        text_dir=args.text_dir,
        output_dir=args.output_dir,
        language=args.language,
        speaker_audio_path=args.speaker_audio,
        seed=args.seed
    )

    print(result)

Step 3: Prepare your text

You can either manually divide your text into small chunks that are a maximum of 27 to 30 seconds long or use a text splitter to segment your text into smaller portions. Zonos works with a maximum of 30 seconds of audio generation and does not count the words. Some sentences with fewer words could generate longer audio based on the settings. For the default setting, I found 40 to 50 words work well. The code below can split your text based on sentences that end with "'.', '!', '?" (You can adjust the filter) and are not longer than 50 words. You can adjust the number of words by changing the value of "word_limit."

put your text into a text file and save it as data.txt.

you can change the file name but update the input_file" value.

Step 4: Split the text

create new 'text_split.py'

Create a new file and copy and paste the following code.
python text_split.py
This script will create a folder named 'text_chunks' and save text files there. If any sentences exceed 50 words, it will generate 'chunk_plus_50_1.txt' for you to manually address.

##### text_split.py #####
import os
import re
from pathlib import Path

def split_into_sentences(text):
    # Split text into sentences and keep track of line numbers
    lines = text.split('\n')
    sentences = []
    line_mappings = []  # Store (sentence, start_line, end_line)
    current_sentence = []
    start_line = 1

    for line_num, line in enumerate(lines, 1):
        words = line.strip().split()
        if not words:  # Skip empty lines
            if current_sentence:
                sentence = ' '.join(current_sentence)
                sentences.append(sentence)
                line_mappings.append((sentence, start_line, line_num))
                current_sentence = []
                start_line = line_num + 1
            continue

        current_sentence.extend(words)

        # Check for sentence endings
        if words[-1].endswith(('.', '!', '?')):
            sentence = ' '.join(current_sentence)
            sentences.append(sentence)
            line_mappings.append((sentence, start_line, line_num))
            current_sentence = []
            start_line = line_num + 1

    # Handle any remaining text
    if current_sentence:
        sentence = ' '.join(current_sentence)
        sentences.append(sentence)
        line_mappings.append((sentence, start_line, len(lines)))

    return sentences, line_mappings

def count_words(text):
    return len(text.split())

def create_mapping_file(chunk_mappings, output_dir):
    with open(output_dir / 'chunk_mapping.txt', 'w', encoding='utf-8') as f:
        f.write("Chunk Mapping Reference:\n")
        f.write("=" * 80 + "\n\n")

        for chunk_info in chunk_mappings:
            f.write(f"File: {chunk_info['filename']}\n")
            f.write(f"Lines: {chunk_info['start_line']} to {chunk_info['end_line']}\n")
            f.write(f"Preview: {chunk_info['preview'][:100]}...\n")
            f.write("-" * 80 + "\n\n")

def create_chunks(input_file, word_limit=50):
    output_dir = Path('text_chunks')
    output_dir.mkdir(exist_ok=True)

    with open(input_file, 'r', encoding='utf-8') as f:
        text = f.read()

    sentences, line_mappings = split_into_sentences(text)
    current_chunk = []
    current_word_count = 0
    chunk_number = 1
    long_sentence_number = 1
    chunk_mappings = []

    sentence_index = 0
    while sentence_index < len(sentences):
        sentence = sentences[sentence_index]
        sentence_word_count = count_words(sentence)
        current_mapping = line_mappings[sentence_index]

        # Handle sentences longer than word limit
        if sentence_word_count > word_limit:
            # Save any accumulated chunk first
            if current_chunk:
                chunk_text = ' '.join(current_chunk)
                filename = f'chunk_{chunk_number}.txt'
                with open(output_dir / filename, 'w', encoding='utf-8') as f:
                    f.write(chunk_text)
                chunk_mappings.append({
                    'filename': filename,
                    'start_line': line_mappings[sentence_index - len(current_chunk)][1],
                    'end_line': line_mappings[sentence_index - 1][2],
                    'preview': chunk_text
                })
                chunk_number += 1
                current_chunk = []
                current_word_count = 0

            # Save long sentence
            filename = f'chunk_plus_{word_limit}_{long_sentence_number}.txt'
            with open(output_dir / filename, 'w', encoding='utf-8') as f:
                f.write(sentence)
            chunk_mappings.append({
                'filename': filename,
                'start_line': current_mapping[1],
                'end_line': current_mapping[2],
                'preview': sentence
            })
            long_sentence_number += 1
            sentence_index += 1
            continue

        if current_word_count + sentence_word_count <= word_limit:
            current_chunk.append(sentence)
            current_word_count += sentence_word_count
        else:
            if current_chunk:
                chunk_text = ' '.join(current_chunk)
                filename = f'chunk_{chunk_number}.txt'
                with open(output_dir / filename, 'w', encoding='utf-8') as f:
                    f.write(chunk_text)
                chunk_mappings.append({
                    'filename': filename,
                    'start_line': line_mappings[sentence_index - len(current_chunk)][1],
                    'end_line': line_mappings[sentence_index - 1][2],
                    'preview': chunk_text
                })
                chunk_number += 1
            current_chunk = [sentence]
            current_word_count = sentence_word_count

        sentence_index += 1

    # Save any remaining chunk
    if current_chunk:
        chunk_text = ' '.join(current_chunk)
        filename = f'chunk_{chunk_number}.txt'
        with open(output_dir / filename, 'w', encoding='utf-8') as f:
            f.write(chunk_text)
        chunk_mappings.append({
            'filename': filename,
            'start_line': line_mappings[sentence_index - len(current_chunk)][1],
            'end_line': line_mappings[sentence_index - 1][2],
            'preview': chunk_text
        })

    # Create mapping reference file
    create_mapping_file(chunk_mappings, output_dir)

if __name__ == '__main__':
    input_file = 'data.txt'
    word_limit = 50  # default word limit

    try:
        create_chunks(input_file, word_limit)
        print(f"Text has been split into chunks in the 'text_chunks' directory")
        print(f"Check 'text_chunks/chunk_mapping.txt' for chunk locations")
    except Exception as e:
        print(f"An error occurred: {str(e)}")

Step 5: Celan your datan & run

Examine the text files and review the results. 'chunk_mapping' provides a detailed description of each chunk. Move it so that Zone does not process it.

Place the 'Speaker Audio (your_voice.mp3 or *.wav)' in the root directory of Zonos or specify the path in the command below.

In Terminal run:
python batch_process.py --text-dir ./text_chunks --output-dir ./output_audio --speaker-audio ./your_voice.mp3 --language en-us --seed 420

In my tests, it will produce up to 80 to 85% consistent voice tone.

@xdevfaheem
Copy link

i think, you could concatenate the audio chunk into one single audio waveform tensor, which could be written into a final file

@rzgarespo
Copy link
Author

i think, you could concatenate the audio chunk into one single audio waveform tensor, which could be written into a final file

Yes, but it depends on the use cases. For instances like this [https://www.youtube.com/watch?v=3OFkEtTFM84], it is practical to have the output in parts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants