Release OuteTTS v0.2.0

### Major Changes - Added support for OuteTTS-0.2-500M model - Introduced default speaker presets for each supported language - **Breaking Changes**: - Incompatible speaker files from versions <0.2.0 - Revised interface usage (see README.md) ### New Features - Added voice cloning guidelines and interface usage in README.md - Implemented Gradio example playground for OuteTTS-0.2-500M - Multi-language alignment support - Enhanced speaker management: - Methods: `print_default_speakers()` and `load_default_speaker(name)` - JSON format for speaker saving with language info - Option to load WavTokenizer from custom path (fixes #24) - Support for multiple interface version initialization ### Improvements - Restructured library files for better organization - Added hash verification for WavTokenizer downloads (fixes #3) - Reworked interface for improved usability - Made sounddevice optional with better error handling - Included training data preparation examples ### Error Handling - Improved validation for audio token detection - Enhanced error messages for long inputs and EOS cases - Better library-wide error handling and feedback
edwko · Nov 25, 2024 · c2d413b · c2d413b
1 parent d90c645
commit c2d413b
Show file tree

Hide file tree

Showing 64 changed files with 15,954 additions and 208 deletions.
diff --git a/README.md b/README.md
@@ -1,10 +1,11 @@
 # OuteTTS
 
-[![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-OuteTTS_0.1_350M-blue)](https://huggingface.co/OuteAI/OuteTTS-0.1-350M)
-[![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-OuteTTS_0.1_350M_GGUF-blue)](https://huggingface.co/OuteAI/OuteTTS-0.1-350M-GGUF)
-[![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Demo-pink)](https://huggingface.co/spaces/OuteAI/OuteTTS-0.1-350M-Demo)
+[![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-OuteTTS_0.2_500M-blue)](https://huggingface.co/OuteAI/OuteTTS-0.1-350M)
+[![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-OuteTTS_0.2_500M_GGUF-blue)](https://huggingface.co/OuteAI/OuteTTS-0.1-350M-GGUF)
+[![HuggingFace](https://img.shields.io/badge/🤗%20Hugging%20Face-Demo_Space-pink)](https://huggingface.co/spaces/OuteAI/OuteTTS-0.2-500M-Demo)
+[![PyPI](https://img.shields.io/badge/PyPI-OuteTTS-orange)](https://pypi.org/project/outetts/)
 
-OuteTTS is an experimental text-to-speech model that uses a pure language modeling approach to generate speech.
+OuteTTS is an experimental text-to-speech model that uses a pure language modeling approach to generate speech, without architectural changes to the foundation model itself.
 
 ## Installation
 
@@ -20,55 +21,88 @@ Visit https://github.com/abetlen/llama-cpp-python for specific installation inst
 
 ### Interface Usage
 ```python
-from outetts.v0_1.interface import InterfaceHF, InterfaceGGUF
+import outetts
 
-# Initialize the interface with the Hugging Face model
-interface = InterfaceHF("OuteAI/OuteTTS-0.1-350M")
+# Configure the model
+model_config = outetts.HFModelConfig_v1(
+    model_path="OuteAI/OuteTTS-0.2-500M",
+    language="en",  # Supported languages in v0.2: en, zh, ja, ko
+)
+
+# Initialize the interface
+interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
+
+# Optional: Create a speaker profile (use a 10-15 second audio clip)
+# speaker = interface.create_speaker(
+#     audio_path="path/to/audio/file",
+#     transcript="Transcription of the audio file."
+# )
 
-# Or initialize the interface with a GGUF model
-# interface = InterfaceGGUF("path/to/model.gguf")
+# Optional: Save and load speaker profiles
+# interface.save_speaker(speaker, "speaker.json")
+# speaker = interface.load_speaker("speaker.json")
+
+# Optional: Load speaker from default presets
+interface.print_default_speakers()
+speaker = interface.load_default_speaker(name="male_1")
 
-# Generate TTS output
-# Without a speaker reference, the model generates speech with random speaker characteristics
 output = interface.generate(
-    text="Hello, am I working?",
+    text="Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and it can be implemented in software or hardware products.",
+    # Lower temperature values may result in a more stable tone,
+    # while higher values can introduce varied and expressive speech
     temperature=0.1,
     repetition_penalty=1.1,
-    max_length=4096
-)
+    max_length=4096,
 
-# Play the generated audio
-output.play()
+    # Optional: Use a speaker profile for consistent voice characteristics
+    # Without a speaker profile, the model will generate a voice with random characteristics
+    speaker=speaker,
+)
 
-# Save the generated audio to a file
+# Save the synthesized speech to a file
 output.save("output.wav")
+
+# Optional: Play the synthesized speech
+# output.play()
 ```
 
-### Voice Cloning
+### Using GGUF Model
 ```python
-# Create a custom speaker from an audio file
-speaker = interface.create_speaker(
-    "path/to/reference.wav",
-    "reference text matching the audio"
+# Configure the GGUF model
+model_config = outetts.GGUFModelConfig_v1(
+    model_path="local/path/to/model.gguf",
+    language="en", # Supported languages in v0.2: en, zh, ja, ko
+    n_gpu_layers=0,
 )
 
-# Save the speaker to a file
-interface.save_speaker(speaker, "speaker.pkl")
+# Initialize the GGUF interface
+interface = outetts.InterfaceGGUF(model_version="0.2", cfg=model_config)
+```
 
-# Load the speaker from a file
-speaker = interface.load_speaker("speaker.pkl")
+### Creating a Speaker for Voice Cloning
 
-# Generate TTS with the custom voice
-output = interface.generate(
-    text="This is a cloned voice speaking",
-    speaker=speaker,
-    temperature=0.1,
-    repetition_penalty=1.1,
-    max_length=4096
-)
-```
+To achieve the best results when creating a speaker profile, consider the following recommendations:
+
+1. **Audio Clip Duration:**
+   - Use an audio clip of around **10-15 seconds**.
+   - This duration provides sufficient data for the model to learn the speaker's characteristics while keeping the input manageable. The model's context length is 4096 tokens, allowing it to generate around 54 seconds of audio in total. However, when a speaker profile is included, this capacity is reduced proportionally to the length of the speaker's audio clip.
+
+2. **Audio Quality:**
+   - Ensure the audio is **clear and noise-free**. Background noise or distortions can reduce the model's ability to extract accurate voice features.
+
+3. **Accurate Transcription:**
+   - Provide a highly **accurate transcription** of the audio clip. Mismatches between the audio and transcription can lead to suboptimal results.
+
+4. **Speaker Familiarity:**
+   - The model performs best with voices that are similar to those seen during training. Using a voice that is **significantly different from typical training samples** (e.g., unique accents, rare vocal characteristics) might result in inaccurate replication.
+   - In such cases, you may need to **fine-tune the model** specifically on your target speaker's voice to achieve a better representation.
+
+5. **Parameter Adjustments:**
+   - Adjust parameters like `temperature` in the `generate` function to refine the expressive quality and consistency of the synthesized voice.
+
+## Blogs
+https://www.outeai.com/blog/OuteTTS-0.2-500M
 
-## Technical Blog
 https://www.outeai.com/blog/OuteTTS-0.1-350M
 
 
@@ -77,3 +111,6 @@ https://www.outeai.com/blog/OuteTTS-0.1-350M
 - WavTokenizer: [GitHub Repository](https://github.com/jishengpeng/WavTokenizer)
     - Decoder and encoder folder files are from this repository
 - CTC Forced Alignment: [PyTorch Tutorial](https://pytorch.org/audio/stable/tutorials/ctc_forced_alignment_api_tutorial.html)
+- Uroman: [GitHub Repository](https://github.com/isi-nlp/uroman)
+    - "This project uses the universal romanizer software 'uroman' written by Ulf Hermjakob, USC Information Sciences Institute (2015-2020)".
+- mecab-python3 [GitHub Repository](https://github.com/SamuraiT/mecab-python3)
diff --git a/examples/v1/data_creation.py b/examples/v1/data_creation.py
@@ -0,0 +1,72 @@
+import os
+import polars as pl
+import torch
+from tqdm import tqdm
+import outetts
+
+df = pl.read_parquet("sample.parquet")
+
+language = "en"
+device = "cuda"
+
+interface = outetts.InterfaceHF(
+    model_version="0.2",
+    cfg=outetts.HFModelConfig_v1(
+        model_path="OuteAI/OuteTTS-0.2-500M",
+        language=language,
+    )
+)
+
+del interface.model
+
+ctc = outetts.CTCForcedAlignment([language], device)
+
+def create_speaker(audio_path: str, transcript: str, language: str):
+    words = ctc.align(audio_path, transcript, language)
+
+    full_codes = interface.audio_codec.encode(
+        interface.audio_codec.convert_audio_tensor(
+            audio=torch.cat([i["audio"] for i in words], dim=1),
+            sr=ctc.sample_rate
+        ).to(interface.audio_codec.device)
+    ).tolist()
+
+    data = []
+    start = 0
+    for i in words:
+        end = int(round((i["x1"] / ctc.sample_rate) * 75))
+        word_tokens = full_codes[0][0][start:end]
+        start = end
+        if not word_tokens:
+            word_tokens = [1]
+
+        data.append({
+            "word": i["word"],
+            "duration": round(len(word_tokens) / 75, 2),
+            "codes": word_tokens
+        })
+
+    return {
+        "text": transcript,
+        "words": data,
+    }
+
+data = []
+
+for i in tqdm(df.to_dicts()):
+    text = i["text"]
+    language = i["language"]
+
+    file = i["audio"]["path"]
+    with open(file, 'wb') as f:
+        f.write(i["audio"]["bytes"])
+
+    data.append(interface.prompt_processor.get_training_prompt(
+        text=text,
+        language=language,
+        speaker=create_speaker(file, text, language)
+    ))
+
+    os.remove(file)
+
+pl.DataFrame({"data": data}).write_parquet("processed_data.parquet")
diff --git a/examples/v1/gradio_playground.py b/examples/v1/gradio_playground.py
@@ -0,0 +1,160 @@
+import os
+import gradio as gr
+import outetts
+from outetts.version.v1.interface import _DEFAULT_SPEAKERS
+
+model_config = outetts.HFModelConfig_v1(
+    model_path="OuteAI/OuteTTS-0.2-500M",
+    language="en",
+)
+interface = outetts.InterfaceHF(model_version="0.2", cfg=model_config)
+
+def get_available_speakers(language):
+    """Get available speakers for the selected language."""
+    if language not in interface.languages:
+        return []
+    speakers = list(_DEFAULT_SPEAKERS[language].keys())
+    speakers.insert(0, "None") 
+    return speakers
+
+def change_interface_language(language):
+    """Change interface language and update available speakers."""
+    try:
+        interface.change_language(language)
+        speakers = get_available_speakers(language)
+        return gr.update(choices=speakers, value="male_1"), gr.update(visible=True)
+    except ValueError as e:
+        return gr.update(choices=["None"], value="None"), gr.update(visible=False)
+
+def generate_tts(
+        text, temperature, repetition_penalty, language, 
+        speaker_selection, reference_audio, reference_text
+    ):
+    """Generate TTS with error handling and new features."""
+    try:
+        # Validate inputs for custom speaker
+        if reference_audio and reference_text:
+            if not os.path.exists(reference_audio):
+                raise ValueError("Reference audio file not found")
+            if not reference_text.strip():
+                raise ValueError("Reference transcription text is required")
+            speaker = interface.create_speaker(reference_audio, reference_text)
+
+        # Use selected default speaker
+        elif speaker_selection and speaker_selection != "None":
+            speaker = interface.load_default_speaker(speaker_selection)
+
+        # No speaker - random characteristics
+        else:
+            speaker = None
+
+        # Generate audio
+        output = interface.generate(
+            text=text,
+            speaker=speaker,
+            temperature=temperature,
+            repetition_penalty=repetition_penalty,
+            max_length=4096
+        )
+
+        # Verify output
+        if output.audio is None:
+            raise ValueError("Model failed to generate audio. This may be due to input length constraints or early EOS token.")
+
+        # Save and return output
+        output_path = "output.wav"
+        output.save(output_path)
+        return output_path, None
+
+    except Exception as e:
+        return None, str(e)
+
+with gr.Blocks() as demo:
+    gr.Markdown("# OuteTTS-0.2-500M Text-to-Speech Demo")
+
+    error_box = gr.Textbox(label="Error Messages", visible=False)
+
+    with gr.Row():
+        with gr.Column():
+            # Language selection
+            language_dropdown = gr.Dropdown(
+                choices=list(interface.languages),
+                value="en",
+                label="Interface Language"
+            )
+
+            # Speaker selection
+            speaker_dropdown = gr.Dropdown(
+                choices=get_available_speakers("en"),
+                value="male_1",
+                label="Speaker Selection"
+            )
+
+            text_input = gr.Textbox(
+                label="Text to Synthesize",
+                placeholder="Enter text here..."
+            )
+
+            temperature = gr.Slider(
+                0.1, 1.0,
+                value=0.1,
+                label="Temperature (lower = more stable tone, higher = more expressive)"
+            )
+
+            repetition_penalty = gr.Slider(
+                0.5, 2.0,
+                value=1.1,
+                label="Repetition Penalty"
+            )
+
+            gr.Markdown("""
+### Voice Cloning Guidelines:
+- Use 10-15 seconds of clear, noise-free audio
+- Provide accurate transcription
+- Longer audio clips will reduce maximum output length
+- Custom speaker overrides speaker selection
+            """)
+
+            reference_audio = gr.Audio(
+                label="Reference Audio (for voice cloning)",
+                type="filepath"
+            )
+
+            reference_text = gr.Textbox(
+                label="Reference Transcription Text",
+                placeholder="Enter exact transcription of reference audio"
+            )
+
+            submit_button = gr.Button("Generate Speech")
+
+        with gr.Column():
+            audio_output = gr.Audio(
+                label="Generated Audio",
+                type="filepath"
+            )
+
+    language_dropdown.change(
+        fn=change_interface_language,
+        inputs=[language_dropdown],
+        outputs=[speaker_dropdown, speaker_dropdown]
+    )
+
+    submit_button.click(
+        fn=generate_tts,
+        inputs=[
+            text_input,
+            temperature,
+            repetition_penalty,
+            language_dropdown,
+            speaker_dropdown,
+            reference_audio,
+            reference_text
+        ],
+        outputs=[audio_output, error_box]
+    ).then(
+        fn=lambda x: gr.update(visible=bool(x)),
+        inputs=[error_box],
+        outputs=[error_box]
+    )
+
+demo.launch()
diff --git a/examples/v1/sample.parquet b/examples/v1/sample.parquet
diff --git a/examples/v1/train.md b/examples/v1/train.md
@@ -0,0 +1,8 @@
+# Training Instructions
+
+The model can be trained similarly to other transformer-based models. An example for preparing datasets is included in `examples/v1/data_creation.py`. After generating the dataset, you can begin training using your preferred library. Below are some suggested libraries for tasks like supervised fine-tuning (SFT):
+
+- [Hugging Face's SFT Trainer](https://huggingface.co/docs/trl/sft_trainer)
+- [TorchTune](https://github.com/pytorch/torchtune)
+
+Refer to the respective documentation for detailed setup and instructions.
diff --git a/outetts/__init__.py b/outetts/__init__.py
@@ -1 +1,5 @@
+__version__ = "0.2.0" 
 
+from .interface import InterfaceHF, InterfaceGGUF, display_available_models
+from .interface import HFModelConfig_v1, GGUFModelConfig_v1
+from .version.v1.alignment import CTCForcedAlignment