Skip to content

feat(irodori-tts): add v2 model support with VoiceDesign and chunked DACVAE decode#660

Open
yoshphys wants to merge 4 commits intoBlaizzy:mainfrom
yoshphys:feature/irodori-tts
Open

feat(irodori-tts): add v2 model support with VoiceDesign and chunked DACVAE decode#660
yoshphys wants to merge 4 commits intoBlaizzy:mainfrom
yoshphys:feature/irodori-tts

Conversation

@yoshphys
Copy link
Copy Markdown
Contributor

Summary

  • Add Irodori-TTS v2 support (latent_dim=32, Semantic-DACVAE) alongside existing v1
  • Add VoiceDesign variant: caption/voice-description conditioning instead of reference audio
  • Accept instruct kwarg as alias for caption (mlx-audio convention)
  • Use chunked DACVAE decoding (chunk_size=50) to stay within 16 GB memory limits on Apple Silicon

Uploaded models

Model HuggingFace
fp16 mlx-community/Irodori-TTS-500M-v2-fp16
8bit mlx-community/Irodori-TTS-500M-v2-8bit
4bit mlx-community/Irodori-TTS-500M-v2-4bit
VoiceDesign fp16 mlx-community/Irodori-TTS-500M-v2-VoiceDesign-fp16
VoiceDesign 8bit mlx-community/Irodori-TTS-500M-v2-VoiceDesign-8bit
VoiceDesign 4bit mlx-community/Irodori-TTS-500M-v2-VoiceDesign-4bit

Audio comparison

Text: 「その森には、古い言い伝えがありました。月が最も高く昇る夜、静かに耳を澄ませば、風の歌声が聞こえるというのです。私は半信半疑でしたが、その夜、確かに誰かが私を呼ぶ声を聞いたのです。」

Audio
Original (Aratako/Irodori-TTS-500M-v2, PyTorch) standard_sample2.wav
MLX port (fp16) standard_sample2_mlx.wav

Usage

Standard (voice cloning)

from mlx_audio.tts.generate import generate_audio

generate_audio(
    model="mlx-community/Irodori-TTS-500M-v2-fp16",
    text="こんにちは、テストです。",
    ref_audio="path/to/reference.wav",
    file_prefix="output",
)

VoiceDesign (voice description)

generate_audio(
    model="mlx-community/Irodori-TTS-500M-v2-VoiceDesign-fp16",
    text="こんにちは、テストです。",
    instruct="穏やかで落ち着いた女性の声。ゆっくりと話す。",
    file_prefix="output",
)

🤖 Generated with Claude Code

# Audio latent dimensions (DACVAE: 128-dim, 48kHz)
latent_dim: int = 128
# Audio latent dimensions (v2: 32-dim Semantic-DACVAE, v1: 128-dim DACVAE)
latent_dim: int = 32
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this break the v1 model?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it will not.
The default value of latent_dim in IrodoriDiTConfig is used only as a fallback when no config is provided (e.g., in unit tests).
When loading a real model, ModelConfig.from_dict() reads the value from the model's config.json, which explicitly specifies "latent_dim": 128 for v1 and "latent_dim": 32 for v2.
So the default never affects actual model loading.

@lucasnewman
Copy link
Copy Markdown
Collaborator

@yoshphys Can you update the README.md for the model with the supported model repo ids, and make sure you run the formatter? pre-commit run --all

yoshphys and others added 4 commits April 21, 2026 08:00
…DACVAE decode

- Support Irodori-TTS v2 (latent_dim=32, Semantic-DACVAE) alongside v1
- Add VoiceDesign caption conditioning (use_caption_condition=True)
- Accept instruct kwarg as alias for caption in model.generate()
- Use chunked DACVAE decoding (chunk_size=50) to stay within 16GB memory limits

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e compat

- Fix _FakeDACVAE.decode() to accept **kwargs (chunk_size support)
- Add _small_irodori_dit_config_voicedesign() and model config helpers
- Add TestIrodoriVoiceDesignShapes: forward pass with caption conditioning
- Add TestIrodoriVoiceDesignGenerate: generate() with caption and instruct alias

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…fix formatting

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@yoshphys yoshphys force-pushed the feature/irodori-tts branch from 2328e1b to f56f57c Compare April 20, 2026 23:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants