feat(irodori-tts): add v2 model support with VoiceDesign and chunked DACVAE decode by yoshphys · Pull Request #660 · Blaizzy/mlx-audio

yoshphys · 2026-04-20T13:49:20Z

Summary

Add Irodori-TTS v2 support (latent_dim=32, Semantic-DACVAE) alongside existing v1
Add VoiceDesign variant: caption/voice-description conditioning instead of reference audio
Accept instruct kwarg as alias for caption (mlx-audio convention)
Use chunked DACVAE decoding (chunk_size=50) to stay within 16 GB memory limits on Apple Silicon

Uploaded models

Model	HuggingFace
fp16	mlx-community/Irodori-TTS-500M-v2-fp16
8bit	mlx-community/Irodori-TTS-500M-v2-8bit
4bit	mlx-community/Irodori-TTS-500M-v2-4bit
VoiceDesign fp16	mlx-community/Irodori-TTS-500M-v2-VoiceDesign-fp16
VoiceDesign 8bit	mlx-community/Irodori-TTS-500M-v2-VoiceDesign-8bit
VoiceDesign 4bit	mlx-community/Irodori-TTS-500M-v2-VoiceDesign-4bit

Audio comparison

Text: 「その森には、古い言い伝えがありました。月が最も高く昇る夜、静かに耳を澄ませば、風の歌声が聞こえるというのです。私は半信半疑でしたが、その夜、確かに誰かが私を呼ぶ声を聞いたのです。」

	Audio
Original (Aratako/Irodori-TTS-500M-v2, PyTorch)	standard_sample2.wav
MLX port (fp16)	standard_sample2_mlx.wav

Usage

Standard (voice cloning)

from mlx_audio.tts.generate import generate_audio

generate_audio(
    model="mlx-community/Irodori-TTS-500M-v2-fp16",
    text="こんにちは、テストです。",
    ref_audio="path/to/reference.wav",
    file_prefix="output",
)

VoiceDesign (voice description)

generate_audio(
    model="mlx-community/Irodori-TTS-500M-v2-VoiceDesign-fp16",
    text="こんにちは、テストです。",
    instruct="穏やかで落ち着いた女性の声。ゆっくりと話す。",
    file_prefix="output",
)

🤖 Generated with Claude Code

lucasnewman · 2026-04-20T18:30:42Z

-    # Audio latent dimensions (DACVAE: 128-dim, 48kHz)
-    latent_dim: int = 128
+    # Audio latent dimensions (v2: 32-dim Semantic-DACVAE, v1: 128-dim DACVAE)
+    latent_dim: int = 32


Will this break the v1 model?

No, it will not.
The default value of latent_dim in IrodoriDiTConfig is used only as a fallback when no config is provided (e.g., in unit tests).
When loading a real model, ModelConfig.from_dict() reads the value from the model's config.json, which explicitly specifies "latent_dim": 128 for v1 and "latent_dim": 32 for v2.
So the default never affects actual model loading.

lucasnewman · 2026-04-20T18:31:36Z

@yoshphys Can you update the README.md for the model with the supported model repo ids, and make sure you run the formatter? pre-commit run --all

…DACVAE decode - Support Irodori-TTS v2 (latent_dim=32, Semantic-DACVAE) alongside v1 - Add VoiceDesign caption conditioning (use_caption_condition=True) - Accept instruct kwarg as alias for caption in model.generate() - Use chunked DACVAE decoding (chunk_size=50) to stay within 16GB memory limits Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e compat - Fix _FakeDACVAE.decode() to accept **kwargs (chunk_size support) - Add _small_irodori_dit_config_voicedesign() and model config helpers - Add TestIrodoriVoiceDesignShapes: forward pass with caption conditioning - Add TestIrodoriVoiceDesignGenerate: generate() with caption and instruct alias Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…fix formatting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

lucasnewman reviewed Apr 20, 2026

View reviewed changes

yoshphys and others added 4 commits April 21, 2026 08:00

style(irodori-tts): fix black formatting in config.py

6f50f11

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

docs(irodori-tts): update README with v2/VoiceDesign model repo IDs; …

f56f57c

…fix formatting Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

yoshphys force-pushed the feature/irodori-tts branch from 2328e1b to f56f57c Compare April 20, 2026 23:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(irodori-tts): add v2 model support with VoiceDesign and chunked DACVAE decode#660

feat(irodori-tts): add v2 model support with VoiceDesign and chunked DACVAE decode#660
yoshphys wants to merge 4 commits intoBlaizzy:mainfrom
yoshphys:feature/irodori-tts

yoshphys commented Apr 20, 2026

Uh oh!

lucasnewman Apr 20, 2026

Uh oh!

yoshphys Apr 20, 2026

Uh oh!

lucasnewman commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

yoshphys commented Apr 20, 2026

Summary

Uploaded models

Audio comparison

Usage

Standard (voice cloning)

VoiceDesign (voice description)

Uh oh!

lucasnewman Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

yoshphys Apr 20, 2026

Choose a reason for hiding this comment

Uh oh!

lucasnewman commented Apr 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants