The most promising path for voice cloning. Uses a reference audio clip to guide both prompt generation (via timbre captioning) and final voice matching (via voice conversion). The result sounds like the reference speaker performing with the sampled emotional/stylistic direction.
Passing voice_ref directly to DramaBox TTS leads to unstable, garbled generations. Instead, Path D uses a two-stage approach:
- Text-only DramaBox TTS — guided by a timbre whisper caption that describes the reference speaker's vocal qualities
- Chatterbox Voice Conversion — converts the TTS output to match the reference speaker's actual voice
This produces stable, high-quality voice matching while preserving the emotional performance.
- Reference Audio: Load a reference clip (5-30 seconds of clean, single-speaker audio)
- Timbre Caption: Generate or load a timbre description via
laion/timbre-whisper- Example: "A warm, resonant baritone with slight breathiness and a smooth, unhurried delivery"
- Situation-Dependent VoiceNet Dims: Filter the 57 VoiceNet dimensions to exclude identity-related ones (age, gender, timbre, resonance). Sample 5 from the remaining situation-dependent dimensions.
- Emotions: Sample 1-3 emotions from EmoNet with intensity
- Tempo: Sampled from VoiceNet
Identity-related dimensions (voice age, perceived gender, timbre qualities, resonance placement) would conflict with the reference speaker's actual characteristics. Only situation-dependent dimensions (emotional delivery, pacing, intensity, speaking style) are sampled — these describe how the speaker performs, not who they are.
The prompt includes:
- Timbre caption — describes the reference speaker's vocal qualities
- Sampled situation-dependent dimensions — performance attributes
- Sampled emotions — emotional coloring
- Language — target language for dialogue
Gemma 4 generates a DramaBox script that is vocally consistent with the timbre description while incorporating the sampled performance attributes.
- DramaBox TTS: Text-only synthesis (no
voice_refparameter) — the timbre caption in the script guides the vocal quality - Chatterbox Voice Conversion: Convert TTS output to match reference speaker
- Best-of-N Scoring: Score by WER + content enjoyment, select best candidate
Listen to Path D samples in the main demo grid.