Path D — Reference Audio

The most promising path for voice cloning. Uses a reference audio clip to guide both prompt generation (via timbre captioning) and final voice matching (via voice conversion). The result sounds like the reference speaker performing with the sampled emotional/stylistic direction.

Why Two Stages?

Passing voice_ref directly to DramaBox TTS leads to unstable, garbled generations. Instead, Path D uses a two-stage approach:

Text-only DramaBox TTS — guided by a timbre whisper caption that describes the reference speaker's vocal qualities
Chatterbox Voice Conversion — converts the TTS output to match the reference speaker's actual voice

This produces stable, high-quality voice matching while preserving the emotional performance.

Sampling Strategy

Reference Audio: Load a reference clip (5-30 seconds of clean, single-speaker audio)
Timbre Caption: Generate or load a timbre description via laion/timbre-whisper
- Example: "A warm, resonant baritone with slight breathiness and a smooth, unhurried delivery"
Situation-Dependent VoiceNet Dims: Filter the 57 VoiceNet dimensions to exclude identity-related ones (age, gender, timbre, resonance). Sample 5 from the remaining situation-dependent dimensions.
Emotions: Sample 1-3 emotions from EmoNet with intensity
Tempo: Sampled from VoiceNet

Why Filter VoiceNet Dimensions?

Identity-related dimensions (voice age, perceived gender, timbre qualities, resonance placement) would conflict with the reference speaker's actual characteristics. Only situation-dependent dimensions (emotional delivery, pacing, intensity, speaking style) are sampled — these describe how the speaker performs, not who they are.

LLM Prompt Construction

The prompt includes:

Timbre caption — describes the reference speaker's vocal qualities
Sampled situation-dependent dimensions — performance attributes
Sampled emotions — emotional coloring
Language — target language for dialogue

Gemma 4 generates a DramaBox script that is vocally consistent with the timbre description while incorporating the sampled performance attributes.

Audio Processing

DramaBox TTS: Text-only synthesis (no voice_ref parameter) — the timbre caption in the script guides the vocal quality
Chatterbox Voice Conversion: Convert TTS output to match reference speaker
Best-of-N Scoring: Score by WER + content enjoyment, select best candidate

Demo

Listen to Path D samples in the main demo grid.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Path D — Reference Audio

Why Two Stages?

Sampling Strategy

Why Filter VoiceNet Dimensions?

LLM Prompt Construction

Audio Processing

Demo

FilesExpand file tree

path_d_reference.md

Latest commit

History

path_d_reference.md

File metadata and controls

Path D — Reference Audio

Why Two Stages?

Sampling Strategy

Why Filter VoiceNet Dimensions?

LLM Prompt Construction

Audio Processing

Demo