The first voice acting pipeline with open-weights components and open post training data that combines zero-shot voice cloning with natural language performance direction. Vocalino allows you to provide a reference voice (or generate one from scratch) and use free-form text instructions to direct how the line is performed. It generates speech that maintains strict voice consistency with your reference audio while adhering to your specific emotional and stylistic prompts—giving you total control over the actor and the performance without any model training.
▶ Click to watch demo video
End-to-end voice prompt generation and audio synthesis using the DramaBox TTS model (22B DiT transformer) and structured voice taxonomy sampling. Based on the voice taxonomy research from Schuhmann et al., 2025.
This pipeline generates richly annotated voice performance prompts in the DramaBox format — single-speaker scenes with stage directions (English) and spoken dialogue (target language) — then synthesizes them into audio. Each prompt is procedurally constructed by sampling from structured taxonomies, then expanded by an LLM (Gemma 4 E4B-it) into a full performance script.
The pipeline supports 12 generation paths organized into three families. Each path uses a different sampling strategy to produce diverse voice acting data.
| Path | Sampling | Description | Details |
|---|---|---|---|
| A (VoiceNet) | 57 VoiceNet dims + EmoNet + Vocal Bursts | Full taxonomy sampling: 3 mandatory dims (Tempo, Gender, Age) + 5 random, 1-3 emotions, flow style, mandatory words | Path A Details |
| B (Archetype) | 920 archetypes × 92 genres | Genre/character archetype-based: random archetype + emotions + Tempo/Arousal | Path B Details |
| C (Archetype Named) | Same as B + explicit naming | Archetype with explicit role naming in the DramaBox script (e.g. "a battle-hardened noble knight") | Path C Details |
| D (Reference Audio) | Timbre whisper + VoiceNet + Chatterbox VC | Reference audio pipeline: timbre caption guides prompt, DramaBox TTS + voice conversion to match reference speaker | Path D Details |
| AC (Acting Challenge) | 1478 acting challenges + VoiceNet gender/age | Audition-style method acting from challenge scenarios — naturalistic, genuine, dynamic emotional arc | AC Details |
All CC paths generate two scenes with the same speaker in contrasting emotional states, separated by a "CUT TO:" marker. The speaker's fundamental voice (age, gender, timbre) stays identical — only the emotional delivery changes. Audio is later split into Scene 1 / Scene 2 using Qwen3-ASR word-level timestamps.
| Path | Sampling | Key Improvement | Details |
|---|---|---|---|
| CC-A (VoiceNet) | VoiceNet + contrasting emotions | Original two-scene format | CC Details |
| CC-B (Archetype) | Archetype + contrasting emotions | Original two-scene format | CC Details |
| CC-C (Archetype Named) | Archetype named + contrasting emotions | Original two-scene format | CC Details |
| CC2-A (VoiceNet v2) | VoiceNet + contrasting emotions | Enhanced: explicit emotional scene setup + dramatic transition descriptions | CC2 Details |
| CC2-B (Archetype v2) | Archetype + contrasting emotions | Enhanced: genuine/spontaneous/authentic delivery emphasis | CC2 Details |
| CC2-C (Archetype Named v2) | Archetype named + contrasting emotions | Enhanced: visceral emotional contrast, human-sounding | CC2 Details |
| ACCC (Acting Challenge CC) | Acting challenge + VoiceNet gender/age | Challenge-driven two-scene format — same actor, same challenge, contrasting emotional moments | ACCC Details |
Sampling → Gemma 4 LLM → DramaBox TTS → RE-USE Enhancement → Best-of-N Scoring
↓
Parakeet ASR (WER) + Empathic Insight (enjoyment)
reward = (1 - WER) × content_enjoyment
For CC/CC2/ACCC paths, an additional step splits the two-scene audio:
RE-USE audio → Qwen3-ASR (word timestamps) → Find "CUT TO:" boundary → Split into Scene 1 + Scene 2
Listen to generated samples from all paths:
| Demo | Description |
|---|---|
| Main Grid (Paths A/B/C/D) | 40 prompts across 4 standalone paths, 3 candidates each, Best-of-3 scoring |
| RE-USE Enhancement | Before/after RE-USE speech enhancement comparison |
| Character Consistent v1 | CC-A/B/C: two-scene pairs with Scene 1 + Scene 2 split players |
| Character Consistent v2 | CC2-A/B/C: improved prompting with emotional scene setup |
| Acting Challenge | AC standalone + ACCC two-scene acting challenges |
Interactive grids comparing 29 ranking methods across 10 prompts × 100 candidates. Each grid lets you switch ranking methods via dropdown and see how candidate ordering changes.
| Grid | Description |
|---|---|
| DramaBox + RE-USE | RE-USE enhanced DramaBox TTS, 10 prompts × 100+10 candidates |
| DramaBox Raw | Raw DramaBox TTS (no enhancement), same prompts |
| DramaBox + RE-USE + ChatterboxVC | RE-USE enhanced + self voice conversion via ChatterboxVC |
| Scenema + RE-USE | Scenema TTS with RE-USE enhancement, 10 prompts × 100+10 candidates |
| Scenema Raw | Raw Scenema TTS (no enhancement) |
| Scenema + RE-USE + ChatterboxVC | Scenema RE-USE enhanced + self voice conversion via ChatterboxVC |
Ranking methods include: Standard (WER × Enjoyment), VoiceCLAP-Large/Small × Quality/Prompt text, 20 multi-text CLAP variants (natural, authentic, professional, expressive, cinematic, warm — with and without negative prompts), and 4 sanitized-prompt methods (directions-only, no quoted speech content).
The pipeline samples from several structured taxonomies to create diverse, controlled voice performances:
| Taxonomy | Size | Format | Documentation |
|---|---|---|---|
| VoiceNet | 57 dimensions × 7 levels | HTML | Taxonomy docs · Interactive viewer |
| VoiceNet Extension | Situation-dependent dims | HTML | Interactive viewer |
| EmoNet | 40 emotions × 4 intensity levels | JSON | Taxonomy docs |
| Vocal Bursts | 120 non-linguistic sounds | JSON | Taxonomy docs |
| Character Archetypes | 920 archetypes × 92 genres | JSON | Taxonomy docs |
| Acting Challenges | 1,478 challenge scenarios | JSON | Preview (100 samples) |
| Situation Taxonomy | Poses, activities, social contexts | JSON | Data file |
Paper reference: Schuhmann et al., 2025 — arXiv:2505.20033. See docs/paper_reference.md for citation and BibTeX.
Full 57-dimension voice attribute sampling. The most granular control over voice performance.
- Sample language + accent
- Sample 1-3 emotions from EmoNet with intensity
- Sample 3 mandatory VoiceNet dims (Tempo, Gender, Age) + 5 random from 54 remaining
- Determine flow style (scattered/flowing/mixed), emotion alignment, direction style
- Optionally include vocal bursts taxonomy
- Inject 3 mandatory words from language-specific word list
- Construct structured LLM prompt with all constraints → Gemma 4 generates DramaBox script
See docs/path_a_voicenet.md for full details.
Genre/character archetype-based sampling. Focuses on character identity over individual vocal dimensions.
- Pick a random genre and archetype from 920 options
- Sample language + accent
- Sample 1-3 emotions with intensity
- Sample Tempo (with fast bias) and Arousal (uniform)
- Construct archetype-focused LLM prompt — no flow/alignment/direction constraints
See docs/path_b_archetype.md for full details.
Same as Path B but with explicit instruction to name the archetype role in the DramaBox script output (e.g. "a battle-hardened noble knight" in the speaker description and stage directions). This gives DramaBox TTS a stronger character signal.
See docs/path_c_archetype_named.md for full details.
The most promising path for voice cloning. Uses reference audio's timbre whisper caption to guide prompt generation, then voice-converts the DramaBox TTS output to match the reference speaker.
- Load reference audio metadata (timbre whisper caption)
- Generate timbre caption on-the-fly if missing (via
laion/timbre-whisper) - Filter VoiceNet dimensions to situation-dependent only (exclude identity: age, gender, timbre, resonance)
- Sample 1-3 emotions + tempo + 5 situation-dependent dimensions
- Construct LLM prompt with timbre caption + sampled performance attributes
- Synthesize with DramaBox TTS (text-only, no voice reference) — passing
voice_refdirectly to DramaBox leads to unstable/garbled generations - Voice-convert generated audio to match reference via Chatterbox VC
- Score and rank with Best-of-N
Why text-only TTS + VC? The timbre whisper caption gives Gemma 4 a rich description of the target speaker's vocal qualities, which guides the LLM to produce a speaker-consistent DramaBox script. Chatterbox VC then handles the actual voice transfer. This two-stage approach is far more stable than passing
voice_refdirectly to DramaBox, which causes garbled or incoherent audio output.
See docs/path_d_reference.md for full details.
Audition-style method acting performances driven by acting challenge scenarios. Samples from 1,478 structured challenges covering diverse emotional and situational contexts.
- Sample a random acting challenge (title + instruction) from the challenge database
- Sample speaker gender (VoiceNet GEND dimension, 7 levels) and age (AGEV dimension, 7 levels)
- Sample word count (40-80 words)
- Gemma 4 generates a DramaBox prompt — actor performs the challenge naturalistically
- DramaBox TTS → RE-USE enhancement → Best-of-N scoring
Key characteristics:
- No self-introduction — the actor simply begins performing
- Dynamic emotional arc with at least one turning point or new insight
- Naturalistic, genuine, spontaneous delivery — method acting, not theatrical performance
- Diverse delivery — whispered, loud, sensual, ranting, all valid if authentic
See docs/path_ac_acting_challenge.md for full details.
All CC paths produce two scenes with the same speaker in contrasting emotional states, separated by a "CUT TO:" marker.
The original two-scene format. Three sampling variants matching standalone Paths A, B, C:
- CC-A (VoiceNet): Full 57-dim sampling + contrasting emotions between scenes
- CC-B (Archetype): Archetype-based + contrasting emotions
- CC-C (Archetype Named): Named archetype + contrasting emotions
Emotion contrast logic: If Scene 1 has positive emotions → Scene 2 samples from negative emotions (and vice versa). Word count: 50-80 total (~25-40 per scene).
See docs/path_cc_character_consistent.md for full details.
Improved version of CC with enhanced LLM prompting:
- Scene 1 setup: Before the first dialogue, 1-2 sentences vividly set the emotional situation — social context, speaker's state of mind, emotional energy
- Scene 2 transition: After "CUT TO:", 1-3 sentences explicitly describe the dramatic shift in emotional tone, talking style, delivery, and pace
- Performance quality emphasis: Delivery must sound like a real, living, breathing human being — genuine, spontaneous, authentic, with natural hesitations and organic pacing
See docs/path_cc2_character_consistent_v2.md for full details.
Challenge-driven two-scene format: same actor performing the same acting challenge at two different emotional moments with dramatically shifted delivery.
- Sample acting challenge + gender + age (same as standalone AC)
- Sample word count (40-80 total, split ~evenly between scenes)
- Gemma 4 generates two contrasting scenes from the same challenge
- DramaBox TTS → chunked RE-USE enhancement → Best-of-N scoring
- Qwen3-ASR word timestamps → split into Scene 1 + Scene 2
See docs/path_ac_acting_challenge.md#accc-character-consistent for full details.
All standalone paths (A, B, C, AC) and character consistent paths use nvidia/RE-USE (SEMamba) for speech enhancement:
- Standalone/short audio: Direct enhancement (single pass)
- CC/CC2/ACCC (long audio): Chunked enhancement (15s chunks, 1s overlap, cross-faded)
Two-scene audio is split using Qwen3-ASR-1.7B with forced alignment:
- Transcribe with word-level timestamps
- Parse the DramaBox prompt to find first words of Scene 2 dialogue
- Match ASR timestamps to find the split boundary
- Split with 100ms fades at the boundary
- Generate N candidate audio samples (default 3)
- Score each with:
- WER (Word Error Rate): Parakeet v3 ASR transcription vs expected dialogue
- Content Enjoyment: Empathic Insight Plus (BUD-E-Whisper encoder + MLP)
- Composite reward:
(1 - min(WER, 1.0)) × content_enjoyment - Select the candidate with the highest reward
git clone https://github.com/LAION-AI/Voice-Acting-Pipeline.git
cd Voice-Acting-Pipeline
pip install -e .For TTS synthesis (requires GPU with ~24GB VRAM):
pip install -e ".[tts]"For audio refinement and scoring:
pip install -e ".[refinement,scoring]"# Generate 1000 DramaBox prompts using GPUs 0 and 1
dramabox generate-prompts --config config.json --total 1000 --gpus 0,1# Synthesize audio from an existing CSV
dramabox synthesize --csv output/dramabox_chunk_000.csv --gpus 0,1,2,3# Generate prompts and immediately synthesize audio
dramabox run --config config.json --total 1000 --gpus 0,1,2,3dramabox reference --config config.json --ref-dir /path/to/references --total 10 --gpus 6,7# Full 4-path demo: A + B + C + D, 10 prompts each, best-of-3 scoring
dramabox demo --config config.json --full --n-prompts 10 --best-of-n 3 --gpus 6,7dramabox score --audio output/audio/sample_000000_raw.wav --prompt "prompt text" --gpu 0All parameters are in config.json. See config_schema.md for full documentation of every field.
| Section | Parameter | Default | Description |
|---|---|---|---|
prompt_generation |
llm_model |
google/gemma-4-E4B-it |
LLM for prompt generation |
prompt_generation |
total_prompts |
100000 |
Number of prompts to generate |
sampling |
archetype_ratio |
0.20 |
Fraction using archetype path |
sampling |
word_count_min/max |
10 / 60 |
Target dialogue word count range |
tts |
cfg_scale |
2.0 |
Classifier-free guidance scale |
tts |
steps |
30 |
Euler flow matching steps |
best_of_n |
n_candidates |
3 |
Candidates per Best-of-N ranking |
Languages are configured in config.json. Currently active: English, German, French, Spanish. Ready to enable: Italian, Dutch, Russian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Polish, Swedish.
| Model | Purpose | VRAM |
|---|---|---|
google/gemma-4-E4B-it |
DramaBox prompt generation | ~16GB |
ResembleAI/Dramabox |
TTS synthesis (22B DiT) | ~24GB |
nvidia/RE-USE |
Speech enhancement (SEMamba) | ~1GB |
Qwen/Qwen3-ASR-1.7B |
Word-level timestamps for audio splitting | ~4GB |
nvidia/parakeet-tdt-0.6b-v3 |
ASR for WER scoring | ~2GB |
laion/Empathic-Insight-Voice-Plus |
Content enjoyment scoring | ~2GB |
laion/timbre-whisper |
On-the-fly timbre captioning (Path D) | ~2GB |
| Chatterbox VC | Voice conversion (Path D) | ~4GB |
dramabox-pipeline/
├── config.json # All configurable parameters
├── config_schema.md # Documentation for config fields
├── pyproject.toml # Python packaging
├── data/
│ ├── voicenet_ext_taxonomy.html # VoiceNet (57 dims)
│ ├── all_acting_challenges.json # 1,478 acting challenge scenarios
│ ├── situation_taxonomy.json # Situation taxonomy (poses, activities, contexts)
│ ├── emonet_taxonomy.json # EmoNet (40 emotions)
│ ├── vocal_bursts_taxonomy.json # Vocal bursts (120 types)
│ ├── archetypes.json # Archetypes (92 genres × 10)
│ └── wordlists/ # Per-language word lists
├── dramabox/
│ ├── cli.py # CLI entry point
│ ├── config_loader.py # Config loading and validation
│ ├── taxonomy.py # Taxonomy parsers and loaders
│ ├── sampling.py # Path A + Path B sampling
│ ├── reference_sampling.py # Path D: reference audio sampling
│ ├── prompts.py # LLM prompt construction
│ ├── prompt_generator.py # Multi-GPU LLM batch generation
│ ├── tts_synthesizer.py # Multi-GPU DramaBox TTS
│ ├── reuse_enhance.py # RE-USE speech enhancement
│ ├── scoring.py # ASR WER + content enjoyment scoring
│ ├── demo_grid.py # HTML demo grid generator
│ └── pipeline.py # Mode 1–6 orchestrator
├── docs/
│ ├── voicenet_taxonomy.md # VoiceNet 57-dim taxonomy
│ ├── voicenet_extension_taxonomy.html # Interactive VoiceNet viewer
│ ├── emonet_taxonomy.md # EmoNet 40 emotions
│ ├── vocal_bursts_taxonomy.md # 120 vocal bursts
│ ├── archetypes.md # 920 archetypes
│ ├── acting_challenges_preview.html # Acting challenge preview (100 samples)
│ ├── paper_reference.md # Citation and BibTeX
│ ├── path_a_voicenet.md # Path A detailed docs
│ ├── path_b_archetype.md # Path B detailed docs
│ ├── path_c_archetype_named.md # Path C detailed docs
│ ├── path_d_reference.md # Path D detailed docs
│ ├── path_ac_acting_challenge.md # AC + ACCC detailed docs
│ ├── path_cc_character_consistent.md # CC v1 detailed docs
│ ├── path_cc2_character_consistent_v2.md # CC2 v2 detailed docs
│ └── demo/ # HTML demo grids with audio
│ ├── index.html # Main 4-path grid
│ ├── reuse.html # RE-USE before/after
│ ├── cc.html # Character Consistent v1
│ ├── cc2.html # Character Consistent v2
│ └── ac.html # Acting Challenge
└── examples/ # Example prompts
| Component | Minimum | Recommended |
|---|---|---|
| Prompt generation | 1 GPU, 16GB VRAM | 4+ GPUs, 16GB+ each |
| TTS synthesis | 1 GPU, 24GB VRAM | 4+ GPUs, 24GB+ each |
| Refinement + scoring | 1 GPU, 8GB VRAM | 1 GPU, 16GB+ |
| RE-USE enhancement | CPU or GPU | 1 GPU |
| RAM | 32GB | 64GB+ |
The Vocalino server provides a web UI and API for interactive voice design and zero-shot voice cloning. It is independent of the DramaBox data pipeline above.
Standard TTS can generate emotions but with random voices. Standard Voice Conversion (VC) can clone a specific person but requires pre-acted source audio. Vocalino decouples vocal identity from performance style by chaining advanced stylistic generation with high-fidelity voice conversion.
┌────────────────────┐
Text + Style ──> │ Qwen3-TTS 1.7B │ ──> Raw TTS audio
│ (VoiceDesign) │ (12 Hz codec tokens → wav)
└────────────────────┘
│
▼
┌────────────────────┐
Reference WAV ─> │ Seed-VC V2 │ ──> Voice-converted audio
│ (CFM + AR) │ (matches reference timbre)
└────────────────────┘
│
▼
┌────────────────────┐
│ ECAPA-TDNN │ ──> 2048-dim embedding
│ (Speaker Encoder) │ → cosine similarity vs ref
└────────────────────┘
- Web UI — dark-themed browser interface served at
/uifor interactive voice design - Batched TTS — generate K candidates in a single forward pass (~2x faster)
- SSE Streaming — candidates stream to the UI as they complete
- Speaker Similarity Ranking — ECAPA-TDNN embeddings rank candidates by voice consistency
- INT8 Quantization — optional bitsandbytes INT8 reduces TTS VRAM from ~15 GB to ~7 GB
- Multi-GPU — split TTS and VC across GPUs for VRAM isolation
# Basic launch (single GPU, bfloat16)
python server.py
# With INT8 quantization (halves TTS VRAM)
TTS_QUANTIZE=int8 python server.py
# Multi-GPU (TTS on GPU 0, VC on GPU 1)
CUDA_VISIBLE_DEVICES=0,1 VC_DEVICE=cuda:1 python server.pyThe server starts on http://0.0.0.0:8000. Open the web UI at http://<server-ip>:8000/ui/.
- Enter text and a natural-language voice/style description
- Generate N samples (batched for speed)
- Listen, download, or select any sample as reference
- Upload or select a reference audio (target speaker identity)
- Enter text and emotion/style instruction
- Generate K candidates — each streamed to the UI as it completes
- Candidates ranked by speaker embedding similarity (green = best match)
| Endpoint | Method | Description |
|---|---|---|
/tts/generate-voice-design |
POST | Generate speech with style prompt |
/voice-design/batch |
POST | Batched voice design (N samples) |
/vc/convert |
POST | Voice conversion with Seed-VC V2 |
/pipeline/tts-then-vc |
POST | TTS + voice conversion combined |
/pipeline/ranked |
POST | Generate K candidates, rank by similarity |
/pipeline/ranked-stream |
POST (SSE) | Streaming version of ranked pipeline |
/health |
GET | Server status and configuration |
| Variable | Default | Description |
|---|---|---|
TTS_DEVICE |
cuda:0 |
GPU for Qwen3-TTS |
VC_DEVICE |
(same as TTS) | GPU for Seed-VC |
TTS_QUANTIZE |
none |
none = bfloat16, int8 = INT8 |
DEFAULT_DIFF_STEPS |
12 |
VC diffusion steps |
- This pipeline code — Apache 2.0
- DramaBox — see model card
- Qwen3-TTS — Apache 2.0
- Seed-VC — MIT