Vocalino V 0.1: Voice Acting Pipeline

The first voice acting pipeline with open-weights components and open post training data that combines zero-shot voice cloning with natural language performance direction. Vocalino allows you to provide a reference voice (or generate one from scratch) and use free-form text instructions to direct how the line is performed. It generates speech that maintains strict voice consistency with your reference audio while adhering to your specific emotional and stylistic prompts—giving you total control over the actor and the performance without any model training.

▶ Click to watch demo video

DramaBox Voice Acting Data Pipeline

End-to-end voice prompt generation and audio synthesis using the DramaBox TTS model (22B DiT transformer) and structured voice taxonomy sampling. Based on the voice taxonomy research from Schuhmann et al., 2025.

This pipeline generates richly annotated voice performance prompts in the DramaBox format — single-speaker scenes with stage directions (English) and spoken dialogue (target language) — then synthesizes them into audio. Each prompt is procedurally constructed by sampling from structured taxonomies, then expanded by an LLM (Gemma 4 E4B-it) into a full performance script.

All Paths at a Glance

The pipeline supports 12 generation paths organized into three families. Each path uses a different sampling strategy to produce diverse voice acting data.

Standalone Paths (Single Scene)

Path	Sampling	Description	Details
A (VoiceNet)	57 VoiceNet dims + EmoNet + Vocal Bursts	Full taxonomy sampling: 3 mandatory dims (Tempo, Gender, Age) + 5 random, 1-3 emotions, flow style, mandatory words	Path A Details
B (Archetype)	920 archetypes × 92 genres	Genre/character archetype-based: random archetype + emotions + Tempo/Arousal	Path B Details
C (Archetype Named)	Same as B + explicit naming	Archetype with explicit role naming in the DramaBox script (e.g. "a battle-hardened noble knight")	Path C Details
D (Reference Audio)	Timbre whisper + VoiceNet + Chatterbox VC	Reference audio pipeline: timbre caption guides prompt, DramaBox TTS + voice conversion to match reference speaker	Path D Details
AC (Acting Challenge)	1478 acting challenges + VoiceNet gender/age	Audition-style method acting from challenge scenarios — naturalistic, genuine, dynamic emotional arc	AC Details

Character Consistent Paths (Two Scenes — "CUT TO:")

All CC paths generate two scenes with the same speaker in contrasting emotional states, separated by a "CUT TO:" marker. The speaker's fundamental voice (age, gender, timbre) stays identical — only the emotional delivery changes. Audio is later split into Scene 1 / Scene 2 using Qwen3-ASR word-level timestamps.

Path	Sampling	Key Improvement	Details
CC-A (VoiceNet)	VoiceNet + contrasting emotions	Original two-scene format	CC Details
CC-B (Archetype)	Archetype + contrasting emotions	Original two-scene format	CC Details
CC-C (Archetype Named)	Archetype named + contrasting emotions	Original two-scene format	CC Details
CC2-A (VoiceNet v2)	VoiceNet + contrasting emotions	Enhanced: explicit emotional scene setup + dramatic transition descriptions	CC2 Details
CC2-B (Archetype v2)	Archetype + contrasting emotions	Enhanced: genuine/spontaneous/authentic delivery emphasis	CC2 Details
CC2-C (Archetype Named v2)	Archetype named + contrasting emotions	Enhanced: visceral emotional contrast, human-sounding	CC2 Details
ACCC (Acting Challenge CC)	Acting challenge + VoiceNet gender/age	Challenge-driven two-scene format — same actor, same challenge, contrasting emotional moments	ACCC Details

Processing Pipeline (All Paths)

Sampling → Gemma 4 LLM → DramaBox TTS → RE-USE Enhancement → Best-of-N Scoring
                                                                    ↓
                                                    Parakeet ASR (WER) + Empathic Insight (enjoyment)
                                                    reward = (1 - WER) × content_enjoyment

For CC/CC2/ACCC paths, an additional step splits the two-scene audio:

RE-USE audio → Qwen3-ASR (word timestamps) → Find "CUT TO:" boundary → Split into Scene 1 + Scene 2

Demo

Listen to generated samples from all paths:

Demo	Description
Main Grid (Paths A/B/C/D)	40 prompts across 4 standalone paths, 3 candidates each, Best-of-3 scoring
RE-USE Enhancement	Before/after RE-USE speech enhancement comparison
Character Consistent v1	CC-A/B/C: two-scene pairs with Scene 1 + Scene 2 split players
Character Consistent v2	CC2-A/B/C: improved prompting with emotional scene setup
Acting Challenge	AC standalone + ACCC two-scene acting challenges

Best-of-N Ranking Analysis (29 Methods)

Interactive grids comparing 29 ranking methods across 10 prompts × 100 candidates. Each grid lets you switch ranking methods via dropdown and see how candidate ordering changes.

Grid	Description
DramaBox + RE-USE	RE-USE enhanced DramaBox TTS, 10 prompts × 100+10 candidates
DramaBox Raw	Raw DramaBox TTS (no enhancement), same prompts
DramaBox + RE-USE + ChatterboxVC	RE-USE enhanced + self voice conversion via ChatterboxVC
Scenema + RE-USE	Scenema TTS with RE-USE enhancement, 10 prompts × 100+10 candidates
Scenema Raw	Raw Scenema TTS (no enhancement)
Scenema + RE-USE + ChatterboxVC	Scenema RE-USE enhanced + self voice conversion via ChatterboxVC

Ranking methods include: Standard (WER × Enjoyment), VoiceCLAP-Large/Small × Quality/Prompt text, 20 multi-text CLAP variants (natural, authentic, professional, expressive, cinematic, warm — with and without negative prompts), and 4 sanitized-prompt methods (directions-only, no quoted speech content).

Taxonomies & Data

The pipeline samples from several structured taxonomies to create diverse, controlled voice performances:

Taxonomy	Size	Format	Documentation
VoiceNet	57 dimensions × 7 levels	HTML	Taxonomy docs · Interactive viewer
VoiceNet Extension	Situation-dependent dims	HTML	Interactive viewer
EmoNet	40 emotions × 4 intensity levels	JSON	Taxonomy docs
Vocal Bursts	120 non-linguistic sounds	JSON	Taxonomy docs
Character Archetypes	920 archetypes × 92 genres	JSON	Taxonomy docs
Acting Challenges	1,478 challenge scenarios	JSON	Preview (100 samples)
Situation Taxonomy	Poses, activities, social contexts	JSON	Data file

Paper reference: Schuhmann et al., 2025 — arXiv:2505.20033. See docs/paper_reference.md for citation and BibTeX.

Standalone Paths — Details

Path A — VoiceNet (default 80%)

Full 57-dimension voice attribute sampling. The most granular control over voice performance.

Sample language + accent
Sample 1-3 emotions from EmoNet with intensity
Sample 3 mandatory VoiceNet dims (Tempo, Gender, Age) + 5 random from 54 remaining
Determine flow style (scattered/flowing/mixed), emotion alignment, direction style
Optionally include vocal bursts taxonomy
Inject 3 mandatory words from language-specific word list
Construct structured LLM prompt with all constraints → Gemma 4 generates DramaBox script

See docs/path_a_voicenet.md for full details.

Path B — Archetype (default 20%)

Genre/character archetype-based sampling. Focuses on character identity over individual vocal dimensions.

Pick a random genre and archetype from 920 options
Sample language + accent
Sample 1-3 emotions with intensity
Sample Tempo (with fast bias) and Arousal (uniform)
Construct archetype-focused LLM prompt — no flow/alignment/direction constraints

See docs/path_b_archetype.md for full details.

Path C — Archetype Named

Same as Path B but with explicit instruction to name the archetype role in the DramaBox script output (e.g. "a battle-hardened noble knight" in the speaker description and stage directions). This gives DramaBox TTS a stronger character signal.

See docs/path_c_archetype_named.md for full details.

Path D — Reference Audio

The most promising path for voice cloning. Uses reference audio's timbre whisper caption to guide prompt generation, then voice-converts the DramaBox TTS output to match the reference speaker.

Load reference audio metadata (timbre whisper caption)
Generate timbre caption on-the-fly if missing (via laion/timbre-whisper)
Filter VoiceNet dimensions to situation-dependent only (exclude identity: age, gender, timbre, resonance)
Sample 1-3 emotions + tempo + 5 situation-dependent dimensions
Construct LLM prompt with timbre caption + sampled performance attributes
Synthesize with DramaBox TTS (text-only, no voice reference) — passing voice_ref directly to DramaBox leads to unstable/garbled generations
Voice-convert generated audio to match reference via Chatterbox VC
Score and rank with Best-of-N

Why text-only TTS + VC? The timbre whisper caption gives Gemma 4 a rich description of the target speaker's vocal qualities, which guides the LLM to produce a speaker-consistent DramaBox script. Chatterbox VC then handles the actual voice transfer. This two-stage approach is far more stable than passing voice_ref directly to DramaBox, which causes garbled or incoherent audio output.

See docs/path_d_reference.md for full details.

Path AC — Acting Challenge

Audition-style method acting performances driven by acting challenge scenarios. Samples from 1,478 structured challenges covering diverse emotional and situational contexts.

Sample a random acting challenge (title + instruction) from the challenge database
Sample speaker gender (VoiceNet GEND dimension, 7 levels) and age (AGEV dimension, 7 levels)
Sample word count (40-80 words)
Gemma 4 generates a DramaBox prompt — actor performs the challenge naturalistically
DramaBox TTS → RE-USE enhancement → Best-of-N scoring

Key characteristics:

No self-introduction — the actor simply begins performing
Dynamic emotional arc with at least one turning point or new insight
Naturalistic, genuine, spontaneous delivery — method acting, not theatrical performance
Diverse delivery — whispered, loud, sensual, ranting, all valid if authentic

See docs/path_ac_acting_challenge.md for full details.

Character Consistent Paths — Details

All CC paths produce two scenes with the same speaker in contrasting emotional states, separated by a "CUT TO:" marker.

CC v1 (A/B/C) — Character Consistent

The original two-scene format. Three sampling variants matching standalone Paths A, B, C:

CC-A (VoiceNet): Full 57-dim sampling + contrasting emotions between scenes
CC-B (Archetype): Archetype-based + contrasting emotions
CC-C (Archetype Named): Named archetype + contrasting emotions

Emotion contrast logic: If Scene 1 has positive emotions → Scene 2 samples from negative emotions (and vice versa). Word count: 50-80 total (~25-40 per scene).

See docs/path_cc_character_consistent.md for full details.

CC2 v2 (A/B/C) — Character Consistent v2

Improved version of CC with enhanced LLM prompting:

Scene 1 setup: Before the first dialogue, 1-2 sentences vividly set the emotional situation — social context, speaker's state of mind, emotional energy
Scene 2 transition: After "CUT TO:", 1-3 sentences explicitly describe the dramatic shift in emotional tone, talking style, delivery, and pace
Performance quality emphasis: Delivery must sound like a real, living, breathing human being — genuine, spontaneous, authentic, with natural hesitations and organic pacing

See docs/path_cc2_character_consistent_v2.md for full details.

ACCC — Acting Challenge Character Consistent

Challenge-driven two-scene format: same actor performing the same acting challenge at two different emotional moments with dramatically shifted delivery.

Sample acting challenge + gender + age (same as standalone AC)
Sample word count (40-80 total, split ~evenly between scenes)
Gemma 4 generates two contrasting scenes from the same challenge
DramaBox TTS → chunked RE-USE enhancement → Best-of-N scoring
Qwen3-ASR word timestamps → split into Scene 1 + Scene 2

See docs/path_ac_acting_challenge.md#accc-character-consistent for full details.

Audio Processing

RE-USE Speech Enhancement

All standalone paths (A, B, C, AC) and character consistent paths use nvidia/RE-USE (SEMamba) for speech enhancement:

Standalone/short audio: Direct enhancement (single pass)
CC/CC2/ACCC (long audio): Chunked enhancement (15s chunks, 1s overlap, cross-faded)

Audio Splitting (CC/CC2/ACCC)

Two-scene audio is split using Qwen3-ASR-1.7B with forced alignment:

Transcribe with word-level timestamps
Parse the DramaBox prompt to find first words of Scene 2 dialogue
Match ASR timestamps to find the split boundary
Split with 100ms fades at the boundary

Best-of-N Ranking

Generate N candidate audio samples (default 3)
Score each with:
- WER (Word Error Rate): Parakeet v3 ASR transcription vs expected dialogue
- Content Enjoyment: Empathic Insight Plus (BUD-E-Whisper encoder + MLP)
Composite reward: (1 - min(WER, 1.0)) × content_enjoyment
Select the candidate with the highest reward

Quick Start

Installation

git clone https://github.com/LAION-AI/Voice-Acting-Pipeline.git
cd Voice-Acting-Pipeline
pip install -e .

For TTS synthesis (requires GPU with ~24GB VRAM):

pip install -e ".[tts]"

For audio refinement and scoring:

pip install -e ".[refinement,scoring]"

Generate Prompts (Mode 1)

# Generate 1000 DramaBox prompts using GPUs 0 and 1
dramabox generate-prompts --config config.json --total 1000 --gpus 0,1

Synthesize Audio (Mode 2)

# Synthesize audio from an existing CSV
dramabox synthesize --csv output/dramabox_chunk_000.csv --gpus 0,1,2,3

End-to-End (Mode 3)

# Generate prompts and immediately synthesize audio
dramabox run --config config.json --total 1000 --gpus 0,1,2,3

Reference Audio Pipeline — Path D (Mode 4)

dramabox reference --config config.json --ref-dir /path/to/references --total 10 --gpus 6,7

Demo Grid (Mode 5)

# Full 4-path demo: A + B + C + D, 10 prompts each, best-of-3 scoring
dramabox demo --config config.json --full --n-prompts 10 --best-of-n 3 --gpus 6,7

Score Audio

dramabox score --audio output/audio/sample_000000_raw.wav --prompt "prompt text" --gpu 0

Configuration

All parameters are in config.json. See config_schema.md for full documentation of every field.

Key Settings

Section	Parameter	Default	Description
`prompt_generation`	`llm_model`	`google/gemma-4-E4B-it`	LLM for prompt generation
`prompt_generation`	`total_prompts`	`100000`	Number of prompts to generate
`sampling`	`archetype_ratio`	`0.20`	Fraction using archetype path
`sampling`	`word_count_min/max`	`10 / 60`	Target dialogue word count range
`tts`	`cfg_scale`	`2.0`	Classifier-free guidance scale
`tts`	`steps`	`30`	Euler flow matching steps
`best_of_n`	`n_candidates`	`3`	Candidates per Best-of-N ranking

Adding Languages

Languages are configured in config.json. Currently active: English, German, French, Spanish. Ready to enable: Italian, Dutch, Russian, Portuguese, Chinese, Japanese, Korean, Arabic, Hindi, Turkish, Polish, Swedish.

Models Used

Model	Purpose	VRAM
`google/gemma-4-E4B-it`	DramaBox prompt generation	~16GB
`ResembleAI/Dramabox`	TTS synthesis (22B DiT)	~24GB
`nvidia/RE-USE`	Speech enhancement (SEMamba)	~1GB
`Qwen/Qwen3-ASR-1.7B`	Word-level timestamps for audio splitting	~4GB
`nvidia/parakeet-tdt-0.6b-v3`	ASR for WER scoring	~2GB
`laion/Empathic-Insight-Voice-Plus`	Content enjoyment scoring	~2GB
`laion/timbre-whisper`	On-the-fly timbre captioning (Path D)	~2GB
Chatterbox VC	Voice conversion (Path D)	~4GB

Project Structure

dramabox-pipeline/
├── config.json                 # All configurable parameters
├── config_schema.md            # Documentation for config fields
├── pyproject.toml              # Python packaging
├── data/
│   ├── voicenet_ext_taxonomy.html   # VoiceNet (57 dims)
│   ├── all_acting_challenges.json   # 1,478 acting challenge scenarios
│   ├── situation_taxonomy.json      # Situation taxonomy (poses, activities, contexts)
│   ├── emonet_taxonomy.json         # EmoNet (40 emotions)
│   ├── vocal_bursts_taxonomy.json   # Vocal bursts (120 types)
│   ├── archetypes.json              # Archetypes (92 genres × 10)
│   └── wordlists/                   # Per-language word lists
├── dramabox/
│   ├── cli.py                  # CLI entry point
│   ├── config_loader.py        # Config loading and validation
│   ├── taxonomy.py             # Taxonomy parsers and loaders
│   ├── sampling.py             # Path A + Path B sampling
│   ├── reference_sampling.py   # Path D: reference audio sampling
│   ├── prompts.py              # LLM prompt construction
│   ├── prompt_generator.py     # Multi-GPU LLM batch generation
│   ├── tts_synthesizer.py      # Multi-GPU DramaBox TTS
│   ├── reuse_enhance.py        # RE-USE speech enhancement
│   ├── scoring.py              # ASR WER + content enjoyment scoring
│   ├── demo_grid.py            # HTML demo grid generator
│   └── pipeline.py             # Mode 1–6 orchestrator
├── docs/
│   ├── voicenet_taxonomy.md              # VoiceNet 57-dim taxonomy
│   ├── voicenet_extension_taxonomy.html  # Interactive VoiceNet viewer
│   ├── emonet_taxonomy.md                # EmoNet 40 emotions
│   ├── vocal_bursts_taxonomy.md          # 120 vocal bursts
│   ├── archetypes.md                     # 920 archetypes
│   ├── acting_challenges_preview.html    # Acting challenge preview (100 samples)
│   ├── paper_reference.md                # Citation and BibTeX
│   ├── path_a_voicenet.md                # Path A detailed docs
│   ├── path_b_archetype.md               # Path B detailed docs
│   ├── path_c_archetype_named.md         # Path C detailed docs
│   ├── path_d_reference.md               # Path D detailed docs
│   ├── path_ac_acting_challenge.md       # AC + ACCC detailed docs
│   ├── path_cc_character_consistent.md   # CC v1 detailed docs
│   ├── path_cc2_character_consistent_v2.md  # CC2 v2 detailed docs
│   └── demo/                             # HTML demo grids with audio
│       ├── index.html                    # Main 4-path grid
│       ├── reuse.html                    # RE-USE before/after
│       ├── cc.html                       # Character Consistent v1
│       ├── cc2.html                      # Character Consistent v2
│       └── ac.html                       # Acting Challenge
└── examples/                   # Example prompts

Hardware Requirements

Component	Minimum	Recommended
Prompt generation	1 GPU, 16GB VRAM	4+ GPUs, 16GB+ each
TTS synthesis	1 GPU, 24GB VRAM	4+ GPUs, 24GB+ each
Refinement + scoring	1 GPU, 8GB VRAM	1 GPU, 16GB+
RE-USE enhancement	CPU or GPU	1 GPU
RAM	32GB	64GB+

Vocalino V0.1 — Interactive Voice Design Server

The Vocalino server provides a web UI and API for interactive voice design and zero-shot voice cloning. It is independent of the DramaBox data pipeline above.

How It Works

The Concept: "Directing" AI Speech

Standard TTS can generate emotions but with random voices. Standard Voice Conversion (VC) can clone a specific person but requires pre-acted source audio. Vocalino decouples vocal identity from performance style by chaining advanced stylistic generation with high-fidelity voice conversion.

Architecture

                     ┌────────────────────┐
    Text + Style ──> │  Qwen3-TTS 1.7B    │ ──> Raw TTS audio
                     │  (VoiceDesign)      │     (12 Hz codec tokens → wav)
                     └────────────────────┘
                              │
                              ▼
                     ┌────────────────────┐
    Reference WAV ─> │  Seed-VC V2        │ ──> Voice-converted audio
                     │  (CFM + AR)        │     (matches reference timbre)
                     └────────────────────┘
                              │
                              ▼
                     ┌────────────────────┐
                     │  ECAPA-TDNN        │ ──> 2048-dim embedding
                     │  (Speaker Encoder) │     → cosine similarity vs ref
                     └────────────────────┘

Features

Web UI — dark-themed browser interface served at /ui for interactive voice design
Batched TTS — generate K candidates in a single forward pass (~2x faster)
SSE Streaming — candidates stream to the UI as they complete
Speaker Similarity Ranking — ECAPA-TDNN embeddings rank candidates by voice consistency
INT8 Quantization — optional bitsandbytes INT8 reduces TTS VRAM from ~15 GB to ~7 GB
Multi-GPU — split TTS and VC across GPUs for VRAM isolation

Server Quick Start

# Basic launch (single GPU, bfloat16)
python server.py

# With INT8 quantization (halves TTS VRAM)
TTS_QUANTIZE=int8 python server.py

# Multi-GPU (TTS on GPU 0, VC on GPU 1)
CUDA_VISIBLE_DEVICES=0,1 VC_DEVICE=cuda:1 python server.py

The server starts on http://0.0.0.0:8000. Open the web UI at http://<server-ip>:8000/ui/.

Web UI

Section 1: Voice Design (Reference Creation)

Enter text and a natural-language voice/style description
Generate N samples (batched for speed)
Listen, download, or select any sample as reference

Section 2: Full Pipeline (Voice-Consistent Generation)

Upload or select a reference audio (target speaker identity)
Enter text and emotion/style instruction
Generate K candidates — each streamed to the UI as it completes
Candidates ranked by speaker embedding similarity (green = best match)

API Reference

Endpoint	Method	Description
`/tts/generate-voice-design`	POST	Generate speech with style prompt
`/voice-design/batch`	POST	Batched voice design (N samples)
`/vc/convert`	POST	Voice conversion with Seed-VC V2
`/pipeline/tts-then-vc`	POST	TTS + voice conversion combined
`/pipeline/ranked`	POST	Generate K candidates, rank by similarity
`/pipeline/ranked-stream`	POST (SSE)	Streaming version of ranked pipeline
`/health`	GET	Server status and configuration

Server Configuration

Variable	Default	Description
`TTS_DEVICE`	`cuda:0`	GPU for Qwen3-TTS
`VC_DEVICE`	(same as TTS)	GPU for Seed-VC
`TTS_QUANTIZE`	`none`	`none` = bfloat16, `int8` = INT8
`DEFAULT_DIFF_STEPS`	`12`	VC diffusion steps

License

This pipeline code — Apache 2.0
DramaBox — see model card
Qwen3-TTS — Apache 2.0
Seed-VC — MIT

Name		Name	Last commit message	Last commit date
Latest commit History 20 Commits
data		data
docs		docs
dramabox		dramabox
examples		examples
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
config.json		config.json
config_schema.md		config_schema.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Vocalino V 0.1: Voice Acting Pipeline

DramaBox Voice Acting Data Pipeline

All Paths at a Glance

Standalone Paths (Single Scene)

Character Consistent Paths (Two Scenes — "CUT TO:")

Processing Pipeline (All Paths)

Demo

Best-of-N Ranking Analysis (29 Methods)

Taxonomies & Data

Standalone Paths — Details

Path A — VoiceNet (default 80%)

Path B — Archetype (default 20%)

Path C — Archetype Named

Path D — Reference Audio

Path AC — Acting Challenge

Character Consistent Paths — Details

CC v1 (A/B/C) — Character Consistent

CC2 v2 (A/B/C) — Character Consistent v2

ACCC — Acting Challenge Character Consistent

Audio Processing

RE-USE Speech Enhancement

Audio Splitting (CC/CC2/ACCC)

Best-of-N Ranking

Quick Start

Installation

Generate Prompts (Mode 1)

Synthesize Audio (Mode 2)

End-to-End (Mode 3)

Reference Audio Pipeline — Path D (Mode 4)

Demo Grid (Mode 5)

Score Audio

Configuration

Key Settings

Adding Languages

Models Used

Project Structure

Hardware Requirements

Vocalino V0.1 — Interactive Voice Design Server

How It Works

The Concept: "Directing" AI Speech

Architecture

Features

Server Quick Start

Web UI

Section 1: Voice Design (Reference Creation)

Section 2: Full Pipeline (Voice-Consistent Generation)

API Reference

Server Configuration

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages