feat: Add Vibevoice7B model support by Talpik · Pull Request #640 · Blaizzy/mlx-audio

Talpik · 2026-04-07T19:51:42Z

Context

This PR improves non-streaming VibeVoice-7B support in mlx-audio and makes it substantially more
practical on Apple Silicon.

The work had two goals:

align the MLX non-streaming inference path more closely with the original vibevoice-community/ VibeVoice implementation
improve deployment options by adding model-aware selective quantization and validating practical
speed/quality tradeoffs

The implementation was tuned and validated against the original upstream repo on a MacBook Pro 16" (M4 Max, 48GB RAM).

Description

The non-streaming VibeVoice-7B path was adapted and corrected to better match upstream behavior,
including prompt handling, reference voice prefill, negative-branch logic, grouped-query attention
behavior, and acoustic encoder parity.

After parity was improved, the runtime was profiled and selective quantization support was added in a
model-aware way, instead of relying only on the generic “quantize everything quantizable” path.

The PR also makes selectively-quantized checkpoints self-describing by saving quantized component
metadata into config.json, and updates model loading so quantized packed weights can be reconstructed
correctly during reload.

Experimental preset layers, speculative CFG speed paths, and investigation-only CLI tuning flags were
intentionally left out of this PR to keep the public surface area small and stable.

Changes in the codebase

improved non-streaming VibeVoice-7B inference parity with upstream
fixed multi-speaker and control-flow related issues in the non-streaming path
fixed grouped-query attention parity in the Qwen2 LM path
fixed acoustic encoder causal/strided padding parity for reference prefill
added VibeVoice-specific selective quantization policy
added support for persisting quantized_components metadata during conversion
fixed quantized checkpoint reload so packed quantized layers are reconstructed correctly
cleaned the public mlx_audio.tts.generate CLI by removing VibeVoice-specific experimental tuning
flags
removed internal non-streaming preset wrappers and speculative CFG speed paths from the PR version
expanded VibeVoice test coverage for parity-sensitive and quantized-loading behavior

Changes outside the codebase

No infrastructure, database, or external service changes were made.

The original vibevoice-community/VibeVoice repository was used as the behavioral reference for parity
checks and runtime comparisons.

Additional information

Main comparison setup:

same reference speaker
same text
cfg_scale=1.5
ddpm_steps=10

Main comparison matrix

Mode	Audio Duration	Processing Time	Speed Factor	Peak Memory	Notes
Original PyTorch full	`25.07s`	`42.65s`	`0.59x`	n/a	Quality reference
MLX full	`26.00s`	`103.02s`	`0.25x`	`25.21GB`	MLX quality baseline
MLX full Q6	`25.60s`	`21.12s`	`1.21x`	`9.57GB`	Best general speed/quality tradeoff
MLX selective LM + prediction_head Q8	`25.73s`	`24.02s`	`1.07x`	`14.19GB`	Strong accelerated-
full candidate

Model preparation

Before running the commands below, download the original upstream
vibevoice-community/VibeVoice weights locally.

Assume the original PyTorch checkpoint is available at:

<ORIGINAL_VIBEVOICE_WEIGHTS>

1. Full MLX port

python -m mlx_audio.convert \
  --hf-path <ORIGINAL_VIBEVOICE_WEIGHTS> \
  --mlx-path <OUTPUT_FULL_MLX_PATH>

#### 2. Full-model quantized Q6

python -m mlx_audio.convert \
  --hf-path <OUTPUT_FULL_MLX_PATH> \
  --mlx-path <OUTPUT_Q6_MLX_PATH> \
  --quantize \
  --q-bits 6

#### 3. Selective accelerated-full mode (LM + prediction_head Q8)

MLX_AUDIO_VIBEVOICE_QUANTIZE_PREDICTION_HEAD=1 \
python -m mlx_audio.convert \
  --hf-path <OUTPUT_FULL_MLX_PATH> \
  --mlx-path <OUTPUT_LM_PRED_Q8_MLX_PATH> \
  --quantize \
  --q-bits 8

### Generation command

Use the same generation command for all three variants by changing only
the --model path.

python -m mlx_audio.tts.generate \
  --model <MODEL_PATH> \
  --ref_audio <REFERENCE_AUDIO> \
  --text "$(cat <TEXT_FILE>)" \
  --cfg_scale 1.5 \
  --ddpm_steps 10 \
  --output_path <OUTPUT_DIR> \
  --file_prefix <RUN_NAME> \
  --audio_format wav \
  --verbose

Examples for <MODEL_PATH>:

- full MLX: <OUTPUT_FULL_MLX_PATH>
- full Q6: <OUTPUT_Q6_MLX_PATH>
- selective LM + prediction_head Q8: <OUTPUT_LM_PRED_Q8_MLX_PATH>

## Checklist

- [x] Tests added/updated
- [x] Documentation updated
- [ ] Issue referenced (e.g., "Closes #...")

This reverts commit cec542a.

lucasnewman

@Talpik Thanks for the contribution!

Can we split this into two PRs to reduce the size of the changes needed to review?

The model updates, which should be verifiable against the reference implementation.
The selective quantization enhancements that you've added for the performance tuning.

vladimir.talpa added 10 commits March 20, 2026 02:02

Merge branch 'main' of https://github.com/Blaizzy/mlx-audio

c29f51d

feat(vibevoice): improve non-streaming parity and selective quantization

77e0df1

fix(cli): preserve generate max_tokens default

e37264b

chore(cli): drop deprecated vibevoice cfg_active_ratio flag

6ff3f38

chore(cli): trim vibevoice-only generate flags

5bc1475

chore(vibevoice): drop internal non-streaming presets

b3033bf

chore(vibevoice): drop experimental speculative cfg paths

7677c34

fix(cli): restore repetition penalty default

7a936cb

docs(vibevoice): clarify repetition penalty tuning note

55fd51c

docs(quantization): clarify selective checkpoint metadata flow

6ef68c6

Talpik changed the title ~~feat(vibevoice): improve non-streaming parity and selective quantization~~ feat(vibevoice7b): improve non-streaming parity and selective quantization Apr 7, 2026

Talpik mentioned this pull request Apr 7, 2026

Add support for VibeVoice 1.5B / 7B #621

Open

Talpik and others added 5 commits April 8, 2026 03:21

Merge branch 'main' into codex-vibevoice-pr-clean

6c55703

test(vibevoice): clarify regression coverage docstrings

bdfcb35

feat(vibevoice): randomize non-streaming seed by default

cec542a

Revert "feat(vibevoice): randomize non-streaming seed by default"

3f1c393

This reverts commit cec542a.

fix(vibevoice): ignore generic repetition penalty default

1ce4d30

Talpik changed the title ~~feat(vibevoice7b): improve non-streaming parity and selective quantization~~ feat: Add Vibevoice7B model support Apr 9, 2026

lucasnewman requested changes Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add Vibevoice7B model support #640

feat: Add Vibevoice7B model support #640
Talpik wants to merge 15 commits intoBlaizzy:mainfrom
Talpik:codex-vibevoice-pr-clean

Talpik commented Apr 7, 2026 •

edited

Loading

Uh oh!

lucasnewman left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Talpik commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Description

Changes in the codebase

Changes outside the codebase

Additional information

Main comparison matrix

Model preparation

1. Full MLX port

Uh oh!

lucasnewman left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Talpik commented Apr 7, 2026 •

edited

Loading