Skip to content

feat: Add Vibevoice7B model support #640

Open
Talpik wants to merge 15 commits intoBlaizzy:mainfrom
Talpik:codex-vibevoice-pr-clean
Open

feat: Add Vibevoice7B model support #640
Talpik wants to merge 15 commits intoBlaizzy:mainfrom
Talpik:codex-vibevoice-pr-clean

Conversation

@Talpik
Copy link
Copy Markdown

@Talpik Talpik commented Apr 7, 2026

Context

This PR improves non-streaming VibeVoice-7B support in mlx-audio and makes it substantially more
practical on Apple Silicon.

The work had two goals:

  1. align the MLX non-streaming inference path more closely with the original vibevoice-community/ VibeVoice implementation
  2. improve deployment options by adding model-aware selective quantization and validating practical
    speed/quality tradeoffs

The implementation was tuned and validated against the original upstream repo on a MacBook Pro 16" (M4 Max, 48GB RAM).

Description

The non-streaming VibeVoice-7B path was adapted and corrected to better match upstream behavior,
including prompt handling, reference voice prefill, negative-branch logic, grouped-query attention
behavior, and acoustic encoder parity.

After parity was improved, the runtime was profiled and selective quantization support was added in a
model-aware way, instead of relying only on the generic “quantize everything quantizable” path.

The PR also makes selectively-quantized checkpoints self-describing by saving quantized component
metadata into config.json, and updates model loading so quantized packed weights can be reconstructed
correctly during reload.

Experimental preset layers, speculative CFG speed paths, and investigation-only CLI tuning flags were
intentionally left out of this PR to keep the public surface area small and stable.

Changes in the codebase

  • improved non-streaming VibeVoice-7B inference parity with upstream
  • fixed multi-speaker and control-flow related issues in the non-streaming path
  • fixed grouped-query attention parity in the Qwen2 LM path
  • fixed acoustic encoder causal/strided padding parity for reference prefill
  • added VibeVoice-specific selective quantization policy
  • added support for persisting quantized_components metadata during conversion
  • fixed quantized checkpoint reload so packed quantized layers are reconstructed correctly
  • cleaned the public mlx_audio.tts.generate CLI by removing VibeVoice-specific experimental tuning
    flags
  • removed internal non-streaming preset wrappers and speculative CFG speed paths from the PR version
  • expanded VibeVoice test coverage for parity-sensitive and quantized-loading behavior

Changes outside the codebase

No infrastructure, database, or external service changes were made.

The original vibevoice-community/VibeVoice repository was used as the behavioral reference for parity
checks and runtime comparisons.

Additional information

Main comparison setup:

  • same reference speaker
  • same text
  • cfg_scale=1.5
  • ddpm_steps=10

Main comparison matrix

Mode Audio Duration Processing Time Speed Factor Peak Memory Notes
Original PyTorch full 25.07s 42.65s 0.59x n/a Quality reference
MLX full 26.00s 103.02s 0.25x 25.21GB MLX quality baseline
MLX full Q6 25.60s 21.12s 1.21x 9.57GB Best general speed/quality tradeoff
MLX selective LM + prediction_head Q8 25.73s 24.02s 1.07x 14.19GB Strong accelerated-
full candidate

Model preparation

Before running the commands below, download the original upstream
vibevoice-community/VibeVoice weights locally.

Assume the original PyTorch checkpoint is available at:

  • <ORIGINAL_VIBEVOICE_WEIGHTS>

1. Full MLX port

python -m mlx_audio.convert \
  --hf-path <ORIGINAL_VIBEVOICE_WEIGHTS> \
  --mlx-path <OUTPUT_FULL_MLX_PATH>

#### 2. Full-model quantized Q6

python -m mlx_audio.convert \
  --hf-path <OUTPUT_FULL_MLX_PATH> \
  --mlx-path <OUTPUT_Q6_MLX_PATH> \
  --quantize \
  --q-bits 6

#### 3. Selective accelerated-full mode (LM + prediction_head Q8)

MLX_AUDIO_VIBEVOICE_QUANTIZE_PREDICTION_HEAD=1 \
python -m mlx_audio.convert \
  --hf-path <OUTPUT_FULL_MLX_PATH> \
  --mlx-path <OUTPUT_LM_PRED_Q8_MLX_PATH> \
  --quantize \
  --q-bits 8

### Generation command

Use the same generation command for all three variants by changing only
the --model path.

python -m mlx_audio.tts.generate \
  --model <MODEL_PATH> \
  --ref_audio <REFERENCE_AUDIO> \
  --text "$(cat <TEXT_FILE>)" \
  --cfg_scale 1.5 \
  --ddpm_steps 10 \
  --output_path <OUTPUT_DIR> \
  --file_prefix <RUN_NAME> \
  --audio_format wav \
  --verbose

Examples for <MODEL_PATH>:

- full MLX: <OUTPUT_FULL_MLX_PATH>
- full Q6: <OUTPUT_Q6_MLX_PATH>
- selective LM + prediction_head Q8: <OUTPUT_LM_PRED_Q8_MLX_PATH>

## Checklist

- [x] Tests added/updated
- [x] Documentation updated
- [ ] Issue referenced (e.g., "Closes #...")

@Talpik Talpik changed the title feat(vibevoice): improve non-streaming parity and selective quantization feat(vibevoice7b): improve non-streaming parity and selective quantization Apr 7, 2026
@Talpik Talpik changed the title feat(vibevoice7b): improve non-streaming parity and selective quantization feat: Add Vibevoice7B model support Apr 9, 2026
Copy link
Copy Markdown
Collaborator

@lucasnewman lucasnewman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Talpik Thanks for the contribution!

Can we split this into two PRs to reduce the size of the changes needed to review?

  1. The model updates, which should be verifiable against the reference implementation.
  2. The selective quantization enhancements that you've added for the performance tuning.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants