feat: add Higgs Audio v2 — 3B Llama-backed TTS with voice cloning by Kairos-a · Pull Request #656 · Blaizzy/mlx-audio

Kairos-a · 2026-04-18T09:38:58Z

Closes #1.

This PR adds Higgs Audio v2, Boson AI's 3B Llama-3.2-backed TTS with multi-codebook acoustic tokens and delay-pattern streaming. The port reuses the in-tree HiggsAudioTokenizer (added 2026-04-14 for the OmniVoice PR) for the codec path and ships a complete serve-engine equivalent for voice cloning.

Benchmarks (M5 Max, warm cache, long-prompt RTF)

variant	RTF	weights size	source
bf16	0.60×	6.8 GB	`bosonai/higgs-audio-v2-generation-3B-base` (authoritative)
q8	0.36×	6.18 GB	`mlx-community/higgs-audio-v2-3B-mlx-q8`
q6	0.33×	4.75 GB	`mlx-community/higgs-audio-v2-3B-mlx-q6`

Real-time voice cloning on Apple Silicon. bf16 loads directly from the authoritative bosonai upload — no redundant mlx-community re-host. q8 and q6 are MLX-specific selectively-quantized variants.

What's added

Model — mlx_audio/tts/models/higgs_audio/

HiggsAudioConfig / HiggsTextConfig — from-dict constructor for the bosonai config.json.
HiggsAudioModel — Llama-3.2-3B backbone with Higgs-specific dual-FFN audio layers (shared attention, split pre-attn-norm + post-attn-norm + MLP between text and audio paths, routed by audio_out_mask). HiggsAudioDecoderProjector with separate text and audio LM heads. Full generate() implementing the AUDIO_INIT + delay-pattern ramp-in + EOS ramp-out state machine.
generation.py — build_delay_pattern_mask, revert_delay_pattern, apply_delay_pattern, lookup_audio_embedding, greedy_sample_audio, sample_audio (temperature + top-p + top-k via Gumbel-max).

Serving — serve.py

HiggsAudioServer.from_pretrained() — loads model + codec + tokenizer in one call.
generate() — ChatML prompt assembly (voice-clone mode when reference_audio_path is provided, smart-voice mode otherwise), reference audio encoded via the in-tree HiggsAudioTokenizer, inputs_embeds stitched with the audio_out_mask routing through the dual-FFN audio path.
generate_stream() — yields PCM chunks at chunk_ms boundaries for Pipecat / streaming consumers. Quality-preserving (per-chunk identical to non-streaming). Mid-generation emission with overlap-add is noted as follow-up work.
RAS (repetition-avoidance sampling) on by default (Higgs: ras_win_len=7, ras_max_repeat=2).

Docs — docs/models/tts/higgs_audio.md covers basic usage, voice cloning, streaming, quantization (including the class_predicate pattern that protects the audio head from quantization), sampling controls, and the generation state machine notes.

Tests — mlx_audio/tts/tests/test_higgs_audio.py — 9 unit tests (0.11s on M5 Max) covering delay-pattern shape + round-trip invariants, audio embedding lookup, sampling equivalence at T=0, and model forward on a tiny synthetic config.

Example — examples/higgs_audio_clone_demo.py — CLI demo matching the omnivoice_clone_demo.py convention.

The non-obvious piece — AUDIO_INIT

The generation state machine is the gotcha for this port. The first audio frame must not be sampled from audio_logits at the <|audio_out_bos|> text position — those logits have huge bias toward audio_stream_eos_id because the model was trained with the <|AUDIO_OUT|> placeholder-substitution convention, not direct audio prediction. Instead, a synthetic all-audio_stream_bos_id frame is force-injected (AUDIO_INIT), then the K-frame delay-pattern ramp-in begins. Codebook i starts emitting at frame i; rest stay BOS. On any codebook emitting EOS, a K-frame ramp-out forces trailing codebooks to EOS before termination.

Skipping this produces a stuck pitch (first-frame EOS on half the codebooks → rest is garbage). I've documented this in docs/models/tts/higgs_audio.md so future ports/maintainers don't rediscover it.

Checkpoints

bf16 — use bosonai/higgs-audio-v2-generation-3B-base directly (authoritative, no re-host needed)
q8 — mlx-community/higgs-audio-v2-3B-mlx-q8
q6 — mlx-community/higgs-audio-v2-3B-mlx-q6
codec — mlx-community/higgs-audio-v2-tokenizer (upstream bosonai repacked into the audio_tokenizer/ subdir layout that mlx-audio's HiggsAudioTokenizer.from_pretrained expects)

Quantized variants carry a quantization block in config.json so HiggsAudioServer.from_pretrained auto-quantizes the skeleton before loading weights.

Quantization notes

MLX native nn.quantize works on the Llama backbone. The docs recommend protecting audio_codebook_embeddings and audio_decoder_proj.audio_lm_head from quantization via class_predicate — quantizing those specific layers introduces voice-character artifacts (pitch-register drift at q6 without protection, trajectory instability at q4).

q4 is deferred from this PR. Rounding noise at 4-bit pushes per-codebook logits close to the sampling decision threshold, so specific Gumbel draws produce clean output while others diverge. The pattern is seed-dependent, not fixable by temperature tuning alone. Follow-up candidates: AWQ-style post-training calibration, smaller group_size, or active retry-on-spiral detection.

Testing

python -m unittest mlx_audio.tts.tests.test_higgs_audio -v
# 9 tests pass in 0.11s

python examples/higgs_audio_clone_demo.py \
    --ref_audio path/to/reference.wav \
    --ref_text "Transcript of the reference clip." \
    --text "Hello from Higgs Audio on MLX."

Credits

Port built on top of the in-tree HiggsAudioTokenizer from the OmniVoice PR (#630). Model code reimplemented from bosonai/higgs-audio against the Llama-3.2-3B architecture in mlx-lm.

lucasnewman · 2026-04-18T22:53:14Z

@Kairos-a Thanks for the contribution! Overall this looks pretty good -- see a couple of comments around how it can better integrate into the existing codebase.

Port of Boson AI's Higgs Audio v2 to MLX. Llama-3.2-3B backbone with a dual-FFN decoder layer (text + audio paths share self-attention; LN + MLP are per-path, routed by audio_out_mask) and delay-pattern audio emission (codebook i lags by i frames). Framework interface ------------------- Exposes `Model` and `ModelConfig` in the shape expected by `mlx_audio.tts.utils.load()` and `python -m mlx_audio.tts.generate`: python -m mlx_audio.tts.generate \ --model mlx-community/higgs-audio-v2-3B-mlx-q8 \ --text "Hello" --ref_audio voice.wav --ref_text "..." `Model` subclasses the underlying `HiggsAudioModel` so safetensors keys land unchanged. Tokenizer + codec attach via `post_load_hook`. The generate() signature matches the standard TTS convention (text, voice, ref_audio, ref_text, ...) and yields a mlx_audio.tts.models.base GenerationResult. `HiggsAudioServer` is kept as an additional Python entrypoint for Higgs-specific kwargs (max_new_frames, ras_win_len, fade_in_ms, etc.). Load-bearing details -------------------- The first generated audio frame must be a synthetic all-stream_bos frame — sampling from the model's audio_logits at the <|audio_out_bos|> text position collapses to stream-EOS on half the codebooks (those positions were never trained for direct audio prediction). See Model.generate and HiggsAudioModel._generate_raw_frames for the full AUDIO_INIT + K-frame ramp-in + EOS ramp-out state machine. Quantization ------------ MLX native 4/6/8-bit on the Llama backbone. `Model.model_quant_predicate` protects `audio_codebook_embeddings` and `audio_decoder_proj.audio_lm_head` at bf16 — quantizing them introduces voice-character drift (pitch register shifts at q6, trajectory instability at q4). RTF on M5 Max: bf16 0.60× / q8 0.36× / q6 0.33×. Bundled assets -------------- Three drop-in reference voices in examples/voice_prompts/ (en_woman, en_man, en_man_deep), generated via smart-voice mode so they're license-clean. Example: examples/higgs_audio_clone_demo.py. Tests ----- 16 unit tests in mlx_audio/tts/tests/test_higgs_audio.py — delay-pattern round-trip, audio-embedding lookup, sampling equivalence at T=0, tiny config model forward, selective-quant predicate verification, and the framework-interface contract (Model subclass, ModelConfig.from_dict, sample_rate, model_quant_predicate, generate-before-load guard). References ---------- Original: https://github.com/boson-ai/higgs-audio HF (bf16): https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base HF (q8): https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q8 HF (q6): https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q6 HF codec: https://huggingface.co/mlx-community/higgs-audio-v2-tokenizer Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Matches the pre-commit config (.pre-commit-config.yaml — black 24.2.0, isort 5.13.2 profile=black) enforced by the PR CI style job. No code changes — pure formatting + import-order + uv.lock reconciliation with the pydub dep already declared in pyproject.toml.

lucasnewman · 2026-04-20T18:34:28Z

@Kairos-a Can you add a README.md in the model directory (see the other models for examples) and then we can merge? Thanks!

@lucasnewman

Addresses @lucasnewman's review request. Covers usage (CLI + standard Python API + Higgs-specific HiggsAudioServer), voice cloning, parameter table, available models / quantizations / RTF / memory, conversion, architecture, and license. Links to the full port docs at docs/models/tts/higgs_audio.md for depth.

lucasnewman

Thanks!

lucasnewman reviewed Apr 18, 2026

View reviewed changes

Comment thread mlx_audio/tts/models/higgs_audio/higgs_audio.py Outdated

lucasnewman reviewed Apr 18, 2026

View reviewed changes

Comment thread mlx_audio/tts/models/higgs_audio/serve.py

Kairos-a force-pushed the higgs-audio-v2-port branch from 52a0c2a to 6cf56a5 Compare April 19, 2026 00:12

lucasnewman approved these changes Apr 21, 2026

View reviewed changes

lucasnewman merged commit 3be6eb4 into Blaizzy:main Apr 21, 2026
11 checks passed

Kairos-a mentioned this pull request Apr 22, 2026

feat(higgs_audio): add ReferenceContext for reusable encoded-reference state #666

Merged

Kairos-a deleted the higgs-audio-v2-port branch April 22, 2026 02:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add Higgs Audio v2 — 3B Llama-backed TTS with voice cloning#656

feat: add Higgs Audio v2 — 3B Llama-backed TTS with voice cloning#656
lucasnewman merged 3 commits into
Blaizzy:mainfrom
kaioct-labs:higgs-audio-v2-port

Kairos-a commented Apr 18, 2026

Uh oh!

Uh oh!

Uh oh!

lucasnewman commented Apr 18, 2026

Uh oh!

lucasnewman commented Apr 20, 2026

Uh oh!

lucasnewman left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Kairos-a commented Apr 18, 2026

Benchmarks (M5 Max, warm cache, long-prompt RTF)

What's added

The non-obvious piece — AUDIO_INIT

Checkpoints

Quantization notes

Testing

Credits

Uh oh!

Uh oh!

Uh oh!

lucasnewman commented Apr 18, 2026

Uh oh!

lucasnewman commented Apr 20, 2026

Uh oh!

lucasnewman left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants