feat: add Higgs Audio v2 — 3B Llama-backed TTS with voice cloning#656
Merged
Conversation
lucasnewman
reviewed
Apr 18, 2026
lucasnewman
reviewed
Apr 18, 2026
Collaborator
|
@Kairos-a Thanks for the contribution! Overall this looks pretty good -- see a couple of comments around how it can better integrate into the existing codebase. |
Port of Boson AI's Higgs Audio v2 to MLX. Llama-3.2-3B backbone with a
dual-FFN decoder layer (text + audio paths share self-attention; LN + MLP
are per-path, routed by audio_out_mask) and delay-pattern audio emission
(codebook i lags by i frames).
Framework interface
-------------------
Exposes `Model` and `ModelConfig` in the shape expected by
`mlx_audio.tts.utils.load()` and `python -m mlx_audio.tts.generate`:
python -m mlx_audio.tts.generate \
--model mlx-community/higgs-audio-v2-3B-mlx-q8 \
--text "Hello" --ref_audio voice.wav --ref_text "..."
`Model` subclasses the underlying `HiggsAudioModel` so safetensors keys
land unchanged. Tokenizer + codec attach via `post_load_hook`. The
generate() signature matches the standard TTS convention (text, voice,
ref_audio, ref_text, ...) and yields a mlx_audio.tts.models.base
GenerationResult.
`HiggsAudioServer` is kept as an additional Python entrypoint for
Higgs-specific kwargs (max_new_frames, ras_win_len, fade_in_ms, etc.).
Load-bearing details
--------------------
The first generated audio frame must be a synthetic all-stream_bos
frame — sampling from the model's audio_logits at the <|audio_out_bos|>
text position collapses to stream-EOS on half the codebooks (those
positions were never trained for direct audio prediction). See
Model.generate and HiggsAudioModel._generate_raw_frames for the full
AUDIO_INIT + K-frame ramp-in + EOS ramp-out state machine.
Quantization
------------
MLX native 4/6/8-bit on the Llama backbone. `Model.model_quant_predicate`
protects `audio_codebook_embeddings` and `audio_decoder_proj.audio_lm_head`
at bf16 — quantizing them introduces voice-character drift (pitch
register shifts at q6, trajectory instability at q4). RTF on M5 Max:
bf16 0.60× / q8 0.36× / q6 0.33×.
Bundled assets
--------------
Three drop-in reference voices in examples/voice_prompts/ (en_woman,
en_man, en_man_deep), generated via smart-voice mode so they're
license-clean. Example: examples/higgs_audio_clone_demo.py.
Tests
-----
16 unit tests in mlx_audio/tts/tests/test_higgs_audio.py — delay-pattern
round-trip, audio-embedding lookup, sampling equivalence at T=0, tiny
config model forward, selective-quant predicate verification, and the
framework-interface contract (Model subclass, ModelConfig.from_dict,
sample_rate, model_quant_predicate, generate-before-load guard).
References
----------
Original: https://github.com/boson-ai/higgs-audio
HF (bf16): https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base
HF (q8): https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q8
HF (q6): https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q6
HF codec: https://huggingface.co/mlx-community/higgs-audio-v2-tokenizer
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
52a0c2a to
6cf56a5
Compare
Matches the pre-commit config (.pre-commit-config.yaml — black 24.2.0, isort 5.13.2 profile=black) enforced by the PR CI style job. No code changes — pure formatting + import-order + uv.lock reconciliation with the pydub dep already declared in pyproject.toml.
Collaborator
|
@Kairos-a Can you add a README.md in the model directory (see the other models for examples) and then we can merge? Thanks! |
Addresses @lucasnewman's review request. Covers usage (CLI + standard Python API + Higgs-specific HiggsAudioServer), voice cloning, parameter table, available models / quantizations / RTF / memory, conversion, architecture, and license. Links to the full port docs at docs/models/tts/higgs_audio.md for depth.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #1.
This PR adds Higgs Audio v2, Boson AI's 3B Llama-3.2-backed TTS with multi-codebook acoustic tokens and delay-pattern streaming. The port reuses the in-tree
HiggsAudioTokenizer(added 2026-04-14 for the OmniVoice PR) for the codec path and ships a complete serve-engine equivalent for voice cloning.Benchmarks (M5 Max, warm cache, long-prompt RTF)
bosonai/higgs-audio-v2-generation-3B-base(authoritative)mlx-community/higgs-audio-v2-3B-mlx-q8mlx-community/higgs-audio-v2-3B-mlx-q6Real-time voice cloning on Apple Silicon. bf16 loads directly from the authoritative bosonai upload — no redundant mlx-community re-host. q8 and q6 are MLX-specific selectively-quantized variants.
What's added
Model —
mlx_audio/tts/models/higgs_audio/HiggsAudioConfig/HiggsTextConfig— from-dict constructor for the bosonaiconfig.json.HiggsAudioModel— Llama-3.2-3B backbone with Higgs-specific dual-FFN audio layers (shared attention, split pre-attn-norm + post-attn-norm + MLP between text and audio paths, routed byaudio_out_mask).HiggsAudioDecoderProjectorwith separate text and audio LM heads. Fullgenerate()implementing the AUDIO_INIT + delay-pattern ramp-in + EOS ramp-out state machine.generation.py—build_delay_pattern_mask,revert_delay_pattern,apply_delay_pattern,lookup_audio_embedding,greedy_sample_audio,sample_audio(temperature + top-p + top-k via Gumbel-max).Serving —
serve.pyHiggsAudioServer.from_pretrained()— loads model + codec + tokenizer in one call.generate()— ChatML prompt assembly (voice-clone mode whenreference_audio_pathis provided, smart-voice mode otherwise), reference audio encoded via the in-treeHiggsAudioTokenizer, inputs_embeds stitched with the audio_out_mask routing through the dual-FFN audio path.generate_stream()— yields PCM chunks atchunk_msboundaries for Pipecat / streaming consumers. Quality-preserving (per-chunk identical to non-streaming). Mid-generation emission with overlap-add is noted as follow-up work.ras_win_len=7,ras_max_repeat=2).Docs —
docs/models/tts/higgs_audio.mdcovers basic usage, voice cloning, streaming, quantization (including theclass_predicatepattern that protects the audio head from quantization), sampling controls, and the generation state machine notes.Tests —
mlx_audio/tts/tests/test_higgs_audio.py— 9 unit tests (0.11s on M5 Max) covering delay-pattern shape + round-trip invariants, audio embedding lookup, sampling equivalence at T=0, and model forward on a tiny synthetic config.Example —
examples/higgs_audio_clone_demo.py— CLI demo matching theomnivoice_clone_demo.pyconvention.The non-obvious piece — AUDIO_INIT
The generation state machine is the gotcha for this port. The first audio frame must not be sampled from
audio_logitsat the<|audio_out_bos|>text position — those logits have huge bias towardaudio_stream_eos_idbecause the model was trained with the<|AUDIO_OUT|>placeholder-substitution convention, not direct audio prediction. Instead, a synthetic all-audio_stream_bos_idframe is force-injected (AUDIO_INIT), then the K-frame delay-pattern ramp-in begins. Codebookistarts emitting at framei; rest stay BOS. On any codebook emitting EOS, a K-frame ramp-out forces trailing codebooks to EOS before termination.Skipping this produces a stuck pitch (first-frame EOS on half the codebooks → rest is garbage). I've documented this in
docs/models/tts/higgs_audio.mdso future ports/maintainers don't rediscover it.Checkpoints
bosonai/higgs-audio-v2-generation-3B-basedirectly (authoritative, no re-host needed)mlx-community/higgs-audio-v2-3B-mlx-q8mlx-community/higgs-audio-v2-3B-mlx-q6mlx-community/higgs-audio-v2-tokenizer(upstream bosonai repacked into theaudio_tokenizer/subdir layout that mlx-audio'sHiggsAudioTokenizer.from_pretrainedexpects)Quantized variants carry a
quantizationblock inconfig.jsonsoHiggsAudioServer.from_pretrainedauto-quantizes the skeleton before loading weights.Quantization notes
MLX native
nn.quantizeworks on the Llama backbone. The docs recommend protectingaudio_codebook_embeddingsandaudio_decoder_proj.audio_lm_headfrom quantization viaclass_predicate— quantizing those specific layers introduces voice-character artifacts (pitch-register drift at q6 without protection, trajectory instability at q4).q4 is deferred from this PR. Rounding noise at 4-bit pushes per-codebook logits close to the sampling decision threshold, so specific Gumbel draws produce clean output while others diverge. The pattern is seed-dependent, not fixable by temperature tuning alone. Follow-up candidates: AWQ-style post-training calibration, smaller group_size, or active retry-on-spiral detection.
Testing
Credits
Port built on top of the in-tree
HiggsAudioTokenizerfrom the OmniVoice PR (#630). Model code reimplemented frombosonai/higgs-audioagainst the Llama-3.2-3B architecture inmlx-lm.