Skip to content

feat: add Higgs Audio v2 — 3B Llama-backed TTS with voice cloning#656

Merged
lucasnewman merged 3 commits into
Blaizzy:mainfrom
kaioct-labs:higgs-audio-v2-port
Apr 21, 2026
Merged

feat: add Higgs Audio v2 — 3B Llama-backed TTS with voice cloning#656
lucasnewman merged 3 commits into
Blaizzy:mainfrom
kaioct-labs:higgs-audio-v2-port

Conversation

@Kairos-a

Copy link
Copy Markdown
Contributor

Closes #1.

This PR adds Higgs Audio v2, Boson AI's 3B Llama-3.2-backed TTS with multi-codebook acoustic tokens and delay-pattern streaming. The port reuses the in-tree HiggsAudioTokenizer (added 2026-04-14 for the OmniVoice PR) for the codec path and ships a complete serve-engine equivalent for voice cloning.

Benchmarks (M5 Max, warm cache, long-prompt RTF)

variant RTF weights size source
bf16 0.60× 6.8 GB bosonai/higgs-audio-v2-generation-3B-base (authoritative)
q8 0.36× 6.18 GB mlx-community/higgs-audio-v2-3B-mlx-q8
q6 0.33× 4.75 GB mlx-community/higgs-audio-v2-3B-mlx-q6

Real-time voice cloning on Apple Silicon. bf16 loads directly from the authoritative bosonai upload — no redundant mlx-community re-host. q8 and q6 are MLX-specific selectively-quantized variants.

What's added

Modelmlx_audio/tts/models/higgs_audio/

  • HiggsAudioConfig / HiggsTextConfig — from-dict constructor for the bosonai config.json.
  • HiggsAudioModel — Llama-3.2-3B backbone with Higgs-specific dual-FFN audio layers (shared attention, split pre-attn-norm + post-attn-norm + MLP between text and audio paths, routed by audio_out_mask). HiggsAudioDecoderProjector with separate text and audio LM heads. Full generate() implementing the AUDIO_INIT + delay-pattern ramp-in + EOS ramp-out state machine.
  • generation.pybuild_delay_pattern_mask, revert_delay_pattern, apply_delay_pattern, lookup_audio_embedding, greedy_sample_audio, sample_audio (temperature + top-p + top-k via Gumbel-max).

Servingserve.py

  • HiggsAudioServer.from_pretrained() — loads model + codec + tokenizer in one call.
  • generate() — ChatML prompt assembly (voice-clone mode when reference_audio_path is provided, smart-voice mode otherwise), reference audio encoded via the in-tree HiggsAudioTokenizer, inputs_embeds stitched with the audio_out_mask routing through the dual-FFN audio path.
  • generate_stream() — yields PCM chunks at chunk_ms boundaries for Pipecat / streaming consumers. Quality-preserving (per-chunk identical to non-streaming). Mid-generation emission with overlap-add is noted as follow-up work.
  • RAS (repetition-avoidance sampling) on by default (Higgs: ras_win_len=7, ras_max_repeat=2).

Docsdocs/models/tts/higgs_audio.md covers basic usage, voice cloning, streaming, quantization (including the class_predicate pattern that protects the audio head from quantization), sampling controls, and the generation state machine notes.

Testsmlx_audio/tts/tests/test_higgs_audio.py — 9 unit tests (0.11s on M5 Max) covering delay-pattern shape + round-trip invariants, audio embedding lookup, sampling equivalence at T=0, and model forward on a tiny synthetic config.

Exampleexamples/higgs_audio_clone_demo.py — CLI demo matching the omnivoice_clone_demo.py convention.

The non-obvious piece — AUDIO_INIT

The generation state machine is the gotcha for this port. The first audio frame must not be sampled from audio_logits at the <|audio_out_bos|> text position — those logits have huge bias toward audio_stream_eos_id because the model was trained with the <|AUDIO_OUT|> placeholder-substitution convention, not direct audio prediction. Instead, a synthetic all-audio_stream_bos_id frame is force-injected (AUDIO_INIT), then the K-frame delay-pattern ramp-in begins. Codebook i starts emitting at frame i; rest stay BOS. On any codebook emitting EOS, a K-frame ramp-out forces trailing codebooks to EOS before termination.

Skipping this produces a stuck pitch (first-frame EOS on half the codebooks → rest is garbage). I've documented this in docs/models/tts/higgs_audio.md so future ports/maintainers don't rediscover it.

Checkpoints

Quantized variants carry a quantization block in config.json so HiggsAudioServer.from_pretrained auto-quantizes the skeleton before loading weights.

Quantization notes

MLX native nn.quantize works on the Llama backbone. The docs recommend protecting audio_codebook_embeddings and audio_decoder_proj.audio_lm_head from quantization via class_predicate — quantizing those specific layers introduces voice-character artifacts (pitch-register drift at q6 without protection, trajectory instability at q4).

q4 is deferred from this PR. Rounding noise at 4-bit pushes per-codebook logits close to the sampling decision threshold, so specific Gumbel draws produce clean output while others diverge. The pattern is seed-dependent, not fixable by temperature tuning alone. Follow-up candidates: AWQ-style post-training calibration, smaller group_size, or active retry-on-spiral detection.

Testing

python -m unittest mlx_audio.tts.tests.test_higgs_audio -v
# 9 tests pass in 0.11s

python examples/higgs_audio_clone_demo.py \
    --ref_audio path/to/reference.wav \
    --ref_text "Transcript of the reference clip." \
    --text "Hello from Higgs Audio on MLX."

Credits

Port built on top of the in-tree HiggsAudioTokenizer from the OmniVoice PR (#630). Model code reimplemented from bosonai/higgs-audio against the Llama-3.2-3B architecture in mlx-lm.

Comment thread mlx_audio/tts/models/higgs_audio/higgs_audio.py Outdated
Comment thread mlx_audio/tts/models/higgs_audio/serve.py
@lucasnewman

Copy link
Copy Markdown
Collaborator

@Kairos-a Thanks for the contribution! Overall this looks pretty good -- see a couple of comments around how it can better integrate into the existing codebase.

Port of Boson AI's Higgs Audio v2 to MLX. Llama-3.2-3B backbone with a
dual-FFN decoder layer (text + audio paths share self-attention; LN + MLP
are per-path, routed by audio_out_mask) and delay-pattern audio emission
(codebook i lags by i frames).

Framework interface
-------------------
Exposes `Model` and `ModelConfig` in the shape expected by
`mlx_audio.tts.utils.load()` and `python -m mlx_audio.tts.generate`:

    python -m mlx_audio.tts.generate \
        --model mlx-community/higgs-audio-v2-3B-mlx-q8 \
        --text "Hello" --ref_audio voice.wav --ref_text "..."

`Model` subclasses the underlying `HiggsAudioModel` so safetensors keys
land unchanged. Tokenizer + codec attach via `post_load_hook`. The
generate() signature matches the standard TTS convention (text, voice,
ref_audio, ref_text, ...) and yields a mlx_audio.tts.models.base
GenerationResult.

`HiggsAudioServer` is kept as an additional Python entrypoint for
Higgs-specific kwargs (max_new_frames, ras_win_len, fade_in_ms, etc.).

Load-bearing details
--------------------
The first generated audio frame must be a synthetic all-stream_bos
frame — sampling from the model's audio_logits at the <|audio_out_bos|>
text position collapses to stream-EOS on half the codebooks (those
positions were never trained for direct audio prediction). See
Model.generate and HiggsAudioModel._generate_raw_frames for the full
AUDIO_INIT + K-frame ramp-in + EOS ramp-out state machine.

Quantization
------------
MLX native 4/6/8-bit on the Llama backbone. `Model.model_quant_predicate`
protects `audio_codebook_embeddings` and `audio_decoder_proj.audio_lm_head`
at bf16 — quantizing them introduces voice-character drift (pitch
register shifts at q6, trajectory instability at q4). RTF on M5 Max:
bf16 0.60× / q8 0.36× / q6 0.33×.

Bundled assets
--------------
Three drop-in reference voices in examples/voice_prompts/ (en_woman,
en_man, en_man_deep), generated via smart-voice mode so they're
license-clean. Example: examples/higgs_audio_clone_demo.py.

Tests
-----
16 unit tests in mlx_audio/tts/tests/test_higgs_audio.py — delay-pattern
round-trip, audio-embedding lookup, sampling equivalence at T=0, tiny
config model forward, selective-quant predicate verification, and the
framework-interface contract (Model subclass, ModelConfig.from_dict,
sample_rate, model_quant_predicate, generate-before-load guard).

References
----------
Original: https://github.com/boson-ai/higgs-audio
HF (bf16): https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base
HF (q8):   https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q8
HF (q6):   https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q6
HF codec:  https://huggingface.co/mlx-community/higgs-audio-v2-tokenizer

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Kairos-a Kairos-a force-pushed the higgs-audio-v2-port branch from 52a0c2a to 6cf56a5 Compare April 19, 2026 00:12
Matches the pre-commit config (.pre-commit-config.yaml — black 24.2.0,
isort 5.13.2 profile=black) enforced by the PR CI style job. No code
changes — pure formatting + import-order + uv.lock reconciliation with
the pydub dep already declared in pyproject.toml.
@lucasnewman

Copy link
Copy Markdown
Collaborator

@Kairos-a Can you add a README.md in the model directory (see the other models for examples) and then we can merge? Thanks!

Addresses @lucasnewman's review request. Covers usage (CLI + standard
Python API + Higgs-specific HiggsAudioServer), voice cloning, parameter
table, available models / quantizations / RTF / memory, conversion,
architecture, and license. Links to the full port docs at
docs/models/tts/higgs_audio.md for depth.

@lucasnewman lucasnewman left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@lucasnewman lucasnewman merged commit 3be6eb4 into Blaizzy:main Apr 21, 2026
11 checks passed
@Kairos-a Kairos-a deleted the higgs-audio-v2-port branch April 22, 2026 02:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TTS and STS Models to port to MLX-Audio (Roadmap)

2 participants