Blaizzy · lucasnewman · Apr 21, 2026 · Apr 19, 2026 · Apr 20, 2026 · Apr 20, 2026
diff --git a/README.md b/README.md
@@ -105,6 +105,7 @@ for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
 | **Voxtral TTS** | Mistral's 4B multilingual TTS (20 voices, 9 languages) | EN, FR, ES, DE, IT, PT, NL, AR, HI | [mlx-community/Voxtral-4B-TTS-2603-mlx-bf16](https://huggingface.co/mlx-community/Voxtral-4B-TTS-2603-mlx-bf16) |
 | **LongCat-AudioDiT** | SOTA diffusion TTS in waveform latent space with voice cloning | ZH, EN | [mlx-community/LongCat-AudioDiT-1B-bf16](https://huggingface.co/mlx-community/LongCat-AudioDiT-1B-bf16) |
 | **MeloTTS** | Lightweight VITS2-based TTS with streaming | EN (more coming) | [mlx-community/MeloTTS-English-MLX](https://huggingface.co/mlx-community/MeloTTS-English-MLX) |
+| **Higgs Audio v2** | 3B Llama-backed TTS with real-time voice cloning | EN, ZH, KO, DE, ES | [bf16 (upstream)](https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base), [q8](https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q8), [q6](https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q6) |
 
 ### Speech-to-Text (STT)
 

diff --git a/docs/models/tts/higgs_audio.md b/docs/models/tts/higgs_audio.md
@@ -0,0 +1,183 @@
+# Higgs Audio v2
+
+Higgs Audio v2 is a Llama-3.2-3B-backed TTS with multi-codebook acoustic tokens and delay-pattern streaming. The MLX port targets the 3B open-weights release from Boson AI and reuses the in-tree HiggsAudio acoustic tokenizer (originally added for OmniVoice).
+
+## Highlights
+
+- Real-time voice cloning on Apple Silicon (RTF ≈ 0.6× bf16 / 0.36× q8 / 0.33× q6 on M5 Max)
+- Reference-audio voice cloning via ChatML prompt format
+- Full `AUDIO_INIT` + delay-pattern ramp-in/out state machine
+- Repetition-avoidance sampling (RAS) for stable long-form output
+- MLX native 4/6/8-bit quantization with optional per-layer protection
+
+## Basic usage
+
+### Top-level CLI
+
+```bash
+python -m mlx_audio.tts.generate \
+  --model mlx-community/higgs-audio-v2-3B-mlx-q8 \
+  --text "Hello from Higgs Audio on MLX." \
+  --ref_audio path/to/reference.wav \
+  --ref_text "Transcript of the reference clip."
+```
+
+The `Model` class conforms to the standard mlx-audio interface, so the
+existing `mlx_audio.tts.generate` CLI and `mlx_audio.server` both work
+unchanged against Higgs.
+
+### Python API (standard)
+
+```python
+from mlx_audio.tts.utils import load
+import soundfile as sf
+
+model = load("mlx-community/higgs-audio-v2-3B-mlx-q8")
+
+for result in model.generate(
+    text="Hello from Higgs Audio on MLX.",
+    ref_audio="path/to/reference.wav",   # optional; strongly recommended
+    ref_text="Transcript of the reference clip.",
+    temperature=0.7,
+    top_p=0.95,
+    max_new_frames=1200,
+    fade_in_ms=30.0,
+):
+    sf.write("output.wav", result.audio, result.sample_rate)
+```
+
+Without `ref_audio`, generation runs in "smart voice" mode (random voice
+per sample). This works but is less reliable than voice cloning — the
+sampling occasionally collapses to `stream_eos` early and produces silent
+output. If that happens, rerun (each call draws fresh noise) or pass
+`ref_audio`. For production use, a reference voice is strongly recommended.
+
+### Python API (Higgs-specific kwargs)
+
+For direct access to the full Higgs parameter surface (RAS windowing,
+sampling warmup, pre-loaded codec override, etc.), use `HiggsAudioServer`:
+
+```python
+from mlx_audio.tts.models.higgs_audio import HiggsAudioServer
+import soundfile as sf
+
+server = HiggsAudioServer.from_pretrained(
+    model_path="bosonai/higgs-audio-v2-generation-3B-base",     # bf16 base
+    codec_path="mlx-community/higgs-audio-v2-tokenizer",        # acoustic tokenizer
+)
+
+result = server.generate(
+    target_text="Hello from Higgs Audio on MLX.",
+    temperature=0.7,
+    top_p=0.95,
+    max_new_frames=1200,
+    fade_in_ms=30.0,
+)
+sf.write("output.wav", result.pcm, result.sampling_rate)
+```
+
+### Recommended parameters
+
+- `temperature=0.7`, `top_p=0.95` — proven stable across prompt lengths during the M5 benchmark
+- `max_new_frames=1200` — generous cap; generation stops naturally at the EOS ramp
+- `fade_in_ms=30.0`, `fade_out_ms=15.0` — suppresses the first-frame transient that the 5ms default occasionally lets through
+
+## Voice cloning
+
+Pass `ref_audio` (path or pre-loaded mx.array at 24 kHz mono) together with
+`ref_text` (the transcript of that clip). Reference audio is encoded through
+the in-tree `HiggsAudioTokenizer` and stitched into the assistant turn of a
+ChatML prompt — the transcript is required for stable alignment between the
+cloned voice and the target text.
+
+```python
+for result in model.generate(
+    text="Hello, this is a cloned voice.",
+    ref_audio="reference.wav",
+    ref_text="Transcript of the reference clip.",
+    temperature=0.7,
+    top_p=0.95,
+    max_new_frames=1200,
+    fade_in_ms=30.0,
+):
+    sf.write("output.wav", result.audio, result.sample_rate)
+```
+
+Best results come from 5–15 seconds of clean reference speech.
+
+### Bundled sample voices
+
+Three drop-in reference voices ship in `examples/voice_prompts/`, generated via Higgs smart-voice mode so they're license-clean:
+
+- `en_woman.wav` — English, feminine register
+- `en_man.wav` — English, masculine register
+- `en_man_deep.wav` — English, masculine register, lower pitch
+
+Each `.wav` is paired with a matching `.txt` transcript. See `examples/voice_prompts/README.md` for the usage snippet.
+
+## Streaming
+
+For chunked streaming output (e.g. Pipecat pipelines), use
+`HiggsAudioServer.generate_stream`:
+
+```python
+for pcm_chunk in server.generate_stream(
+    target_text="Generating in chunks for live playback.",
+    reference_audio_path="reference.wav",
+    reference_text="...",
+    chunk_ms=640.0,
+):
+    # emit or resample pcm_chunk (float32 at 24 kHz)
+    ...
+```
+
+Current shape: full generate, then chunk the resulting PCM. Per-chunk quality matches non-streaming exactly. Mid-generation streaming (emit-as-you-go) is not yet supported because the neural-vocoder codec produces subtly different PCM at the same sample position when called with different accumulated lengths — boundary discontinuities become audible. Proper overlap-add streaming is follow-up work.
+
+## Quantization
+
+MLX native 4/6/8-bit quantization works on the Llama backbone. The audio head and audio codebook embeddings benefit from staying at bf16 — quantizing them introduces voice-character drift (pitch register shifts at q6, trajectory instability at q4).
+
+Already-quantized checkpoints load transparently via `load(...)` — config.json carries a `quantization` block that the framework applies before weight load. To quantize in place on a fresh bf16 load, use `model.model_quant_predicate`:
+
+```python
+import mlx.core as mx
+import mlx.nn as nn
+from mlx_audio.tts.utils import load
+
+model = load("bosonai/higgs-audio-v2-generation-3B-base")
+nn.quantize(model, group_size=64, bits=8, class_predicate=model.model_quant_predicate)
+mx.eval(model.parameters())
+```
+
+Benchmark on M5 Max (warm), long-prompt RTF:
+
+| variant | RTF   | weights size | notes                                       |
+|---------|-------|--------------|---------------------------------------------|
+| bf16    | 0.60× | 6.8 GB       | `bosonai/higgs-audio-v2-generation-3B-base` (authoritative) |
+| q8      | 0.36× | 6.18 GB      | `mlx-community/higgs-audio-v2-3B-mlx-q8`    |
+| q6      | 0.33× | 4.75 GB      | `mlx-community/higgs-audio-v2-3B-mlx-q6`    |
+| q4      | 0.26× | 3.32 GB      | deferred — seed-sensitive, follow-up PR     |
+
+bf16 is served directly from the authoritative `bosonai/*` upload — no need for a redundant mlx-community re-host. q8 and q6 are MLX-specific selectively-quantized variants.
+
+## Sampling controls
+
+- `temperature=0.7`, `top_p=0.95` are the Higgs defaults.
+- `ras_win_len=7`, `ras_max_repeat=2` enables repetition-avoidance sampling (catches near-tie mispicks that compound into loops). Set `ras_win_len=None` to disable.
+- `sampling_warmup_frames=N` uses greedy sampling for the first N frames, then switches to temperature. Exposed for experimentation; not helpful at default settings.
+- `fade_in_ms=5.0`, `fade_out_ms=5.0` applies a short linear fade to the decoded PCM boundaries. Below onset perception threshold on bf16/q8; masks rounding-click transients on quantized variants.
+
+## Implementation notes
+
+The generation state machine is the non-obvious piece of this port. See source at `mlx_audio/tts/models/higgs_audio/higgs_audio.py:HiggsAudioModel._generate_raw_frames`. The first audio frame is **synthetic all `audio_stream_bos_id`** (AUDIO_INIT) — not sampled from audio_logits at the `<|audio_out_bos|>` text position, because those logits were never trained for direct audio prediction. Without this, the model emits the stream-EOS token on half the codebooks at step 1 and output collapses to a stuck pitch.
+
+Codebook `i` is emitted with `i`-frame delay, so the first K frames are a progressive ramp-in (cb₀ sampled at frame 1, cb₁ at frame 2, etc.; the rest forced to BOS). On any codebook emitting EOS, a K-frame ramp-out begins — trailing codebooks forced to EOS before termination. After `revert_delay_pattern`, the first and last aligned columns are dropped (BOS-seed and EOS-seal — they decode to arbitrary codec token 1023 and produce audible clicks otherwise).
+
+## References
+
+- Original repo: <https://github.com/boson-ai/higgs-audio>
+- Paper / blog: <https://boson.ai/blog/higgs-audio-v2>
+- HF model (reference): <https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base>
+- HF model (MLX q8): <https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q8>
+- HF model (MLX q6): <https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q6>
+- HF codec: <https://huggingface.co/mlx-community/higgs-audio-v2-tokenizer>
diff --git a/examples/higgs_audio_clone_demo.py b/examples/higgs_audio_clone_demo.py
@@ -0,0 +1,148 @@
+#!/usr/bin/env python3
+"""Higgs Audio v2 voice cloning demo.
+
+Uses the Higgs-specific HiggsAudioServer API for full parameter surface.
+For a drop-in example against the standard mlx_audio.tts.generate CLI,
+see docs/models/tts/higgs_audio.md.
+
+Quick start with the bundled `en_woman` sample voice:
+    python examples/higgs_audio_clone_demo.py \\
+        --text "Text to synthesize in the cloned voice."
+
+Supply your own reference:
+    python examples/higgs_audio_clone_demo.py \\
+        --ref_audio reference.wav \\
+        --ref_text "Reference transcript text." \\
+        --text "Text to synthesize in the cloned voice."
+
+Reference audio is encoded through the in-tree HiggsAudioTokenizer and
+stitched into the assistant turn of a ChatML prompt. ref_text is the
+transcript of the reference clip — required for stable alignment
+between the cloned voice and the target text.
+
+Best results come from 5-15 seconds of clean reference speech.
+Three sample voices live in examples/voice_prompts/ (en_woman,
+en_man, en_man_deep) for drop-in use.
+"""
+
+import argparse
+import sys
+import time
+from pathlib import Path
+
+import mlx.core as mx
+import mlx.nn as nn
+import numpy as np
+import soundfile as sf
+
+from mlx_audio.tts.models.higgs_audio import HiggsAudioServer
+
+
+def _quantize_predicate(name: str, module: nn.Module) -> bool:
+    """Keep audio head + audio codebook embeddings at bf16 — they're most
+    sensitive to quantization noise. Everything else (Llama backbone + text
+    head) gets compressed."""
+    if not isinstance(module, (nn.Linear, nn.Embedding)):
+        return False
+    protected = ("audio_codebook_embeddings", "audio_decoder_proj.audio_lm_head")
+    return not any(b in name for b in protected)
+
+
+def _default_voice_prompt() -> tuple[str, str]:
+    """Return the bundled `en_woman` voice prompt (wav path + transcript)."""
+    here = Path(__file__).resolve().parent / "voice_prompts"
+    return str(here / "en_woman.wav"), (here / "en_woman.txt").read_text().strip()
+
+
+def main() -> int:
+    p = argparse.ArgumentParser(description="Higgs Audio v2 voice cloning demo")
+    default_ref_audio, default_ref_text = _default_voice_prompt()
+    p.add_argument(
+        "--ref_audio",
+        default=default_ref_audio,
+        help="Reference audio WAV (defaults to bundled en_woman sample)",
+    )
+    p.add_argument(
+        "--ref_text", default=default_ref_text, help="Transcript of the reference audio"
+    )
+    p.add_argument("--text", required=True, help="Target text to synthesize")
+    p.add_argument("--output", default="higgs_clone_output.wav", help="Output WAV path")
+    p.add_argument(
+        "--model",
+        default="mlx-community/higgs-audio-v2-3B-mlx-bf16",
+        help="Higgs Audio v2 MLX model repo or path",
+    )
+    p.add_argument(
+        "--codec",
+        default="mlx-community/higgs-audio-v2-tokenizer",
+        help="Higgs Audio v2 tokenizer repo or path",
+    )
+    p.add_argument(
+        "--quantize_bits",
+        type=int,
+        default=None,
+        choices=[4, 6, 8],
+        help="Optionally quantize the loaded model in-place (4/6/8-bit)",
+    )
+    p.add_argument("--temperature", type=float, default=0.7)
+    p.add_argument("--top_p", type=float, default=0.95)
+    p.add_argument("--max_new_frames", type=int, default=1200)
+    p.add_argument("--ras_win_len", type=int, default=7)
+    p.add_argument("--ras_max_repeat", type=int, default=2)
+    p.add_argument(
+        "--fade_in_ms",
+        type=float,
+        default=30.0,
+        help="Leading fade (ms) — 30ms suppresses the first-frame transient cleanly",
+    )
+    p.add_argument("--fade_out_ms", type=float, default=15.0)
+    args = p.parse_args()
+
+    print(f"[load] HiggsAudioServer from {args.model}")
+    t0 = time.monotonic()
+    server = HiggsAudioServer.from_pretrained(
+        model_path=args.model,
+        codec_path=args.codec,
+    )
+    if args.quantize_bits is not None:
+        print(
+            f"[quantize] group_size=64 bits={args.quantize_bits} (audio head protected)"
+        )
+        nn.quantize(
+            server.model,
+            group_size=64,
+            bits=args.quantize_bits,
+            class_predicate=_quantize_predicate,
+        )
+        mx.eval(server.model.parameters())
+    print(f"  loaded in {time.monotonic() - t0:.2f}s")
+
+    print(f"[generate] target: {args.text!r}")
+    t_gen = time.monotonic()
+    result = server.generate(
+        target_text=args.text,
+        reference_audio_path=args.ref_audio,
+        reference_text=args.ref_text,
+        max_new_frames=args.max_new_frames,
+        temperature=args.temperature,
+        top_p=args.top_p,
+        ras_win_len=args.ras_win_len,
+        ras_max_repeat=args.ras_max_repeat,
+        fade_in_ms=args.fade_in_ms,
+        fade_out_ms=args.fade_out_ms,
+    )
+    wall = time.monotonic() - t_gen
+    audio_sec = len(result.pcm) / result.sampling_rate
+    rtf = wall / audio_sec if audio_sec > 0 else float("inf")
+
+    sf.write(args.output, result.pcm, result.sampling_rate)
+    print(
+        f"[done] {audio_sec:.2f}s audio in {wall:.2f}s wall "
+        f"(RTF {rtf:.2f}×, {result.num_frames_raw} frames, "
+        f"stop={result.stop_reason}) → {args.output}"
+    )
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(main())
diff --git a/examples/voice_prompts/README.md b/examples/voice_prompts/README.md
@@ -0,0 +1,39 @@
+# Higgs Audio v2 — Sample Voice Prompts
+
+Drop-in reference voices for `HiggsAudioServer.generate(..., reference_audio_path=...)`. Each `.wav` is paired with a `.txt` containing the transcript of that clip (required for stable alignment between the cloned voice and the target text).
+
+| File | Character |
+| --- | --- |
+| `en_woman.wav` | English, feminine register |
+| `en_man.wav` | English, masculine register |
+| `en_man_deep.wav` | English, masculine register, lower pitch |
+
+All three were generated via Higgs Audio v2 smart-voice mode (no human recordings), so they're license-clean and can be freely redistributed.
+
+## Usage
+
+```python
+from mlx_audio.tts.models.higgs_audio import HiggsAudioServer
+from pathlib import Path
+
+voice_dir = Path("examples/voice_prompts")
+ref_wav = voice_dir / "en_woman.wav"
+ref_txt = (voice_dir / "en_woman.txt").read_text().strip()
+
+server = HiggsAudioServer.from_pretrained(
+    model_path="mlx-community/higgs-audio-v2-3B-mlx-q8",
+    codec_path="mlx-community/higgs-audio-v2-tokenizer",
+)
+
+result = server.generate(
+    target_text="Anything you want cloned in the chosen voice.",
+    reference_audio_path=str(ref_wav),
+    reference_text=ref_txt,
+    temperature=0.7,
+    top_p=0.95,
+    max_new_frames=1200,
+    fade_in_ms=30.0,
+)
+```
+
+For the recommended parameter set, see [`docs/models/tts/higgs_audio.md`](../../docs/models/tts/higgs_audio.md).
diff --git a/examples/voice_prompts/en_man.txt b/examples/voice_prompts/en_man.txt
@@ -0,0 +1 @@
+The radio quietly played a familiar song. Outside, rain tapped against the window in a steady rhythm. Coffee cooled slowly in a ceramic mug. Somewhere down the hall, a door clicked shut.
diff --git a/examples/voice_prompts/en_man.wav b/examples/voice_prompts/en_man.wav
diff --git a/examples/voice_prompts/en_man_deep.txt b/examples/voice_prompts/en_man_deep.txt
@@ -0,0 +1 @@
+The radio quietly played a familiar song. Outside, rain tapped against the window in a steady rhythm. Coffee cooled slowly in a ceramic mug. Somewhere down the hall, a door clicked shut.
diff --git a/examples/voice_prompts/en_man_deep.wav b/examples/voice_prompts/en_man_deep.wav
diff --git a/examples/voice_prompts/en_woman.txt b/examples/voice_prompts/en_woman.txt
@@ -0,0 +1 @@
+The radio quietly played a familiar song. Outside, rain tapped against the window in a steady rhythm. Coffee cooled slowly in a ceramic mug. Somewhere down the hall, a door clicked shut.
diff --git a/examples/voice_prompts/en_woman.wav b/examples/voice_prompts/en_woman.wav
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		The radio quietly played a familiar song. Outside, rain tapped against the window in a steady rhythm. Coffee cooled slowly in a ceramic mug. Somewhere down the hall, a door clicked shut.