Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,7 @@ for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
| **Voxtral TTS** | Mistral's 4B multilingual TTS (20 voices, 9 languages) | EN, FR, ES, DE, IT, PT, NL, AR, HI | [mlx-community/Voxtral-4B-TTS-2603-mlx-bf16](https://huggingface.co/mlx-community/Voxtral-4B-TTS-2603-mlx-bf16) |
| **LongCat-AudioDiT** | SOTA diffusion TTS in waveform latent space with voice cloning | ZH, EN | [mlx-community/LongCat-AudioDiT-1B-bf16](https://huggingface.co/mlx-community/LongCat-AudioDiT-1B-bf16) |
| **MeloTTS** | Lightweight VITS2-based TTS with streaming | EN (more coming) | [mlx-community/MeloTTS-English-MLX](https://huggingface.co/mlx-community/MeloTTS-English-MLX) |
| **Higgs Audio v2** | 3B Llama-backed TTS with real-time voice cloning | EN, ZH, KO, DE, ES | [bf16 (upstream)](https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base), [q8](https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q8), [q6](https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q6) |

### Speech-to-Text (STT)

Expand Down
183 changes: 183 additions & 0 deletions docs/models/tts/higgs_audio.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
# Higgs Audio v2

Higgs Audio v2 is a Llama-3.2-3B-backed TTS with multi-codebook acoustic tokens and delay-pattern streaming. The MLX port targets the 3B open-weights release from Boson AI and reuses the in-tree HiggsAudio acoustic tokenizer (originally added for OmniVoice).

## Highlights

- Real-time voice cloning on Apple Silicon (RTF ≈ 0.6× bf16 / 0.36× q8 / 0.33× q6 on M5 Max)
- Reference-audio voice cloning via ChatML prompt format
- Full `AUDIO_INIT` + delay-pattern ramp-in/out state machine
- Repetition-avoidance sampling (RAS) for stable long-form output
- MLX native 4/6/8-bit quantization with optional per-layer protection

## Basic usage

### Top-level CLI

```bash
python -m mlx_audio.tts.generate \
--model mlx-community/higgs-audio-v2-3B-mlx-q8 \
--text "Hello from Higgs Audio on MLX." \
--ref_audio path/to/reference.wav \
--ref_text "Transcript of the reference clip."
```

The `Model` class conforms to the standard mlx-audio interface, so the
existing `mlx_audio.tts.generate` CLI and `mlx_audio.server` both work
unchanged against Higgs.

### Python API (standard)

```python
from mlx_audio.tts.utils import load
import soundfile as sf

model = load("mlx-community/higgs-audio-v2-3B-mlx-q8")

for result in model.generate(
text="Hello from Higgs Audio on MLX.",
ref_audio="path/to/reference.wav", # optional; strongly recommended
ref_text="Transcript of the reference clip.",
temperature=0.7,
top_p=0.95,
max_new_frames=1200,
fade_in_ms=30.0,
):
sf.write("output.wav", result.audio, result.sample_rate)
```

Without `ref_audio`, generation runs in "smart voice" mode (random voice
per sample). This works but is less reliable than voice cloning — the
sampling occasionally collapses to `stream_eos` early and produces silent
output. If that happens, rerun (each call draws fresh noise) or pass
`ref_audio`. For production use, a reference voice is strongly recommended.

### Python API (Higgs-specific kwargs)

For direct access to the full Higgs parameter surface (RAS windowing,
sampling warmup, pre-loaded codec override, etc.), use `HiggsAudioServer`:

```python
from mlx_audio.tts.models.higgs_audio import HiggsAudioServer
import soundfile as sf

server = HiggsAudioServer.from_pretrained(
model_path="bosonai/higgs-audio-v2-generation-3B-base", # bf16 base
codec_path="mlx-community/higgs-audio-v2-tokenizer", # acoustic tokenizer
)

result = server.generate(
target_text="Hello from Higgs Audio on MLX.",
temperature=0.7,
top_p=0.95,
max_new_frames=1200,
fade_in_ms=30.0,
)
sf.write("output.wav", result.pcm, result.sampling_rate)
```

### Recommended parameters

- `temperature=0.7`, `top_p=0.95` — proven stable across prompt lengths during the M5 benchmark
- `max_new_frames=1200` — generous cap; generation stops naturally at the EOS ramp
- `fade_in_ms=30.0`, `fade_out_ms=15.0` — suppresses the first-frame transient that the 5ms default occasionally lets through

## Voice cloning

Pass `ref_audio` (path or pre-loaded mx.array at 24 kHz mono) together with
`ref_text` (the transcript of that clip). Reference audio is encoded through
the in-tree `HiggsAudioTokenizer` and stitched into the assistant turn of a
ChatML prompt — the transcript is required for stable alignment between the
cloned voice and the target text.

```python
for result in model.generate(
text="Hello, this is a cloned voice.",
ref_audio="reference.wav",
ref_text="Transcript of the reference clip.",
temperature=0.7,
top_p=0.95,
max_new_frames=1200,
fade_in_ms=30.0,
):
sf.write("output.wav", result.audio, result.sample_rate)
```

Best results come from 5–15 seconds of clean reference speech.

### Bundled sample voices

Three drop-in reference voices ship in `examples/voice_prompts/`, generated via Higgs smart-voice mode so they're license-clean:

- `en_woman.wav` — English, feminine register
- `en_man.wav` — English, masculine register
- `en_man_deep.wav` — English, masculine register, lower pitch

Each `.wav` is paired with a matching `.txt` transcript. See `examples/voice_prompts/README.md` for the usage snippet.

## Streaming

For chunked streaming output (e.g. Pipecat pipelines), use
`HiggsAudioServer.generate_stream`:

```python
for pcm_chunk in server.generate_stream(
target_text="Generating in chunks for live playback.",
reference_audio_path="reference.wav",
reference_text="...",
chunk_ms=640.0,
):
# emit or resample pcm_chunk (float32 at 24 kHz)
...
```

Current shape: full generate, then chunk the resulting PCM. Per-chunk quality matches non-streaming exactly. Mid-generation streaming (emit-as-you-go) is not yet supported because the neural-vocoder codec produces subtly different PCM at the same sample position when called with different accumulated lengths — boundary discontinuities become audible. Proper overlap-add streaming is follow-up work.

## Quantization

MLX native 4/6/8-bit quantization works on the Llama backbone. The audio head and audio codebook embeddings benefit from staying at bf16 — quantizing them introduces voice-character drift (pitch register shifts at q6, trajectory instability at q4).

Already-quantized checkpoints load transparently via `load(...)` — config.json carries a `quantization` block that the framework applies before weight load. To quantize in place on a fresh bf16 load, use `model.model_quant_predicate`:

```python
import mlx.core as mx
import mlx.nn as nn
from mlx_audio.tts.utils import load

model = load("bosonai/higgs-audio-v2-generation-3B-base")
nn.quantize(model, group_size=64, bits=8, class_predicate=model.model_quant_predicate)
mx.eval(model.parameters())
```

Benchmark on M5 Max (warm), long-prompt RTF:

| variant | RTF | weights size | notes |
|---------|-------|--------------|---------------------------------------------|
| bf16 | 0.60× | 6.8 GB | `bosonai/higgs-audio-v2-generation-3B-base` (authoritative) |
| q8 | 0.36× | 6.18 GB | `mlx-community/higgs-audio-v2-3B-mlx-q8` |
| q6 | 0.33× | 4.75 GB | `mlx-community/higgs-audio-v2-3B-mlx-q6` |
| q4 | 0.26× | 3.32 GB | deferred — seed-sensitive, follow-up PR |

bf16 is served directly from the authoritative `bosonai/*` upload — no need for a redundant mlx-community re-host. q8 and q6 are MLX-specific selectively-quantized variants.

## Sampling controls

- `temperature=0.7`, `top_p=0.95` are the Higgs defaults.
- `ras_win_len=7`, `ras_max_repeat=2` enables repetition-avoidance sampling (catches near-tie mispicks that compound into loops). Set `ras_win_len=None` to disable.
- `sampling_warmup_frames=N` uses greedy sampling for the first N frames, then switches to temperature. Exposed for experimentation; not helpful at default settings.
- `fade_in_ms=5.0`, `fade_out_ms=5.0` applies a short linear fade to the decoded PCM boundaries. Below onset perception threshold on bf16/q8; masks rounding-click transients on quantized variants.

## Implementation notes

The generation state machine is the non-obvious piece of this port. See source at `mlx_audio/tts/models/higgs_audio/higgs_audio.py:HiggsAudioModel._generate_raw_frames`. The first audio frame is **synthetic all `audio_stream_bos_id`** (AUDIO_INIT) — not sampled from audio_logits at the `<|audio_out_bos|>` text position, because those logits were never trained for direct audio prediction. Without this, the model emits the stream-EOS token on half the codebooks at step 1 and output collapses to a stuck pitch.

Codebook `i` is emitted with `i`-frame delay, so the first K frames are a progressive ramp-in (cb₀ sampled at frame 1, cb₁ at frame 2, etc.; the rest forced to BOS). On any codebook emitting EOS, a K-frame ramp-out begins — trailing codebooks forced to EOS before termination. After `revert_delay_pattern`, the first and last aligned columns are dropped (BOS-seed and EOS-seal — they decode to arbitrary codec token 1023 and produce audible clicks otherwise).

## References

- Original repo: <https://github.com/boson-ai/higgs-audio>
- Paper / blog: <https://boson.ai/blog/higgs-audio-v2>
- HF model (reference): <https://huggingface.co/bosonai/higgs-audio-v2-generation-3B-base>
- HF model (MLX q8): <https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q8>
- HF model (MLX q6): <https://huggingface.co/mlx-community/higgs-audio-v2-3B-mlx-q6>
- HF codec: <https://huggingface.co/mlx-community/higgs-audio-v2-tokenizer>
148 changes: 148 additions & 0 deletions examples/higgs_audio_clone_demo.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,148 @@
#!/usr/bin/env python3
"""Higgs Audio v2 voice cloning demo.

Uses the Higgs-specific HiggsAudioServer API for full parameter surface.
For a drop-in example against the standard mlx_audio.tts.generate CLI,
see docs/models/tts/higgs_audio.md.

Quick start with the bundled `en_woman` sample voice:
python examples/higgs_audio_clone_demo.py \\
--text "Text to synthesize in the cloned voice."

Supply your own reference:
python examples/higgs_audio_clone_demo.py \\
--ref_audio reference.wav \\
--ref_text "Reference transcript text." \\
--text "Text to synthesize in the cloned voice."

Reference audio is encoded through the in-tree HiggsAudioTokenizer and
stitched into the assistant turn of a ChatML prompt. ref_text is the
transcript of the reference clip — required for stable alignment
between the cloned voice and the target text.

Best results come from 5-15 seconds of clean reference speech.
Three sample voices live in examples/voice_prompts/ (en_woman,
en_man, en_man_deep) for drop-in use.
"""

import argparse
import sys
import time
from pathlib import Path

import mlx.core as mx
import mlx.nn as nn
import numpy as np
import soundfile as sf

from mlx_audio.tts.models.higgs_audio import HiggsAudioServer


def _quantize_predicate(name: str, module: nn.Module) -> bool:
"""Keep audio head + audio codebook embeddings at bf16 — they're most
sensitive to quantization noise. Everything else (Llama backbone + text
head) gets compressed."""
if not isinstance(module, (nn.Linear, nn.Embedding)):
return False
protected = ("audio_codebook_embeddings", "audio_decoder_proj.audio_lm_head")
return not any(b in name for b in protected)


def _default_voice_prompt() -> tuple[str, str]:
"""Return the bundled `en_woman` voice prompt (wav path + transcript)."""
here = Path(__file__).resolve().parent / "voice_prompts"
return str(here / "en_woman.wav"), (here / "en_woman.txt").read_text().strip()


def main() -> int:
p = argparse.ArgumentParser(description="Higgs Audio v2 voice cloning demo")
default_ref_audio, default_ref_text = _default_voice_prompt()
p.add_argument(
"--ref_audio",
default=default_ref_audio,
help="Reference audio WAV (defaults to bundled en_woman sample)",
)
p.add_argument(
"--ref_text", default=default_ref_text, help="Transcript of the reference audio"
)
p.add_argument("--text", required=True, help="Target text to synthesize")
p.add_argument("--output", default="higgs_clone_output.wav", help="Output WAV path")
p.add_argument(
"--model",
default="mlx-community/higgs-audio-v2-3B-mlx-bf16",
help="Higgs Audio v2 MLX model repo or path",
)
p.add_argument(
"--codec",
default="mlx-community/higgs-audio-v2-tokenizer",
help="Higgs Audio v2 tokenizer repo or path",
)
p.add_argument(
"--quantize_bits",
type=int,
default=None,
choices=[4, 6, 8],
help="Optionally quantize the loaded model in-place (4/6/8-bit)",
)
p.add_argument("--temperature", type=float, default=0.7)
p.add_argument("--top_p", type=float, default=0.95)
p.add_argument("--max_new_frames", type=int, default=1200)
p.add_argument("--ras_win_len", type=int, default=7)
p.add_argument("--ras_max_repeat", type=int, default=2)
p.add_argument(
"--fade_in_ms",
type=float,
default=30.0,
help="Leading fade (ms) — 30ms suppresses the first-frame transient cleanly",
)
p.add_argument("--fade_out_ms", type=float, default=15.0)
args = p.parse_args()

print(f"[load] HiggsAudioServer from {args.model}")
t0 = time.monotonic()
server = HiggsAudioServer.from_pretrained(
model_path=args.model,
codec_path=args.codec,
)
if args.quantize_bits is not None:
print(
f"[quantize] group_size=64 bits={args.quantize_bits} (audio head protected)"
)
nn.quantize(
server.model,
group_size=64,
bits=args.quantize_bits,
class_predicate=_quantize_predicate,
)
mx.eval(server.model.parameters())
print(f" loaded in {time.monotonic() - t0:.2f}s")

print(f"[generate] target: {args.text!r}")
t_gen = time.monotonic()
result = server.generate(
target_text=args.text,
reference_audio_path=args.ref_audio,
reference_text=args.ref_text,
max_new_frames=args.max_new_frames,
temperature=args.temperature,
top_p=args.top_p,
ras_win_len=args.ras_win_len,
ras_max_repeat=args.ras_max_repeat,
fade_in_ms=args.fade_in_ms,
fade_out_ms=args.fade_out_ms,
)
wall = time.monotonic() - t_gen
audio_sec = len(result.pcm) / result.sampling_rate
rtf = wall / audio_sec if audio_sec > 0 else float("inf")

sf.write(args.output, result.pcm, result.sampling_rate)
print(
f"[done] {audio_sec:.2f}s audio in {wall:.2f}s wall "
f"(RTF {rtf:.2f}×, {result.num_frames_raw} frames, "
f"stop={result.stop_reason}) → {args.output}"
)
return 0


if __name__ == "__main__":
sys.exit(main())
39 changes: 39 additions & 0 deletions examples/voice_prompts/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Higgs Audio v2 — Sample Voice Prompts

Drop-in reference voices for `HiggsAudioServer.generate(..., reference_audio_path=...)`. Each `.wav` is paired with a `.txt` containing the transcript of that clip (required for stable alignment between the cloned voice and the target text).

| File | Character |
| --- | --- |
| `en_woman.wav` | English, feminine register |
| `en_man.wav` | English, masculine register |
| `en_man_deep.wav` | English, masculine register, lower pitch |

All three were generated via Higgs Audio v2 smart-voice mode (no human recordings), so they're license-clean and can be freely redistributed.

## Usage

```python
from mlx_audio.tts.models.higgs_audio import HiggsAudioServer
from pathlib import Path

voice_dir = Path("examples/voice_prompts")
ref_wav = voice_dir / "en_woman.wav"
ref_txt = (voice_dir / "en_woman.txt").read_text().strip()

server = HiggsAudioServer.from_pretrained(
model_path="mlx-community/higgs-audio-v2-3B-mlx-q8",
codec_path="mlx-community/higgs-audio-v2-tokenizer",
)

result = server.generate(
target_text="Anything you want cloned in the chosen voice.",
reference_audio_path=str(ref_wav),
reference_text=ref_txt,
temperature=0.7,
top_p=0.95,
max_new_frames=1200,
fade_in_ms=30.0,
)
```

For the recommended parameter set, see [`docs/models/tts/higgs_audio.md`](../../docs/models/tts/higgs_audio.md).
1 change: 1 addition & 0 deletions examples/voice_prompts/en_man.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The radio quietly played a familiar song. Outside, rain tapped against the window in a steady rhythm. Coffee cooled slowly in a ceramic mug. Somewhere down the hall, a door clicked shut.
Binary file added examples/voice_prompts/en_man.wav
Binary file not shown.
1 change: 1 addition & 0 deletions examples/voice_prompts/en_man_deep.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The radio quietly played a familiar song. Outside, rain tapped against the window in a steady rhythm. Coffee cooled slowly in a ceramic mug. Somewhere down the hall, a door clicked shut.
Binary file added examples/voice_prompts/en_man_deep.wav
Binary file not shown.
1 change: 1 addition & 0 deletions examples/voice_prompts/en_woman.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The radio quietly played a familiar song. Outside, rain tapped against the window in a steady rhythm. Coffee cooled slowly in a ceramic mug. Somewhere down the hall, a door clicked shut.
Binary file added examples/voice_prompts/en_woman.wav
Binary file not shown.
Loading