Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
35 changes: 35 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,41 @@ python infer.py \

This writes audio to `generated_audio/infer_output.wav` by default.

#### Voice cloning details

A few questions come up frequently (see [#9](https://github.com/OpenMOSS/MOSS-TTS-Nano/issues/9)):

1. **Can I pass the transcript of the reference audio?**
Yes — use `--prompt-text "<transcript>"` (or `--prompt-text-file path.txt`).
It is honoured by both `--mode voice_clone` (the default) and
`--mode continuation`. Supplying it generally improves cloning quality
because the model can align text-to-audio for the prompt clip.

```bash
python infer.py \
--prompt-audio-path assets/audio/zh_1.wav \
--prompt-text "欢迎收听今日新闻播报。" \
--text "今天的天气非常好。"
```

2. **What length should the reference audio be?**
We don't enforce a hard limit — the audio tokenizer accepts arbitrary
lengths and the prompt is internally clipped by
`--max-new-frames` / `--voice-clone-max-text-tokens`. Empirically,
short clips (≈ 3–10 seconds) of *clean* speech tend to give the best
results: long clips spend more of the model's prompt budget on
acoustic context, and very short ones (< 2 s) often don't carry
enough timbre. If you see degraded output, try clipping a clean,
single-speaker passage at around 5 seconds.

3. **How do I cache a voice profile across multiple generations?**
There's no separate "voice profile" object yet — the cleanest pattern
is to keep the model loaded in process (e.g. via `python -i infer.py`,
`moss-tts-nano serve`, or by reusing a `MossTtsNanoRuntime` instance
in your own script) and call `model.inference(...)` repeatedly with
the same `prompt_audio_path` and `prompt_text`. The audio tokenizer
will re-encode the prompt each call, but the model weights stay warm.

### Local Web Demo with `app.py`

You can launch the local FastAPI demo for browser-based testing:
Expand Down
23 changes: 23 additions & 0 deletions README_zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,29 @@ python infer.py \

默认情况下,这会将音频写入 `generated_audio/infer_output.wav`。

#### 语音克隆细节

社区里反复出现的几个问题(参见 [#9](https://github.com/OpenMOSS/MOSS-TTS-Nano/issues/9)):

1. **可以传入参考音频的转写文本吗?**
可以 — 使用 `--prompt-text "<转写>"`(或 `--prompt-text-file path.txt`)。
`--mode voice_clone`(默认)和 `--mode continuation` 都支持。提供该转写
通常能提升克隆质量,因为模型可以将提示片段的文本与音频对齐。

2. **参考音频应该多长?**
没有硬性限制 — 音频 tokenizer 接受任意长度,提示部分会受
`--max-new-frames` / `--voice-clone-max-text-tokens` 内部裁剪。经验上,
3–10 秒左右的*干净*单人语音效果最好:过长的片段会把模型的提示预算
花在声学上下文上,过短(< 2 秒)的片段又难以承载足够音色信息。
如果输出质量下降,建议截取一段约 5 秒的清晰单人语音重试。

3. **如何在多次生成之间缓存语音 profile?**
目前没有独立的"语音 profile"对象 — 最干净的做法是让模型驻留在进程中
(例如 `python -i infer.py`、`moss-tts-nano serve`,或在脚本中复用
`MossTtsNanoRuntime` 实例),然后用相同的 `prompt_audio_path` 和
`prompt_text` 反复调用 `model.inference(...)`。音频 tokenizer 每次都会
重新编码提示,但模型权重保持加载状态。

### 使用 `app.py` 启动本地 Web 演示

您可以启动本地 FastAPI 演示进行基于浏览器的测试:
Expand Down
14 changes: 12 additions & 2 deletions infer.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,18 @@ def parse_args(argv: Optional[Sequence[str]] = None) -> argparse.Namespace:
text_group.add_argument("--text-file", help="Path to a UTF-8 text file to synthesize.")

prompt_text_group = parser.add_mutually_exclusive_group(required=False)
prompt_text_group.add_argument("--prompt-text", help="Reference transcript used by continuation mode.")
prompt_text_group.add_argument("--prompt-text-file", help="UTF-8 reference transcript file used by continuation mode.")
prompt_text_group.add_argument(
"--prompt-text",
help=(
"Transcript of the reference audio. Used by both continuation mode and "
"voice_clone mode — supplying it generally improves cloning quality "
"because the model can align text-to-audio for the prompt clip."
),
)
prompt_text_group.add_argument(
"--prompt-text-file",
help="UTF-8 file alternative to --prompt-text. Same behaviour for both modes.",
)

parser.add_argument("--text-tokenizer-path", default=None, help="Override the checkpoint-bundled text tokenizer.")
parser.add_argument(
Expand Down
7 changes: 6 additions & 1 deletion moss_tts_nano/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,12 @@ def _build_parser() -> argparse.ArgumentParser:
generate_parser.add_argument(
"--prompt-text",
default=None,
help="PyTorch backend only. Reference transcript used by continuation mode.",
help=(
"PyTorch backend only. Transcript of the reference audio. Used by "
"both continuation and voice_clone modes, and supplying it "
"generally improves cloning quality because the model can align "
"text-to-audio for the prompt."
),
)
generate_parser.add_argument(
"--voice",
Expand Down