OpenMOSS · MukundaKatta · Apr 15, 2026
diff --git a/README.md b/README.md
@@ -147,6 +147,41 @@ python infer.py \
 
 This writes audio to `generated_audio/infer_output.wav` by default.
 
+#### Voice cloning details
+
+A few questions come up frequently (see [#9](https://github.com/OpenMOSS/MOSS-TTS-Nano/issues/9)):
+
+1. **Can I pass the transcript of the reference audio?**
+   Yes — use `--prompt-text "<transcript>"` (or `--prompt-text-file path.txt`).
+   It is honoured by both `--mode voice_clone` (the default) and
+   `--mode continuation`. Supplying it generally improves cloning quality
+   because the model can align text-to-audio for the prompt clip.
+
+   ```bash
+   python infer.py \
+     --prompt-audio-path assets/audio/zh_1.wav \
+     --prompt-text "欢迎收听今日新闻播报。" \
+     --text "今天的天气非常好。"
+   ```
+
+2. **What length should the reference audio be?**
+   We don't enforce a hard limit — the audio tokenizer accepts arbitrary
+   lengths and the prompt is internally clipped by
+   `--max-new-frames` / `--voice-clone-max-text-tokens`. Empirically,
+   short clips (≈ 3–10 seconds) of *clean* speech tend to give the best
+   results: long clips spend more of the model's prompt budget on
+   acoustic context, and very short ones (< 2 s) often don't carry
+   enough timbre. If you see degraded output, try clipping a clean,
+   single-speaker passage at around 5 seconds.
+
+3. **How do I cache a voice profile across multiple generations?**
+   There's no separate "voice profile" object yet — the cleanest pattern
+   is to keep the model loaded in process (e.g. via `python -i infer.py`,
+   `moss-tts-nano serve`, or by reusing a `MossTtsNanoRuntime` instance
+   in your own script) and call `model.inference(...)` repeatedly with
+   the same `prompt_audio_path` and `prompt_text`. The audio tokenizer
+   will re-encode the prompt each call, but the model weights stay warm.
+
 ### Local Web Demo with `app.py`
 
 You can launch the local FastAPI demo for browser-based testing:

diff --git a/README_zh.md b/README_zh.md
@@ -142,6 +142,29 @@ python infer.py \
 
 默认情况下，这会将音频写入 `generated_audio/infer_output.wav`。
 
+#### 语音克隆细节
+
+社区里反复出现的几个问题（参见 [#9](https://github.com/OpenMOSS/MOSS-TTS-Nano/issues/9)）：
+
+1. **可以传入参考音频的转写文本吗？**
+   可以 — 使用 `--prompt-text "<转写>"`（或 `--prompt-text-file path.txt`）。
+   `--mode voice_clone`（默认）和 `--mode continuation` 都支持。提供该转写
+   通常能提升克隆质量，因为模型可以将提示片段的文本与音频对齐。
+
+2. **参考音频应该多长？**
+   没有硬性限制 — 音频 tokenizer 接受任意长度，提示部分会受
+   `--max-new-frames` / `--voice-clone-max-text-tokens` 内部裁剪。经验上，
+   3–10 秒左右的*干净*单人语音效果最好：过长的片段会把模型的提示预算
+   花在声学上下文上，过短（< 2 秒）的片段又难以承载足够音色信息。
+   如果输出质量下降，建议截取一段约 5 秒的清晰单人语音重试。
+
+3. **如何在多次生成之间缓存语音 profile？**
+   目前没有独立的"语音 profile"对象 — 最干净的做法是让模型驻留在进程中
+   （例如 `python -i infer.py`、`moss-tts-nano serve`，或在脚本中复用
+   `MossTtsNanoRuntime` 实例），然后用相同的 `prompt_audio_path` 和
+   `prompt_text` 反复调用 `model.inference(...)`。音频 tokenizer 每次都会
+   重新编码提示，但模型权重保持加载状态。
+
 ### 使用 `app.py` 启动本地 Web 演示
 
 您可以启动本地 FastAPI 演示进行基于浏览器的测试：

diff --git a/infer.py b/infer.py
@@ -53,8 +53,18 @@ def parse_args(argv: Optional[Sequence[str]] = None) -> argparse.Namespace:
     text_group.add_argument("--text-file", help="Path to a UTF-8 text file to synthesize.")
 
     prompt_text_group = parser.add_mutually_exclusive_group(required=False)
-    prompt_text_group.add_argument("--prompt-text", help="Reference transcript used by continuation mode.")
-    prompt_text_group.add_argument("--prompt-text-file", help="UTF-8 reference transcript file used by continuation mode.")
+    prompt_text_group.add_argument(
+        "--prompt-text",
+        help=(
+            "Transcript of the reference audio. Used by both continuation mode and "
+            "voice_clone mode — supplying it generally improves cloning quality "
+            "because the model can align text-to-audio for the prompt clip."
+        ),
+    )
+    prompt_text_group.add_argument(
+        "--prompt-text-file",
+        help="UTF-8 file alternative to --prompt-text. Same behaviour for both modes.",
+    )
 
     parser.add_argument("--text-tokenizer-path", default=None, help="Override the checkpoint-bundled text tokenizer.")
     parser.add_argument(

diff --git a/moss_tts_nano/cli.py b/moss_tts_nano/cli.py
@@ -71,7 +71,12 @@ def _build_parser() -> argparse.ArgumentParser:
     generate_parser.add_argument(
         "--prompt-text",
         default=None,
-        help="PyTorch backend only. Reference transcript used by continuation mode.",
+        help=(
+            "PyTorch backend only. Transcript of the reference audio. Used by "
+            "both continuation and voice_clone modes, and supplying it "
+            "generally improves cloning quality because the model can align "
+            "text-to-audio for the prompt."
+        ),
     )
     generate_parser.add_argument(
         "--voice",