Skip to content

docs(voice-clone): clarify --prompt-text scope, length tuning, profile reuse (#9)#13

Open
MukundaKatta wants to merge 1 commit intoOpenMOSS:mainfrom
MukundaKatta:docs/voice-cloning-clarifications
Open

docs(voice-clone): clarify --prompt-text scope, length tuning, profile reuse (#9)#13
MukundaKatta wants to merge 1 commit intoOpenMOSS:mainfrom
MukundaKatta:docs/voice-cloning-clarifications

Conversation

@MukundaKatta
Copy link
Copy Markdown

Summary

Closes #9.

The reporter asked three concrete questions about voice cloning that the docs don't currently answer cleanly. This PR addresses all three.

1. `--prompt-text` help was misleading

In both `infer.py` and `moss_tts_nano/cli.py`, the help text on `--prompt-text` said it was "used by continuation mode". But `model.inference` actually receives `prompt_text` regardless of mode (`infer.py:344-362`), and supplying it improves voice-clone quality because the model can text-align the prompt clip.

→ Updated both CLI help strings to make this explicit.

2. Length recommendation

We don't enforce a hard limit on prompt-audio length, but the reporter found that 3 s gave decent results while 2 / 6 / 10 / 30 s did not. Added an honest note in both READMEs:

Empirically, short clips (≈ 3–10 seconds) of clean speech tend to give the best results: long clips spend more of the model's prompt budget on acoustic context, and very short ones (< 2 s) often don't carry enough timbre. If you see degraded output, try clipping a clean, single-speaker passage at around 5 seconds.

3. Voice-profile caching

There's no profile abstraction in the codebase yet, so the cleanest pattern is to keep the model in process and call `inference()` repeatedly with the same prompt args. Documented exactly that, plus the three obvious ways to do it (`python -i infer.py`, `moss-tts-nano serve`, `MossTtsNanoRuntime` reuse).

Files changed

  • `infer.py` — corrected `--prompt-text` / `--prompt-text-file` help.
  • `moss_tts_nano/cli.py` — same correction on the packaged CLI.
  • `README.md` — new "Voice cloning details" subsection under Voice Clone with infer.py.
  • `README_zh.md` — same subsection translated.

Test plan

  • `python3 -c "import ast; ast.parse(open('infer.py').read()); ast.parse(open('moss_tts_nano/cli.py').read())"` — syntax clean.
  • No behavioural change; help-text + docs only.

AI-assisted disclosure

Drafted with Claude Code. I traced `prompt_text` through `infer.py:300-362` to confirm it reaches `model.inference` for both modes, then wrote the docs around what the code actually does (rather than asserting a length recipe I can't ground — the README phrasing is intentionally hedged, citing the issue reporter's empirical observation).

…e reuse (OpenMOSS#9)

The CLI help on `--prompt-text` (in both `infer.py` and `moss_tts_nano/cli.py`)
said it was "used by continuation mode" — but `model.inference` accepts
`prompt_text` for voice_clone mode too, and supplying it improves
cloning quality. Update the help to reflect that.

Also adds a "Voice cloning details" subsection to README.md and
README_zh.md that addresses the three questions from OpenMOSS#9 directly:

1. Yes, you can pass the source audio's transcript via --prompt-text /
   --prompt-text-file. It works for both modes.
2. Reference audio length: no enforced limit, but ~3–10 seconds of
   clean single-speaker speech tends to give the best results.
   Acknowledges the empirical observation that very short or very long
   clips degrade output, with a concrete suggestion (clip ~5 s).
3. There's no separate "voice profile" cache yet — keep the model
   loaded in process (via `python -i infer.py`, `moss-tts-nano serve`,
   or a reused `MossTtsNanoRuntime`) and call inference repeatedly
   with the same prompt args.

No behavioural change; help text + docs only. Closes OpenMOSS#9.
@MukundaKatta MukundaKatta force-pushed the docs/voice-cloning-clarifications branch from 72ba67f to 3e7465f Compare April 18, 2026 09:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Unclear voice cloning instructions / documentation

1 participant