docs(voice-clone): clarify --prompt-text scope, length tuning, profile reuse (#9)#13
Open
MukundaKatta wants to merge 1 commit intoOpenMOSS:mainfrom
Open
Conversation
…e reuse (OpenMOSS#9) The CLI help on `--prompt-text` (in both `infer.py` and `moss_tts_nano/cli.py`) said it was "used by continuation mode" — but `model.inference` accepts `prompt_text` for voice_clone mode too, and supplying it improves cloning quality. Update the help to reflect that. Also adds a "Voice cloning details" subsection to README.md and README_zh.md that addresses the three questions from OpenMOSS#9 directly: 1. Yes, you can pass the source audio's transcript via --prompt-text / --prompt-text-file. It works for both modes. 2. Reference audio length: no enforced limit, but ~3–10 seconds of clean single-speaker speech tends to give the best results. Acknowledges the empirical observation that very short or very long clips degrade output, with a concrete suggestion (clip ~5 s). 3. There's no separate "voice profile" cache yet — keep the model loaded in process (via `python -i infer.py`, `moss-tts-nano serve`, or a reused `MossTtsNanoRuntime`) and call inference repeatedly with the same prompt args. No behavioural change; help text + docs only. Closes OpenMOSS#9.
72ba67f to
3e7465f
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #9.
The reporter asked three concrete questions about voice cloning that the docs don't currently answer cleanly. This PR addresses all three.
1. `--prompt-text` help was misleading
In both `infer.py` and `moss_tts_nano/cli.py`, the help text on `--prompt-text` said it was "used by continuation mode". But `model.inference` actually receives `prompt_text` regardless of mode (`infer.py:344-362`), and supplying it improves voice-clone quality because the model can text-align the prompt clip.
→ Updated both CLI help strings to make this explicit.
2. Length recommendation
We don't enforce a hard limit on prompt-audio length, but the reporter found that 3 s gave decent results while 2 / 6 / 10 / 30 s did not. Added an honest note in both READMEs:
3. Voice-profile caching
There's no profile abstraction in the codebase yet, so the cleanest pattern is to keep the model in process and call `inference()` repeatedly with the same prompt args. Documented exactly that, plus the three obvious ways to do it (`python -i infer.py`, `moss-tts-nano serve`, `MossTtsNanoRuntime` reuse).
Files changed
Test plan
AI-assisted disclosure
Drafted with Claude Code. I traced `prompt_text` through `infer.py:300-362` to confirm it reaches `model.inference` for both modes, then wrote the docs around what the code actually does (rather than asserting a length recipe I can't ground — the README phrasing is intentionally hedged, citing the issue reporter's empirical observation).