-
Combination of
ref_audioandinstruct: When bothref_audioandinstructare provided and they conflict, the model will most likely follow the style of the reference audio. When the two are consistent,instructcan improve cloning stability for the attributes it describes. A typical example is Chinese dialect cloning: provide both dialect reference audio and a matching dialect instruct (e.g.,ref_audio="sichuan.wav", instruct="四川话") for more stable dialect output. -
Short Audio Generation: The model may not reliably generate short audio clips (e.g., 1–2 seconds) without reference audio. If you need to generate short clips, provide reference audio to the model.
-
Min Nan Chinese (Hokkien) Input Format: Min Nan Chinese (闽南语, also known as Hokkien) can only be synthesized using Tai-lo romanization as input; Chinese characters are not supported for Min Nan Chinese in the current model version.