Skip to content

Latest commit

 

History

History
129 lines (98 loc) · 3.12 KB

File metadata and controls

129 lines (98 loc) · 3.12 KB

Voice Design

Voice Design mode lets you describe the desired speaker through speaker attributes (instruct parameter) — no reference audio needed. The model generates a matching voice on the fly.

Quick Example

import torch
from omnivoice import OmniVoice

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16
)

audio = model.generate(
    text="This is a test for voice design.",
    instruct="female, young adult, high pitch, british accent",
)

How It Works

The instruct parameter accepts a comma-separated string of speaker attributes. Each attribute belongs to a category (gender, age, pitch, style, accent, or dialect). Within a category, only one attribute may be selected at a time. Attributes from different categories can be freely combined.

The model auto-detects the language of the instruct text and normalises it internally — you can write in English, Chinese, or a mix of both.

Supported Attributes

Gender

English Chinese
male
female

Age

English Chinese
child 儿童
teenager 少年
young adult 青年
middle-aged 中年
elderly 老年

Pitch

English Chinese
very low pitch 极低音调
low pitch 低音调
moderate pitch 中音调
high pitch 高音调
very high pitch 极高音调

Style

English Chinese
whisper 耳语

English Accent

Only effective when the synthesis text is in English.

Accent
american accent
british accent
australian accent
canadian accent
indian accent
chinese accent
korean accent
japanese accent
portuguese accent
russian accent

Chinese Dialect

Only effective when the synthesis text is in Chinese.

Dialect
河南话
陕西话
四川话
贵州话
云南话
桂林话
济南话
石家庄话
甘肃话
宁夏话
青岛话
东北话

Writing Instruct Strings

Separate attributes with commas (half-width , for English, full-width for Chinese — the model auto-fixes mismatches).

# English
"female, young adult, high pitch, british accent"

# Chinese
"女,青年,高音调,四川话"

# Mixed (auto-normalised)
"female, young adult, 四川话"

Tips

  • Combine freely across categories: "male, elderly, low pitch, whisper".

  • Leave it to the model: omit attributes you don't care about — the model fills in the rest. For example "female" alone is valid.

  • Case-insensitive: "Male", "MALE", and "male" are all accepted, the code will normalize them to lower case.

  • Accent vs Dialect: English accents are only applied to English speech, Chinese dialects are only applied to Chinese speech.

  • Attribute combinations: Due to training data limitations, some attribute combinations may not work well — the model may ignore certain attributes in a combination. If the output doesn't match your expectation, try simplifying the instruct string.