Voice Design

Voice Design mode lets you describe the desired speaker through speaker attributes (instruct parameter) — no reference audio needed. The model generates a matching voice on the fly.

Quick Example

import torch
from omnivoice import OmniVoice

model = OmniVoice.from_pretrained(
    "k2-fsa/OmniVoice",
    device_map="cuda:0",
    dtype=torch.float16
)

audio = model.generate(
    text="This is a test for voice design.",
    instruct="female, young adult, high pitch, british accent",
)

How It Works

The instruct parameter accepts a comma-separated string of speaker attributes. Each attribute belongs to a category (gender, age, pitch, style, accent, or dialect). Within a category, only one attribute may be selected at a time. Attributes from different categories can be freely combined.

The model auto-detects the language of the instruct text and normalises it internally — you can write in English, Chinese, or a mix of both.

Supported Attributes

Gender

English	Chinese
male	男
female	女

Age

English	Chinese
child	儿童
teenager	少年
young adult	青年
middle-aged	中年
elderly	老年

Pitch

English	Chinese
very low pitch	极低音调
low pitch	低音调
moderate pitch	中音调
high pitch	高音调
very high pitch	极高音调

Style

English	Chinese
whisper	耳语

English Accent

Only effective when the synthesis text is in English.

Accent
american accent
british accent
australian accent
canadian accent
indian accent
chinese accent
korean accent
japanese accent
portuguese accent
russian accent

Chinese Dialect

Only effective when the synthesis text is in Chinese.

Dialect
河南话
陕西话
四川话
贵州话
云南话
桂林话
济南话
石家庄话
甘肃话
宁夏话
青岛话
东北话

Writing Instruct Strings

Separate attributes with commas (half-width , for English, full-width ， for Chinese — the model auto-fixes mismatches).

# English
"female, young adult, high pitch, british accent"

# Chinese
"女，青年，高音调，四川话"

# Mixed (auto-normalised)
"female, young adult, 四川话"

Tips

Combine freely across categories: "male, elderly, low pitch, whisper".
Leave it to the model: omit attributes you don't care about — the model fills in the rest. For example "female" alone is valid.
Case-insensitive: "Male", "MALE", and "male" are all accepted, the code will normalize them to lower case.
Accent vs Dialect: English accents are only applied to English speech, Chinese dialects are only applied to Chinese speech.
Attribute combinations: Due to training data limitations, some attribute combinations may not work well — the model may ignore certain attributes in a combination. If the output doesn't match your expectation, try simplifying the instruct string.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Voice Design

Quick Example

How It Works

Supported Attributes

Gender

Age

Pitch

Style

English Accent

Chinese Dialect

Writing Instruct Strings

Tips

FilesExpand file tree

voice-design.md

Latest commit

History

voice-design.md

File metadata and controls

Voice Design

Quick Example

How It Works

Supported Attributes

Gender

Age

Pitch

Style

English Accent

Chinese Dialect

Writing Instruct Strings

Tips