Skip to content

Qwen3-TTS-12Hz-0.6B-Base fine-tuning fails due to embedding dimension mismatch (2048 vs 1024) #198

@shubhamragade

Description

@shubhamragade

Description

Hi team,

Thank you for open-sourcing Qwen3-TTS.

I am following the official fine-tuning guide from the repository for single-speaker SFT.

Data preparation using prepare_data.py works correctly and audio_codes are generated successfully.

However, when starting training with the 0.6B Base checkpoint, the process crashes with an embedding dimension mismatch error.

It appears that the training script might assume the hidden size of the 1.7B model, while the 0.6B model uses a different dimension.

Could you please clarify whether fine-tuning for the 0.6B Base model is currently supported with the provided scripts?
If yes, is there a different configuration or branch we should use?

Thank you!

Reproduction

pip install -U qwen-tts
git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS/finetuning

python sft_12hz.py
--init_model_path Qwen/Qwen3-TTS-12Hz-0.6B-Base
--output_model_path output
--train_jsonl train_with_codes.jsonl
--batch_size 6
--lr 1e-5
--num_epochs 1
--speaker_name test

Logs

RuntimeError: Shapes are not compatible for broadcasting:
bf16[*,*,2048] vs bf16[*,*,1024]

Environment Information

  • OS: Google Colab
  • Python: 3.12
  • GPU: A100
  • CUDA: default Colab runtime
  • qwen-tts: latest from pip
  • Repository: latest main branch
  • dtype: bfloat16

Known Issue

  • The issue hasn't been already addressed in Documentation, Issues, and Discussions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions