-
Notifications
You must be signed in to change notification settings - Fork 934
Description
Description
Hi team,
Thank you for open-sourcing Qwen3-TTS.
I am following the official fine-tuning guide from the repository for single-speaker SFT.
Data preparation using prepare_data.py works correctly and audio_codes are generated successfully.
However, when starting training with the 0.6B Base checkpoint, the process crashes with an embedding dimension mismatch error.
It appears that the training script might assume the hidden size of the 1.7B model, while the 0.6B model uses a different dimension.
Could you please clarify whether fine-tuning for the 0.6B Base model is currently supported with the provided scripts?
If yes, is there a different configuration or branch we should use?
Thank you!
Reproduction
pip install -U qwen-tts
git clone https://github.com/QwenLM/Qwen3-TTS.git
cd Qwen3-TTS/finetuning
python sft_12hz.py
--init_model_path Qwen/Qwen3-TTS-12Hz-0.6B-Base
--output_model_path output
--train_jsonl train_with_codes.jsonl
--batch_size 6
--lr 1e-5
--num_epochs 1
--speaker_name test
Logs
RuntimeError: Shapes are not compatible for broadcasting:
bf16[*,*,2048] vs bf16[*,*,1024]Environment Information
- OS: Google Colab
- Python: 3.12
- GPU: A100
- CUDA: default Colab runtime
- qwen-tts: latest from pip
- Repository: latest main branch
- dtype: bfloat16
Known Issue
- The issue hasn't been already addressed in Documentation, Issues, and Discussions.