Finetuning Base results in progressively faster speech with every epoch

### Description

See pull request #178 

Now that finetuning is basically working (given the change to sft_12hz.py in commit 680d4e9), it seems there's only a few things to hammer out the rest of the way. This is the big one for me personally, without it finetuning isn't fully functional IMHO.

Each successive epoch of finetuning is skewing how fast generated audio is 

### Reproduction

Run normal finetuning with a few hundred audio files, setting --num_epochs to 20.

`python ./finetuning/sft_12hz.py --init_model_path ./Qwen3-TTS-12Hz-1.7B-Base --output_model_path output --train_jsonl output-with-codes.jsonl --num_epochs 20`

Then run the following to perform basic inference using each of the checkpoints:

```python
import torch
import soundfile as sf
from qwen_tts import Qwen3TTSModel

for i in range(50):
    wavs, sr = Qwen3TTSModel.from_pretrained(f"output/checkpoint-epoch-{i}", device_map="cuda:0", dtype=torch.bfloat16, attn_implementation="flash_attention_2").generate_custom_voice(text="She said she would be here by noon.", language="english", speaker="speaker_test")
    sf.write(f"output/output_file_{i}.wav", wavs[0], sr)
```

Listen to each checkpoint and hear the difference (see attached for a few examples prior to pull request #178, which resolves this):
[output_file_5.wav](https://github.com/user-attachments/files/25084310/output_file_5.wav)
[output_file_7.wav](https://github.com/user-attachments/files/25084307/output_file_7.wav)
[output_file_10.wav](https://github.com/user-attachments/files/25084309/output_file_10.wav)
[output_file_12.wav](https://github.com/user-attachments/files/25084308/output_file_12.wav)
[output_file_15.wav](https://github.com/user-attachments/files/25084305/output_file_15.wav)
[output_file_19.wav](https://github.com/user-attachments/files/25084306/output_file_19.wav)

### Logs

```shell

```

### Environment Information

Tested on Debian 12, with nvidia driver 535.129.03, CUDA version 12.2, RTX 3090.

### Known Issue

- [x] The issue hasn't been already addressed in Documentation, Issues, and Discussions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Finetuning Base results in progressively faster speech with every epoch #179

Description

Reproduction

Logs

Environment Information

Known Issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Finetuning Base results in progressively faster speech with every epoch #179

Description

Description

Reproduction

Logs

Environment Information

Known Issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions