Bad quality of generated speech after training

Hello! I made some preprocessing to get features of wavs in dataset for training EA-SVC. Actually, I get the following features:
- PPG from hidden state of model trained on TIMIT dataset (768 dim)
- f0 with WORLD by direct use of pyworld (1 dim, zeros in f0 are not processed)
- spk embeds using `pyannote.audio`

I tried training for first 2 stages (i.e. without adversarial generator training and then with it) on both LibriSpeech dev-clean and NUS48E singing. Disentaglement loss wasn't used in experiment. So, for the 1st stage `loss_g`(`g_mag` + `g_sc`) is about 1.0; for the 2nd: `loss_g` increased to 5.0 (`g_mag` + `g_sc` + `g_adv` + `g_feat`), `loss_d` is about 3.0e-01 (`d_real` + `d_fake`). Model wasn't trained for 3rd stage. In both dataset experiments results are quite the same.

Because generated audio on both stages are not good, I wonder if I made a mistake in training process or something. I believe losses values above will give you a better view of this situation.

P.S. Number of stage refers to such parameter in config:
1. `"adv_ag": false`, `"adv_fd": false`
2. `"adv_ag": true`, `"adv_fd": false`
3. `"adv_ag": true`, `"adv_fd": true`


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad quality of generated speech after training #5

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Bad quality of generated speech after training #5

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions