Hello! I made some preprocessing to get features of wavs in dataset for training EA-SVC. Actually, I get the following features:
- PPG from hidden state of model trained on TIMIT dataset (768 dim)
- f0 with WORLD by direct use of pyworld (1 dim, zeros in f0 are not processed)
- spk embeds using
pyannote.audio
I tried training for first 2 stages (i.e. without adversarial generator training and then with it) on both LibriSpeech dev-clean and NUS48E singing. Disentaglement loss wasn't used in experiment. So, for the 1st stage loss_g(g_mag + g_sc) is about 1.0; for the 2nd: loss_g increased to 5.0 (g_mag + g_sc + g_adv + g_feat), loss_d is about 3.0e-01 (d_real + d_fake). Model wasn't trained for 3rd stage. In both dataset experiments results are quite the same.
Because generated audio on both stages are not good, I wonder if I made a mistake in training process or something. I believe losses values above will give you a better view of this situation.
P.S. Number of stage refers to such parameter in config:
"adv_ag": false, "adv_fd": false
"adv_ag": true, "adv_fd": false
"adv_ag": true, "adv_fd": true
Hello! I made some preprocessing to get features of wavs in dataset for training EA-SVC. Actually, I get the following features:
pyannote.audioI tried training for first 2 stages (i.e. without adversarial generator training and then with it) on both LibriSpeech dev-clean and NUS48E singing. Disentaglement loss wasn't used in experiment. So, for the 1st stage
loss_g(g_mag+g_sc) is about 1.0; for the 2nd:loss_gincreased to 5.0 (g_mag+g_sc+g_adv+g_feat),loss_dis about 3.0e-01 (d_real+d_fake). Model wasn't trained for 3rd stage. In both dataset experiments results are quite the same.Because generated audio on both stages are not good, I wonder if I made a mistake in training process or something. I believe losses values above will give you a better view of this situation.
P.S. Number of stage refers to such parameter in config:
"adv_ag": false,"adv_fd": false"adv_ag": true,"adv_fd": false"adv_ag": true,"adv_fd": true