Skip to content

Bad quality of generated speech after training #5

@SolomidHero

Description

@SolomidHero

Hello! I made some preprocessing to get features of wavs in dataset for training EA-SVC. Actually, I get the following features:

  • PPG from hidden state of model trained on TIMIT dataset (768 dim)
  • f0 with WORLD by direct use of pyworld (1 dim, zeros in f0 are not processed)
  • spk embeds using pyannote.audio

I tried training for first 2 stages (i.e. without adversarial generator training and then with it) on both LibriSpeech dev-clean and NUS48E singing. Disentaglement loss wasn't used in experiment. So, for the 1st stage loss_g(g_mag + g_sc) is about 1.0; for the 2nd: loss_g increased to 5.0 (g_mag + g_sc + g_adv + g_feat), loss_d is about 3.0e-01 (d_real + d_fake). Model wasn't trained for 3rd stage. In both dataset experiments results are quite the same.

Because generated audio on both stages are not good, I wonder if I made a mistake in training process or something. I believe losses values above will give you a better view of this situation.

P.S. Number of stage refers to such parameter in config:

  1. "adv_ag": false, "adv_fd": false
  2. "adv_ag": true, "adv_fd": false
  3. "adv_ag": true, "adv_fd": true

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions