Implementation of MelNet: A Generative Model for Audio in the Frequency Domain (Work in progress)
- Tested with Python 3.6.8 & 3.7.4, PyTorch 1.2.0 & 1.3.0.
- pip install -r requirements.txt
- Blizzard, VoxCeleb2, and KSS have YAML files provided under config/. For other datasets, fill out your own YAML file according to the other provided ones.
- Unconditional training is possible for all kinds of datasets, provided that they have a consistent file extension specified by data.extensionwithin the YAML file.
- Conditional training is currently only implemented for KSS and a subset of the Blizzard dataset.
- python trainer.py -c [config YAML file path] -n [name of run] -t [tier number] -b [batch size] -s [TTS]- Each tier can be trained separately. Since each tier is larger than the one before it (with the exception of tier 1), modify the batch size for each tier.
- Tier 6 of the Blizzard dataset does not fit on a 16GB P100, even with a batch size of 1.
 
- The -sflag is a boolean for determining whether to train a TTS tier. Since a TTS tier only differs at tier 1, this flag is ignored when[tier number] != 0. Warning: this flag is toggledTrueno matter what follows the flag. Ignore it if you're not planning to use it.
 
- Each tier can be trained separately. Since each tier is larger than the one before it (with the exception of tier 1), modify the batch size for each tier.
- The checkpoints must be stored under chkpt/.
- A YAML file named inference.yamlmust be provided underconfig/.
- inference.yamlmust specify the number of tiers, the names of the checkpoints, and whether or not it is a conditional generation.
- python inference.py -c [config YAML file path] -p [inference YAML file path] -t [timestep of generated mel spectrogram] -n [name of sample] -i [input sentence for conditional generation]- Timestep refers to the length of the mel spectrogram. The ratio of timestep to seconds is roughly [sample rate] : [hop length of FFT].
- The -iflag is optional, only needed for conditional generation. Surround the sentence with""and end with..
- Both unconditional generation and conditional generation currently does not support primed generation (extrapolating from provided data).
 
- Timestep refers to the length of the mel spectrogram. The ratio of timestep to seconds is roughly 
- Implement upsampling procedure
- GMM sampling + loss function
- Unconditional audio generation
- TTS synthesis
- Tensorboard logging
- Multi-GPU training
- Primed generation
- Seungwon Park, June Young Yi, Yoonhyung Lee, Joowhan Song @ Deepest Season 6
MIT License