This directory provides a script and recipe to run FastSpeech2 models on Graphcore IPUs. The model is based on original paper FastSpeech 2: Fast and High-Quality End-to-End Text to Speech with few changes mentioned on TensorFlowTTS.
file/folder | description |
---|---|
config/ | Folder for model configs |
custom_op/ | Implement LengthRegular op |
preprocessor/ | Prepare datasets for training model |
tests/ | Unit tests. |
dataloader.py | Loading and post-processing datasets |
fastspeech2.py | FastSpeech2 model implementation |
optimizer.py | Implement AdamW optimizer and warmup scheduler |
options.py | Commandline options for model configuration and IPU specific options |
train.py | Training process |
utils.py | Utils for callbacks, IPU configuration and other useful utils |
ckpt_utils.py | Utils for parsing and mapping saved *.ckpt or *.h5 |
Download and install the Poplar SDK following the Getting Started guide for your IPU system. Source the enable.sh
script for poplar.
Create a virtual environment and install the appropriate Graphcore TensorFlow 2 wheel from inside the SDK directory:
virtualenv --python python3 fs2_venv
source fs2_venv/bin/activate
pip install -r requirements.txt
pip install <path to the tensorflow-2 wheel from the Poplar SDK>
We implement LengthRegulator
custom op to avoid dynamic operations that we do not support. Make sure MAX_TOTAL_SIZE
in custom_op/length_regulator/poplar_code.cpp
is identical with max_wave_length
in json file. Then run:
cd custom_op/length_regulator
make
Note that SDK needs to be sourced for the compilation to work.
The data used for this example is LJSpeech, you can download the dataset at here. The whole dataset tarball is about 2.6GB.
- Unzip the datasets to specific folder.
mkdir -p /path/to/unzipped_dataset
tar -xvf /path/to/LJSpeech-1.1.tar.bz2 -C /path/to/unzipped_dataset
- Generate training and validation datasets
Go into
preprocessor
folder and run
python3 audio.py \
--config ljspeech_preprocess.yaml \
--rootdir /path/to/unzipped_dataset \
--outdir /path/to/preprocessed_dataset \
--n_cpus 4 \
--test_size 0.05 \
--seed 1234
It will take about 10 minutes to get the datasets. Increasing n_cpus
to speed up the process. Then you should see belowing files/folders under /path/to/preprocessed_dataset
:
files/folders | description |
---|---|
vocab.txt | Vocabulary of LJSpeech dataset with characters, phonemes, punctuation per line. |
ljspeech_mapper.json | Record the mapping between id and symbols. |
length.json | Record maximium sequence length, maximium mel-spectrum length and vocab size of the dataset |
stats{,_f0, _energy}.npy | Contains the mean and std from the training split mel-spectrograms/f0/energy data. |
{train,valid}_utt_ids.npy | Contains training/ validation utterances IDs respectively. |
train/ or valid/ |
Contains pre-computed features of training/validation data. ids/ : The ids of symbols in training data.norm-feats/ : Mel-spectrum features with normalization. raw-feats/ : Mel-spectrum features. raw-f0/ : Pitch features.raw-energies/ : Energy features.wavs/ : Wave features. |
The processed datasets is about 13GB. |
You need to extract duration datasets for FastSpeech2. We used pre-trained duration alignment dataset provided by Anh Minh Nguyen Quan on Google Drive instead of training from scratch. However, the duration dataset was mismatched in train/valid subsets. You need re-located them to corresponding folders according to utterance ids.
# unzip and the downloaded duration datasets
unzip /path/to/downloaded_duration -d /path/to/unzipped_dataset
cd /path/to/unzipped_dataset && mv train/* valid/* . && rm -r train/ valid/
python3 relocate_duration.py --root-path /path/to/preprocessed_dataset --duration-path /path/to/unzipped_dataset
After relocation, you will see duration
folder under both train/
and valid/
directory.
Now that the data are ready we can start training our FastSpeech2 model on the IPU! Run this script:
python3 train.py --config config/fastspeech2.json \
--data-path /path/to/preprocessed_dataset \
--train
All command line options can find in options.py
. For example, you can specify --wandb true
and --wandb-name project_name
to use Weights&Biases to monitor training process. Using --generated-data
to feed fake dataset if you don't have preprocessed dataset.
**[NOTE]**The option in configuration json file will be overided by command line options.
If you want to generate profiles, you can try:
export POPLAR_ENGINE_OPTIONS='{"autoReport.directory":"./profiles","debug.allowOutOfMemory": "true", "autoReport.outputSerializedGraph": "false", "debug.outputAllSymbols": "true", "autoReport.all": "true"}'
python3 train.py --config config/fastspeech2.json \
--train \
--data-path /path/to/preprocessed_dataset \
--batch-size 1 \
--gradient-accumulation-count 4 \
--epochs 1
Then all profiles stay in ./profiles
folder and you can use our PopVision tool for deeper analysis.
The code presented here is licensed under the Apache License Version 2.0, see the LICENSE file in this directory.
This directory includes derived work from the following:
TensorFlowTTS, https://github.com/TensorSpeech/TensorFlowTTS
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.
- TensorFlowTTS: https://github.com/TensorSpeech/TensorFlowTTS
- FastSpeech 2: https://arxiv.org/abs/2006.04558