GitHub - sml09181/Korean-SV2TTS: [2024-Fall] Korean Pytorch Implementation of SV2TTS

2024년 2학기 이화여자대학교 도전학기제 개인 프로젝트로 진행하였음.

Implementation of Korean SV2TTS

Based on CorentinJ/Real-Time-Voice-Cloning and esoyeon/KoreanTTS

GPU: NVIDIA RTX A6000
CUDA version: 10.1.243

You can use pretrained weights (Korean) for each component. The weights are located in /weights folder. Highly recommend to create new anaconda environment and run pip install -r requirements.txt. Before running the code, modify the paths to match your project directory.

STEP1: Encoder

Input: log mel-spectrogram (from raw speech data w/ arbitrary length)
Output: speaker embedding w/ fixed dimension
Action: Learn speaker embedding via speaker verification task

Dataset

The dataset used is the subset of AIHub 화자 인식용 음성 데이터(AIHub Speaker Recognition Speech Data), with a total size of 98GB.

Preprocessing

The 'command' type has been excluded from the dataset. Speakers with fewer than 8 samples have been excluded. As a result, the dataset contains preprocessed data from a total of 2,535 speakers.

First, the raw dataset has been reorganized into a speaker-specific folder structure. Each folder is named after an individual speaker and contains their respective audio samples.

$ python speaker_folder.py

Mainly, let's loads and preprocesses audio files (e.g., generating mel spectrograms) while filtering short or corrupted files.

$ python encoder_preprocess.py

Training

Then train the encoder.

$ python encoder_train.py

STEP2: Synthesizer

Input: grapheme or phoneme sequence
Output: log-mel spectrogram
Action: Text-to-spectrogram for target speaker

Dataset

The dataset used is the subset of AIHub 다화자 음성합성 데이터(AIHub Multi-Speaker Speech Synthesis Data). The ZIP files are randomly selected and downloaded, with the total raw data size being 704GB based on 791 speakers.

Preprocessing

Run code below to organize data into speaker-specific folders and extract Korean transcripts. The extracted translabels are saved as .txt files from the raw label files.

$ korean_dataset.py

Run code below to normalize Korean translabels.

$ korean_normalize.py

Preprocessing steps:

Remove leading and trailing spaces (using strip())
Remove characters such as ⺀-⺙, ⺛-⻳, 〠-⿕, 々, and 〇
Normalize using a Korean dictionary
Normalize English text (convert to uppercase)
Normalize quotation marks
Normalize numbers

Final Dataset:

Data used from 202 out of 791 speakers
Total of 633,517 samples used

Then, preprocess audioaudio files from datasets, encodes them as mel spectrograms and writes them to the disk. Audio files are also saved, to be used by the vocoder for training.

$ python synthesizer_preprocess_audio.py

Creates embeddings for the synthesizer from the LibriSpeech utterances.

$ python synthesizer_preprocess_embeds.py

Train

It was run with the no_alignment setting.

$ python synthesizer_train.py

STEP3: Vocoder

Input: log-mel spectrogram
Output: time-domain waveform
Action: Convert mel-spectrogram into waveform

Preprocessing

This script generates ground truth-aligned mel-spectrograms (GTA mels) for vocoder training by synthesizing them with a pre-trained Tacotron model. It processes a dataset, removes padding from the generated mels, and saves the results along with metadata.

$ python vocoder_preprocess.py

Train

Trains a WaveRNN vocoder model on mel-spectrograms to generate high-quality audio.

$ python vocoder_train.py

Run demo

Prepare reference audio files before running the code. Here, my voice is used as a reference voice.

$ python demo_cli.py

Related Repository

Reference

AIHub 화자 인식용 음성 데이터(AIHub Speaker Recognition Speech Data)
AIHub 다화자 음성합성 데이터(AIHub Multi-Speaker Speech Synthesis Data)
CorentinJ/Real-Time-Voice-Cloning
esoyeon/KoreanTTS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Implementation of Korean SV2TTS

STEP1: Encoder

Dataset

Preprocessing

Training

STEP2: Synthesizer

Dataset

Preprocessing

Train

STEP3: Vocoder

Preprocessing

Train

Run demo

Related Repository

Reference

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
encoder		encoder
log		log
synthesizer		synthesizer
text		text
toolbox		toolbox
utils		utils
vocoder		vocoder
weights		weights
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
demo_cli.py		demo_cli.py
encoder_preprocess.py		encoder_preprocess.py
encoder_raw_preprocess.py		encoder_raw_preprocess.py
encoder_train.py		encoder_train.py
korean_dataset.py		korean_dataset.py
korean_normalize.py		korean_normalize.py
requirements.txt		requirements.txt
synthesizer_preprocess_audio.py		synthesizer_preprocess_audio.py
synthesizer_preprocess_embeds.py		synthesizer_preprocess_embeds.py
synthesizer_train.py		synthesizer_train.py
vocoder_preprocess.py		vocoder_preprocess.py
vocoder_train.py		vocoder_train.py

License

sml09181/Korean-SV2TTS

Folders and files

Latest commit

History

Repository files navigation

Implementation of Korean SV2TTS

STEP1: Encoder

Dataset

Preprocessing

Training

STEP2: Synthesizer

Dataset

Preprocessing

Train

STEP3: Vocoder

Preprocessing

Train

Run demo

Related Repository

Reference

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages