- 
This is for speech recognition including models and train, evaluate, inference scripts based tensorflow 2 
- 
You can execute script examples on below descriptions with test data 
- 
resources/configsdirectory contains default datasets (LibriSpeech, KsponSpeech, Clovacall) and models (LAS, DeepSpeech2) configs.
- 
resources/sp-modelsdirectory contains default sentencepiece tokenizer for each datasets
- 
I trained LAS small model using LibriSpeech dataset. You can download pretrained model on release page 
Trained model performance is below.
| LibriSpeech dev-clean | LibriSpeech dev-other | |
|---|---|---|
| WER (Word Error Rate) | 9.35% | 24.53% | 
| CER (Character Error Rate) | 4.24% | 13.29% | 
- Listen, Attend and Spell
- On the Choice of Modeling Unit for Sequence-to-Sequence Speech Recognition
- SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition
- Dataset File is tsv(tab separated values) format.
- The dataset file should have header line.
- The 1st column is audio file path relative to directory that contains dataset tsv file.
- The 2nd column is recognized text.
- Refer to tests/data/dataset.tsvfile.
| FilePath | Text | 
|---|---|
| audio/001.wav | 안녕하세요 | 
| audio/002.wav | 반갑습니다 | 
| audio/003.wav | 근데 이름이 어떻게 되세요? | 
| ... | ... | 
- This is tsv file example.
You can start training by running script like below example.
$ python -m speech_recognition.run.train \
    --data-config resources/configs/libri_config.yml \
    --model-config resources/configs/las_small.yml \
    --sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \
    --train-dataset-paths tests/data/wav_dataset.tsv \
    --dev-dataset-paths tests/data/wav_dataset.tsv \
    --train-dataset-size 1000 \
    --steps-per-epoch 100 \
    --epochs 10 \
    --batch-size 32 \
    --dev-batch-size 32 \
    --learning-rate 2e-4 \
    --mixed-precision \
    --device CPUYou can also start training with train configuration file using --from-file parameter.
$ python -m speech_recognition.run.train --from-file resources/configs/train_config_sample.ymlAnd you can override the parameter of file by command line arguments like below.
$ python -m speech_recognition.run.train \
    --from-file resources/configs/train_config_sample.yml \
    --epochs 1 \
    --batch-size 128 \
    --device GPU  --from-file FROM_FILE
                        load configs from file
  --data-config DATA_CONFIG
                        data processing config file
  --model-config MODEL_CONFIG
                        model config file
  --sp-model-path SP_MODEL_PATH
                        sentencepiece model path
  --train-dataset-paths TRAIN_DATASET_PATHS
                        a tsv/tfrecord dataset file or multiple files ex)
                        *.tsv
  --dev-dataset-paths DEV_DATASET_PATHS
                        a tsv/tfrecord dataset file or multiple files ex)
                        *.tsv
  --train-dataset-size TRAIN_DATASET_SIZE
                        the number of training dataset examples
  --output-path OUTPUT_PATH
                        output directory to save log and model checkpoints
  --pretrained-model-path PRETRAINED_MODEL_PATH
                        pretrained model checkpoint
  --epochs EPOCHS
  --steps-per-epoch STEPS_PER_EPOCH
  --learning-rate LEARNING_RATE
  --min-learning-rate MIN_LEARNING_RATE
  --warmup-rate WARMUP_RATE
  --warmup-steps WARMUP_STEPS
  --batch-size BATCH_SIZE
  --dev-batch-size DEV_BATCH_SIZE
  --shuffle-buffer-size SHUFFLE_BUFFER_SIZE
                        shuffle buffer size
  --max-over-policy {filter,slice}
                        policy for sequence whose length is over max
  --use-tfrecord        use tfrecord dataset
  --tensorboard-update-freq TENSORBOARD_UPDATE_FREQ
  --mixed-precision     use mixed precision FP16
  --seed SEED           Set random seed
  --skip-epochs SKIP_EPOCHS
                        skip first N epochs and start N + 1 epoch
  --device {CPU,GPU,TPU}
                        device to use (TPU or GPU or CPU)
- data-configis config file path for data processing. example config is- resources/configs/libri_config.yml.
- model-configis config model file path for model initialize. default config is- resources/configs/las_small.yml.
- sp-model-pathis sentencepiece model path to tokenize target text.
- pretrained-model-pathis pretrained model checkpoint path if you continue to train from pretrained model.
- warmup-rateor- warmup-stepsspecify warmup steps. default is zero.- warmup-stepsis used if both of params provided.
- max-over-policyoption is for sequences whose length is over than max sequence. You can filter longer example or slice to fit length.
- use-tfrecordoption should be provided when using TFRecord format dataset.
- mixed-precisionoption is enabling FP16 mixed precision.
You can evaluate your trained model using evaluate.py script.
You'll get to know CER or WER as a result of evaluation like below example.
$ python -m speech_recognition.run.evaluate \
    --data-config resources/configs/libri_config.yml \
    --model-config tests/data/model-configs/las_mini_for_test.yml \
    --dataset-paths tests/data/wav_dataset.tsv \
    --model-path tests/data/model-checkpoints/las.ckpt \
    --sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \
    --device CPU
...
[2021-06-07 13:22:48,599] [+] Load Tokenizer from resources/sp-models/sp_model_unigram_16K_libri.model
[2021-06-07 13:22:48,626] [+] Load Data Config from resources/configs/libri_config.yml
[2021-06-07 13:22:48,629] [+] Load dataset from tests/data/wav_dataset.tsv
2021-06-07 13:22:49.018137: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA
[2021-06-07 13:22:49,662] [+] Use delta and deltas accelerate
[2021-06-07 13:22:53,122] [+] Load weights of model from tests/data/model-checkpoints/las.ckpt
Model: "las"
...
[2021-06-07 13:22:53,135] [+] Start Inference
2021-06-07 13:22:53.171394: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-06-07 13:22:53.188758: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2198835000 Hz
[2021-06-07 13:22:56,352] [+] Ended Inference
[2021-06-07 13:22:56,589] [+] Average WER: 2494.6429%
[2021-06-07 13:22:56,589] [+] Average CER: 7256.3131%  --data-config DATA_CONFIG
                        data processing config file
  --model-config MODEL_CONFIG
                        model config file
  --dataset-paths DATASET_PATHS
                        a tsv/tfrecord dataset file or multiple files ex)
                        *.tsv
  --model-path MODEL_PATH
                        pretrained model checkpoint
  --sp-model-path SP_MODEL_PATH
                        sentencepiece model path
  --output-path OUTPUT_PATH
                        output tsv file path to save generated sentences
  --batch-size BATCH_SIZE
  --beam-size BEAM_SIZE
                        not given, use greedy search else beam search with
                        this value as beam size
  --use-tfrecord        use tfrecord dataset
  --mixed-precision     Use mixed precision FP16
  --device DEVICE       device to train- dataset-pathsis same as- dataset-pathsin train script.
- If you pass output-pathargument, recognized text and real target text, distance metric is exported in tsv format.
- You can select your metric of CER or WER by passing metricargument.
You can infer with trained model to your audio files like below example.
$ python -m speech_recognition.run.inference \
    --data-config resources/configs/libri_config.yml \
    --model-config tests/data/model-configs/las_mini_for_test.yml \
    --audio-files "tests/data/audio_files/*.wav"  \
    --model-path tests/data/model-checkpoints/las.ckpt \
    --sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \
    --batch-size 3 \
    --device CPU \
    --beam-size 2
...
[2021-06-07 13:28:27,696] [+] Use delta and deltas accelerate
[2021-06-07 13:28:31,202] Loaded weights of model from tests/data/model-checkpoints/las.ckpt
Model: "las"
(MODEL SUMMARY)
[2021-06-07 13:28:31,204] Start Inference
2021-06-07 13:28:31.238552: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-06-07 13:28:31.256769: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2198835000 Hz
[2021-06-07 13:28:35,693] Ended Inference, Start to save...
[2021-06-07 13:28:35,694] Saved (audio path,decoded sentence) pairs to output.tsvThen inferenced files is saved to output path.
  --data-config DATA_CONFIG
                        data processing config file
  --model-config MODEL_CONFIG
                        model config file
  --audio-files AUDIO_FILES
                        an audio file or glob pattern of multiple files ex)
                        *.pcm
  --model-path MODEL_PATH
                        pretrained model checkpoint
  --output-path OUTPUT_PATH
                        output tsv file path to save generated sentences
  --sp-model-path SP_MODEL_PATH
                        sentencepiece model path
  --batch-size BATCH_SIZE
  --beam-size BEAM_SIZE
                        not given, use greedy search else beam search with
                        this value as beam size
  --mixed-precision     Use mixed precision FP16
  --device DEVICE       device to train- audio-filesis audio files glob pattern. i.e) "*.pcm", "data[0-9]+.wav"
- model-pathis tensorflow model checkpoint path.
You can convert dataset into TFRecord format like below example.
$ python -m speech_recognition.run.make_tfrecord \
    --data-config resources/configs/libri_config.yml \
    --dataset-paths tests/data/wav_dataset.tsv \
    --sp-model-path resources/sp-models/sp_model_unigram_16K_libri.model \
    --output-dir .
[2021-06-07 13:31:10,444] [+] Number of Dataset Files: 1
[2021-06-07 13:31:10,445] [+] Load Config From resources/configs/libri_config.yml
[2021-06-07 13:31:10,447] [+] Load Tokenizer From resources/sp-models/sp_model_unigram_16K_libri.model
...
2021-06-07 13:31:10.491991: I tensorflow/compiler/jit/xla_gpu_device.cc:99] Not creating XLA devices, tf_xla_enable_xla_devices not set
[2021-06-07 13:31:10,519] [+] Start Saving Dataset...
  0%|                                                                                                                                                                                        | 0/1 [00:00<?, ?it/s]2021-06-07 13:31:10.848397: I tensorflow_io/core/kernels/cpu_check.cc:128] Your CPU supports instructions that this TensorFlow IO binary was not compiled to use: AVX2 FMA
2021-06-07 13:31:11.530043: I tensorflow/compiler/mlir/mlir_graph_optimization_pass.cc:116] None of the MLIR optimization passes are enabled (registered 2)
2021-06-07 13:31:11.548833: I tensorflow/core/platform/profile_utils/cpu_utils.cc:112] CPU Frequency: 2198835000 Hz
100%|█| 1/1 [00:01<00:00,  1.35s/it]
[2021-06-07 13:31:11,867] [+] Done  --data-config DATA_CONFIG
                        data processing config file
  --dataset-paths DATASET_PATHS
                        dataset file path glob pattern
  --output-dir OUTPUT_DIR
                        output directory path, default is input dataset file
                        directoruy
  --sp-model-path SP_MODEL_PATH
                        sentencepiece model path
- The arguments is same as train script arguments.
- The output TFRecord file contains already pre-processed audio tensors and tokenized tensors, so you can train with only TFRecord file without tsv or audio files.