Skip to content

Commit

Permalink
Update README.md with training details (#1637)
Browse files Browse the repository at this point in the history
  • Loading branch information
Paschalis authored Sep 26, 2024
1 parent 1b308a3 commit a9f27eb
Showing 1 changed file with 119 additions and 2 deletions.
121 changes: 119 additions & 2 deletions training/README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,120 @@
A proper simple setup to train a Vosk model
# Vosk API Training

More documentation later
This directory contains scripts and tools for training speech recognition models using the Kaldi toolkit.

## Table of Contents

1. [Overview](#overview)
2. [Directory Structure](#directory-structure)
3. [Installation](#installation)
4. [Training Process](#training-process)
- [Data Preparation](#data-preparation)
- [Dictionary Preparation](#dictionary-preparation)
- [MFCC Feature Extraction](#mfcc-feature-extraction)
- [Acoustic Model Training](#acoustic-model-training)
- [TDNN Chain Model Training](#tdnn-chain-model-training)
- [Decoding](#decoding)
5. [Results](#results)
6. [Contributing](#contributing)

## Overview

This repository provides tools for training custom speech recognition models using Kaldi. It supports acoustic model training, language model creation, and decoding pipelines.

## Directory Structure

```plaintext
.
├── cmd.sh # Command configuration for training and decoding
├── conf/
│ ├── mfcc.conf # Configuration for MFCC feature extraction
│ └── online_cmvn.conf # Online Cepstral Mean Variance Normalization (currently empty)
├── local/
│ ├── chain/
│ │ ├── run_ivector_common.sh # Script for i-vector extraction during chain model training
│ │ └── run_tdnn.sh # Script for training a TDNN model
│ ├── data_prep.sh # Data preparation script for creating Kaldi data directories
│ ├── download_and_untar.sh # Script for downloading and extracting datasets
│ ├── download_lm.sh # Downloads language models
│ ├── prepare_dict.sh # Prepares the pronunciation dictionary
│ └── score.sh # Scoring script for evaluation
├── path.sh # Script for setting Kaldi paths
├── RESULTS # Script for printing the best WER results
├── RESULTS.txt # Contains WER results from decoding
├── run.sh # Main script for the entire training pipeline
├── steps -> ../../wsj/s5/steps/ # Link to Kaldi’s WSJ steps for acoustic model training
└── utils -> ../../wsj/s5/utils/ # Link to Kaldi’s utility scripts
```

### Key Files:
- **cmd.sh**: Defines commands for running training and decoding tasks.
- **path.sh**: Sets up paths for Kaldi binaries and scripts.
- **run.sh**: Main entry point for the training pipeline, running tasks in stages.
- **RESULTS**: Displays Word Error Rate (WER) for the trained models.

## Installation

### Prerequisites
- [Kaldi](https://github.com/kaldi-asr/kaldi): Kaldi toolkit must be installed and configured.
- Required tools: `ffmpeg`, `sox`, `sctk` for data preparation and scoring.

### Steps
1. Clone the Vosk API repository.
2. Install Kaldi and ensure the `KALDI_ROOT` is correctly set in `path.sh`.
3. Set environment variables using `cmd.sh` and `path.sh`.

## Training Process

### Data Preparation
Run the data preparation stage in `run.sh`:
```bash
bash run.sh --stage 0 --stop_stage 0
```
This stage downloads and prepares the LibriSpeech dataset.

### Dictionary Preparation
Prepare the pronunciation dictionary with:
```bash
bash run.sh --stage 1 --stop_stage 1
```
This step generates the necessary files for Kaldi's `prepare_lang.sh` script.

### MFCC Feature Extraction
Run the MFCC extraction process:
```bash
bash run.sh --stage 2 --stop_stage 2
```
This step extracts Mel-frequency cepstral coefficients (MFCC) features and computes Cepstral Mean Variance Normalization (CMVN).

### Acoustic Model Training
Train monophone, LDA+MLLT, and SAT models:
```bash
bash run.sh --stage 3 --stop_stage 3
```
This stage trains GMM-based models and aligns the data for TDNN training.

### TDNN Chain Model Training
Train a Time-Delay Neural Network (TDNN) chain model:
```bash
bash run.sh --stage 4 --stop_stage 4
```
The chain model uses i-vectors for speaker adaptation.

### Decoding
After training, decode the test data:
```bash
bash run.sh --stage 5 --stop_stage 5
```
This step decodes using the trained model and evaluates the Word Error Rate (WER).

## Results

WER can be evaluated by running:
```bash
bash RESULTS
```
Example of `RESULTS.txt`:
```plaintext
%WER 14.10 [ 2839 / 20138, 214 ins, 487 del, 2138 sub ] exp/chain/tdnn/decode_test/wer_11_0.0
%WER 12.67 [ 2552 / 20138, 215 ins, 406 del, 1931 sub ] exp/chain/tdnn/decode_test_rescore/wer_11_0.0
```

0 comments on commit a9f27eb

Please sign in to comment.