Skip to content

Dor890/Speech-Processing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

95 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Speech Recognition and Processing

Written by: Dor Messica, Roi Zehavi.

This repository contains a collection of exercises and a final project focused on speech processing and recognition. The exercises cover a range of topics, including Fourier transforms, MFCC (Mel-frequency cepstral coefficients), Mel spectrograms, audio and music classification, DTW, DNN-HMM, CTC, Full ASR model and more.

Exercise 1

In Exercise 1, we have explored Fourier transforms, MFCC, Mel spectrograms, and basic audio handling and manipulation techniques. This exercise serves as a foundation for understanding speech processing and recognition. It covers fundamental concepts and techniques that are widely used in the field.

log_mel

Exercise 2

Exercise 2 focuses on music classification using different features. We implemented there logistic regression from scratch to build a music classifier. This exercise provides hands-on experience with implementing a machine learning algorithm and applying it to the task of music classification. By completing this exercise, we have gained insights into feature selection and classification techniques.

Exercise 3

In Exercise 3, We delved into digit audio classification. This exercise specifically uses MFCC and DTW (Dynamic Time Warping) to build a classifier for recognizing spoken digits. By working on this exercise, We have gained practical knowledge of applying MFCC and DTW in the context of speech recognition.

image

Exercise 4

In this exercise we implement the CTC loss in Python, which calculates the probability of a specific labeling given the model’s output distribution over phonemes. We assume to be given with a sequence of phonemes p and the network’s output y. In words, y is a matrix with the shape of T × K where T is the number of time steps, and K is the amount of phonemes, where each column i of y is a distribution over K phonemes at time i. Our goal is to implement the CTC function to calculate P(p|x) using the following equations:

image

Where the final probability is given by implementing the following dynamic programming matrix:

image

Final Project

For the final project we builded a full Automatic Speech Recognition (ASR) pipeline, when we used here everything we have learned in the course including: Dynamic Time Warping (DTW), Hidden Markov Models (HMMs), DNN-HMM, End-to-End models: CTC, ASG, RNN Transducers, LAS, etc., Language modeling, Different acoustic feature, Ensemble methods, Different search methods e.t.c.

Throughout our project we documented evaluated, analyzed, and document the performance of each of the tested configurations, that way we make smarter decisions when picking the final model, which is considered as a full-blown research project (on a small scale).

We used here AN4 dataset. This is a small dataset recorded and distributed by Carnegie Mellon University (CMU). It consists of recordings of people spelling out addresses, names, etc. Moreover, we evaluate out model using Word Error Rate (WER) and Character Error Rate (CER).

The structure of the model is as follows: image

Throughout the project we introduce some different ways to manage an ASR model. In our case, we eventually builded our final model using ctc loss connected to a NN based on DeepSpeech2 structure. Moreover, we used beam decoder for creating new predictions, with a reasonable weight for the use of a 4-gram KenLM language model. For the final predictions over the test set we got the following results (Left one for WER, right one for CER): image

Those can be seen in the allignments we made over different samples from the dataset:

image

For more information, you are more then welcome to read the final report which is located in the "Final Project" directory, which is written in Hebrew.

Conclusion

We hope this repository provides you with a valuable learning experience in speech processing and recognition. The exercises are designed to cover key concepts and techniques, allowing you to gain hands-on experience with various aspects of speech-related tasks. Feel free to explore the exercises, experiment with the code, and expand your knowledge in this exciting field!

Happy coding!

About

Repository introduce all basic elements in speech recognition and processing, and includes a final and extensive project.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors