This code uses deep neural network (DNN) to do speech enhancement. This code is a Keras implementation of The paper:
[1] Xu, Y., Du, J., Dai, L.R. and Lee, C.H., 2015. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Transactions on Audio, Speech and Language Processing (TASLP), 23(1), pp.7-19.
Original C++ implementation is here ( by Yong Xu ([email protected]).
Original Keras re-implementation( is done by Qiuqiang Kong ([email protected])
Noise(0dB) PESQ ---------------------- n64 1.36 +- 0.05 n71 1.35 +- 0.18 ---------------------- Avg. 1.35 +- 0.12
You may replace the mini data with your own data. We listed the data need to be prepared in meta_data/ to re-run the experiments in [1]. The data contains:
Training: Speech: TIMIT 4620 training sentences. Noise: 115 kinds of noises (
Testing: Speech: TIMIT 168 testing sentences (selected 10% from 1680 testing sentences) Noise: Noise 92 (
Some of the dataset are not published. Instead, you could collect your own data.
Download and prepare data.
Set MINIDATA=0 in Modify WORKSPACE, TR_SPEECH_DIR, TR_NOISE_DIR, TE_SPEECH_DIR, TE_NOISE_DIR in and some arguments (get_args() function)
Iteration: 0, tr_loss: 1.228049, te_loss: 1.252313 Iteration: 1000, tr_loss: 0.533825, te_loss: 0.677872 Iteration: 2000, tr_loss: 0.505751, te_loss: 0.678816 Iteration: 3000, tr_loss: 0.483631, te_loss: 0.666576 Iteration: 4000, tr_loss: 0.480287, te_loss: 0.675403 Iteration: 5000, tr_loss: 0.457020, te_loss: 0.676319 Saved model to /vol/vssp/msos/qk/workspaces/speech_enhancement/models/0db/md_5000iters.h5 Iteration: 6000, tr_loss: 0.461330, te_loss: 0.673847 Iteration: 7000, tr_loss: 0.445159, te_loss: 0.668545 Iteration: 8000, tr_loss: 0.447244, te_loss: 0.680740 Iteration: 9000, tr_loss: 0.427652, te_loss: 0.678236 Iteration: 10000, tr_loss: 0.421219, te_loss: 0.663294 Saved model to /vol/vssp/msos/qk/workspaces/speech_enhancement/models/0db/md_10000iters.h5 Training time: 202.551192045 s
The final PESQ looks like:
Noise(0dB) PESQ --------------------------------- pink 2.01 +- 0.23 buccaneer1 1.88 +- 0.25 factory2 2.21 +- 0.21 hfchannel 1.63 +- 0.24 factory1 1.93 +- 0.23 babble 1.81 +- 0.28 m109 2.13 +- 0.25 leopard 2.49 +- 0.23 volvo 2.83 +- 0.23 buccaneer2 2.03 +- 0.25 white 2.00 +- 0.21 f16 1.86 +- 0.24 destroyerops 1.99 +- 0.23 destroyerengine 1.86 +- 0.23 machinegun 2.55 +- 0.27 --------------------------------- Avg. 2.08 +- 0.24
In the inference step, you may add --visualize to the arguments to plot the mixture, clean and enhanced speech log magnitude spectrogram.
PESQ dose not support long path/folder name, so please shorten your path/folder name. Or you will get a wrong/low PESQ score (or you can modify the PESQ source code to enlarge the size of the path name variable)
For larger dataset which can not be loaded into the momemory at one time, you can 1. prepare your training scp list ---> 2. random your training scp list ---> 3. split your triaining scp list into several parts ---> 4. read each part for training one by one