Single Channel Speech Enhancement Using Outlier Detection

Distortion of the underlying speech is a common problem for single-channel speech enhancement algorithms, and hinders such methods from being used more extensively. A dictionary based speech enhancement method that emphasizes preserving the underlying speech is proposed. Spectral patches of clean speech are sampled and clustered to train a dictionary. Given a noisy speech spectral patch, the best matching dictionary entry is selected and used to estimate the noise power at each time-frequency bin. The noise estimation step is formulated as an outlier detection problem, where the noise at each bin is assumed present only if it is an outlier to the corresponding bin of the best matching dictionary entry. This framework assigns higher priority in removing spectral elements that strongly deviate from a typical spoken unit stored in the trained dictionary. Even without the aid of a separate noise model, this method can achieve significant noise reduction for various non-stationary noises, while effectively preserving the underlying speech in more challenging noisy environments.

[1]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[2]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .

[3]  Pascal Scalart,et al.  Speech enhancement based on a priori signal to noise estimation , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[4]  Joachim M. Buhmann,et al.  Speech Enhancement Using Generative Dictionary Learning , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  W. Bastiaan Kleijn,et al.  Codebook driven short-term predictor parameter estimation for speech enhancement , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Daniel P. W. Ellis,et al.  Model-Based Monaural Source Separation Using a Vector-Quantized Phase-Vocoder Representation , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[7]  Richard C. Hendriks,et al.  Unbiased MMSE-Based Noise Power Estimation With Low Complexity and Low Tracking Delay , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Paris Smaragdis,et al.  A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[9]  Paris Smaragdis,et al.  Speech Enhancement by Online Non-negative Spectrogram Decomposition in Non-stationary Noise Environments , 2012, INTERSPEECH.

[10]  Rainer Martin,et al.  Noise power spectral density estimation based on optimal smoothing and minimum statistics , 2001, IEEE Trans. Speech Audio Process..

[11]  Israel Cohen,et al.  Noise spectrum estimation in adverse environments: improved minima controlled recursive averaging , 2003, IEEE Trans. Speech Audio Process..

[12]  Rainer Martin,et al.  An evaluation of noise power spectral density estimation algorithms in adverse acoustic environments , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Ronald E. Crochiere,et al.  A study of complexity and quality of speech waveform coders , 1978, ICASSP.

[14]  Danny Crookes,et al.  A Corpus-Based Approach to Speech Enhancement From Nonstationary Noise , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[15]  Jonathan G. Fiscus,et al.  DARPA TIMIT:: acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1 , 1993 .

[16]  Mads Græsbøll Christensen,et al.  A new metric for VQ-based speech enhancement and separation , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[17]  Paris Smaragdis,et al.  Convolutive Speech Bases and Their Application to Supervised Speech Separation , 2007, IEEE Transactions on Audio, Speech, and Language Processing.