A comparative study of noise estimation algorithms for nonlinear compensation in robust speech recognition

The Gauss-Newton and EM-FA methods are two main approaches for estimating noise parameters of the non-linear compensation models.We conduct a systematic comparison between these two estimation methods.Both approaches belong to the family of gradient-based methods except with different convergence rates.The Gauss-Newton method is superior to the EM-FA method in terms of convergence. Nonlinear compensation models make use of a nonlinear mismatch function, which characterizes the joint effects of additive and convolutional noise, to realize noise-robust speech recognition. Representative compensation models consist of vector Taylor series (VTS), data-driven parallel model combination (DPMC), and unscented transform (UT). The noise parameters of the compensation models, often estimated in the maximum likelihood (ML) sense, are known to play an important role on the system performance in noisy conditions. In this paper, we conduct a systematic comparison between two popular approaches for estimating the noise parameters. The first approach employs the Gauss-Newton method in a generalized EM framework to iteratively maximizing the EM auxiliary function. The second approach views the compensation models from a generative perspective, giving rise to an EM algorithm, analogous to the ML estimation for factor analysis (EM-FA). We demonstrate a close connection between these two approaches: they belong to the family of gradient-based methods except with different convergence rates. Note that the convergence property can be crucial to the noise estimation since model compensation may be frequently carried out in changing noisy environments for retaining desired performance. Furthermore, we present an in-depth discussion on the advantages and limitations of the two approaches, and illustrate how to extend these approaches to allow for adaptive training. The investigated noise estimation approaches are evaluated on several tasks. The first is to fit a GMM model to artificially corrupted samples, and then speech recognition are performed on the Aurora 2 and Aurora 4 tasks.

[1]  Biing-Hwang Juang,et al.  Nonlinear Compensation Using the Gauss–Newton Method for Noise-Robust Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Yifan Gong,et al.  An Overview of Noise-Robust Automatic Speech Recognition , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[3]  Stephen J. Wright,et al.  Numerical Optimization , 2018, Fundamental Statistical Inference.

[4]  Douglas A. Reynolds,et al.  Integrated models of signal and background with application to speaker identification in noise , 1994, IEEE Trans. Speech Audio Process..

[5]  Mark J. F. Gales,et al.  Discriminative adaptive training with VTS and JUD , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[6]  Friedrich Faubel,et al.  On expectation maximization based channel and noise estimation beyond the vector Taylor series expansion , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[7]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Li Deng,et al.  HMM adaptation using vector taylor series for noisy speech recognition , 2000, INTERSPEECH.

[9]  Charles R. Johnson,et al.  Topics in Matrix Analysis , 1991 .

[10]  Li Deng,et al.  Uncertainty decoding with SPLICE for noise robust speech recognition , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Mark J. F. Gales,et al.  Model-Based Approaches to Handling Uncertainty , 2011, Robust Speech Recognition of Uncertain or Missing Data.

[12]  Alex Acero,et al.  Noise Adaptive Training for Robust Automatic Speech Recognition , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[13]  Mark J. F. Gales,et al.  Extended VTS for Noise-Robust Speech Recognition , 2011, IEEE Trans. Speech Audio Process..

[14]  R. A. Leibler,et al.  On Information and Sufficiency , 1951 .

[15]  Hugo Van hamme,et al.  Model-based feature enhancement with uncertainty decoding for noise robust ASR , 2006, Speech Commun..

[16]  Mark J. F. Gales,et al.  Model-based techniques for noise robust speech recognition , 1995 .

[17]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[18]  Chong Kwan Un,et al.  Speech recognition in noisy environments using first-order vector Taylor series , 1998, Speech Commun..

[19]  Richard M. Schwartz,et al.  A compact model for speaker-adaptive training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Yu Hu,et al.  An HMM Compensation Approach Using Unscented Transformation for Noisy Speech Recognition , 2006, ISCSLP.

[21]  Yifan Gong,et al.  Unscented transform with online distortion estimation for HMM adaptation , 2010, INTERSPEECH.

[22]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[23]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[24]  Yu Hu,et al.  Irrelevant variability normalization based HMM training using VTS approximation of an explicit model of environmental distortions , 2007, INTERSPEECH.

[25]  Jeffrey K. Uhlmann,et al.  Unscented filtering and nonlinear estimation , 2004, Proceedings of the IEEE.

[26]  Biing-Hwang Juang,et al.  Signal bias removal by maximum likelihood estimation for robust telephone speech recognition , 1996, IEEE Trans. Speech Audio Process..

[27]  Keiichi Tokuda,et al.  HMM compensation for noisy speech recognition based on cepstral parameter generation , 1997, EUROSPEECH.

[28]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[29]  Biing-Hwang Juang,et al.  A comparative study of noise estimation algorithms for VTS-based robust speech recognition , 2010, INTERSPEECH.

[30]  Yifan Gong,et al.  A unified framework of HMM adaptation with joint compensation of additive and convolutive distortions , 2009, Computer Speech and Language.

[31]  Biing-Hwang Juang,et al.  Speech recognition in adverse environments , 1991 .

[32]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[33]  Dorothy T. Thayer,et al.  EM algorithms for ML factor analysis , 1982 .