Model Adaptation for Automatic Speech Recognition Based on Multiple Time Scale Evolution

The change in speech characteristics is originated from various factors, at various (temporal) rates in a real world conversation. These temporal changes have their own dynamics and therefore, we propose to extend the single (time-) incremental adaptations to a multiscale adaptation, which has the potential of greatly increasing the model’s robustness as it will include adaptation mechanism to approximate the nature of the characteristic change. The formulation of the incremental adaptation assumes a time evolution system of the model, where the posterior distributions, used in the decision process, are successively updated based on a macroscopic time scale in accordance with the Kalman filter theory. In this paper, we extend the original incremental adaptation scheme, based on a single time scale, to multiple time scales, and apply the method to the adaptation of both the acoustic model and the language model. We further investigate methods to integrate the multi-scale adaptation scheme to realize the robust speech recognition performance. Large vocabulary continuous speech recognition experiments for English and Japanese lectures revealed the importance of modeling multiscale properties in speech recognition. Index Terms: speech recognition, incremental adaptation, multiscale, time evolution system

[1]  Atsushi Nakamura,et al.  Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[2]  Masakiyo Fujimoto,et al.  Noise Robust Voice Activity Detection Based on Statistical Model and Parallel Non-Linear Kalman Filtering , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[3]  Mari Ostendorf,et al.  Modeling dependency in adaptation of acoustic models using multiscale tree processes , 1997, EUROSPEECH.

[4]  Yasushi Sakurai,et al.  Online multiscale dynamic topic models , 2010, KDD.

[5]  Naonori Ueda,et al.  Variational bayesian estimation and clustering for speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[6]  Barry Y. Chen,et al.  Pushing the Envelope – Aside : Beyond the Spectral Envelope as the Fundamental Representation for Speech Recognition , 2008 .

[7]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[8]  Bin Ma,et al.  Online adaptive learning of continuous-density hidden Markov models based on multiple-stream prior evolution and posterior pooling , 2001, IEEE Trans. Speech Audio Process..

[9]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.

[10]  Shinji Watanabe,et al.  Predictor–Corrector Adaptation by Using Time Evolution System With Macroscopic Time Scale , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Yasuo Ariki,et al.  Topic tracking language model for speech recognition , 2011, Comput. Speech Lang..

[12]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[13]  Hermann Ney,et al.  Frame based system combination and a comparison with weighted ROVER and CNC , 2006, INTERSPEECH.

[14]  Shinji Watanabe,et al.  Discriminative training based on an integrated view of MPE and MMI in margin and error space , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.