Adaptive & discriminative speech modeling to cope with temporal changes of environments

The change in speech characteristics is originated from various factors, at various (temporal) rates in a real world conversation. These temporal changes have their own dynamics and therefore, we propose to extend the single (time-) incremental adaptations to a multiscale adaptation, which has the potential of greatly increasing the model’s robustness as it will include adaptation mechanism to approximate the nature of the characteristic change. The formulation of the incremental adaptation assumes a time evolution system of the model, where the posterior distributions, used in the decision process, are successively updated based on a macroscopic time scale in accordance with the Kalman filter theory. In this paper, we extend the original incremental adaptation scheme, based on a single time scale, to multiple time scales, and apply the method to the adaptation of both the acoustic model and the language model. We further investigate methods to integrate the multi-scale adaptation scheme to realize the robust speech recognition performance. Large vocabulary continuous speech recognition experiments for English and Japanese lectures revealed the importance of modeling multiscale properties in speech recognition.

[1]  Mari Ostendorf,et al.  Modeling dependency in adaptation of acoustic models using multiscale tree processes , 1997, EUROSPEECH.

[2]  Shinji Watanabe,et al.  Discriminative training based on an integrated view of MPE and MMI in margin and error space , 2010, 2010 IEEE International Conference on Acoustics, Speech and Signal Processing.

[3]  Hermann Ney,et al.  Frame based system combination and a comparison with weighted ROVER and CNC , 2006, INTERSPEECH.

[4]  Naonori Ueda,et al.  Variational bayesian estimation and clustering for speech recognition , 2004, IEEE Transactions on Speech and Audio Processing.

[5]  Yasushi Sakurai,et al.  Online multiscale dynamic topic models , 2010, KDD.

[6]  James R. Glass,et al.  Recent progress in the MIT spoken lecture processing project , 2007, INTERSPEECH.

[7]  Yasuo Ariki,et al.  Topic tracking language model for speech recognition , 2011, Comput. Speech Lang..

[8]  Barry Y. Chen,et al.  Pushing the Envelope – Aside : Beyond the Spectral Envelope as the Fundamental Representation for Speech Recognition , 2008 .

[9]  Bin Ma,et al.  Online adaptive learning of continuous-density hidden Markov models based on multiple-stream prior evolution and posterior pooling , 2001, IEEE Trans. Speech Audio Process..

[10]  Atsushi Nakamura,et al.  Efficient WFST-Based One-Pass Decoding With On-The-Fly Hypothesis Rescoring in Extremely Large Vocabulary Continuous Speech Recognition , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Shinji Watanabe,et al.  Predictor–Corrector Adaptation by Using Time Evolution System With Macroscopic Time Scale , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[12]  Hitoshi Isahara,et al.  Spontaneous Speech Corpus of Japanese , 2000, LREC.