Autoregressive product of multi-frame predictions can improve the accuracy of hybrid models

We describe a simple but effective way of using multi-frame targets to improve the accuracy of Artificial Neural NetworkHidden Markov Model (ANN-HMM) hybrid systems. In this approach a Deep Neural Network (DNN) is trained to predict the forced-alignment state of multiple frames using a separate softmax unit for each of the frames. This is in contrast to the usual method of training a DNN to predict only the state of the central frame. By itself this is not sufficient to improve accuracy of the system significantly. However, if we average the predictions for each frame from the different contexts it is associated with we achieve state of the art results on TIMIT using a fully connected Deep Neural Network without convolutional architectures or dropout training. On a 14 hour subset of Wall Street Journal (WSJ) using a context dependent DNN-HMM system it leads to a relative improvement of 6.4% on the dev set (testdev93) and 9.3% on test set (test-eval92).

[1]  Larry Gillick,et al.  Discriminative training for speech recognition is compensating for statistical dependence in the HMM framework , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[2]  Nitish Srivastava,et al.  Improving neural networks by preventing co-adaptation of feature detectors , 2012, ArXiv.

[3]  Daniel Povey,et al.  The Kaldi Speech Recognition Toolkit , 2011 .

[4]  Navdeep Jaitly,et al.  Hybrid speech recognition with Deep Bidirectional LSTM , 2013, 2013 IEEE Workshop on Automatic Speech Recognition and Understanding.

[5]  Brian Kingsbury,et al.  Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[6]  Geoffrey E. Hinton,et al.  Using an autoencoder with deformable templates to discover features for automated speech recognition , 2013, INTERSPEECH.

[7]  Guangsen Wang,et al.  Sequential Classification Criteria for NNs in Automatic Speech Recognition , 2011, INTERSPEECH.

[8]  Georg Heigold,et al.  Multiframe deep neural networks for acoustic modeling , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[9]  Dong Yu,et al.  Exploring convolutional neural network structures and optimization techniques for speech recognition , 2013, INTERSPEECH.

[10]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[11]  Jonathan G. Fiscus,et al.  Darpa Timit Acoustic-Phonetic Continuous Speech Corpus CD-ROM {TIMIT} | NIST , 1993 .

[12]  Gerald Penn,et al.  Applying Convolutional Neural Networks concepts to hybrid NN-HMM model for speech recognition , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Li Deng,et al.  A deep convolutional neural network using heterogeneous pooling for trading acoustic invariance with phonetic confusion , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  Larry Gillick,et al.  Don't multiply lightly: Quantifying problems with the acoustic model assumptions in speech recognition , 2011, 2011 IEEE Workshop on Automatic Speech Recognition & Understanding.

[15]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[16]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[17]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[18]  Khe Chai Sim,et al.  Discriminative Product-of-Expert acoustic mapping for cross-lingual phone recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.