Classifying Unprompted Speech by Retraining LSTM Nets

We apply Long Short-Term Memory (LSTM) recurrent neural networks to a large corpus of unprompted speech- the German part of the VERBMOBIL corpus. By training first on a fraction of the data, then retraining on another fraction, we both reduce time costs and significantly improve recognition rates. For comparison we show recognition rates of Hidden Markov Models (HMMs) on the same corpus, and provide a promising extrapolation for HMM-LSTM hybrids.

[1]  J. Meigs,et al.  WHO Technical Report , 1954, The Yale Journal of Biology and Medicine.

[2]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[3]  Kjell Elenius,et al.  PHONEME RECOGNITION USING ARTIFICIAL NEURAL NETWORKS , 1991 .

[4]  Mats Blomberg,et al.  Comparing phoneme and feature based speech recognition using artificial neural networks , 1992, ICSLP.

[5]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[6]  Steve Young,et al.  The HTK book , 1995 .

[7]  Ruxin Chen,et al.  Experiments on the implementation of recurrent neural networks for speech phone recognition , 1996, Conference Record of The Thirtieth Asilomar Conference on Signals, Systems and Computers.

[8]  Steve R. Waterhouse,et al.  Smoothed local adaptation of connectionist systems , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[10]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[11]  Giovanni Soda,et al.  Exploiting the past and the future in protein secondary structure prediction , 1999, Bioinform..

[12]  Juergen Schmidhuber,et al.  Long Short-Term Memory Learns Context Free and Context Sensitive Languages , 2000 .

[13]  Michael L. Shire Relating frame accuracy with word error in hybrid ANN-HMM ASR , 2001, INTERSPEECH.

[14]  Florian Schiel,et al.  Multi-Tier Annotations in the Verbmobil Corpus , 2002, LREC.

[15]  H. Kirchmann,et al.  SmartKom : Symmetric Multimodality in an Adaptive and Reusable Dialogue Shell , 2003 .

[16]  Fuliang Yin,et al.  Advances in Neural Networks – ISNN 2004 , 2004, Lecture Notes in Computer Science.

[17]  Narendra S. Chaudhari,et al.  Capturing Long-Term Dependencies for Protein Secondary Structure Prediction , 2004, ISNN.

[18]  Jürgen Schmidhuber,et al.  Biologically Plausible Speech Recognition with LSTM Neural Nets , 2004, BioADIT.

[19]  Alexander H. Waibel,et al.  Performance comparisons of all-pass transform adaptation with maximum likelihood linear regression , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Jürgen Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM and other neural network architectures , 2005, Neural Networks.

[21]  J. Schmidhuber,et al.  Framewise phoneme classification with bidirectional LSTM networks , 2005, Proceedings. 2005 IEEE International Joint Conference on Neural Networks, 2005..

[22]  Alex Graves,et al.  Rapid Retraining on Speech Data with LSTM Recurrent Networks. , 2005 .