Acoustic Modeling Using Deep Belief Networks

Gaussian mixture models are currently the dominant technique for modeling the emission distribution of hidden Markov models for speech recognition. We show that better phone recognition on the TIMIT dataset can be achieved by replacing Gaussian mixture models by deep neural networks that contain many layers of features and a very large number of parameters. These networks are first pre-trained as a multi-layer generative model of a window of spectral feature vectors without making use of any discriminative information. Once the generative pre-training has designed the features, we perform discriminative fine-tuning using backpropagation to adjust the features slightly to make them better at predicting a probability distribution over the states of monophone hidden Markov models.

[1]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[2]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[3]  Judea Pearl,et al.  Probabilistic reasoning in intelligent systems - networks of plausible inference , 1991, Morgan Kaufmann series in representation and reasoning.

[4]  Yoshua Bengio,et al.  Artificial neural networks and their application to sequence recognition , 1991 .

[5]  Radford M. Neal Connectionist Learning of Belief Networks , 1992, Artif. Intell..

[6]  Mari Ostendorf,et al.  Fast algorithms for phone classification and recognition using segment-based models , 1992, IEEE Trans. Signal Process..

[7]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[8]  Jont B. Allen,et al.  How do humans process and recognize speech? , 1994, IEEE Trans. Speech Audio Process..

[9]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[10]  Geoffrey E. Hinton,et al.  The "wake-sleep" algorithm for unsupervised neural networks. , 1995, Science.

[11]  James R. Glass,et al.  Heterogeneous measurements and multiple classifiers for speech recognition , 1998, ICSLP.

[12]  Teuvo Kohonen,et al.  Speech recognition: a hybrid approach , 1998 .

[13]  Hynek Hermansky,et al.  TRAPS - classifiers of temporal patterns , 1998, ICSLP.

[14]  Francis Jack Smith,et al.  Improved phone recognition using Bayesian triphone models , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[15]  N. Shinkura,et al.  Pushing the envelope: chromatin boundaries at the nuclear pore. , 2002, Molecular cell.

[16]  Geoffrey E. Hinton Training Products of Experts by Minimizing Contrastive Divergence , 2002, Neural Computation.

[17]  Christopher K. I. Williams,et al.  An analysis of contrastive divergence learning in gaussian boltzmann machines , 2002 .

[18]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[19]  Miguel Á. Carreira-Perpiñán,et al.  On Contrastive Divergence Learning , 2005, AISTATS.

[20]  Pavel Matejka,et al.  Hierarchical Structures of Neural Networks for Phoneme Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[21]  Yoshua Bengio,et al.  Greedy Layer-Wise Training of Deep Networks , 2006, NIPS.

[22]  Geoffrey E. Hinton,et al.  Reducing the Dimensionality of Data with Neural Networks , 2006, Science.

[23]  Dong Yu,et al.  Structured speech modeling , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[24]  Yee Whye Teh,et al.  A Fast Learning Algorithm for Deep Belief Nets , 2006, Neural Computation.

[25]  Eric Fosler-Lussier,et al.  Combining phonetic attributes using conditional random fields , 2006, INTERSPEECH.

[26]  Lawrence K. Saul,et al.  Large Margin Gaussian Mixture Modeling for Phonetic Classification and Recognition , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[27]  Yoshua Bengio,et al.  An empirical evaluation of deep architectures on problems with many factors of variation , 2007, ICML '07.

[28]  Dong Yu,et al.  Use of Differential Cepstra as Acoustic Features in Hidden Trajectory Modeling for Phonetic Recognition , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[29]  Aren Jansen,et al.  A hierarchical point process model for speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[30]  Geoffrey E. Hinton,et al.  Using matrices to model symbolic relationships , 2008 .

[31]  Geoffrey E. Hinton,et al.  Using matrices to model symbolic relationship , 2008, NIPS.

[32]  Geoffrey E. Hinton,et al.  Using fast weights to improve persistent contrastive divergence , 2009, ICML '09.

[33]  Volodymyr Mnih,et al.  CUDAMat: a CUDA-based matrix class for Python , 2009 .

[34]  Geoffrey E. Hinton,et al.  3D Object Recognition with Deep Belief Nets , 2009, NIPS.

[35]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[36]  Tara N. Sainath,et al.  An exploration of large vocabulary tools for small vocabulary phonetic recognition , 2009, 2009 IEEE Workshop on Automatic Speech Recognition & Understanding.

[37]  Hermann Ney,et al.  A Deep Learning Approach to Machine Transliteration , 2009, WMT@EACL.

[38]  Geoffrey E. Hinton,et al.  Semantic hashing , 2009, Int. J. Approx. Reason..

[39]  Steve Renals,et al.  Speech Recognition Using Augmented Conditional Random Fields , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[40]  Geoffrey E. Hinton,et al.  Learning to Detect Roads in High-Resolution Aerial Images , 2010, ECCV.

[41]  Geoffrey E. Hinton,et al.  Phone Recognition with the Mean-Covariance Restricted Boltzmann Machine , 2010, NIPS.

[42]  Dong Yu,et al.  Investigation of full-sequence training of deep belief networks for speech recognition , 2010, INTERSPEECH.

[43]  Dong Yu,et al.  Context-Dependent Pre-Trained Deep Neural Networks for Large-Vocabulary Speech Recognition , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Geoffrey E. Hinton A Practical Guide to Training Restricted Boltzmann Machines , 2012, Neural Networks: Tricks of the Trade.