On supervised learning from sequential data with applications for speech regognition

Many problems of engineering interest, for example speech recognition, can be formulated in an abstract sense as supervised learning from sequential data, where an input sequence x T 1 = fx 1 ;x 2 ;x 3 ; : : : ;x T 1 ;x T g has to be mapped to an output sequence y T 1 = fy 1 ;y 2 ;y 3 ; : : : ;y T 1 ;y T g. This thesis gives a uni ed view of the abstract problem and presents some models and algorithms for improved sequence recognition and modeling performance, measured on synthetic data and on real speech data. A powerful neural network structure to deal with sequential data is the recurrent neural network (RNN), which allows one to estimate P (y t jx 1 ;x 2 ; : : : ;x t ), the output probability distribution at time t given all previous input. The rst part of this thesis presents various extensions to the basic RNN structure, which are a) a bidirectional recurrent neural network (BRNN), which allows the estimation of expressions of the form P (y t jx T 1 ), the output at t given all sequential input, for uni-modal regression and classi cation problems, b) an extended BRNN to directly estimate the posterior probability of a symbol sequence, P (y T 1 jx T 1 ), by modeling P (y t jy t 1 ;y t 2 ; : : : ;y 1 ;x T 1 ) without explicit assumptions about the shape of the distribution P (y T 1 jx T 1 ), c) a BRNN to model multi-modal input data that can be described by Gaussian mixture distributions conditioned on an output vector sequence, P (x t jy T 1 ), assuming that neighboring x t ;x t+1 are conditionally independent, and d) an extension to c) which removes the independence assumption by modeling P (x t jx t 1 ;x t 2 ; : : : ;x 1 ;y T 1 ) to estimate the likelihood P (x T 1 jy T 1 ) of a given output sequence without any explicit approximations about the use of context. The second part of this thesis describes the details of a fast and memory-e cient one-pass stack decoder for speech recognition to perform the search for the most probable word sequence. The use of this decoder, which can handle arbitrary order N-gram language models and arbitrary order context-dependent acoustic models with full crossword expansion, led to the best reported recognition results on the standard test set of a widely used Japanese newspaper dictation task.

[1]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[2]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[3]  Richard O. Duda,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[4]  Peter E. Hart,et al.  Pattern classification and scene analysis , 1974, A Wiley-Interscience publication.

[5]  Richard W. Hamming,et al.  Coding and Information Theory , 1980 .

[6]  R. Gray,et al.  Vector quantization , 1984, IEEE ASSP Magazine.

[7]  M. F.,et al.  Bibliography , 1985, Experimental Gerontology.

[8]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[9]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[10]  J. Berger Statistical Decision Theory and Bayesian Analysis , 1988 .

[11]  Barak A. Pearlmutter Learning state space trajectories in recurrent neural networks : a preliminary report. , 1988 .

[12]  Barak A. Pearlmutter Learning State Space Trajectories in Recurrent Neural Networks , 1989, Neural Computation.

[13]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[14]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[15]  O. Mildenberger Informationstheorie und Codierung , 1990 .

[16]  H. Gish,et al.  A probabilistic approach to the understanding and training of neural network classifiers , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[17]  Douglas B. Paul,et al.  Algorithms for an Optimal A* Search and Linearizing the Search in the Stack Decoder* , 1991, HLT.

[18]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[19]  Michael Picheny,et al.  Decision trees for phonological rules in continuous speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[20]  Anders Krogh,et al.  Introduction to the theory of neural computation , 1994, The advanced book program.

[21]  Richard Lippmann,et al.  Neural Network Classifiers Estimate Bayesian a posteriori Probabilities , 1991, Neural Computation.

[22]  Biing-Hwang Juang,et al.  Hidden Markov Models for Speech Recognition , 1991 .

[23]  Douglas B. Paul An Efficient A* Stack Decoder Algorithm for Continuous Speech Recognition with a Stochastic Language Model , 1992, HLT.

[24]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[25]  Michael Picheny,et al.  A fast match for continuous speech recognition using allophonic models , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[26]  Michael Picheny,et al.  Word lookahead scheme for cross-word right context models in a stack decoder , 1993, EUROSPEECH.

[27]  Mitch Weintraub,et al.  Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[28]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[29]  Martin A. Riedmiller,et al.  A direct adaptive method for faster backpropagation learning: the RPROP algorithm , 1993, IEEE International Conference on Neural Networks.

[30]  Victor Zue,et al.  A* word network search for continuous speech recognition , 1993, EUROSPEECH.

[31]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[32]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[33]  Robert A. Jacobs,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1993, Neural Computation.

[34]  Lori Lamel,et al.  The LIMSI continuous speech dictation system: evaluation on the ARPA Wall Street Journal task , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[35]  Hermann Ney,et al.  Large vocabulary continuous speech recognition of Wall Street Journal data , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[36]  C. L. Giles,et al.  Dynamic recurrent neural networks: Theory and applications , 1994, IEEE Trans. Neural Networks Learn. Syst..

[37]  Hermann Ney,et al.  Improvements in beam search , 1994, ICSLP.

[38]  Ah Chung Tsoi,et al.  Locally recurrent globally feedforward networks: a critical review of architectures , 1994, IEEE Trans. Neural Networks.

[39]  Michael I. Jordan,et al.  Hierarchical Mixtures of Experts and the EM Algorithm , 1994, Neural Computation.

[40]  Yoshua Bengio,et al.  Learning long-term dependencies with gradient descent is difficult , 1994, IEEE Trans. Neural Networks.

[41]  S. Srihari Mixture Density Networks , 1994 .

[42]  Lalit R. Bahl,et al.  A tree search strategy for large-vocabulary continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[43]  Jj Odell,et al.  The Use of Context in Large Vocabulary Speech Recognition , 1995 .

[44]  Anthony J. Robinson,et al.  Context-Dependent Classes in a Hybrid Recurrent Network-HMM Speech Recognition System , 1995, NIPS.

[45]  Yoshua Bengio,et al.  Pattern Recognition and Neural Networks , 1995 .

[46]  Hermann Ney,et al.  Search Strategies For Large-Vocabulary Continuous-Speech Recognition , 1995 .

[47]  Peter Beyerlein,et al.  Hamming distance approximation for a fast log-likelihood computation for mixture densities , 1995, EUROSPEECH.

[48]  Steve Renals,et al.  DECODER TECHNOLOGY FOR CONNECTIONIST LARGE VOCABULARY SPEECH RECOGNITION , 1995 .

[49]  Steve Renals,et al.  Efficient search using posterior phone probability estimates , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[50]  Anthony J. Robinson,et al.  Forward-backward retraining of recurrent neural networks , 1995, NIPS.

[51]  Robert A. Jacobs,et al.  Methods For Combining Experts' Probability Assessments , 1995, Neural Computation.

[52]  Steve Renals,et al.  THE USE OF RECURRENT NEURAL NETWORKS IN CONTINUOUS SPEECH RECOGNITION , 1996 .

[53]  Mosur Ravishankar,et al.  Efficient Algorithms for Speech Recognition. , 1996 .

[54]  Yoshua Bengio,et al.  Input-output HMMs for sequence processing , 1996, IEEE Trans. Neural Networks.

[55]  Mari Ostendorf,et al.  From HMMS to Segment Models: Stochastic Modeling for CSR , 1996 .

[56]  M. Schuster FAST K-MEANS VECTOR QUANTIZER FOR VERY LARGE AMOUNTS OF DATA , 1996 .

[57]  G. McLachlan,et al.  The EM algorithm and extensions , 1996 .

[58]  Long Nguyen,et al.  Multiple-Pass Search Strategies , 1996 .

[59]  M. Schuster Learning out of time series with an extended recurrent neural network , 1996, Neural Networks for Signal Processing VI. Proceedings of the 1996 IEEE Signal Processing Society Workshop.

[60]  Mei-Yuh Hwang,et al.  Improvements on the pronunciation prefix tree search organization , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[61]  Yoshinori Sagisaka,et al.  Reduction of Number of Word Hypotheses for Large Vocabulary Continuous Speech Recognition , 1996 .

[62]  Gerhard Rigoll,et al.  Fast online video image sequence recognition with statistical methods , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[63]  Gerhard Rigoll,et al.  A new approach to video sequence recognition based on statistical methods , 1996, Proceedings of 3rd IEEE International Conference on Image Processing.

[64]  Herbert Gish,et al.  Parametric trajectory models for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[65]  Alexander H. Waibel,et al.  Context-dependent hybrid HME/HMM speech recognition using polyphone clustering decision trees , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[66]  Frederick Jelinek,et al.  Statistical methods for speech recognition , 1997 .

[67]  A. Kosmala,et al.  High Performance Gesture Recognition Using Probabilistic Neural Networks and Hidden Markov Models , 1997 .

[68]  F. Alleva Search organization in the Whisper continuous speech recognition system , 1997, 1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings.

[69]  Yoshinori Sagisaka,et al.  Segment boundary estimation using recurrent neural networks , 1997, EUROSPEECH.

[70]  MSc PhD Adrian J. Shepherd BA Second-Order Methods for Neural Networks , 1997, Perspectives in Neural Computing.

[71]  Kazumi Saito,et al.  Partial BFGS Update and Efficient Step-Length Calculation for Three-Layer Neural Networks , 1997, Neural Computation.

[72]  Hermann Ney,et al.  A word graph algorithm for large vocabulary continuous speech recognition , 1994, Comput. Speech Lang..

[73]  Kuldip K. Paliwal,et al.  Bidirectional recurrent neural networks , 1997, IEEE Trans. Signal Process..

[74]  Kuldip K. Paliwal,et al.  Model parameter estimation for mixture density polynomial segment models , 1998, Comput. Speech Lang..

[75]  Michael Finke,et al.  ACID/HNN: clustering hierarchies of neural networks for context-dependent connectionist acoustic modeling , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[76]  Nobuaki Minematsu,et al.  Sharable software repository for Japanese large vocabulary continuous speech recognition , 1998, ICSLP.

[77]  M. Schuster Neural networks for speech processing , 1998 .

[78]  Zhengyou Zhang,et al.  Comparison between geometry-based and Gabor-wavelets-based facial expression recognition using multi-layer perceptron , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[79]  Li Deng,et al.  Initial evaluation of hidden dynamic models on conversational speech , 1999, 1999 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings. ICASSP99 (Cat. No.99CH36258).

[80]  S. Ortmanns,et al.  Progress in dynamic programming search for LVCSR , 1997, Proceedings of the IEEE.