A Survey on Automatic Speech Recognition with an Illustrative Example on Continuous Speech Recognition of Mandarin

For the past two decades, research in speech recognition has been intensively carried out worldwide, spurred on by advances in signal processing, algorithms, architectures, and hardware. Speech recognition systems have been developed for a wide variety of applications, ranging from small vocabulary keyword recognition over dial-up telephone lines, to medium size vocabulary voice interactive command and control systems on personal computers, to large vocabulary speech dictation, spontaneous speech understanding, and limited-domain speech translation. In this paper we review some of the key advances in several areas of automatic speech recognition. We also illustrate, by examples, how these key advances can be used for continuous speech recognition of Mandarin. Finally we elaborate the requirements in designing successful real-world applications and address technical challenges that need to be harnessed in order to reach the ultimate goal of providing an easy-to-use, natural, and flexible voice interface between people and machines.

[1]  S. Furui,et al.  Unsupervised speaker adaptation method based on hierarchical spectral clustering , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[2]  Shigeki Sagayama,et al.  A successive state splitting algorithm for efficient allophone modeling , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  Frank K. Soong,et al.  A Tree.Trellis Based Fast Search for Finding the N Best Sentence Hypotheses in Continuous Speech Recognition , 1990, HLT.

[4]  Richard C. Rose,et al.  Task independent wordspotting using decision tree based allophone clustering , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Frank K. Soong,et al.  A segment model based approach to speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[6]  Chin-Hui Lee,et al.  A minimax classification approach with application to robust speech recognition , 1993, IEEE Trans. Speech Audio Process..

[7]  Kai-Fu Lee,et al.  Vocabulary learning and environment normalization in vocabulary-independent speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  P. Mermelstein,et al.  Speech sounds and features , 1975, Proceedings of the IEEE.

[9]  R. Schwartz,et al.  The N-best algorithms: an efficient and exact procedure for finding the N most likely sentence hypotheses , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  Chin-Hui Lee,et al.  Large vocabulary speech recognition using subword units , 1993, Speech Commun..

[11]  Biing-Hwang Juang,et al.  A study on speaker adaptation of the parameters of continuous density hidden Markov models , 1991, IEEE Trans. Signal Process..

[12]  Chin-Hui Lee,et al.  Bayesian learning for hidden Markov model with Gaussian mixture state observation densities , 1991, Speech Commun..

[13]  Roger K. Moore,et al.  Hidden Markov model decomposition of speech and noise , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[14]  Biing-Hwang Juang,et al.  Robust utterance verification for connected digits recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[15]  Mari Ostendorf,et al.  ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition , 1993, IEEE Trans. Speech Audio Process..

[16]  Anthony J. Robinson,et al.  An application of recurrent nets to phone probability estimation , 1994, IEEE Trans. Neural Networks.

[17]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Richard M. Stern,et al.  Efficient joint compensation of speech for the effects of additive noise and linear filtering , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[20]  Lawrence R. Rabiner,et al.  A segmental k-means training procedure for connected word recognition , 1986, AT&T Technical Journal.

[21]  Chin-Hui Lee,et al.  On-line adaptive learning of the continuous density hidden Markov model based on approximate recursive Bayes estimate , 1997, IEEE Trans. Speech Audio Process..

[22]  Mari Ostendorf,et al.  A stochastic segment model for phoneme-based continuous speech recognition , 1989, IEEE Trans. Acoust. Speech Signal Process..

[23]  Yunxin Zhao A new speaker adaptation technique using very short calibration speech , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[24]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[25]  S. Seneff A joint synchrony/mean-rate model of auditory speech processing , 1990 .

[26]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[27]  Stephen Cox,et al.  Unsupervised speaker adaptation by probabilistic spectrum fitting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[28]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[29]  Andrej Ljolje,et al.  Optimal speech recognition using phone recognition and lexical access , 1992, ICSLP.

[30]  George Zavaliagkos,et al.  A hybrid segmental neural net/hidden Markov model system for continuous speech recognition , 1994, IEEE Trans. Speech Audio Process..

[31]  Keh-Yih Su,et al.  On jointly learning the parameters in a character-synchronous integrated speech and language model , 1996, IEEE Trans. Speech Audio Process..

[32]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[33]  Li Deng,et al.  Speech recognition using the atomic speech units constructed from overlapping articulatory features , 1994, EUROSPEECH.

[34]  Shigeki Sagayama,et al.  Vector field smoothing principle for speaker adaptation , 1992, ICSLP.

[35]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .

[36]  Chin-Hui Lee,et al.  Acoustic modeling for large vocabulary speech recognition , 1990 .

[37]  Michael Picheny,et al.  Multonic Markov word models for large vocabulary continuous speech recognition , 1993, IEEE Trans. Speech Audio Process..

[38]  Biing-Hwang Juang,et al.  Minimum error rate training based on N-best string models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[39]  Salvatore D. Morgera,et al.  An improved MMIE training algorithm for speaker-independent, small vocabulary, continuous speech recognition , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[40]  Chin-Hui Lee,et al.  Segmental GPD training of HMM based speech recognizer , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[41]  Yifan Gong,et al.  Stochastic trajectory modeling for speech recognition , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[42]  Xuedong Huang,et al.  Semi-continuous hidden Markov models for speech signals , 1990 .

[43]  Chiu-yu Tseng,et al.  Golden Mandarin (I)-A real-time Mandarin speech dictation machine for Chinese language with very large vocabulary , 1993, IEEE Trans. Speech Audio Process..

[44]  Biing-Hwang Juang,et al.  Discriminative feature extraction for speech recognition , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[45]  Douglas B. Paul,et al.  Algorithms for an Optimal A* Search and Linearizing the Search in the Stack Decoder* , 1991, HLT.

[46]  Biing-Hwang Juang,et al.  A study on task-independent subword selection and modeling for speech recognition , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[47]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[48]  Oded Ghitza,et al.  Auditory neural feedback as a basis for speech processing , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[49]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[50]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[51]  Biing-Hwang Juang,et al.  A vocabulary independent discriminatively trained method for rejection of non-keywords in sub word based speech recognition , 1995, EUROSPEECH.

[52]  Y R Wang,et al.  Tone recognition of continuous Mandarin speech assisted with prosodic information. , 1994, The Journal of the Acoustical Society of America.

[53]  H. Bourlard,et al.  Links Between Markov Models and Multilayer Perceptrons , 1990, IEEE Trans. Pattern Anal. Mach. Intell..

[54]  Lin-Shan Lee,et al.  Tone recognition for continuous Mandarin speech with limited training data using selected context-dependent hidden Markov models , 1994 .

[55]  S. J. Young,et al.  Tree-based state tying for high accuracy acoustic modelling , 1994 .

[56]  Frederick Jelinek,et al.  The development of an experimental discrete dictation recognizer , 1985, Proceedings of the IEEE.

[57]  Lin-Shan Lee,et al.  A new framework for recognition of Mandarin syllables with tones using sub-syllabic units , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[58]  Chin-Hui Lee,et al.  Speech recognition using weighted HMM and subspace projection approaches , 1994, IEEE Trans. Speech Audio Process..

[59]  B.-H. Juang,et al.  Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains , 1985, AT&T Technical Journal.

[60]  Michael Riley,et al.  A statistical model for generating pronunciation networks , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[61]  Sarangarajan Parthasarathy,et al.  On automatic estimation of articulatory parameters in a text-to-speech system , 1992 .

[62]  Chin-Hui Lee,et al.  Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains , 1994, IEEE Trans. Speech Audio Process..

[63]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[64]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[65]  Aaron E. Rosenberg,et al.  Improved acoustic modeling for large vocabulary continuous speech recognition , 1992 .

[66]  Douglas A. Reynolds,et al.  Integrated models of signal and background with application to speaker identification in noise , 1994, IEEE Trans. Speech Audio Process..

[67]  Vassilios Digalakis,et al.  Speaker adaptation using constrained estimation of Gaussian mixtures , 1995, IEEE Trans. Speech Audio Process..

[68]  Jerome R. Bellegarda,et al.  Tied mixture continuous parameter modeling for speech recognition , 1990, IEEE Trans. Acoust. Speech Signal Process..

[69]  Hervé Bourlard,et al.  Connectionist Speech Recognition: A Hybrid Approach , 1993 .

[70]  Biing-Hwang Juang,et al.  Signal bias removal for robust telephone based speech recognition in adverse environments , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[71]  D. O'Shaughnessy,et al.  A*-admissible heuristics for rapid lexical access , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[72]  Steve Young,et al.  Parallel model combination for speech recognition in noise , 1993 .

[73]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[74]  Philip C. Woodland,et al.  Speaker adaptation of continuous density HMMs using multivariate linear regression , 1994, ICSLP.

[75]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[76]  Chin-Hui Lee,et al.  Simultaneous ANN feature and HMM recognizer design using string-based minimum classification error (MCE) training , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[77]  Lalit R. Bahl,et al.  A fast approximate acoustic match for large vocabulary speech recognition , 1989, IEEE Trans. Speech Audio Process..

[78]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[79]  Mitch Weintraub,et al.  Large-vocabulary dictation using SRI's DECIPHER speech recognition system: progressive search techniques , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[80]  Hermann Ney,et al.  Dynamic programming parsing for context-free grammars in continuous speech recognition , 1991, IEEE Trans. Signal Process..

[81]  R. Schwartz,et al.  Rapid speaker adaptation using a probabilistic spectral mapping , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[82]  Hynek Hermansky,et al.  Evaluation and optimization of perceptually-based ASR front-end , 1993, IEEE Trans. Speech Audio Process..

[83]  Michael Picheny,et al.  The metamorphic algorithm: a speaker mapping approach to data augmentation , 1994, IEEE Trans. Speech Audio Process..

[84]  Biing-Hwang Juang,et al.  New discriminative training algorithms based on the generalized probabilistic descent method , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[85]  Li Deng,et al.  A stochastic model of speech incorporating hierarchical nonstationarity , 1993, IEEE Trans. Speech Audio Process..

[86]  S.E. Levinson,et al.  Structural methods in automatic speech recognition , 1985, Proceedings of the IEEE.

[87]  Chin-Hui Lee,et al.  Acoustic modeling of context dependent units, for large vocabulary speech recognition in Spanish , 1995, EUROSPEECH.

[88]  Victor Zue,et al.  The MIT SUMMIT Speech Recognition System: A Progress Report , 1989, HLT.