Hidden Markov Models for Speech Recognition — Strengths and Limitations

The use of hidden Markov models for speech recognition has become predominant for the last several years, as evidenced by the number of published papers and talks at major speech conferences. The reasons why this method has become so popular are the inherent statistical (mathematically precise) framework, the ease and availability of training algorithms for estimating the parameters of the models from finite training sets of speech data, the flexibility of the resulting recognition system where one can easily change the size, type, or architecture of the models to suit particular words, sounds etc., and the ease of implementation of the overall recognition system. However, although hidden Markov model technology has brought speech recognition system performance to new high levels for a variety of applications, there remain some fundamental areas where aspects of the theory are either inadequate for speech, or for which the assumptions that are made do not apply. Examples of such areas range from the fundamental modeling assumption, i.e. that a maximum likelihood estimate of the model parameters provides the best system performance, to issues involved with inadequate training data which leads to the concepts of parameter tying across states, deleted interpolation and other smoothing methods, etc. Other aspects of the basic hidden Markov modeling methodology which are still not well understood include; ways of integrating new features (e.g. prosodic versus spectral features) into the framework in a consistent and meaningful way; the way to properly model sound durations (both within a state and across states of a model); the way to properly use the information in state transitions; and finally the way in which models can be split or clustered as warranted by the training data. It is the purpose of this paper to examine each of these strengths and limitations and discuss how they affect overall performance of a typical speech recognition system.

[1]  Biing-Hwang Juang,et al.  The segmental K-means algorithm for estimating parameters of hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[2]  Louis A. Liporace,et al.  Maximum likelihood estimation for multivariate observations of Markov sources , 1982, IEEE Trans. Inf. Theory.

[3]  Stephen E. Levinson,et al.  Continuously variable duration hidden Markov models for automatic speech recognition , 1986 .

[4]  D. B. Paul Training of HMM recognizers by simulated annealing , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Jr. G. Forney,et al.  The viterbi algorithm , 1973 .

[6]  Wayne A. Lea,et al.  Trends in Speech Recognition , 1980 .

[7]  James A. Cadzow ARMA Modeling of Time Series , 1982, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[8]  Biing-Hwang Juang,et al.  HMM clustering for connected word recognition , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[9]  Robert M. Gray,et al.  An Algorithm for Vector Quantizer Design , 1980, IEEE Trans. Commun..

[10]  George R. Doddington,et al.  Frame-specific statistical features for speaker independent speech recognition , 1986, IEEE Trans. Acoust. Speech Signal Process..

[11]  L. Baum,et al.  A Maximization Technique Occurring in the Statistical Analysis of Probabilistic Functions of Markov Chains , 1970 .

[12]  A. Poritz,et al.  On hidden Markov models in isolated word recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[14]  L. R. Rabiner,et al.  On the application of vector quantization and hidden Markov models to speaker-independent, isolated word recognition , 1983, The Bell System Technical Journal.

[15]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[16]  L. Baum,et al.  An inequality and associated maximization technique in statistical estimation of probabilistic functions of a Markov process , 1972 .

[17]  Bruce Lowerre,et al.  The Harpy speech understanding system , 1990 .

[18]  Leo Breiman,et al.  Classification and Regression Trees , 1984 .

[19]  Chin-Hui Lee,et al.  Acoustic modeling for large vocabulary speech recognition , 1990 .

[20]  B. Atal,et al.  Speech analysis and synthesis by linear prediction of the speech wave. , 1971, The Journal of the Acoustical Society of America.

[21]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[22]  S. Furui Speaker-Independent Isolated Word Recognition Based on Dynamics-Emphasized Cepstrum , 1986 .

[23]  L. R. Rabiner,et al.  On the application of energy contours to the recognition of connected word sequences , 1984, AT&T Bell Laboratories Technical Journal.

[24]  F. Jelinek Fast sequential decoding algorithm using a stack , 1969 .

[25]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[26]  Michael Picheny,et al.  On a model-robust training method for speech recognition , 1988, IEEE Trans. Acoust. Speech Signal Process..

[27]  J.B. Allen,et al.  A unified approach to short-time Fourier analysis and synthesis , 1977, Proceedings of the IEEE.

[28]  Vishwa Gupta,et al.  Integration of acoustic information in a large vocabulary word recognizer , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[29]  Lalit R. Bahl,et al.  A Maximum Likelihood Approach to Continuous Speech Recognition , 1983, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[30]  J. Makhoul,et al.  Linear prediction: A tutorial review , 1975, Proceedings of the IEEE.

[31]  Lalit R. Bahl,et al.  A new algorithm for the estimation of hidden Markov model parameters , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[32]  C. J. Wellekens,et al.  Explicit time correlation in hidden Markov models for speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[33]  Xuedong Huang,et al.  Unified techniques for vector quantization and hidden Markov modeling using semi-continuous models , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[34]  John Makhoul,et al.  BYBLOS: The BBN continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[35]  R. Bakis Continuous speech recognition via centisecond acoustic states , 1976 .