Context-Dependent Connectionist Probability Estimation in a Hybrid HMM-Neural Net Speech Recognition

In this paper we present a training method and a network achitecture for the estimation of context-dependent observation probabilities in the framework of a hybrid Hidden Markov Model (HMM) / Multi Layer Perceptron (MLP) speaker independent continuous speech recognition system. The context-dependent modeling approach we present here computes the HMM context-dependent observation probabilities using a Bayesian factorization in terms of scaled posterior phone probabilities which are computed with a set of MLPs, one for every relevant context. The proposed network architecture shares the input-to-hidden layer among the set of context-dependent MLPs in order to reduce the number of independent parameters. Multiple states for phone models with different context dependence for each state are used to model the different context effects at the begining and end of phonetic segments. A new training procedure that ‘‘smooths’’ networks with different degrees of context-dependence is proposed in order to obtain a robust estimate of the context-dependent probabilities. We have used this new architecture to model generalized biphone phonetic contexts. Tests with the speaker-independent DARPA Resource Management database have shown average reductions in word error rates of 20% in the word-pair grammar case, and 11% in the no-grammar case, compared to our earlier context-independent HMM/MLP hybrid.

[1]  Hervé Bourlard,et al.  Continuous speech recognition using multilayer perceptrons with hidden Markov models , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[2]  L. R. Rabiner,et al.  An introduction to the application of the theory of probabilistic functions of a Markov process to automatic speech recognition , 1983, The Bell System Technical Journal.

[3]  Steve Renals,et al.  Connectionist probability estimation in the DECIPHER speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[5]  Mitch Weintraub,et al.  SRI's DECIPHER System , 1989, HLT.

[6]  B. Juang,et al.  Context-dependent Phonetic Hidden Markov Models for Speaker-independent Continuous Speech Recognition , 2008 .

[7]  Jeffrey L. Elman,et al.  Interactive processes in speech perception: the TRACE model , 1986 .

[8]  Hervé Bourlard,et al.  CDNN: a context dependent neural network for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[9]  A. Waibel,et al.  Connectionist Viterbi training: a new hybrid method for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  Frederick Jelinek,et al.  Interpolated estimation of Markov source parameters from sparse data , 1980 .