Combining TDNN and HMM in a hybrid system for improved continuous-speech recognition

The paper presents a hybrid continuous-speech recognition system that leads to improved results on the speaker dependent DARPA Resource Management task. This hybrid system, called the combined system, is based on a combination of normalized neural network output scores with hidden Markov model (HMM) emission probabilities. The neural network is trained under mean square error and the HMM is trained under maximum likelihood estimation. In theory, whatever criterion may be used, the same word error rate should be reached if enough training data is available. As this is never the case, the idea of combining two different criteria, each of them extracting complementary characteristics of the feature is interesting. A state-of-the-art HMM system will be combined with a time delay neural network (TDNN) integrated in a Viterbi framework. A hierarchical TDNN structure is described that splits training into subtasks corresponding to subsets of phonemes. This structure makes training of TDNNs on large-vocabulary tasks manageable on workstations. It will be shown that the combined system, despite the low accuracy of the hierarchical TDNN, achieves a word error rate reduction of 15% with respect to our state-of-the-art HMM system. This reduction is obtained with a 10% increase only in the number of parameters. >

[1]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[2]  Andreas Noll,et al.  A data-driven organization of the dynamic programming beam search for continuous speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[3]  John S. Bridle,et al.  Training Stochastic Model Recognition Algorithms as Networks can Lead to Maximum Mutual Information Estimation of Parameters , 1989, NIPS.

[4]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[5]  Alex Waibel,et al.  Consonant recognition by modular construction of large phonemic time-delay neural networks , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[6]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[7]  Hervé Bourlard,et al.  Connectionist Approaches to the Use of Markov Models for Speech Recognition , 1990, NIPS.

[8]  A. Waibel,et al.  Connectionist Viterbi training: a new hybrid method for continuous speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[9]  Hermann Dr Ney Experiments on mixture-density phoneme-modelling for the speaker-independent 1000-word speech recognition DARPA task , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[10]  Christian Dugast,et al.  Comparison of continuous mixture densities and TDNN in a viterbi-framework: experiments on speaker dependent DARPA RM1+ , 1991, EUROSPEECH.

[11]  Yoshua Bengio,et al.  Artificial neural networks and their application to sequence recognition , 1991 .

[12]  Steve Renals,et al.  Connectionist probability estimation in the DECIPHER speech recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Christian Dugast,et al.  Incorporating acoustic-phonetic knowledge in hybrid TDNN/HMM frameworks , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  P. Haffner,et al.  Connectionist word-level classification in speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[15]  H. Ney,et al.  Improvements in beam search for 10000-word continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Hervé Bourlard,et al.  CDNN: a context dependent neural network for continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  H. Ney,et al.  Linear discriminant analysis for improved large vocabulary continuous speech recognition , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Steve Austin,et al.  Speech recognition using segmental neural nets , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Tony Robinson,et al.  A real-time recurrent error propagation network word recognition system , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Hermann Ney,et al.  Continuous mixture densities and linear discriminant analysis for improved context-dependent acoustic models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[21]  Hermann Ney,et al.  Improvements in beam search , 1994, ICSLP.