Word set probability boosting for improved spontaneous dialog recognition

Based on the observation that the unpredictable nature of conversational speech makes it almost impossible to reliably model sequential word constraints, the notion of word set error criteria is proposed for improved recognition of spontaneous dialogs. The single-pass adaptive boosting (AB) algorithm enables the language model weights to be tuned using the word set error criteria. In the two-pass version of the algorithm, the basic idea is to predict a set of words based on some a priori information, and perform a rescoring pass wherein the probabilities of the words in the predicted word set are amplified or boosted in some manner. An adaptive gradient descent procedure for tuning the word boosting factor is formulated, which enables the boost factors to be incrementally adjusted to maximize the accuracy of the speech recognition system outputs on held-out training data using the word set error criteria. Two novel models which predict the required word sets are presented: (i) utterance triggers, which capture within-utterance long-distance word interdependencies, and (ii) dialog triggers, which capture local temporal dialog-oriented word relations. The proposed trigger and adaptive boosting (TAB) algorithm, and the single-pass adaptive boosting (AB) algorithm are experimentally tested on a subset of the TRAINS-93 spontaneous dialogs and the TRAINS-95 semispontaneous corpus, and the results summarized.

[1]  Sheryl R. Young Discourse structure for multi-speaker spontaneous spoken dialogs: incorporating heuristics into stochastic RTNs , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[2]  Alan W. Biermann,et al.  An Architecture for Voice Dialog Systems Based on Prolog-Style Theorem Proving , 1995, Comput. Linguistics.

[3]  Egidio P. Giachin,et al.  Phrase bigrams for continuous speech recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[4]  Dana H. Ballard,et al.  The distance set representation of speech segments , 1995, EUROSPEECH.

[5]  Kenneth Ward Church,et al.  Word Association Norms, Mutual Information, and Lexicography , 1989, ACL.

[6]  Lalit R. Bahl,et al.  Estimating hidden Markov model parameters so as to maximize speech recognition accuracy , 1993, IEEE Trans. Speech Audio Process..

[7]  Dana H. Ballard,et al.  A novel word pre-selection method based on phonetic set indexing , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[8]  Mei-Yuh Hwang,et al.  An Overview of the SPHINX-II Speech Recognition System , 1993, HLT.

[9]  Frederick Jelinek,et al.  Self-organizing language modeling for speech recognition , 1990 .

[10]  James F. Allen,et al.  The TRAINS 93 Dialogues , 1995 .

[11]  Chung Hee Hwang,et al.  The TRAINS project: a case study in building a conversational planning agent , 1994, J. Exp. Theor. Artif. Intell..

[12]  Ronald Rosenfeld,et al.  Trigger-based language models: a maximum entropy approach , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[13]  Douglas D. O'Shaughnessy Correcting complex false starts in spontaneous speech , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[14]  David G. Novick,et al.  The effect of context on the intelligibility of dialogue , 1995, EUROSPEECH.

[15]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[16]  Ronald Rosenfeld,et al.  A maximum entropy approach to adaptive statistical language modelling , 1996, Comput. Speech Lang..

[17]  Ronald Rosenfeld,et al.  Adaptive Statistical Language Modeling; A Maximum Entropy Approach , 1994 .

[18]  Mei-Yuh Hwang,et al.  Unified stochastic engine (USE) for speech recognition , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.