Discriminative utterance verification for connected digits recognition

Utterance verification represents an important technology in the design of user-friendly speech recognition systems. It involves the recognition of keyword strings and the rejection of nonkeyword strings. This paper describes a hidden Markov model-based (HMM-based) utterance verification system using the framework of statistical hypothesis testing. The two major issues on how to design keyword and string scoring criteria are addressed. For keyword verification, different alternative hypotheses are proposed based on the scores of antikeyword models and a general acoustic filler model. For string verification, different measures are proposed with the objective of detecting nonvocabulary word strings and possibly erroneous strings (so-called putative errors). This paper also motivates the need for discriminative hypothesis testing in verification. One such approach based on minimum classification error training is investigated in detail. When the proposed verification technique was integrated into a state-of-the-art connected digit recognition system, the string error rate for valid digit strings was found to decrease by 57% when setting the rejection rate to 5%. Furthermore, the system was able to correctly reject over 99.9% of nonvocabulary word strings.

[1]  P. Bickel,et al.  Mathematical Statistics: Basic Ideas and Selected Topics , 1977 .

[2]  M.G. Bellanger,et al.  Digital processing of speech signals , 1980, Proceedings of the IEEE.

[3]  Lalit R. Bahl,et al.  Maximum mutual information estimation of hidden Markov model parameters for speech recognition , 1986, ICASSP '86. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  W. Russell,et al.  Continuous hidden Markov modeling for speaker-independent word spotting , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[5]  Richard Rose,et al.  A hidden Markov model based keyword recognition system , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[6]  Chin-Hui Lee,et al.  Automatic recognition of keywords in unconstrained speech using hidden Markov models , 1990, IEEE Trans. Acoust. Speech Signal Process..

[7]  Biing-Hwang Juang,et al.  Discriminative multi-layer feed-forward networks , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[8]  Richard C. Rose,et al.  Techniques for robust word spotting in continuous speech messages , 1991, EUROSPEECH.

[9]  Chin-Hui Lee,et al.  Segmental GPD training of HMM based speech recognizer , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Richard Rose,et al.  Discriminant wordspotting techniques for rejecting non-vocabulary utterances in unconstrained speech , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  B. Chigier,et al.  Rejection and keyword spotting algorithms for a directory assistance city name recognition application , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  Baruch Mazor,et al.  Continuous word spotting for applications in telecommunications , 1992, ICSLP.

[13]  John J. Godfrey,et al.  SWITCHBOARD: telephone speech corpus for research and development , 1992, [Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[14]  Biing-Hwang Juang,et al.  Discriminative learning for minimum error classification [pattern recognition] , 1992, IEEE Trans. Signal Process..

[15]  Michael Weintraub,et al.  Keyword-spotting using SRI's DECIPHER large-vocabulary speech-recognition system , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[16]  Herbert Gish,et al.  Phonetic training and language modeling for word spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[17]  Biing-Hwang Juang,et al.  Minimum error rate training based on N-best string models , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[18]  Richard C. Rose,et al.  Task independent wordspotting using decision tree based allophone clustering , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[19]  Alex Acero,et al.  Rejection techniques for digit recognition in telecommunication applications , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  Richard Lippmann,et al.  Figure of Merit Training for Detection and Spotting , 1993, NIPS.

[21]  Jay G. Wilpon,et al.  A two pass classifier for utterance rejection in keyword spotting , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[23]  Biing-Hwang Juang,et al.  An algorithm of high resolution and efficient multiple string hypothesization for continuous speech recognition using inter-word models , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[24]  Hervé Bourlard,et al.  Optimizing recognition and rejection performance in wordspotting systems , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[25]  Jeffrey S. Sorensen,et al.  Hierarchical pattern classification for high performance text-independent speaker verification systems , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[26]  Biing-Hwang Juang,et al.  A Minimum Error Rate Pattern Recognition Approach to Speech Recognition , 1994, Int. J. Pattern Recognit. Artif. Intell..

[27]  Rafid A. Sukkar,et al.  Rejection for connected digit recognition based on GPD segmental discrimination , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[28]  B. Juang,et al.  A study on minimum error discriminative training for speaker recognition , 1995 .

[29]  Biing-Hwang Juang,et al.  Robust utterance verification for connected digits recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[30]  Biing-Hwang Juang,et al.  Discriminative utterance verification using minimum string verification error (MSVE) training , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[31]  Biing-Hwang Juang,et al.  Signal bias removal by maximum likelihood estimation for robust telephone speech recognition , 1996, IEEE Trans. Speech Audio Process..

[32]  Chin-Hui Lee,et al.  Utterance verification of keyword strings using word-based minimum verification error (WB-MVE) training , 1996, 1996 IEEE International Conference on Acoustics, Speech, and Signal Processing Conference Proceedings.

[33]  M.G. Rahim,et al.  Signal conditioning techniques for robust speech recognition , 1996, IEEE Signal Processing Letters.

[34]  B. Juang,et al.  A study on robust utterance verification for connected digits recognition , 1997 .