Using spectro-temporal features to improve AFE feature extraction for ASR

Previous work has shown that spectro-temporal features reduce WER for automatic speech recognition under noisy conditions. The spectro-temporal framework, however, is not the only way to process features in order to reduce errors due to noise in the signal. The two-stage mel-warped Wiener filtering method used in the “Advanced Front End” (AFE), now a standard front end for robust recognition, is another way. Since the spectrotemporal approach can be applied to a noise-reduced spectrum, we wanted to explore whether spectro-temporal features could improve the performance of AFE for ASR. We show that computing spectro-temporal features after AFE processing results in a 45% relative improvement compared to AFE in clean conditions and a 6% to 30% improvement in noisy conditions on the Aurora2 clean training setup.

[1]  Daniel P. W. Ellis,et al.  Tandem connectionist feature extraction for conventional HMM systems , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[2]  Anshu Agarwal,et al.  TWO-STAGE MEL-WARPED WIENER FILTER FOR ROBUST SPEECH RECOGNITION , 1999 .

[3]  Nelson Morgan,et al.  Multi-stream to many-stream: using spectro-temporal features for ASR , 2009, INTERSPEECH.

[4]  David Pearce,et al.  The aurora experimental framework for the performance evaluation of speech recognition systems under noisy conditions , 2000, INTERSPEECH.

[5]  Nelson Morgan,et al.  Multi-stream spectro-temporal features for robust speech recognition , 2008, INTERSPEECH.

[6]  Hynek Hermansky,et al.  Multi-resolution RASTA filtering for TANDEM-based ASR , 2005, INTERSPEECH.

[7]  S. Shamma,et al.  Spectro-temporal modulation transfer functions and speech intelligibility. , 1999, The Journal of the Acoustical Society of America.

[8]  Hervé Bourlard,et al.  A mew ASR approach based on independent processing and recombination of partial frequency bands , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[9]  R. Cole,et al.  TELEPHONE SPEECH CORPUS DEVELOPMENT AT CSLU , 1998 .

[10]  Nima Mesgarani,et al.  Discrimination of speech from nonspeech based on multiscale spectro-temporal Modulations , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Michael Kleinschmidt,et al.  Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[12]  Mei-Yuh Hwang,et al.  Improved tone modeling for Mandarin broadcast news speech recognition , 2006, INTERSPEECH.

[13]  N. Morgan,et al.  Pushing the envelope - aside [speech recognition] , 2005, IEEE Signal Processing Magazine.

[14]  Hervé Bourlard,et al.  New entropy based combination rules in HMM/ANN multi-stream ASR , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[15]  Misha Pavel,et al.  On the relative importance of various components of the modulation spectrum for automatic speech recognition , 1999, Speech Commun..

[16]  Fabio Valente,et al.  On the combination of auditory and modulation frequency channels for ASR applications , 2008, INTERSPEECH.

[17]  Frank Joublin,et al.  Hierarchical spectro-temporal features for robust speech recognition , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.