Speech analysis and cognition using category-dependent features in a model of the central auditory system

It is well known that machines perform far worse than humans in recognizing speech and audio, especially in noisy environments. One method of addressing this issue of robustness is to study physiological models of the human auditory system and to adopt some of its characteristics in computers. As a first step in studying the potential benefits of an elaborate computational model of the primary auditory cortex (A1) in the central auditory system, we qualitatively and quantitatively validate the model under existing speech processing recognition methodology. Next, we develop new insights and ideas on how to interpret the model, and reveal some of the advantages of its dimension-expansion that may be potentially used to improve existing speech processing and recognition methods. This is done by statistically analyzing the neural responses to various classes of speech signals and forming empirical conjectures on how cognitive information is encoded in a category-dependent manner. We also establish a theoretical framework that shows how noise and signal can be separated in the dimension-expanded cortical space. Finally, we develop new feature selection and pattern recognition methods to exploit the category-dependent encoding of noise-robust cognitive information in the cortical response. Category-dependent features are proposed as features that "specialize" in discriminating specific sets of classes, and as a natural way of incorporating them into a Bayesian decision framework, we propose methods to construct hierarchical classifiers that perform decisions in a two-stage process. Phoneme classification tasks using the TIMIT speech database are performed to quantitatively validate all developments in this work, and the results encourage future work in exploiting high-dimensional data with category(or class)-dependent features for improved classification or detection.

[1]  B. Scholkopf,et al.  Fisher discriminant analysis with kernels , 1999, Neural Networks for Signal Processing IX: Proceedings of the 1999 IEEE Signal Processing Society Workshop (Cat. No.98TH8468).

[2]  Alfred O. Hero,et al.  On Dimensionality Reduction for Classification and its Application , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Tao Li,et al.  Music genre classification with taxonomy , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[4]  Q. Summerfield Book Review: Auditory Scene Analysis: The Perceptual Organization of Sound , 1992 .

[5]  Bhaskar D. Rao,et al.  Robust Feature Extraction for Continuous Speech Recognition Using the MVDR Spectrum Estimation Method , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[6]  Paul M. Baggenstoss Class-specific feature sets in classification , 1999, IEEE Trans. Signal Process..

[7]  Li Deng,et al.  Dynamic compensation of HMM variances using the feature enhancement uncertainty computed from a parametric model of speech distortion , 2005, IEEE Transactions on Speech and Audio Processing.

[8]  Eric R. Ziegel,et al.  The Elements of Statistical Learning , 2003, Technometrics.

[9]  Richard M. Stern,et al.  Signal Processing for Robust Speech Recognition , 1994, HLT.

[10]  David V. Anderson,et al.  Improving the noise-robustness of mel-frequency cepstral coefficients for speech processing , 2006, SAPA@INTERSPEECH.

[11]  B. Kollmeier,et al.  A HUMAN-MACHINE COMPARISON IN SPEECH RECOGNITION BASED ON A LOGATOME CORPUS , 2006 .

[12]  Biing-Hwang Juang,et al.  Separation of Snr Via Dimension Expansion in a Model of the Central Auditory System , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[13]  Hsiao-Wuen Hon,et al.  Speaker-independent phone recognition using hidden Markov models , 1989, IEEE Trans. Acoust. Speech Signal Process..

[14]  J. Friedman Regularized Discriminant Analysis , 1989 .

[15]  G. Kramer Auditory Scene Analysis: The Perceptual Organization of Sound by Albert Bregman (review) , 2016 .

[16]  Linkai Bu,et al.  Perceptual speech processing and phonetic feature mapping for robust vowel recognition , 2000, IEEE Trans. Speech Audio Process..

[17]  Louis ten Bosch,et al.  A novel feature transformation for vocal tract length normalization in automatic speech recognition , 1998, IEEE Trans. Speech Audio Process..

[18]  B. Atal Effectiveness of linear prediction characteristics of the speech wave for automatic speaker identification and verification. , 1974, The Journal of the Acoustical Society of America.

[19]  James R. Glass,et al.  HETEROGENEOUS ACOUSTIC MEASUREMENTS FOR PHONETIC CLASSIFICATION , 1997 .

[20]  Li Deng,et al.  Large-vocabulary speech recognition under adverse acoustic environments , 2000, INTERSPEECH.

[21]  Juyang Weng,et al.  Using Discriminant Eigenfeatures for Image Retrieval , 1996, IEEE Trans. Pattern Anal. Mach. Intell..

[22]  Oded Ghitza,et al.  Auditory neural feedback as a basis for speech processing , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[23]  Shigeo Abe DrEng Pattern Classification , 2001, Springer London.

[24]  Michael Kleinschmidt Localized spectro-temporal features for automatic speech recognition , 2003, INTERSPEECH.

[25]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[26]  Biing-Hwang Juang,et al.  Fundamentals of speech recognition , 1993, Prentice Hall signal processing series.

[27]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[28]  Shihab A. Shamma Spatial and temporal processing in central auditory networks , 1989 .

[29]  Ee-Peng Lim,et al.  Hierarchical text classification and evaluation , 2001, Proceedings 2001 IEEE International Conference on Data Mining.

[30]  A. W. M. van den Enden,et al.  Discrete Time Signal Processing , 1989 .

[31]  Andreas G. Andreou,et al.  Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition , 1998, Speech Commun..

[32]  John H. L. Hansen,et al.  Discrete-Time Processing of Speech Signals , 1993 .

[33]  C. L. Searle,et al.  Stop consonant discrimination based on human audition. , 1979, The Journal of the Acoustical Society of America.

[34]  Richard M. Stern,et al.  Environmental robustness in automatic speech recognition using physiologic ally-motivated signal processing , 1994, ICSLP.

[35]  Biing-Hwang Juang,et al.  A category-dependent feature selection method for speech signals , 2005, INTERSPEECH.

[36]  Biing-Hwang Juang,et al.  On the use of bandpass liftering in speech recognition , 1987, IEEE Trans. Acoust. Speech Signal Process..

[37]  Anne H. Soukhanov,et al.  The american heritage dictionary of the english language , 1992 .

[38]  Yuqing Gao,et al.  Auditory model based speech processing , 1992, ICSLP.

[39]  D.V. Anderson,et al.  Cascade classifiers for audio classification , 2004, 3rd IEEE Signal Processing Education Workshop. 2004 IEEE 11th Digital Signal Processing Workshop, 2004..

[40]  Kuansan Wang,et al.  Self-normalization and noise-robustness in early auditory representations , 1994, IEEE Trans. Speech Audio Process..

[41]  Abeer Alwan,et al.  A model of dynamic auditory perception and its application to robust word recognition , 1997, IEEE Trans. Speech Audio Process..

[42]  Hong Kook Kim,et al.  Cepstrum-domain acoustic feature compensation based on decomposition of speech and noise for ASR in noisy environments , 2001, IEEE Trans. Speech Audio Process..

[43]  Jeih-Weih Hung,et al.  Optimization of temporal filters for constructing robust features in speech recognition , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[44]  Birger Kollmeier,et al.  Combining speech enhancement and auditory feature extraction for robust speech recognition , 2000, Speech Commun..

[45]  E. Owens Introduction to the Psychology of Hearing , 1977 .

[46]  C. Schreiner,et al.  Modular organization of frequency integration in primary auditory cortex. , 2000, Annual review of neuroscience.

[47]  Alex Acero,et al.  Hidden conditional random fields for phone classification , 2005, INTERSPEECH.

[48]  Richard Lippmann,et al.  A comparison of signal processing front ends for automatic word recognition , 1995, IEEE Trans. Speech Audio Process..

[49]  E Paulus,et al.  Automatic speech recognition using psychoacoustic models. , 1979, The Journal of the Acoustical Society of America.

[50]  Chin-Hui Lee,et al.  A minimax classification approach with application to robust speech recognition , 1993, IEEE Trans. Speech Audio Process..

[51]  Steve Love,et al.  Improving the noise and spectral robustness of an isolated-word recognizer using an auditory-model front end , 1998, ICSLP.

[52]  Pavel Pudil,et al.  Introduction to Statistical Pattern Recognition , 2006 .

[53]  Alan C. Evans,et al.  Left‐hemisphere specialization for the processing of acoustic transients , 1997, Neuroreport.

[54]  Mark J. F. Gales,et al.  Robust continuous speech recognition using parallel model combination , 1996, IEEE Trans. Speech Audio Process..

[55]  Robert P. W. Duin,et al.  Expected classification error of the Fisher linear classifier with pseudo-inverse covariance matrix , 1998, Pattern Recognit. Lett..

[56]  Werner Hemmert,et al.  Automatic speech recognition with an adaptation model motivated by auditory processing , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[57]  S. Kay Fundamentals of statistical signal processing: estimation theory , 1993 .

[58]  Yifan Gong,et al.  Speech recognition in noisy environments: A survey , 1995, Speech Commun..

[59]  Chin-Hui Lee,et al.  A maximum-likelihood approach to stochastic matching for robust speech recognition , 1996, IEEE Trans. Speech Audio Process..

[60]  Mazin G. Rahim,et al.  On second order statistics and linear estimation of cepstral coefficients , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[61]  Alfred O. Hero,et al.  Classification constrained dimensionality reduction , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[62]  Richard Lippmann,et al.  Speech recognition by machines and humans , 1997, Speech Commun..

[63]  Janet M. Baker,et al.  The Design for the Wall Street Journal-based CSR Corpus , 1992, HLT.

[64]  J. Rauschecker,et al.  Hierarchical Organization of the Human Auditory Cortex Revealed by Functional Magnetic Resonance Imaging , 2001, Journal of Cognitive Neuroscience.

[65]  Li Deng,et al.  A robust compensation strategy for extraneous acoustic variations in spontaneous speech recognition , 2002, IEEE Trans. Speech Audio Process..

[66]  L. Tan,et al.  Distinct brain regions associated with syllable and phoneme , 2003, Human brain mapping.

[67]  Steven Kay,et al.  Sufficiency, classification, and the class-specific feature theorem , 2000, IEEE Trans. Inf. Theory.

[68]  Kuansan Wang,et al.  Auditory representations of acoustic signals , 1992, IEEE Trans. Inf. Theory.

[69]  T. Hughes,et al.  Signals and systems , 2006, Genome Biology.

[70]  Hamid Sheikhzadeh,et al.  Speech analysis and recognition using interval statistics generated from a composite auditory model , 1998, IEEE Trans. Speech Audio Process..

[71]  Yariv Ephraim Gain-adapted hidden Markov models for recognition of clean and noisy speech , 1992, IEEE Trans. Signal Process..

[72]  Mounya Elhilali,et al.  A Biologically-Inspired Approach to the Cocktail Party Problem , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[73]  Biing-Hwang Juang,et al.  A study of auditory modeling and processing for speech signals , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[74]  Anil K. Jain,et al.  On the optimal number of features in the classification of multivariate Gaussian data , 1978, Pattern Recognit..

[75]  Keikichi Hirose,et al.  Robust speech recognition based on a Bayesian prediction approach , 1999, IEEE Trans. Speech Audio Process..

[76]  Gerald Langner,et al.  Laminar fine structure of frequency organization in auditory midbrain , 1997, Nature.

[77]  Jerome H. Friedman,et al.  On Bias, Variance, 0/1—Loss, and the Curse-of-Dimensionality , 2004, Data Mining and Knowledge Discovery.

[78]  Biing-Hwang Juang,et al.  Minimum classification error rate methods for speech recognition , 1997, IEEE Trans. Speech Audio Process..

[79]  Imre Kiss,et al.  Noise robust speech parameterization using multiresolution feature extraction , 2001, IEEE Trans. Speech Audio Process..

[80]  Adam Krzyżak,et al.  Methods of combining multiple classifiers and their applications to handwriting recognition , 1992, IEEE Trans. Syst. Man Cybern..

[81]  Abeer Alwan,et al.  Noise robust speech recognition using feature compensation based on polynomial regression of utterance SNR , 2005, IEEE Transactions on Speech and Audio Processing.

[82]  Roger K. Moore,et al.  Noise compensation algorithms for use with hidden Markov model based speech recognition , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[83]  Biing-Hwang Juang,et al.  Speech Analysis in a Model of the Central Auditory System , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[84]  G. Baudat,et al.  Generalized Discriminant Analysis Using a Kernel Approach , 2000, Neural Computation.

[85]  Tao Qin,et al.  Hierarchical taxonomy preparation for text categorization using consistent bipartite spectral graph copartitioning , 2005, IEEE Transactions on Knowledge and Data Engineering.

[86]  Li Lee,et al.  A frequency warping approach to speaker normalization , 1998, IEEE Trans. Speech Audio Process..

[87]  Ching Y. Suen,et al.  Analysis of Class Separation and Combination of Class-Dependent Features for Handwriting Recognition , 1999, IEEE Trans. Pattern Anal. Mach. Intell..

[88]  C. Lefebvre,et al.  A comparison of several acoustic representations for speech recognition with degraded and undegraded speech , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[89]  S. Boll,et al.  Suppression of acoustic noise in speech using spectral subtraction , 1979 .

[90]  J R Cohen,et al.  Application of an auditory model to speech recognition. , 1989, The Journal of the Acoustical Society of America.

[91]  Nima Mesgarani,et al.  Speech discrimination based on multiscale spectro-temporal modulations , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[92]  Kuldip K. Paliwal,et al.  Automatic Speech and Speaker Recognition: Advanced Topics , 1999 .

[93]  Yuqing Gao,et al.  Central auditory model for spectral processing , 1993, 1993 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[94]  Chuen-Der Huang,et al.  Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification , 2003, IEEE Transactions on NanoBioscience.

[95]  James R. Glass,et al.  Heterogeneous acoustic measurements for phonetic classification 1 , 1997, EUROSPEECH.

[96]  Yunxin Zhao,et al.  Frequency-domain maximum likelihood estimation for automatic speech recognition in additive and convolutive noises , 2000, IEEE Trans. Speech Audio Process..

[97]  Kuansan Wang,et al.  Spectral shape analysis in the central auditory system , 1995, IEEE Trans. Speech Audio Process..

[98]  Sargur N. Srihari,et al.  A theory of classifier combination: the neural network approach , 1995, Proceedings of 3rd International Conference on Document Analysis and Recognition.

[99]  Richard F. Lyon,et al.  A computational model of filtering, detection, and compression in the cochlea , 1982, ICASSP.