Time-frequency analysis and auditory modeling for automatic recognition of speech

Modern speech processing research may be categorized into three broad areas: statistical, physiological, and perceptual. Statistical research investigates the nature of the variability of the speech waveform from a signal processing viewpoint. This approach relates to the processing of speech in order to obtain measurements of speech characteristics which demonstrate manageable variabilities across a wide range of the talker population, in the presence of noise or competing speakers as well as the interaction of speech with the channel through which it is transmitted, and under the inherent interaction of the information content of speech itself (i.e., the contextual factor). Physiological research aims at constructing accurate models of the articulatory and auditory process, helping to limit the signal space for speech processing. In the perceptual realm, work focuses on understanding the psychoacoustic and possibly the psycholinguistic aspects of the speech communication process that the human so conveniently conducts. By studying this working analysis/recognition system, insights may be garnered that will lead to improved methods of speech processing. Conversely by studying the limitations of this system, particularly how it reduces the information rate of the received signal through, for example, masking and adaptation improvements may be made in the efficiency of speech coding schemes without impacting the quality of the reconstructed speech. Thus comprehension of speech production and perception impacts methods of speech processing, and vice-versa. This paper enunciates such a position, focusing on how modern time-frequency signal analysis methods could help expedite needed advances in these areas.

[1]  B. Moore An introduction to the psychology of hearing (5th ed.). , 1989 .

[2]  J. Ville THEORY AND APPLICATION OF THE NOTION OF COMPLEX SIGNAL , 1958 .

[3]  W. Koenig,et al.  The Sound Spectrograph , 1946 .

[4]  Shihab A. Shamma,et al.  The acoustic features of speech sounds in a model of auditory processing: vowels and voiceless fricatives , 1988 .

[5]  J. Pickles An Introduction to the Physiology of Hearing , 1982 .

[6]  James J. Jenkins,et al.  Dynamic specification of coarticulated vowels , 1983 .

[7]  H Hermansky,et al.  Perceptual linear predictive (PLP) analysis of speech. , 1990, The Journal of the Acoustical Society of America.

[8]  Stan Davis,et al.  Comparison of Parametric Representations for Monosyllabic Word Recognition in Continuously Spoken Se , 1980 .

[9]  J Harrington,et al.  The contribution of the murmur and vowel to the place of articulation distinction in nasal consonants. , 1994, The Journal of the Acoustical Society of America.

[10]  J. Allen,et al.  Cochlear modeling , 1985, IEEE ASSP Magazine.

[11]  Steven Greenberg,et al.  A Composite Model of the Auditory Periphery for the Processing of Speech (Invited) , 1988 .

[12]  M. Sachs,et al.  Representation of steady-state vowels in the temporal aspects of the discharge patterns of populations of auditory-nerve fibers. , 1979, The Journal of the Acoustical Society of America.

[13]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[14]  A. Gray,et al.  Least squares glottal inverse filtering from the acoustic speech waveform , 1979 .

[15]  B H Repp,et al.  Acoustic properties and perception of stop consonant release transients. , 1989, The Journal of the Acoustical Society of America.

[16]  Robert J. Marks,et al.  The use of cone-shaped kernels for generalized time-frequency representations of nonstationary signals , 1990, IEEE Trans. Acoust. Speech Signal Process..

[17]  J J Jenkins,et al.  Vowel identification in mixed-speaker silent-center syllables. , 1994, The Journal of the Acoustical Society of America.

[18]  Ted H. Applebaum,et al.  Perceptually-based dynamic spectrograms , 1993 .

[19]  Harvey F. Silverman,et al.  A time-varying analysis method for rapid transitions in speech , 1991, IEEE Trans. Signal Process..

[20]  E. Wigner On the quantum correction for thermodynamic equilibrium , 1932 .

[21]  Nobuhiro Miki,et al.  Adaptive identification of a time-varying ARMA speech model , 1986, IEEE Trans. Acoust. Speech Signal Process..

[22]  L. Carney,et al.  A model for the responses of low-frequency auditory-nerve fibers in cat. , 1993, The Journal of the Acoustical Society of America.

[23]  L. Cohen Generalized Phase-Space Distribution Functions , 1966 .

[24]  L. A. Westerman,et al.  A diffusion model of the transient response of the cochlear inner hair cell synapse. , 1988, The Journal of the Acoustical Society of America.

[25]  Li Deng,et al.  A generalized hidden Markov model with state-conditioned trend functions of time for the speech signal , 1992, Signal Process..

[26]  Oded Ghitza,et al.  Auditory nerve representation as a front-end for speech recognition in a noisy environment , 1986 .

[27]  Petros Maragos,et al.  Energy separation in signal modulations with application to speech analysis , 1993, IEEE Trans. Signal Process..

[28]  R. Smits Accuracy of quasistationary analysis of highly dynamic speech signals , 1994 .

[29]  B. Kedem,et al.  Spectral analysis and discrimination by zero-crossings , 1986, Proceedings of the IEEE.

[31]  William Bialek,et al.  Optimal Real-Time Signal Processing in the Nervous System , 1993 .

[32]  Shiufun Cheung,et al.  Combined multi-resolution (wideband/narrowband) spectrogram , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[33]  Athanasios Papoulis,et al.  Probability, Random Variables and Stochastic Processes , 1965 .

[34]  Harvey F. Silverman,et al.  Time-varying feature selection and classification of unvoiced stop consonants , 1994, IEEE Trans. Speech Audio Process..

[35]  B.-H. Juang,et al.  Maximum-likelihood estimation for mixture multivariate stochastic observations of Markov chains , 1985, AT&T Technical Journal.

[36]  Patrick J. Loughlin,et al.  Advanced time-frequency representations for speech processing , 1993 .

[37]  Les E. Atlas,et al.  Construction of positive time-frequency distributions , 1994, IEEE Trans. Signal Process..

[38]  Mari Ostendorf,et al.  ML estimation of a stochastic linear system with the EM algorithm and its application to speech recognition , 1993, IEEE Trans. Speech Audio Process..

[39]  E. Bryan George,et al.  Co-channel speaker separation , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[40]  William J. Byrne,et al.  The Auditory Processing and Recognition of Speech , 1989, HLT.

[41]  Richard F. Lyon,et al.  Computational models of neural auditory processing , 1984, ICASSP.

[42]  Francisco Casacuberta,et al.  A nonstationary model for the analysis of transient speech signals , 1987, IEEE Trans. Acoust. Speech Signal Process..

[43]  Oded Ghitza,et al.  Hidden Markov models with templates as non-stationary states: an application to speech recognition , 1993, Comput. Speech Lang..

[44]  C D Geisler,et al.  Comparison of the responses of auditory nerve fibers to consonant-vowel syllables with predictions from linear models. , 1984, The Journal of the Acoustical Society of America.

[45]  J. Eggermont Peripheral auditory adaptation and fatigue: A model oriented review , 1985, Hearing Research.

[46]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with the same fundamental frequency. , 1989, The Journal of the Acoustical Society of America.

[47]  Oded Ghitza,et al.  A comparative study of mel cepstra and EIH for phone classification under adverse conditions , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[48]  E. F. Velez,et al.  Transient analysis of speech signals using the Wigner time-frequency representation , 1989, International Conference on Acoustics, Speech, and Signal Processing,.

[49]  Biing-Hwang Juang,et al.  Robust utterance verification for connected digits recognition , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[50]  S. Seneff A joint synchrony/mean-rate model of auditory processing , 1988 .

[51]  H. K. Dunn Methods of Measuring Vowel Formant Bandwidths , 1961 .

[52]  G. Fairbanks,et al.  Diphthong formants and their movements. , 1962, Journal of speech and hearing research.

[53]  Xiaodong Sun,et al.  Speech recognition using hidden Markov models with polynomial regression functions as nonstationary states , 1994, IEEE Trans. Speech Audio Process..

[54]  Q. Summerfield,et al.  Modeling the perception of concurrent vowels: vowels with different fundamental frequencies. , 1990, The Journal of the Acoustical Society of America.

[55]  Les E. Atlas,et al.  Bilinear time-frequency representations: new insights and properties , 1993, IEEE Trans. Signal Process..

[56]  D Kewley-Port,et al.  Time-varying features as correlates of place of articulation in stop consonants. , 1983, The Journal of the Acoustical Society of America.

[57]  John E. Markel,et al.  Linear Prediction of Speech , 1976, Communication and Cybernetics.

[58]  C V Pavlovic,et al.  Frequency importance functions for a feature recognition test material. , 1988, The Journal of the Acoustical Society of America.

[59]  O. Kakusho,et al.  Hierarchical AR model for time varying speech signals , 1982, ICASSP.

[60]  Benjamin Kedem,et al.  Authors' Reply to Comments on 'Zero-crossing rates of functions of Gaussian processes' , 1991, IEEE Trans. Inf. Theory.

[61]  Les Atlas,et al.  New stationary techniques for the analysis and display of speech transients , 1990, International Conference on Acoustics, Speech, and Signal Processing.

[62]  R Plomp,et al.  Objective analysis versus subjective assessment of vowels pronounced by native, non-native, and deaf male speakers of Dutch. , 1993, The Journal of the Acoustical Society of America.

[63]  M. Sachs,et al.  Encoding of steady-state vowels in the auditory nerve: representation in terms of discharge rate. , 1979, The Journal of the Acoustical Society of America.

[64]  Hamid Sheikhzadeh,et al.  Speech analysis and recognition using interval statistics generated from a composite auditory model , 1998, IEEE Trans. Speech Audio Process..

[65]  S. Blumstein,et al.  Invariant cues for place of articulation in stop consonants. , 1978, The Journal of the Acoustical Society of America.

[66]  B. Hannaford,et al.  Approximating time-frequency density functions via optimal combinations of spectrograms , 1994, IEEE Signal Processing Letters.

[67]  A. Liberman,et al.  Acoustic Loci and Transitional Cues for Consonants , 1954 .

[68]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[69]  R. N. Ohde,et al.  Effect of relative amplitude of frication on perception of place of articulation. , 1991, The Journal of the Acoustical Society of America.

[70]  R Meddis,et al.  Modeling the identification of concurrent vowels with different fundamental frequencies. , 1992, The Journal of the Acoustical Society of America.

[71]  Kuansan Wang,et al.  Self-normalization and noise-robustness in early auditory representations , 1994, IEEE Trans. Speech Audio Process..

[72]  William J. Williams,et al.  Improved time-frequency representation of multicomponent signals using exponential kernels , 1989, IEEE Trans. Acoust. Speech Signal Process..

[73]  Les E. Atlas,et al.  Applications of positive time-frequency distributions to speech processing , 1994, IEEE Trans. Speech Audio Process..

[74]  Richard F. Lyon,et al.  On the importance of time—a temporal representation of sound , 1993 .

[75]  Yves Grenier,et al.  Time-dependent ARMA modeling of nonstationary signals , 1983 .

[76]  E D Young,et al.  Auditory nerve representation of vowels in background noise. , 1983, Journal of neurophysiology.

[77]  Les Atlas,et al.  Truly nonstationary techniques for the analysis and display of voiced speech , 1991, [Proceedings] ICASSP 91: 1991 International Conference on Acoustics, Speech, and Signal Processing.

[78]  S. Furui On the role of spectral transition for speech perception. , 1986, The Journal of the Acoustical Society of America.

[79]  L. A. Liporace Linear estimation of nonstationary signals. , 1975, The Journal of the Acoustical Society of America.

[80]  E. Young,et al.  Auditory-nerve encoding of pinna-based spectral cues: rate representation of high-frequency stimuli. , 1995, The Journal of the Acoustical Society of America.

[81]  Sadaoki Furui,et al.  Speaker-independent isolated word recognition using dynamic features of speech spectrum , 1986, IEEE Trans. Acoust. Speech Signal Process..

[82]  S. Blumstein,et al.  Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants. , 1979, The Journal of the Acoustical Society of America.

[83]  Q Summerfield,et al.  Perception of concurrent vowels: effects of harmonic misalignment and pitch-period asynchrony. , 1991, The Journal of the Acoustical Society of America.

[84]  John H. L. Hansen,et al.  Speech enhancement based on a new set of auditory constrained parameters , 1994, Proceedings of ICASSP '94. IEEE International Conference on Acoustics, Speech and Signal Processing.

[85]  Steven Greenberg,et al.  Stochastic perceptual models of speech , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[86]  Robert J. Safranek,et al.  Signal compression based on models of human perception , 1993, Proc. IEEE.

[87]  Biing-Hwang Juang,et al.  Filtering the time sequence of spectral parameters for speaker-independent CDHMM word recognition , 1995, EUROSPEECH.

[88]  Douglas L. Jones,et al.  A signal-dependent time-frequency representation: optimal kernel design , 1993, IEEE Trans. Signal Process..

[89]  Nathalie Virag Speech enhancement based on masking properties of the auditory system , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[90]  C. Neti,et al.  Neuromorphic speech processing for noisy environments , 1994, Proceedings of 1994 IEEE International Conference on Neural Networks (ICNN'94).

[91]  M. Riley Speech Time-Frequency Representations , 1989 .

[92]  Leon Cohen,et al.  Instantaneous Frequency, Its Standard Deviation And Multicomponent Signals , 1988, Optics & Photonics.

[93]  S. Zahorian,et al.  Spectral-shape features versus formants as acoustic correlates for vowels. , 1993, The Journal of the Acoustical Society of America.

[94]  H. Garudadri,et al.  Invariant acoustic cues in stop consonants: A cross‐language study using the Wignet distribution , 1986 .

[95]  P H Milenkovic Voice source model for continuous control of pitch period. , 1993, The Journal of the Acoustical Society of America.

[96]  Douglas A. Reynolds,et al.  Measuring fine structure in speech: application to speaker identification , 1995, 1995 International Conference on Acoustics, Speech, and Signal Processing.

[97]  R. Patterson,et al.  Complex Sounds and Auditory Images , 1992 .

[98]  L. A. Westerman,et al.  Rapid and short-term adaptation in auditory nerve responses , 1984, Hearing Research.

[99]  Hynek Hermansky,et al.  RASTA processing of speech , 1994, IEEE Trans. Speech Audio Process..

[100]  S. Shamma Speech processing in the auditory system. II: Lateral inhibition and the central processing of speech evoked activity in the auditory nerve. , 1985, The Journal of the Acoustical Society of America.

[101]  R I Damper,et al.  A computational model of afferent neural activity from the cochlea to the dorsal acoustic stria. , 1991, The Journal of the Acoustical Society of America.

[102]  P. Lieberman Perturbations in Vocal Pitch , 1960 .