Psycho-acoustics and Speech Perception

Computational models of speech pattern processing might be able to benefit a lot from sound and speech perception by humans. Psycho-acoustics has given us insight into the limits and the capabilities of peripheral hearing for, mainly, simple stationary sounds. Threshold phenomena and temporal and spectral resolution for such stimuli are a first indication of how the front end of a recognizer should be modeled, and what level of precision is required in rule synthesis. Much less is known about the ear’s sensitivity to dynamic events with complex signals, such as formant-like transitions. Once the signal becomes a syllable or a meaningful word or sentence, our ear’s behavior and our brain’s interpretations become even more complex. A good example of that is our perception of stressed and unstressed syllables, including schwas. I will claim that vowel reduction manifests itself as contextual assimilation, rather than as a form of centralization, which again has implications for our phone and word models in ASR and for our coarticulation rules in a synthesizer.

[1]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[2]  Reinier Plomp,et al.  Aspects of tone sensation , 1976 .

[3]  Reinier Plomp,et al.  Aspects of tone sensation : a psychophysical study , 1976 .

[4]  B. Moore An Introduction to the Psychology of Hearing , 1977 .

[5]  W. Yost Fundamentals of hearing: An introduction , 1977 .

[6]  L. Pols,et al.  Formant movements of Dutch vowels in a text, read at normal and fast rate. , 1992, The Journal of the Acoustical Society of America.

[7]  Louis C. W. Pols,et al.  Acoustics and perception of dynamic vowel segments , 1993, Speech Commun..

[8]  R. Plomp,et al.  Effect of spectral envelope smearing on speech reception. II. , 1992, The Journal of the Acoustical Society of America.

[9]  R. Plomp,et al.  Effect of reducing slow temporal modulations on speech reception. , 1994, The Journal of the Acoustical Society of America.

[10]  D. V. Bergem Acoustic and Lexical Vowel Reduction , 1995 .

[11]  R. Smits Detailed versus gross spectro-temporal cues for the perception of stop consonants , 1995 .

[12]  L. Pols,et al.  Discrimination of single and complex consonant–vowel‐ and vowel–consonant‐like formant transitions , 1995 .

[13]  D. Kewley-Port,et al.  Fundamental frequency effects on thresholds for vowel formant discrimination. , 1994, The Journal of the Acoustical Society of America.

[14]  Hynek Hermansky,et al.  Towards increasing speech recognition error rates , 1995, Speech Commun..

[15]  Xue Wang,et al.  Modelling of phone duration (using the TIMIT database) and its potential benefit for ASR , 1996, Speech Commun..

[16]  Louis C. W. Pols,et al.  The correlation between consonant identification and the amount of acoustic consonant reduction , 1997, EUROSPEECH.

[17]  R. Sproat,et al.  Multilingual text-to-speech synthesis : the Bell Labs approach , 1998 .

[18]  Steve Young,et al.  Acoustic Modelling for Large Vocabulary Continuous Speech Recognition , 1999 .

[19]  Li Deng,et al.  Articulatory Features and Associated Production Models Statistical Speech Recognition , 1999 .

[20]  Mari Ostendorf,et al.  Tree-based Dependence Models for Speech Recognition , 1999 .