Neural network models of sensory integration for improved vowel recognition

It is demonstrated that multiple sources of speech information can be integrated at a subsymbolic level to improve vowel recognition. Feedforward and recurrent neural networks are trained to estimate the acoustic characteristics of a vocal tract from images of the speaker's mouth. These estimates are then combined with the noise-degraded acoustic information, effectively increasing the signal-to-noise ratio and improving the recognition of these noise-degraded signals. Alternative symbolic strategies such as direct categorization of the visual signals into vowels are also presented. The performances of these neural networks compare favorably with human performance and with other pattern-matching and estimation techniques. >

[1]  H. K. Dunn The Calculation of Vowel Resonances , 1950 .

[2]  H. K. Dunn The Calculation of Vowel Resonances, and an Electrical Vocal Tract , 1950 .

[3]  R. Miller Auditory Tests with Synthetic Vowels , 1951 .

[4]  G. E. Peterson,et al.  Control Methods Used in a Study of the Vowels , 1951 .

[5]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[6]  G. A. Miller,et al.  An Analysis of Perceptual Confusions Among Some English Consonants , 1955 .

[7]  G. A. Miller,et al.  Erratum: An Analysis of Perceptual Confusions Among Some English Consonants [J. Acoust. Soc. Am. 27, 339 (1955)] , 1955 .

[8]  Roman Jakobson,et al.  Fundamentals of Language , 1957 .

[9]  Gunnar Fant,et al.  Acoustic Theory Of Speech Production , 1960 .

[10]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[11]  John Lyons,et al.  Introduction to Theoretical Linguistics , 1971 .

[12]  J. Flanagan Speech Analysis, Synthesis and Perception , 1971 .

[13]  H. Ewertsen,et al.  A comparative analysis of the audiovisual, auditive and visual perception of speech. , 1971, Acta oto-laryngologica.

[14]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[15]  E. Spelke Infants' intermodal perception of events , 1976, Cognitive Psychology.

[16]  A. Montgomery,et al.  Perceptual dimensions underlying vowellipreading performance. , 1976, Journal of speech and hearing research.

[17]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[18]  B. Walden,et al.  Effects of training on the visual recognition of consonants. , 1977, Journal of speech and hearing research.

[19]  T.H. Crystal,et al.  Linear prediction of speech , 1977, Proceedings of the IEEE.

[20]  H. McGurk,et al.  Visual influences on speech perception processes , 1978, Perception & psychophysics.

[21]  Dennis H. Klatt,et al.  Speech perception: a model of acoustic–phonetic analysis and lexical access , 1979 .

[22]  A. Meltzoff,et al.  The bimodal perception of speech in infancy. , 1982, Science.

[23]  A. Montgomery,et al.  Physical characteristics of the lips underlying vowel lipreading performance. , 1983, The Journal of the Acoustical Society of America.

[24]  Stephanie Seneff,et al.  Pitch and spectral analysis of speech based on an auditory synchrony model , 1985 .

[25]  Frederick Jelinek,et al.  The development of an experimental discrete dictation recognizer , 1985 .

[26]  J. Allen A perspective on man-machine communication by speech , 1985, Proceedings of the IEEE.

[27]  James L. McClelland,et al.  James L. McClelland, David Rumelhart and the PDP Research Group, Parallel distributed processing: explorations in the microstructure of cognition . Vol. 1. Foundations . Vol. 2. Psychological and biological models . Cambridge MA: M.I.T. Press, 1987. , 1989, Journal of Child Language.

[28]  Pineda,et al.  Generalization of back-propagation to recurrent neural networks. , 1987, Physical review letters.

[29]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[30]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[31]  J J Hopfield,et al.  Neural computation by concentrating information in time. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[32]  E. Petajan,et al.  An improved automatic lipreading system to enhance speech recognition , 1988, CHI '88.

[33]  Michael I. Jordan Supervised learning and systems with excess degrees of freedom , 1988 .

[34]  Björn Lindblom,et al.  Phonetic Invariance and the Adaptive Nature of Speech , 1989 .

[35]  David B. Pisoni,et al.  Removal of Noise From Noise-Degraded Speech Signals , 1989 .

[36]  Kai-Fu Lee,et al.  Automatic Speech Recognition , 1989 .

[37]  Carver Mead,et al.  Analog VLSI and neural systems , 1989 .

[38]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[39]  Yaser S. Abu-Mostafa,et al.  The Vapnik-Chervonenkis Dimension: Information versus Complexity in Learning , 1989, Neural Computation.

[40]  L. Bernstein,et al.  Single-channel vibrotactile supplements to visual perception of intonation and stress. , 1989, The Journal of the Acoustical Society of America.

[41]  Teuvo Kohonen,et al.  Self-Organization and Associative Memory, Third Edition , 1989, Springer Series in Information Sciences.

[42]  Alexander H. Waibel,et al.  Modular Construction of Time-Delay Neural Networks for Speech Recognition , 1989, Neural Computation.

[43]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[44]  Jeffrey L. Elman,et al.  Finding Structure in Time , 1990, Cogn. Sci..