Phoneme recognition using time-delay neural networks

The authors present a time-delay neural network (TDNN) approach to phoneme recognition which is characterized by two important properties: (1) using a three-layer arrangement of simple computing units, a hierarchy can be constructed that allows for the formation of arbitrary nonlinear decision surfaces, which the TDNN learns automatically using error backpropagation; and (2) the time-delay arrangement enables the network to discover acoustic-phonetic features and the temporal relationships between them independently of position in time and therefore not blurred by temporal shifts in the input. As a recognition task, the speaker-dependent recognition of the phonemes B, D, and G in varying phonetic contexts was chosen. For comparison, several discrete hidden Markov models (HMM) were trained to perform the same task. Performance evaluation over 1946 testing tokens from three speakers showed that the TDNN achieves a recognition rate of 98.5% correct while the rate obtained by the best of the HMMs was only 93.7%. >

[1]  James K. Baker,et al.  Stochastic modeling as a means of automatic speech recognition. , 1975 .

[2]  F. Jelinek,et al.  Continuous speech recognition by statistical methods , 1976, Proceedings of the IEEE.

[3]  S. Blumstein,et al.  Perceptual invariance and onset spectra for stop consonants in different vowel environments , 1976 .

[4]  S. Blumstein,et al.  Acoustic invariance in speech production: evidence from measurements of the spectral characteristics of stop consonants. , 1979, The Journal of the Acoustical Society of America.

[5]  S. Blumstein,et al.  Perceptual invariance and onset spectra for stop consonants in different vowel environments. , 1980, The Journal of the Acoustical Society of America.

[6]  K. Shikano,et al.  LPC peak weighted spectral matching measures , 1981 .

[7]  D Kewley-Port,et al.  Time-varying features as correlates of place of articulation in stop consonants. , 1983, The Journal of the Acoustical Society of America.

[8]  Takayuki Ito,et al.  Neocognitron: A neural network model for a mechanism of visual pattern recognition , 1983, IEEE Transactions on Systems, Man, and Cybernetics.

[9]  Alex Waibel,et al.  Comparative study of nonlinear time warping techniques in isolated word speech recognition systems , 1983 .

[10]  Michael Picheny,et al.  Some experiments with large-vocabulary isolated-word sentence recognition , 1984, ICASSP.

[11]  John Makhoul,et al.  Context-dependent modeling for acoustic-phonetic recognition of continuous speech , 1985, ICASSP '85. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[12]  L. R. Rabiner,et al.  Recognition of isolated digits using hidden Markov models with continuous mixture densities , 1985, AT&T Technical Journal.

[13]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[14]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[15]  Shozo Makino,et al.  Recognition of phonemes using time-spectrum pattern , 1986, Speech Commun..

[16]  T. D. Harrison,et al.  Boltzmann machines for speech recognition , 1986 .

[17]  Jeffrey L. Elman,et al.  Interactive processes in speech perception: the TRACE model , 1986 .

[18]  Peter F. Brown,et al.  The acoustic-modeling problem in automatic speech recognition , 1987 .

[19]  Anne-Marie Derouault,et al.  Context-dependent phonetic Markov models for large vocabulary speech recognition , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[20]  R. Lippmann,et al.  An introduction to computing with neural nets , 1987, IEEE ASSP Magazine.

[21]  John Makhoul,et al.  BYBLOS: The BBN continuous speech recognition system , 1987, ICASSP '87. IEEE International Conference on Acoustics, Speech, and Signal Processing.

[22]  J J Hopfield,et al.  Neural computation by concentrating information in time. , 1987, Proceedings of the National Academy of Sciences of the United States of America.

[23]  Lokendra Shastri,et al.  Learning Phonetic Features Using Connectionist Networks , 1987, IJCAI.

[24]  Lokendra Shastri,et al.  Learned phonetic discrimination using connectionist networks , 1990, ECST.

[25]  D. Lubensky,et al.  Learning spectral-temporal dependencies using connectionist networks , 1988, ICASSP-88., International Conference on Acoustics, Speech, and Signal Processing.

[26]  Terrence J. Sejnowski,et al.  NETtalk: a parallel network that learns to read aloud , 1988 .

[27]  Kai-Fu Lee,et al.  Speaker‐independent phoneme recognition using hidden Markov models , 1988 .

[28]  D Zipser,et al.  Learning the hidden structure of speech. , 1988, The Journal of the Acoustical Society of America.

[29]  Geoffrey E. Hinton 20 – CONNECTIONIST LEARNING PROCEDURES1 , 1990 .