Abstract As is well known, the acquisition of speech skills by humans involves the “simultaneous” learning of speech perception and of speech production in an environment of speakers who have already acquired these skills. By contrast in speech processing by machines, speech recognition and speech synthesis are studied and implemented separately (and different methodologies have been developed for each). The present paper puts forward a structure for the acquisition of speech by machines, asm, in which both recognition and synthesis are trained “simultaneously” from human training speech. The structure consists of a synthesis chain in which a synthesiser is driven by a trainable neural network controller from a synthesis state vector and of a recognition chain comprising a trainable neural network recogniser which produces a recogniser state vector. The recogniser alternately receives training speech from a human speaker and speech from the synthesiser. A coupled minimisation is set up which trains the recogniser network and the synthesiser state and network necessary to classify or recognise human input speech and to produce synthetic speech which is recognised to be of the same class as the human speech. The algorithm is demonstrated for the acquisition of steady state vowels and simple isolated words.
[1]
Norio Baba,et al.
A new approach for finding the global minimum of error function of neural networks
,
1989,
Neural Networks.
[2]
Alex Waibel,et al.
Large vocabulary recognition using linked predictive neural networks
,
1990,
International Conference on Acoustics, Speech, and Signal Processing.
[3]
J. N. Holmes,et al.
Formant synthesizers: Cascade or parallel?
,
1983,
Speech Commun..
[4]
Christof Traber.
F0 generation with a data base of natural F0 patterns and with a neural network
,
1990,
SSW.
[5]
Roger J.-B. Wets,et al.
Minimization by Random Search Techniques
,
1981,
Math. Oper. Res..
[6]
Anthony J. Robinson,et al.
Lexical access using a recurrent error propagation network
,
1991,
EUROSPEECH.
[7]
Frank Fallside,et al.
A recurrent error propagation network speech recognition system
,
1991
.
[8]
Frank Fallside.
Synfrec: speech synthesis from recognition using neural networks
,
1990,
SSW.
[9]
David B. Pisoni,et al.
Text-to-speech: the mitalk system
,
1987
.
[10]
Andrej Ljolje,et al.
Synthesis of natural sounding pitch contours in isolated utterances using hidden Markov models
,
1986,
IEEE Trans. Acoust. Speech Signal Process..