On the acquisition of speech by machines, asm

Abstract As is well known, the acquisition of speech skills by humans involves the “simultaneous” learning of speech perception and of speech production in an environment of speakers who have already acquired these skills. By contrast in speech processing by machines, speech recognition and speech synthesis are studied and implemented separately (and different methodologies have been developed for each). The present paper puts forward a structure for the acquisition of speech by machines, asm, in which both recognition and synthesis are trained “simultaneously” from human training speech. The structure consists of a synthesis chain in which a synthesiser is driven by a trainable neural network controller from a synthesis state vector and of a recognition chain comprising a trainable neural network recogniser which produces a recogniser state vector. The recogniser alternately receives training speech from a human speaker and speech from the synthesiser. A coupled minimisation is set up which trains the recogniser network and the synthesiser state and network necessary to classify or recognise human input speech and to produce synthetic speech which is recognised to be of the same class as the human speech. The algorithm is demonstrated for the acquisition of steady state vowels and simple isolated words.