Synthesis of natural sounding pitch contours in isolated utterances using hidden Markov models

A novel technique is introduced for characterizing prosodic structure and is used for speech synthesis. The mechanism consists of modeling a set of observations as a probabilistic function of a hidden Markov chain. It uses mixtures of Gaussian continuous probability density functions to represent the essential, perceptually relevant structure of intonation by observing movements of fundamental frequency in monosyllabic words of varying phonetic structure. High-quality speech synthesis, using multipulse excitation, is used to demonstrate the power of the HMM in preserving the naturalness of the intonational meaning, conveyed by the variation of fundamental frequency and duration. The fundamental frequency contours are synthesized using a random number generator from the models, and are imposed on a synthesized prototype word which had the intonation of a low fall. The resulting monosyllabic words with imposed synthesized fundamental frequency contours show a high level of naturalness and are found to be perceptually indistinguishable from the original recordings with the same intonation. The results clearly show the high potential of hidden Markov models as a mechanism for the representation of prosodic structure by naturally capturing its essentials.