MikeTalk: a talking facial display based on morphing visemes

We present MikeTalk, a text-to-audiovisual speech synthesizer which converts input text into an audiovisual speech stream. MikeTalk is built using visemes, which are a set of images spanning a large range of mouth shapes. The visemes are acquired from a recorded visual corpus of a human subject which is specifically designed to elicit one instantiation of each viseme. Using optical flow methods, correspondence from every viseme to every other viseme is computed automatically. By morphing along this correspondence, a smooth transition between viseme images may be generated. A complete visual utterance is constructed by concatenating viseme transitions. Finally, phoneme and timing information extracted from a text-to-speech synthesizer is exploited to determine which viseme transitions to use, and the rate at which the morphing process should occur. In this manner, we are able to synchronize the visual speech stream with the audio speech stream, and hence give the impression, of a photorealistic talking face.

[1]  C. G. Fisher,et al.  Confusions among visually perceived consonants. , 1968, Journal of speech and hearing research.

[2]  Frederic I. Parke,et al.  A parametric model for human faces. , 1974 .

[3]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[4]  A. Montgomery,et al.  Physical characteristics of the lips underlying vowel lipreading performance. , 1983, The Journal of the Acoustical Society of America.

[5]  E. Owens,et al.  Visemes observed by hearing-impaired and normal-hearing adult viewers. , 1985, Journal of speech and hearing research.

[6]  Melvyn J. Hunt,et al.  Issues in high quality LPC analysis and synthesis , 1989, EUROSPEECH.

[7]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[8]  Tomaso Poggio,et al.  Example Based Image Analysis and Synthesis , 1993 .

[9]  Lance Williams,et al.  View Interpolation for Image Synthesis , 1993, SIGGRAPH.

[10]  Demetri Terzopoulos,et al.  Analysis and Synthesis of Facial Image Sequences Using Physical and Anatomical Models , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[11]  John R. Wright,et al.  Synthesis of Speaker Facial Movement to Match Selected Speech Sequences , 1994 .

[12]  A. Jongman Acoustics of American English Speech: A Dynamic Approach , 1995 .

[13]  Demetri Terzopoulos,et al.  Realistic modeling for facial animation , 1995, SIGGRAPH.

[14]  Tony Ezzat,et al.  Example-based analysis and synthesis for images of human faces , 1996 .

[15]  Bertrand Le Goff,et al.  A text-to-audiovisual-speech synthesizer for French , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[16]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[17]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[18]  Thaddeus Beier,et al.  Feature-based image metamorphosis , 1998 .