Audiovisual Speech Synthesis

This paper presents the main approaches used to synthesize talking faces, and provides greater detail on a handful of these approaches. An attempt is made to distinguish between facial synthesis itself (i.e. the manner in which facial movements are rendered on a computer screen), and the way these movements may be controlled and predicted using phonetic input. The two main synthesis techniques (model-based vs. image-based) are contrasted and presented by a brief description of the most illustrative existing systems. The challenging issues—evaluation, data acquisition and modeling—that may drive future models are also discussed and illustrated by our current work at ICP.

[1]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[2]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[3]  Steven M. Seitz,et al.  View morphing , 1996, SIGGRAPH.

[4]  Timothy F. Cootes,et al.  Extraction of Visual Features for Lipreading , 2002, IEEE Trans. Pattern Anal. Mach. Intell..

[5]  Y Payan,et al.  The mesh-matching algorithm: an automatic 3D mesh generator for finite element structures. , 2000, Journal of biomechanics.

[6]  D. Ostry,et al.  The equilibrium point hypothesis and its application to speech motor control. , 1996, Journal of speech and hearing research.

[7]  Gérard Bailly,et al.  Creating and controlling video-realistic talking heads , 2001, AVSP.

[8]  Frederic I. Parke A model for human faces that allows speech synchronized animation , 1975, Comput. Graph..

[9]  Jörn Ostermann,et al.  User evaluation: Synthetic talking faces for interactive services , 1999, The Visual Computer.

[10]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[11]  Keith Waters,et al.  Computer facial animation , 1996 .

[12]  D Terzopoulos,et al.  The computer synthesis of expressive faces. , 1992, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[13]  Yohan Payan,et al.  A 3D Finite Element Model of the Face for Simulation in Plastic and Maxillo-Facial Surgery , 2000, MICCAI.

[14]  Gérard Bailly,et al.  MOTHER: a new generation of talking heads providing a flexible articulatory control for video-realistic speech animation , 2000, INTERSPEECH.

[15]  Björn Granström,et al.  The teleface project multi-modal speech-communication for the hearing impaired , 1997, EUROSPEECH.

[16]  Parke,et al.  Parameterized Models for Facial Animation , 1982, IEEE Computer Graphics and Applications.

[17]  C. Benoît,et al.  A set of French visemes for visual speech synthesis , 1994 .

[18]  A. Murat Tekalp,et al.  Face and 2-D mesh animation in MPEG-4 , 2000, Signal Process. Image Commun..

[19]  Shigeo Morishima,et al.  Facial image reconstruction by estimated muscle parameter , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[20]  Peter Eisert,et al.  Analyzing Facial Expressions for Virtual Conferencing , 1998, IEEE Computer Graphics and Applications.

[21]  S Shaiman,et al.  Different phase-stable relationships of the upper lip and jaw for production of vowels and diphthongs. , 1991, The Journal of the Acoustical Society of America.

[22]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[23]  S. Ohman Numerical model of coarticulation. , 1967, The Journal of the Acoustical Society of America.

[24]  Dominic W. Massaro,et al.  Illusions and Issues In Bimodal Speech Perception , 1998, AVSP.

[25]  Jonas Beskow,et al.  Rule-based visual speech synthesis , 1995, EUROSPEECH.

[26]  Gérard Bailly,et al.  Learning to speak. Sensori-motor control of speech movements , 1997, Speech Commun..

[27]  M. Turk,et al.  Eigenfaces for Recognition , 1991, Journal of Cognitive Neuroscience.

[28]  Aggelos K. Katsaggelos,et al.  Model-based synthetic view generation from a monocular video sequence , 1997, Proceedings of International Conference on Image Processing.

[29]  Gérard Bailly,et al.  TOWARDS AN AUDIOVISUAL VIRTUAL TALKING HEAD: 3D ARTICULATORY MODELING OF TONGUE, LIPS AND FACE BASED ON MRI AND VIDEO IMAGES , 1998 .

[30]  Fabio Lavagetto,et al.  MPEG-4:Audio/Video and Synthetic Graphics/Audio for Real-Time , 1997 .

[31]  Louis Goldstein,et al.  Gestural specification using dynamically-defined articulatory structures , 1990 .

[32]  Gavin C. Cawley,et al.  Visual speech synthesis using statistical models of shape and appearance , 2001, AVSP.

[33]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[34]  Guillaume Gibert,et al.  Evaluation of movement generation systems using the point-light technique , 2002, Proceedings of 2002 IEEE Workshop on Speech Synthesis, 2002..

[35]  Tony Ezzat,et al.  MikeTalk: a talking facial display based on morphing visemes , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[36]  T. Kaburagi,et al.  Articulatory movement formation by kinematic triphone model , 1999, IEEE SMC'99 Conference Proceedings. 1999 IEEE International Conference on Systems, Man, and Cybernetics (Cat. No.99CH37028).

[37]  Tomaso Poggio,et al.  Trainable Videorealistic Speech Animation , 2004, FGR.

[38]  Bertil Lyberg,et al.  Visual Speech Synthesis With Concatenative Speech , 1998, AVSP.

[39]  Yohan Payan,et al.  A 3 D Finite Element Model of the Face for Simulation in Plastic and MaxilloFacial Surgery , 2001 .

[40]  N. Michael Brooke,et al.  Two- and Three-Dimensional Audio-Visual Speech Synthesis , 1998, AVSP.

[41]  P. Ekman Unmasking The Face , 1975 .

[42]  Satoshi Nakamura,et al.  Lip movement synthesis from speech based on hidden Markov models , 1998, Proceedings Third IEEE International Conference on Automatic Face and Gesture Recognition.

[43]  Norman I. Badler,et al.  Animating facial expressions , 1981, SIGGRAPH '81.

[44]  David B. Pisoni,et al.  Perception of Synthetic Speech , 1997 .

[45]  Andrew P. Breen,et al.  Modeling visual coarticulation in synthetic talking heads using a lip motion unit inventory with concatenative synthesis , 2000, INTERSPEECH.

[46]  Mikko Sams,et al.  Audio-visual speech synthesis for finnish , 1999, AVSP.

[47]  Demetri Terzopoulos,et al.  Physically-based facial modelling, analysis, and animation , 1990, Comput. Animat. Virtual Worlds.

[48]  Christoph Bregler,et al.  Video rewrite: visual speech synthesis from video , 1997, AVSP.

[49]  Takao Kobayashi,et al.  Text-to-audio-visual speech synthesis based on parameter generation from HMM , 1999, EUROSPEECH.

[50]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[51]  F. I. Parke June,et al.  Computer Generated Animation of Faces , 1972 .

[52]  Mark Huckvale,et al.  ProSynth: an integrated prosodic approach to device-independent, natural-sounding speech synthesis , 1998, Comput. Speech Lang..

[53]  Keith Waters,et al.  A muscle model for animation three-dimensional facial expression , 1987, SIGGRAPH.

[54]  Pertti Roivainen,et al.  3-D Motion Estimation in Model-Based Facial Image Coding , 1993, IEEE Trans. Pattern Anal. Mach. Intell..

[55]  Fabio Vignoli,et al.  A text-speech synchronization technique with applications to talking heads , 1999, AVSP.

[56]  G. Plant Perceiving Talking Faces: From Speech Perception to a Behavioral Principle , 1999 .