Center for Biological and Computational Learning Massachusetts Institute of Technology

We describe how to create with machine learning techniques a generative, videorealistic, speech animation module. A human subject is first recorded using a videocamera as he/she utters a predetermined speech corpus. After processing the corpus automatically, a visual speech module is learned from the data that is capable of synthesizing the human subject’s mouth uttering entirely novel utterances that were not recorded in the original video. The synthesized utterance is re-composited onto a background sequence which contains natural head and eye movement. The final output is videorealistic in the sense that it looks like a video camera recording of the subject. At run time, the input to the system can be either real audio sequences or synthetic audio produced by a text-to-speech system, as long as they have been phonetically aligned. The two key contributions of this paper are 1) a variant of the multidimensional morphable model (MMM) to synthesize new, previously unseen mouth configurations from a small set of mouth image prototypes; and 2) a trajectory synthesis technique based on regularization, which is automatically trained from the recorded video corpus, and which is capable of synthesizing trajectories in MMM space corresponding to any desired utterance.

[1]  Seungyong Lee,et al.  POLYMORPH : AN ALGORITHM FOR MORPHING AMONG MULTIPLE IMAGES , 2007 .

[2]  Tony Ezzat,et al.  Visual Speech Synthesis by Morphing Visemes , 2000, International Journal of Computer Vision.

[3]  Michael Jones,et al.  Multidimensional Morphable Models: A Framework for Representing and Matching Object Classes , 2004, International Journal of Computer Vision.

[4]  J. Tenenbaum,et al.  A global geometric framework for nonlinear dimensionality reduction. , 2000, Science.

[5]  Jonas Beskow,et al.  Wavesurfer - an open source speech tool , 2000, INTERSPEECH.

[6]  Aaron Hertzmann,et al.  Style machines , 2000, SIGGRAPH 2000.

[7]  David J. Fleet,et al.  Robustly Estimating Changes in Image Appearance , 2000, Comput. Vis. Image Underst..

[8]  Matthew Brand,et al.  Voice puppetry , 1999, SIGGRAPH.

[9]  Christopher M. Bishop,et al.  Mixtures of Probabilistic Principal Component Analyzers , 1999, Neural Computation.

[10]  Matthew Turk,et al.  A Morphable Model For The Synthesis Of 3D Faces , 1999, SIGGRAPH.

[11]  Henrique S. Malvar,et al.  Making Faces , 2019, Topoi.

[12]  David Salesin,et al.  Synthesizing realistic facial expressions from photographs , 1998, SIGGRAPH.

[13]  Hans Peter Graf,et al.  Sample-based synthesis of photo-realistic talking heads , 1998, Proceedings Computer Animation '98 (Cat. No.98EX169).

[14]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[15]  Keiichi Tokuda,et al.  Text-to-visual speech synthesis based on parameter generation from HMM , 1998, Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP '98 (Cat. No.98CH36181).

[16]  Paul Taylor,et al.  Festival Speech Synthesis System , 1998 .

[17]  Sam T. Roweis,et al.  EM Algorithms for PCA and SPCA , 1997, NIPS.

[18]  Christoph Bregler,et al.  Video Rewrite: Driving Visual Speech with Audio , 1997, SIGGRAPH.

[19]  Bertrand Le Goff,et al.  A text-to-audiovisual-speech synthesizer for French , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[20]  Tomaso Poggio,et al.  Image Representations for Visual Learning , 1996, Science.

[21]  Sung Yong Shin,et al.  Image metamorphosis using snakes and free-form deformations , 1995, SIGGRAPH.

[22]  Demetri Terzopoulos,et al.  Realistic modeling for facial animation , 1995, SIGGRAPH.

[23]  John R. Wright,et al.  Synthesis of Speaker Facial Movement to Match Selected Speech Sequences , 1994 .

[24]  Avon Ba Computer Graphics Animations of Talking Faces Based on Stochastic Models , 1994 .

[25]  Tomaso Poggio,et al.  Example Based Image Analysis and Synthesis , 1993 .

[26]  F. Girosi,et al.  From regularization to radial, tensor and additive splines , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[27]  Lance Williams,et al.  View Interpolation for Image Synthesis , 1993, SIGGRAPH.

[28]  Mei-Yuh Hwang,et al.  The SPHINX-II speech recognition system: an overview , 1993, Comput. Speech Lang..

[29]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[30]  Michael M. Cohen,et al.  Modeling Coarticulation in Synthetic Visual Speech , 1993 .

[31]  Thaddeus Beier,et al.  Feature-based image metamorphosis , 1992, SIGGRAPH.

[32]  David J. Fleet,et al.  Performance of optical flow techniques , 1992, Proceedings 1992 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[33]  P. Anandan,et al.  Hierarchical Model-Based Motion Estimation , 1992, ECCV.

[34]  T. Poggio,et al.  Recognition and Structure from one 2D Model View: Observations on Prototypes, Object Classes and Symmetries , 1992 .

[35]  L. Galway Spline Models for Observational Data , 1991 .

[36]  George Wolberg,et al.  Digital image warping , 1990 .

[37]  Eric Moulines,et al.  Pitch-synchronous waveform processing techniques for text-to-speech synthesis using diphones , 1989, Speech Commun..

[38]  Brian Wyvill,et al.  Speech and expression: a computer solution to face animation , 1986 .

[39]  Edward H. Adelson,et al.  The Laplacian Pyramid as a Compact Image Code , 1983, IEEE Trans. Commun..

[40]  Berthold K. P. Horn,et al.  Determining Optical Flow , 1981, Other Conferences.

[41]  Frederic I. Parke,et al.  A parametric model for human faces. , 1974 .