A study of lip movements during spontaneous dialog and its application to voice activity detection.

This paper presents a quantitative and comprehensive study of the lip movements of a given speaker in different speech/nonspeech contexts, with a particular focus on silences (i.e., when no sound is produced by the speaker). The aim is to characterize the relationship between "lip activity" and "speech activity" and then to use visual speech information as a voice activity detector (VAD). To this aim, an original audiovisual corpus was recorded with two speakers involved in a face-to-face spontaneous dialog, although being in separate rooms. Each speaker communicated with the other using a microphone, a camera, a screen, and headphones. This system was used to capture separate audio stimuli for each speaker and to synchronously monitor the speaker's lip movements. A comprehensive analysis was carried out on the lip shapes and lip movements in either silence or nonsilence (i.e., speech+nonspeech audible events). A single visual parameter, defined to characterize the lip movements, was shown to be efficient for the detection of silence sections. This results in a visual VAD that can be used in any kind of environment noise, including intricate and highly nonstationary noises, e.g., multiple and/or moving noise sources or competing speech signals.

[1]  Climent Nadeu,et al.  Automatic Speech Activity Detection, Source Localization, and Speech Recognition on the Chil Seminar Corpus , 2005, 2005 IEEE International Conference on Multimedia and Expo.

[2]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[3]  Christian Jutten,et al.  Two novel visual voice activity detectors based on appearance models and retinal filtering , 2007, 2007 15th European Signal Processing Conference.

[4]  Javier Ramírez,et al.  Efficient voice activity detection algorithms using long-term speech information , 2004, Speech Commun..

[5]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.

[6]  H. Lane,et al.  The Lombard Sign and the Role of Hearing in Speech , 1971 .

[7]  Tsuhan Chen,et al.  Cross-Modal Predictive Coding for Talking Head Sequences , 1996 .

[8]  Zhu Liu,et al.  Integration of multimodal features for video scene classification based on HMM , 1999, 1999 IEEE Third Workshop on Multimedia Signal Processing (Cat. No.99TH8451).

[9]  Christian Jutten,et al.  Mixing Audiovisual Speech Processing and Blind Source Separation for the Extraction of Speech Signals From Convolutive Mixtures , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Abeer Alwan,et al.  On the Relationship between Face Movements, Tongue Movements, and Speech Acoustics , 2002, EURASIP J. Adv. Signal Process..

[11]  C. Neti,et al.  A vision-based microphone switch for speech intent detection , 2001, Proceedings IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems.

[12]  Christian Jutten,et al.  Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli , 2002, EURASIP J. Adv. Signal Process..

[13]  Laurent Girin Joint matrix quantization of face parameters and LPC coefficients for low bit rate audiovisual speech coding , 2004, IEEE Transactions on Speech and Audio Processing.

[14]  Roland Göcke,et al.  Statistical analysis of the relationship between audio and video speech parameters for Australian English , 2003, AVSP.

[15]  Jeesun Kim,et al.  Investigating the audio-visual speech detection advantage , 2004, Speech Commun..

[16]  J. A. Johnson,et al.  Point-light facial displays enhance comprehension of speech in noise. , 1996, Journal of speech and hearing research.

[17]  J Robert-Ribes,et al.  Complementarity and synergy in bimodal speech: auditory, visual, and audio-visual identification of French oral vowels in noise. , 1998, The Journal of the Acoustical Society of America.

[18]  R. Campbell,et al.  Reading Speech from Still and Moving Faces: The Neural Substrates of Visible Speech , 2003, Journal of Cognitive Neuroscience.

[19]  Yannick Deville,et al.  Blind separation of dependent sources using the "time-frequency ratio of mixtures" approach , 2003, Seventh International Symposium on Signal Processing and Its Applications, 2003. Proceedings..

[20]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[21]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[22]  Piero Cosi,et al.  LUCIA a new italian talking-head based on a modified cohen-massaro's labial coarticulation model , 2003, INTERSPEECH.

[23]  C. Benoît,et al.  Effects of phonetic context on audio-visual intelligibility of French. , 1994, Journal of speech and hearing research.

[24]  Eric Vatikiotis-Bateson,et al.  The moving face during speech communication , 1998 .

[25]  Q. Summerfield Some preliminaries to a comprehensive account of audio-visual speech perception. , 1987 .

[26]  Q Summerfield,et al.  Use of Visual Information for Phonetic Perception , 1979, Phonetica.

[27]  P. Bertelson Chapter 14 Ventriloquism: A case of crossmodal perceptual grouping , 1999 .

[28]  Chalapathy Neti,et al.  Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization) , 2002, Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002.

[29]  Régine Le Bouquin-Jeannès,et al.  Study of a voice activity detector and its influence on a noise reduction system , 1995, Speech Commun..

[30]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[31]  Christian Abry,et al.  "Laws" for lips , 1986, Speech Commun..

[32]  Jon Barker,et al.  Estimation of speech acoustics from visual speech features: A comparison of linear and non-linear models , 1999, AVSP.

[33]  Chalapathy Neti,et al.  Joint audio-visual speech processing for recognition and enhancement , 2003, AVSP.

[34]  Wonyong Sung,et al.  A statistical model-based voice activity detection , 1999, IEEE Signal Processing Letters.

[35]  P F Seitz,et al.  The use of visible speech cues for improving auditory detection of spoken sentences. , 2000, The Journal of the Acoustical Society of America.

[36]  M A Goodale,et al.  Dynamic visual speech perception in a patient with visual form agnosia , 2002, Neuroreport.

[37]  Javier Ramírez,et al.  Statistical voice activity detection using a multiple observation likelihood ratio test , 2005, IEEE Signal Processing Letters.

[38]  Chalapathy Neti,et al.  Audio-visual intent-to-speak detection for human-computer interaction , 2000, 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.00CH37100).

[39]  C. Benoît,et al.  A set of French visemes for visual speech synthesis , 1994 .

[40]  P. Gribble,et al.  Temporal constraints on the McGurk effect , 1996, Perception & psychophysics.

[41]  S. Gökhun Tanyer,et al.  Voice activity detection in nonstationary noise , 2000, IEEE Trans. Speech Audio Process..

[42]  N. Campbell APPROACHES TO CONVERSATIONAL SPEECH RHYTHM: SPEECH ACTIVITY IN TWO-PERSON TELEPHONE DIALOGES , 2007 .

[43]  Eric D. Petajan Automatic lipreading to enhance speech recognition , 1984 .

[44]  Saeid Sanei,et al.  Video assisted speech source separation , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[45]  Geoffrey P. Bingham,et al.  "Dynamics and the orientation of kinematic forms in visual event recognition": Correction. , 1996 .

[46]  Christian Jutten,et al.  Further experiments on audio-visual speech source separation , 2003, AVSP.

[47]  David Malah,et al.  Speech enhancement using a minimum mean-square error log-spectral amplitude estimator , 1984, IEEE Trans. Acoust. Speech Signal Process..

[48]  Gérard Bailly,et al.  Seeing tongue movements from outside , 2002, INTERSPEECH.

[49]  Gérard Bailly,et al.  Audiovisual Speech Synthesis , 2003, Int. J. Speech Technol..

[50]  W. H. Sumby,et al.  Erratum: Visual Contribution to Speech Intelligibility in Noise [J. Acoust. Soc. Am. 26, 212 (1954)] , 1954 .

[51]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[52]  Christian Benoît,et al.  Which components of the face do humans and machines best speechread , 1996 .

[53]  Peng Liu,et al.  Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[54]  Guillaume Gibert,et al.  Analysis and synthesis of the three-dimensional movements of the head, face, and hand of a speaker using cued speech. , 2005, The Journal of the Acoustical Society of America.

[55]  Christian Jutten,et al.  Developing an audio-visual speech source separation algorithm , 2004, Speech Commun..

[56]  Sharon M. Thomas,et al.  Contributions of oral and extraoral facial movement to visual and audiovisual speech perception. , 2004, Journal of experimental psychology. Human perception and performance.

[57]  L. Rosenblum,et al.  An audiovisual test of kinematic primitives for visual speech perception. , 1996, Journal of experimental psychology. Human perception and performance.

[58]  Lynne E. Bernstein,et al.  Auditory speech detection in noise enhanced by lipreading , 2004, Speech Commun..

[59]  Christian Jutten,et al.  Visual voice activity detection as a help for speech source separation from convolutive mixtures , 2007, Speech Commun..

[60]  N. P. Erber Auditory-visual perception of speech. , 1975, The Journal of speech and hearing disorders.

[61]  H. McGurk,et al.  Hearing lips and seeing voices , 1976, Nature.

[62]  Chalapathy Neti,et al.  Recent advances in the automatic recognition of audiovisual speech , 2003, Proc. IEEE.

[63]  Ephraim Speech enhancement using a minimum mean square error short-time spectral amplitude estimator , 1984 .