Development and comparison of two approaches for visual speech analysis with application to voice activity detection

In this paper 1 we present two novel methods for visual voice activity detection (V-VAD) which exploit the bimodality of speech ( i.e. the coherence between speaker’s lips and the resulting speech). The first method uses appearance parameters of a speaker’s lips, obtained from an active appearance model (AAM). An HMM then dynamically models the change in appearance over time. The second method uses a retinal filter on the region of the lips to extract the required parameter. A corpus of a single speaker is applied to each method in turn, where each method is used to classify voice activity as speech or non speech. The efficiency of each method is evaluated individually using receiver operating characteristics and their respective performances are then compared and discussed. Both methods achieve a high correct silence detection rate for a small false detection rate.

[1]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[2]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[3]  Christian Jutten,et al.  An Analysis of Visual Speech Information Applied to Voice Activity Detection , 2006, 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings.

[4]  Timothy F. Cootes,et al.  Active Appearance Models , 1998, ECCV.

[5]  I. Matthews,et al.  Audio-Visual Automatic Speech Recognition: An Overview , 2004 .

[6]  Timothy F. Cootes,et al.  Active Appearance Models , 2001, IEEE Trans. Pattern Anal. Mach. Intell..

[7]  Lawrence R. Rabiner,et al.  A tutorial on hidden Markov models and selected applications in speech recognition , 1989, Proc. IEEE.

[8]  Christian Jutten,et al.  Solving the indeterminations of blind source separation of convolutive speech mixtures , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[9]  Peng Liu,et al.  Voice activity detection using visual information , 2004, 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[10]  Jeanny Hérault,et al.  Motion processing in the retina: about a velocity matched filter , 1993, ESANN.

[11]  C. Neti,et al.  A vision-based microphone switch for speech intent detection , 2001, Proceedings IEEE ICCV Workshop on Recognition, Analysis, and Tracking of Faces and Gestures in Real-Time Systems.

[12]  Andrew J. Aubrey,et al.  Using the bi-modality of speech for convolutive frequency domain blind source separation , 2006 .