Developing an audio-visual speech source separation algorithm

Abstract Looking at the speaker’s face is useful to hear better a speech signal and extract it from competing sources before identification. This might result in elaborating new speech enhancement or extraction techniques exploiting the audio-visual coherence of speech stimuli. In this paper, a novel algorithm plugging audio-visual coherence estimated by statistical tools on classical blind source separation algorithms is presented, and its assessment is described. We show, in the case of additive mixtures, that this algorithm performs better than classical blind tools both when there are as many sensors as sources, and when there are less sensors than sources. Audio-visual coherence enables a focus on the speech source to extract. It may also be used at the output of a classical source separation algorithm, to select the “best” sensor with reference to a target source.

[1]  David Poeppel,et al.  Detection of auditory (cross-spectral) and auditory-visual (cross-modal) synchrony , 2004, Speech Commun..

[2]  J. Schwartz,et al.  Seeing to hear better: evidence for early audio-visual interactions in speech identification , 2004, Cognition.

[3]  Aapo Hyvärinen,et al.  Fast and robust fixed-point algorithms for independent component analysis , 1999, IEEE Trans. Neural Networks.

[4]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[5]  J. Cardoso,et al.  Blind beamforming for non-gaussian signals , 1993 .

[6]  Hani Yehia,et al.  Quantitative association of vocal-tract and facial behavior , 1998, Speech Commun..

[7]  Jean-François Cardoso,et al.  Equivariant adaptive source separation , 1996, IEEE Trans. Signal Process..

[8]  Christian Jutten,et al.  Separation of Audio-Visual Speech Sources: A New Approach Exploiting the Audio-Visual Coherence of Speech Stimuli , 2002, EURASIP J. Adv. Signal Process..

[9]  D. Rubin,et al.  Maximum likelihood from incomplete data via the EM - algorithm plus discussions on the paper , 1977 .

[10]  Chalapathy Neti,et al.  Noisy audio feature enhancement using audio-visual speech data , 2002, 2002 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[11]  Hiroshi G. Okuno,et al.  Improvement of three simultaneous speech recognition by using AV integration and scattering theory for humanoid , 2003, AVSP.

[12]  G. Bailly,et al.  Talking Machines , 1992 .

[13]  Frédéric Berthommier,et al.  Audiovisual speech enhancement based on the association between speech envelope and video features , 2003, INTERSPEECH.

[14]  Lynne E. Bernstein,et al.  For speech perception by humans or machines, three senses are better than one , 1996, Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP '96.

[15]  Christian Jutten,et al.  Blind separation of sources, part I: An adaptive algorithm based on neuromimetic architecture , 1991, Signal Process..

[16]  C. Benoît,et al.  A set of French visemes for visual speech synthesis , 1994 .

[17]  Jeesun Kim,et al.  Testing the cuing hypothesis for the AV speech detection advantage , 2003, AVSP.

[18]  Chalapathy Neti,et al.  Audio-visual speech enhancement with AVCDCN (audio-visual codebook dependent cepstral normalization) , 2002, Sensor Array and Multichannel Signal Processing Workshop Proceedings, 2002.

[19]  Christian Jutten,et al.  Source separation in post-nonlinear mixtures , 1999, IEEE Trans. Signal Process..

[20]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[21]  J L Schwartz,et al.  Audio-visual enhancement of speech in noise. , 2001, The Journal of the Acoustical Society of America.

[22]  Gang Feng,et al.  Can the visual input make the audio signal "pop out" in noise ? a first study of the enhancement of noisy VCV acoustic sequences by audio-visual fusion , 1997, AVSP ...

[23]  Lynne E. Bernstein,et al.  Enhanced auditory detection with av speech: perceptual evidence for speech and non-speech mechanisms , 2003, AVSP.

[24]  P F Seitz,et al.  The use of visible speech cues for improving auditory detection of spoken sentences. , 2000, The Journal of the Acoustical Society of America.

[25]  Hiroaki Kitano,et al.  Separating three simultaneous speeches with two microphones by integrating auditory and visual processing , 2001, INTERSPEECH.

[26]  Eric David Petajan,et al.  Automatic Lipreading to Enhance Speech Recognition (Speech Reading) , 1984 .

[27]  Christian Jutten,et al.  Extracting an AV speech source from a mixture of signals , 2003, INTERSPEECH.

[28]  A. B.,et al.  SPEECH COMMUNICATION , 2001 .

[29]  Frédéric Berthommier,et al.  Audio-visual scene analysis: evidence for a "very-early" integration process in audio-visual speech perception , 2002, INTERSPEECH.

[30]  Aapo Hyv Fast and Robust Fixed-Point Algorithms for Independent Component Analysis , 1999 .

[31]  Jeesun Kim,et al.  Visible speech cues and auditory detection of spoken sentences: an effect of degree of correlation between acoustic and visual properties , 2001, AVSP.

[32]  Frédéric Berthommier,et al.  A phonetically neutral model of the low-level audio-visual interaction , 2004, Speech Commun..