Audio-visual speech sources separation: a new approach exploiting the audio-visual coherence of speech stimuli

We present a new approach to the source separation problem in the case of multiple speech signals. The method is based on the use of automatic lipreading: the objective is to extract an acoustic speech signal from other acoustic signals by exploiting its coherence with the speaker’s lip movements. We show how, if a statistical model of the joint probability of visual and spectral audio input is learnt to quantify the audio-visual coherence, separation can be achieved by maximising this probability. Then, we present a number of separation results on a corpus of vowel-plosive-vowel sequences uttered by a single speaker, embedded in a mixture of other voices.