Ausio-visual Segmentation and "The Cocktail Party Effect"

Audio-based interfaces usually suffer when noise or other acoustic sources are present in the environment. For robust audio recognition, a single source must first be isolated. Existing solutions to this problem generally require special microphone configurations, and often assume prior knowledge of the spurious sources. We have developed new algorithms for segmenting streams of audio-visual information into their constituent sources by exploiting the mutual information present between audio and visual tracks. Automatic face recognition and image motion analysis methods are used to generate visual features for a particular user; empirically these features have high mutual information with audio recorded from that user. We show how audio utterances from several speakers recorded with a single microphone can be separated into constituent streams; we also show how the method can help reduce the effect of noise in automatic speech recognition.

[1]  Barak A. Pearlmutter,et al.  A Context-Sensitive Generalization of ICA , 1996 .

[2]  Paris Smaragdis,et al.  Blind separation of convolved mixtures in the frequency domain , 1998, Neurocomputing.

[3]  Michael J. Black,et al.  A framework for the robust estimation of optical flow , 1993, 1993 (4th) International Conference on Computer Vision.

[4]  John W. Fisher,et al.  Unsupervised learning for nonlinear synthetic discriminant functions , 1996, Defense, Security, and Sensing.

[5]  Michael A. Casey,et al.  Vision-Steered Beam Forming and Transaural Rendering for the Artificial Life Interactive Video Environment (ALIVE) , 1995 .

[6]  Satoshi Nakamura,et al.  An effect of adaptive beamforming on hands-free speech recognition based on 3-d viterbi search , 1998, ICSLP.

[7]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[8]  Takeo Kanade,et al.  Neural Network-Based Face Detection , 1998, IEEE Trans. Pattern Anal. Mach. Intell..

[9]  Paul A. Viola,et al.  Empirical Entropy Manipulation for Real-World Problems , 1995, NIPS.

[10]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[11]  Paul A. Viola,et al.  Learning Informative Statistics: A Nonparametnic Approach , 1999, NIPS.

[12]  Terrence J. Sejnowski,et al.  An Information-Maximization Approach to Blind Separation and Blind Deconvolution , 1995, Neural Computation.

[13]  Ea-Ee Jan,et al.  Microphone arrays and speaker identification , 1994, IEEE Trans. Speech Audio Process..

[14]  Barry Arons,et al.  A Review of The Cocktail Party Effect , 1992 .