Boosting-Based Multimodal Speaker Detection for Distributed Meetings

Speaker detection is a very important task in distributed meeting applications. This paper discusses a number of challenges we met while designing a speaker detector for the Microsoft RoundTable distributed meeting device, and proposes a boosting-based multimodal speaker detection (BMSD) algorithm. Instead of performing sound source localization (SSL) and multi-person detection (MPD) separately and subsequently fusing their individual results, the proposed algorithm uses boosting to select features from a combined pool of both audio and visual features simultaneously. The result is a very accurate speaker detector with extremely high efficiency. The algorithm reduces the error rate of SSL-only approach by 47%, and the SSL and MPD fusion approach by 27%

[1]  A. Adjoudani,et al.  On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[2]  Alex Pentland,et al.  Pfinder: real-time tracking of the human body , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[3]  Hong Wang,et al.  Voice source localization for automatic camera pointing system in videoconferencing , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4]  Alex Pentland,et al.  Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5]  Alex Pentland,et al.  Pfinder: Real-Time Tracking of the Human Body , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[6]  Steven George Goodridge Multimedia sensor fusion for intelligent camera control and human-computer interaction , 1997 .

[7]  Michael S. Brandstein,et al.  A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8]  Yoram Singer,et al.  Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[9]  Javier R. Movellan,et al.  Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[10]  Kentaro Toyama,et al.  Wallflower: principles and practice of background maintenance , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[11]  Takeo Kanade,et al.  A System for Video Surveillance and Monitoring , 2000 .

[12]  Vladimir Pavlovic,et al.  Multimodal speaker detection using error feedback dynamic Bayesian networks , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[13]  Trevor Darrell,et al.  Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[14]  Giridharan Iyengar,et al.  Speaker change detection using joint audio-visual statistics , 2000, RIAO.

[15]  J. Friedman Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[16]  Larry S. Davis,et al.  Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[17]  Juergen Luettin,et al.  Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[18]  Patrick Pérez,et al.  Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[19]  Milind R. Naphade,et al.  Duration dependent input output markov models for audio-visual event detection , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[20]  B.D. Rao,et al.  Source localization in reverberant environments: performance bounds and ML estimation , 2001, Conference Record of Thirty-Fifth Asilomar Conference on Signals, Systems and Computers (Cat.No.01CH37256).

[21]  Gopal Sarma Pingali,et al.  A multimodal speaker detection and tracking system for teleconferencing , 2002, MULTIMEDIA '02.

[22]  Anoop Gupta,et al.  Distributed meetings: a meeting capture and broadcasting system , 2002, MULTIMEDIA '02.

[23]  Nebojsa Jojic,et al.  Audio-Video Sensor Fusion with Probabilistic Graphical Models , 2002, ECCV.

[24]  Larry S. Davis,et al.  Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[25]  Harriet J. Nock,et al.  Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[26]  Shih-Fu Chang,et al.  Discovery and fusion of salient multimodal features toward news story segmentation , 2003, IS&T/SPIE Electronic Imaging.

[27]  Michael R. M. Jenkin,et al.  Audiovisual localization of multiple speakers in a video teleconferencing setting , 2003, Int. J. Imaging Syst. Technol..

[28]  Yong Rui,et al.  Real-time speaker tracking using particle filter sensor fusion , 2004, Proceedings of the IEEE.

[29]  Paul A. Viola,et al.  Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[30]  Yoram Singer,et al.  Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[31]  Paul A. Viola,et al.  Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[32]  Yong Rui,et al.  Sound source localization for circular arrays of directional microphones , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[33]  Carlos Busso,et al.  Smart room: participant and speaker localization and identification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[34]  John W. McDonough,et al.  A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[35]  Bill Triggs,et al.  Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[36]  Murat Kunt,et al.  School of Engineering -sti Signal Processing Institute Information Theoretic Optimization of Audio Features for Multimodal Speaker Detection Information Theoretic Optimization of Audio Features for Multimodal Speaker Detection , 2022 .

[37]  Zhengyou Zhang,et al.  Maximum Likelihood Sound Source Localization for Multiple Directional Microphones , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[38]  Paul A. Viola,et al.  Multiple-Instance Pruning For Learning Efficient Cascade Detectors , 2007, NIPS.

[39]  Peter L. Bartlett,et al.  Boosting Algorithms as Gradient Descent in Function Space , 2007 .