论文信息 - Boosting-Based Multimodal Speaker Detection for Distributed Meetings

Boosting-Based Multimodal Speaker Detection for Distributed Meetings

Speaker detection is a very important task in distributed meeting applications. This paper discusses a number of challenges we met while designing a speaker detector for the Microsoft RoundTable distributed meeting device, and proposes a boosting-based multimodal speaker detection (BMSD) algorithm. Instead of performing sound source localization (SSL) and multi-person detection (MPD) separately and subsequently fusing their individual results, the proposed algorithm uses boosting to select features from a combined pool of both audio and visual features simultaneously. The result is a very accurate speaker detector with extremely high efficiency. The algorithm reduces the error rate of SSL-only approach by 47%, and the SSL and MPD fusion approach by 27%

[1] A. Adjoudani,et al. On the Integration of Auditory and Visual Parameters in an HMM-based ASR , 1996 .

[2] Alex Pentland,et al. Pfinder: real-time tracking of the human body , 1996, Proceedings of the Second International Conference on Automatic Face and Gesture Recognition.

[3] Hong Wang,et al. Voice source localization for automatic camera pointing system in videoconferencing , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[4] Alex Pentland,et al. Coupled hidden Markov models for complex action recognition , 1997, Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[5] Alex Pentland,et al. Pfinder: Real-Time Tracking of the Human Body , 1997, IEEE Trans. Pattern Anal. Mach. Intell..

[6] Steven George Goodridge. Multimedia sensor fusion for intelligent camera control and human-computer interaction , 1997 .

[7] Michael S. Brandstein,et al. A robust method for speech signal time-delay estimation in reverberant rooms , 1997, 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing.

[8] Yoram Singer,et al. Improved Boosting Algorithms Using Confidence-rated Predictions , 1998, COLT' 98.

[9] Javier R. Movellan,et al. Audio Vision: Using Audio-Visual Synchrony to Locate Sounds , 1999, NIPS.

[10] Kentaro Toyama,et al. Wallflower: principles and practice of background maintenance , 1999, Proceedings of the Seventh IEEE International Conference on Computer Vision.

[11] Takeo Kanade,et al. A System for Video Surveillance and Monitoring , 2000 .

[12] Vladimir Pavlovic,et al. Multimodal speaker detection using error feedback dynamic Bayesian networks , 2000, Proceedings IEEE Conference on Computer Vision and Pattern Recognition. CVPR 2000 (Cat. No.PR00662).

[13] Trevor Darrell,et al. Learning Joint Statistical Models for Audio-Visual Fusion and Segregation , 2000, NIPS.

[14] Giridharan Iyengar,et al. Speaker change detection using joint audio-visual statistics , 2000, RIAO.

[15] J. Friedman. Special Invited Paper-Additive logistic regression: A statistical view of boosting , 2000 .

[16] Larry S. Davis,et al. Look who's talking: speaker detection using video and audio correlation , 2000, 2000 IEEE International Conference on Multimedia and Expo. ICME2000. Proceedings. Latest Advances in the Fast Changing World of Multimedia (Cat. No.00TH8532).

[17] Juergen Luettin,et al. Audio-Visual Speech Modeling for Continuous Speech Recognition , 2000, IEEE Trans. Multim..

[18] Patrick Pérez,et al. Sequential Monte Carlo Fusion of Sound and Vision for Speaker Tracking , 2001, ICCV.

[19] Milind R. Naphade,et al. Duration dependent input output markov models for audio-visual event detection , 2001, IEEE International Conference on Multimedia and Expo, 2001. ICME 2001..

[20] B.D. Rao,et al. Source localization in reverberant environments: performance bounds and ML estimation , 2001, Conference Record of Thirty-Fifth Asilomar Conference on Signals, Systems and Computers (Cat.No.01CH37256).

[21] Gopal Sarma Pingali,et al. A multimodal speaker detection and tracking system for teleconferencing , 2002, MULTIMEDIA '02.

[22] Anoop Gupta,et al. Distributed meetings: a meeting capture and broadcasting system , 2002, MULTIMEDIA '02.

[23] Nebojsa Jojic,et al. Audio-Video Sensor Fusion with Probabilistic Graphical Models , 2002, ECCV.

[24] Larry S. Davis,et al. Joint Audio-Visual Tracking Using Particle Filters , 2002, EURASIP J. Adv. Signal Process..

[25] Harriet J. Nock,et al. Speaker Localisation Using Audio-Visual Synchrony: An Empirical Study , 2003, CIVR.

[26] Shih-Fu Chang,et al. Discovery and fusion of salient multimodal features toward news story segmentation , 2003, IS&T/SPIE Electronic Imaging.

[27] Michael R. M. Jenkin,et al. Audiovisual localization of multiple speakers in a video teleconferencing setting , 2003, Int. J. Imaging Syst. Technol..

[28] Yong Rui,et al. Real-time speaker tracking using particle filter sensor fusion , 2004, Proceedings of the IEEE.

[29] Paul A. Viola,et al. Robust Real-Time Face Detection , 2001, International Journal of Computer Vision.

[30] Yoram Singer,et al. Logistic Regression, AdaBoost and Bregman Distances , 2000, Machine Learning.

[31] Paul A. Viola,et al. Detecting Pedestrians Using Patterns of Motion and Appearance , 2005, International Journal of Computer Vision.

[32] Yong Rui,et al. Sound source localization for circular arrays of directional microphones , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[33] Carlos Busso,et al. Smart room: participant and speaker localization and identification , 2005, Proceedings. (ICASSP '05). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005..

[34] John W. McDonough,et al. A joint particle filter for audio-visual speaker tracking , 2005, ICMI '05.

[35] Bill Triggs,et al. Histograms of oriented gradients for human detection , 2005, 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05).

[36] Murat Kunt,et al. School of Engineering -sti Signal Processing Institute Information Theoretic Optimization of Audio Features for Multimodal Speaker Detection Information Theoretic Optimization of Audio Features for Multimodal Speaker Detection , 2022 .

[37] Zhengyou Zhang,et al. Maximum Likelihood Sound Source Localization for Multiple Directional Microphones , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[38] Paul A. Viola,et al. Multiple-Instance Pruning For Learning Efficient Cascade Detectors , 2007, NIPS.

[39] Peter L. Bartlett,et al. Boosting Algorithms as Gradient Descent in Function Space , 2007 .