Single sensor audiovisual speech source separation

The Kernel Additive Modeling (KAM) is a recent promising framework for the separation of underdetermined convolutive mixture of audio signal. The principle of this method is to estimate the short term Power Spectral Densities (PSD) of the sources directly from the mixture by taking advantage of redundant features in the PSD of the source, such as periodicity or smoothness. The separation itself is then performed with a generalized Wiener filter. This preliminary study aims to evaluate the improvement of using the video of the speaker's face to directly detect such redundancies in the speech that could be used in the KAM framework to perform the extraction of the speech signal.

[1]  Tülay Adalı,et al.  Diversity in Independent Component and Vector Analyses: Identifiability, algorithms, and applications in medical imaging , 2014, IEEE Signal Processing Magazine.

[2]  Antoine Liutkus,et al.  Kernel Additive Models for Source Separation , 2014, IEEE Transactions on Signal Processing.

[3]  Dinesh Kant Kumar,et al.  Visual Speech Recognition Using Optical Flow and Support Vector Machines , 2011, Int. J. Comput. Intell. Appl..

[4]  Jonathon A. Chambers,et al.  Audiovisual Speech Source Separation: An overview of key methodologies , 2014, IEEE Signal Processing Magazine.

[5]  S. Chiba,et al.  Dynamic programming algorithm optimization for spoken word recognition , 1978 .

[6]  Sailes K. Sengijpta Fundamentals of Statistical Signal Processing: Estimation Theory , 1995 .

[7]  Rémi Gribonval,et al.  Oracle estimators for the benchmarking of source separation algorithms , 2007, Signal Process..

[8]  Alexey Ozerov,et al.  Multichannel Nonnegative Matrix Factorization in Convolutive Mixtures for Audio Source Separation , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Rémi Gribonval,et al.  Under-Determined Reverberant Audio Source Separation Using a Full-Rank Spatial Covariance Model , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[10]  Pierre Comon,et al.  Independent component analysis, A new concept? , 1994, Signal Process..

[11]  Rémi Gribonval,et al.  Non negative sparse representation for Wiener based source separation with a single sensor , 2003, 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)..

[12]  W. H. Sumby,et al.  Visual contribution to speech intelligibility in noise , 1954 .

[13]  N. P. Erber Interaction of audition and vision in the recognition of oral speech stimuli. , 1969, Journal of speech and hearing research.

[14]  Pierre Comon,et al.  Handbook of Blind Source Separation: Independent Component Analysis and Applications , 2010 .