Audio Source Separation with Discriminative Scattering Networks

Many monaural signal decomposition techniques proposed in the literature operate on a feature space consisting of a time-frequency representation of the input data. A challenge faced by these approaches is to effectively exploit the temporal dependencies of the signals at scales larger than the duration of a time-frame. In this work we propose to tackle this problem by modeling the signals using a time-frequency representation with multiple temporal resolutions. For this reason we use a signal representation that consists of a pyramid of wavelet scattering operators, which generalizes Constant Q Transforms CQT with extra layers of convolution and complex modulus. We first show that learning standard models with this multi-resolution setting improves source separation results over fixed-resolution methods. As study case, we use Non-Negative Matrix Factorizations NMF that has been widely considered in many audio application. Then, we investigate the inclusion of the proposed multi-resolution setting into a discriminative training regime. We discuss several alternatives using different deep neural network architectures, and our preliminary experiments suggest that in this task, finite impulse, multi-resolution Convolutional Networks are a competitive baseline compared to recurrent alternatives.

[1]  Mikkel N. Schmidt,et al.  Single-channel speech separation using sparse non-negative matrix factorization , 2006, INTERSPEECH.

[2]  Ulrike Goldschmidt Speech And Audio Processing In Adverse Environments , 2016 .

[3]  Gautham J. Mysore,et al.  Audio Imputation Using the Non-negative Hidden Markov Model , 2012, LVA/ICA.

[4]  Paris Smaragdis,et al.  Deep learning for monaural speech separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[5]  Paris Smaragdis,et al.  A non-negative approach to semi-supervised separation of speech from noise with the use of temporal dynamics , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  R. Gerchberg A practical algorithm for the determination of phase from image and diffraction plane pictures , 1972 .

[7]  Bhiksha Raj,et al.  Sparse Overcomplete Decomposition for Single Channel Speaker Separation , 2007, 2007 IEEE International Conference on Acoustics, Speech and Signal Processing - ICASSP '07.

[8]  Guillermo Sapiro,et al.  Real-time Online Singing Voice Separation from Monaural Recordings Using Robust Low-rank Modeling , 2012, ISMIR.

[9]  Philipos C. Loizou,et al.  Speech Enhancement: Theory and Practice , 2007 .

[10]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Andrew Zisserman,et al.  Very Deep Convolutional Networks for Large-Scale Image Recognition , 2014, ICLR.

[12]  Jean Ponce,et al.  Task-Driven Dictionary Learning , 2010, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[13]  Joakim Andén,et al.  Deep Scattering Spectrum , 2013, IEEE Transactions on Signal Processing.

[14]  Stéphane Mallat,et al.  Invariant Scattering Convolution Networks , 2012, IEEE transactions on pattern analysis and machine intelligence.

[15]  S. Mallat A wavelet tour of signal processing , 1998 .

[16]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[17]  Jonathan Le Roux,et al.  Non-negative dynamical system with application to speech and audio , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  Cédric Févotte,et al.  Majorization-minimization algorithm for smooth Itakura-Saito nonnegative matrix factorization , 2011, 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Björn W. Schuller,et al.  Discriminatively trained recurrent neural networks for single-channel speech separation , 2014, 2014 IEEE Global Conference on Signal and Information Processing (GlobalSIP).

[20]  Bhiksha Raj,et al.  Speech denoising using nonnegative matrix factorization with priors , 2008, 2008 IEEE International Conference on Acoustics, Speech and Signal Processing.

[21]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[22]  Xiaoou Tang,et al.  Learning a Deep Convolutional Network for Image Super-Resolution , 2014, ECCV.

[23]  Bernhard Schölkopf,et al.  Learning to Deblur , 2014, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[24]  Paris Smaragdis,et al.  Static and Dynamic Source Separation Using Nonnegative Factorizations: A unified view , 2014, IEEE Signal Processing Magazine.

[25]  Geoffrey E. Hinton,et al.  Understanding how Deep Belief Networks perform acoustic modelling , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Paris Smaragdis,et al.  Online PLCA for Real-Time Semi-supervised Source Separation , 2012, LVA/ICA.

[27]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[28]  J. Larsen,et al.  Wind Noise Reduction using Non-Negative Sparse Coding , 2007, 2007 IEEE Workshop on Machine Learning for Signal Processing.

[29]  Guillermo Sapiro,et al.  Supervised non-euclidean sparse NMF via bilevel optimization with applications to speech enhancement , 2014, 2014 4th Joint Workshop on Hands-free Speech Communication and Microphone Arrays (HSCMA).

[30]  Stephane Mollai Recursive interferometric representations , 2010, EUSIPCO.

[31]  Jonathan Le Roux,et al.  Discriminative NMF and its application to single-channel source separation , 2014, INTERSPEECH.

[32]  Paris Smaragdis,et al.  Singing-Voice Separation from Monaural Recordings using Deep Recurrent Neural Networks , 2014, ISMIR.

[33]  Gautham J. Mysore,et al.  Exploiting long-term temporal dependencies in NMF using recurrent neural networks with application to source separation , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[34]  Stéphane Mallat,et al.  Audio Texture Synthesis with Scattering Moments , 2013, ArXiv.

[35]  Paris Smaragdis,et al.  Supervised and Unsupervised Speech Enhancement Using Nonnegative Matrix Factorization , 2013, IEEE Transactions on Audio, Speech, and Language Processing.

[36]  Guillermo Sapiro,et al.  Learnable low rank sparse models for speech denoising , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[37]  Joan Bruna,et al.  Source separation with scattering Non-Negative Matrix Factorization , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).