Informed monaural source separation of music based on convolutional sparse coding

Monaural source separation is a challenging problem that has many important applications in music information retrieval. In this paper, we focus on the score-informed variant of this problem. While non-negative matrix factorization and some other approaches have been shown effective, few existing approaches have properly taken the phase information into account. There are unnatural sound in the separation result, as the phase of each source signal is considered equivalent to the phase of the mixed signal. To remedy this, we propose to perform source separation directly in the time domain using a convolutional sparse coding (CSC) approach. Evaluation on the Bach10 dataset shows that, when the instrument, pitch and onset/offset time are informed, the source to distortion ratio of the separation result reaches 8.59 dB, which is 2.02 dB higher than a state-of-the-art system called Soundprism.

[1]  Juhan Nam,et al.  Learning Sparse Feature Representations for Music Annotation and Retrieval , 2012, ISMIR.

[2]  David J. Field,et al.  Sparse coding with an overcomplete basis set: A strategy employed by V1? , 1997, Vision Research.

[3]  Bryan Pardo,et al.  Soundprism: An Online System for Score-Informed Source Separation of Music Audio , 2011, IEEE Journal of Selected Topics in Signal Processing.

[4]  Rémi Gribonval,et al.  Performance measurement in blind audio source separation , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[5]  Sebastian Böck,et al.  Improved musical onset detection with Convolutional Neural Networks , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[6]  Guillermo Sapiro,et al.  Real-time Online Singing Voice Separation from Monaural Recordings Using Robust Low-rank Modeling , 2012, ISMIR.

[7]  Gautham J. Mysore,et al.  Evaluation of a Score-informed Source Separation System , 2010, ISMIR.

[8]  Sridha Sridharan,et al.  The Delta-Phase Spectrum With Application to Voice Activity Detection and Speaker Recognition , 2011, IEEE Transactions on Audio, Speech, and Language Processing.

[9]  Masataka Goto,et al.  Beyond NMF: Time-Domain Audio Source Separation without Phase Reconstruction , 2013, ISMIR.

[10]  Mike E. Davies,et al.  Sparse and shift-Invariant representations of music , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[11]  Michael Elad,et al.  Stable recovery of sparse overcomplete representations in the presence of noise , 2006, IEEE Transactions on Information Theory.

[12]  Bryan Pardo,et al.  Combining Rhythm-Based and Pitch-Based Methods for Background and Melody Separation , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[13]  Meinard Müller,et al.  Using score-informed constraints for NMF-based source separation , 2012, 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[14]  Yann LeCun,et al.  Unsupervised Learning of Sparse Features for Scalable Audio Classification , 2011, ISMIR.

[15]  Mikkel N. Schmidt,et al.  Shift Invariant Sparse Coding of Image and Music Data , 2007 .

[16]  Gaël Richard,et al.  Source/Filter Model for Unsupervised Main Melody Extraction From Polyphonic Audio Signals , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[17]  Roland Badeau,et al.  Blind Harmonic Adaptive Decomposition applied to supervised source separation , 2012, 2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO).

[18]  Francis R. Bach,et al.  Semi-supervised NMF with Time-frequency Annotations for Single-channel Source Separation , 2012, ISMIR.

[19]  Shigeki Sagayama,et al.  Singing Voice Enhancement in Monaural Music Signals Based on Two-stage Harmonic/Percussive Sound Separation on Multiple Resolution Spectrograms , 2014, IEEE/ACM Transactions on Audio, Speech, and Language Processing.

[20]  Patrik O. Hoyer,et al.  Non-negative sparse coding , 2002, Proceedings of the 12th IEEE Workshop on Neural Networks for Signal Processing.

[21]  Philippe Depalle,et al.  Phase constrained complex NMF: Separating overlapping partials in mixtures of harmonic musical sources , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[22]  Mark D. Plumbley,et al.  Score informed audio source separation using constrained nonnegative matrix factorization and score synthesis , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[23]  Simon Dixon,et al.  Joint Multi-Pitch Detection Using Harmonic Envelope Estimation for Polyphonic Music Transcription , 2011, IEEE Journal of Selected Topics in Signal Processing.

[24]  Brendt Wohlberg,et al.  Noise sensitivity of sparse signal representations: reconstruction error bounds for the inverse problem , 2003, IEEE Trans. Signal Process..

[25]  Brendt Wohlberg,et al.  Efficient convolutional sparse coding , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[26]  Gaël Richard,et al.  A Musically Motivated Mid-Level Representation for Pitch Estimation and Musical Audio Source Separation , 2011, IEEE Journal of Selected Topics in Signal Processing.

[27]  Mark D. Plumbley,et al.  Accounting for phase cancellations in non-negative matrix factorization using weighted distances , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[28]  Yi-Hsuan Yang,et al.  Multipitch Estimation of Piano Music by Exemplar-Based Sparse Representation , 2012, IEEE Transactions on Multimedia.

[29]  Graham W. Taylor,et al.  Deconvolutional networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.