Piano music transcription with fast convolutional sparse coding

Automatic music transcription (AMT) is the process of converting an acoustic musical signal into a symbolic musical representation, such as a MIDI file, which contains the pitches, the onsets and offsets of the notes and, possibly, their dynamics and sources (i.e., instruments). Most existing algorithms for AMT operate in the frequency domain, which introduces the well known time/frequency resolution trade-off of the Short Time Fourier Transform and its variants. In this paper, we propose a time-domain transcription algorithm based on an efficient convolutional sparse coding algorithm in an instrument-specific scenario, i.e., the dictionary is trained and tested on the same piano. The proposed method outperforms a current state-of-the-art AMT method by over 26% in F-measure, achieving a median F-measure of 93.6%, and drastically increases both time and frequency resolutions, especially for the lowest octaves of the piano keyboard.

[1]  Anssi Klapuri,et al.  Automatic music transcription: challenges and future directions , 2013, Journal of Intelligent Information Systems.

[2]  Hideki Kawahara,et al.  YIN, a fundamental frequency estimator for speech and music. , 2002, The Journal of the Acoustical Society of America.

[3]  Daniel P. W. Ellis,et al.  Transcribing Multi-Instrument Polyphonic Music With Hierarchical Eigeninstruments , 2011, IEEE Journal of Selected Topics in Signal Processing.

[4]  Paris Smaragdis,et al.  Non-negative Matrix Factor Deconvolution; Extraction of Multiple Sound Sources from Monophonic Inputs , 2004, ICA.

[5]  Clive A. Greated,et al.  The Musician's Guide to Acoustics , 1987 .

[6]  Mark D. Plumbley,et al.  Polyphonic piano transcription using non-negative Matrix Factorisation with group sparsity , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[7]  Tuomas Virtanen,et al.  Monaural Sound Source Separation by Nonnegative Matrix Factorization With Temporal Continuity and Sparseness Criteria , 2007, IEEE Transactions on Audio, Speech, and Language Processing.

[8]  Tuomas Virtanen,et al.  Multichannel audio upmixing based on non-negative tensor factorization representation , 2011, 2011 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA).

[9]  Mark D. Plumbley,et al.  A dynamic programming variant of non-negative matrix deconvolution for the transcription of struck string instruments , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[10]  Ping-Keng Jao,et al.  Informed monaural source separation of music based on convolutional sparse coding , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[11]  Zhiyao Duan,et al.  Piano music transcription modeling note temporal evolution , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[12]  Brendt Wohlberg,et al.  Efficient convolutional sparse coding , 2014, 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[13]  Tillman Weyde,et al.  Template Adaptation for Improving Automatic Music Transcription , 2014, ISMIR.

[14]  H. Sebastian Seung,et al.  Learning the parts of objects by non-negative matrix factorization , 1999, Nature.

[15]  Roland Badeau,et al.  Multipitch Estimation of Piano Sounds Using a New Probabilistic Spectral Smoothness Principle , 2010, IEEE Transactions on Audio, Speech, and Language Processing.

[16]  Jérôme Idier,et al.  Algorithms for Nonnegative Matrix Factorization with the β-Divergence , 2010, Neural Computation.

[17]  Simon Dixon,et al.  A Shift-Invariant Latent Variable Model for Automatic Music Transcription , 2012, Computer Music Journal.

[18]  Simon Dixon,et al.  Modelling the decay of piano sounds , 2015, 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).

[19]  Yi-Hsuan Yang,et al.  Multipitch Estimation of Piano Music by Exemplar-Based Sparse Representation , 2012, IEEE Transactions on Multimedia.

[20]  P. Smaragdis,et al.  Non-negative matrix factorization for polyphonic music transcription , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[21]  Tom Barker,et al.  Non-negative tensor factorisation of modulation spectrograms for monaural sound source separation , 2013, INTERSPEECH.

[22]  Mike E. Davies,et al.  Sparse and shift-Invariant representations of music , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Bhiksha Raj,et al.  Non-negative Hidden Markov Modeling of Audio with Application to Source Separation , 2010, LVA/ICA.

[24]  Graham W. Taylor,et al.  Deconvolutional networks , 2010, 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition.

[25]  Mark D. Plumbley,et al.  Sparse representations of polyphonic music , 2006, Signal Process..

[26]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[27]  Bhiksha Raj,et al.  A Probabilistic Latent Variable Model for Acoustic Modeling , 2006 .