Unsupervised Learning of Sparse Features for Scalable Audio Classification

In this work we present a system to automatically learn features from audio in an unsupervised manner. Our method first learns an overcomplete dictionary which can be used to sparsely decompose log-scaled spectrograms. It then trains an efficient encoder which quickly maps new inputs to approximations of their sparse representations using the learned dictionary. This avoids expensive iterative procedures usually required to infer sparse codes. We then use these sparse codes as inputs for a linear Support Vector Machine (SVM). Our system achieves 83.4% accuracy in predicting genres on the GTZAN dataset, which is competitive with current state-of-the-art approaches. Furthermore, the use of a simple linear classifier combined with a fast feature extraction system allows our approach to scale well to large datasets.

[1]  Stéphane Mallat,et al.  Matching pursuits with time-frequency dictionaries , 1993, IEEE Trans. Signal Process..

[2]  David J. Field,et al.  Emergence of simple-cell receptive field properties by learning a sparse code for natural images , 1996, Nature.

[3]  Michael A. Saunders,et al.  Atomic Decomposition by Basis Pursuit , 1998, SIAM J. Sci. Comput..

[4]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[5]  Michael Elad,et al.  Optimally sparse representation in general (nonorthogonal) dictionaries via ℓ1 minimization , 2003, Proceedings of the National Academy of Sciences of the United States of America.

[6]  I. Daubechies,et al.  An iterative thresholding algorithm for linear inverse problems with a sparsity constraint , 2003, math/0307152.

[7]  Tao Li,et al.  Factors in automatic musical genre classification of audio signals , 2003, 2003 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (IEEE Cat. No.03TH8684).

[8]  Tao Li,et al.  A comparative study on content-based music genre classification , 2003, SIGIR.

[9]  Perfecto Herrera-Boyer,et al.  Automatic Classification of Musical Instrument Sounds , 2003 .

[10]  Douglas Eck,et al.  Aggregate features and ADABOOST for music classification , 2006, Machine Learning.

[11]  Michael S. Lewicki,et al.  Efficient auditory coding , 2006, Nature.

[12]  Rajat Raina,et al.  Efficient sparse coding algorithms , 2006, NIPS.

[13]  Thierry Bertin-Mahieux,et al.  On the Use of Sparce Time Relative Auditory Codes for Music , 2008, ISMIR.

[14]  Yihong Gong,et al.  Linear spatial pyramid matching using sparse coding for image classification , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[15]  Mohammad Norouzi,et al.  Stacks of convolutional Restricted Boltzmann Machines for shift-invariant feature learning , 2009, 2009 IEEE Conference on Computer Vision and Pattern Recognition.

[16]  Constantine Kotropoulos,et al.  Music Genre Classification Using Locality Preserving Non-Negative Tensor Factorization and Sparse Representations , 2009, ISMIR.

[17]  Marc Teboulle,et al.  A fast Iterative Shrinkage-Thresholding Algorithm with application to wavelet-based image deblurring , 2009, 2009 IEEE International Conference on Acoustics, Speech and Signal Processing.

[18]  S. Osher,et al.  Coordinate descent optimization for l 1 minimization with application to compressed sensing; a greedy algorithm , 2009 .

[19]  Mohammad Norouzi,et al.  Stacks of convolutional Restricted Boltzmann Machines for shift-invariant feature learning , 2009, CVPR.

[20]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[21]  Christian Schörkhuber CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING , 2010 .

[22]  Jyh-Shing Roger Jang,et al.  Music Genre Classification via Compressive Sampling , 2010, ISMIR.

[23]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[24]  Yann LeCun,et al.  Convolutional networks and applications in vision , 2010, Proceedings of 2010 IEEE International Symposium on Circuits and Systems.

[25]  Marc'Aurelio Ranzato,et al.  Fast Inference in Sparse Coding Algorithms with Applications to Object Recognition , 2010, ArXiv.

[26]  Yann LeCun,et al.  Learning Fast Approximations of Sparse Coding , 2010, ICML.

[27]  Douglas Eck,et al.  Scalable Genre and Tag Prediction with Spectral Covariance , 2010, ISMIR.

[28]  Chih-Jen Lin,et al.  LIBSVM: A library for support vector machines , 2011, TIST.

[29]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[30]  Yoshua Bengio,et al.  A Spike and Slab Restricted Boltzmann Machine , 2011, AISTATS.