Unsupervised Learning of Local Features for Music Classification

In this work we investigate the applicability of unsupervised feature learning methods to the task of automatic genre prediction of music pieces. More specifically we evaluate a framework that recently has been successfully used to recognize objects in images. We first extract local patches from the time-frequency transformed audio signal, which are then pre-processed and used for unsupervised learning of an overcomplete dictionary of local features. For learning we either use a bootstrapped k-means clustering approach or select features randomly. We further extract feature responses in a convolutional manner and train a linear SVM for classification. We extensively evaluate the approach on the GTZAN dataset, emphasizing the influence of important design choices such as dimensionality reduction, pooling and patch dimension on the classification accuracy. We show that convolutional extraction of local feature responses is crucial to reach high performance. Furthermore we find that using this approach, simple and fast learning techniques such as k-means or randomly selected features are competitive with previously published results which also learn features from audio signals.

[1]  Erkki Oja,et al.  Independent component analysis: algorithms and applications , 2000, Neural Networks.

[2]  George Tzanetakis,et al.  Musical genre classification of audio signals , 2002, IEEE Trans. Speech Audio Process..

[3]  Klaus Seyerlehner,et al.  FRAME LEVEL AUDIO SIMILARITY - A CODEBOOK APPROACH , 2008 .

[4]  Constantine Kotropoulos,et al.  Music Genre Classification Using Locality Preserving Non-Negative Tensor Factorization and Sparse Representations , 2009, ISMIR.

[5]  Honglak Lee,et al.  Unsupervised feature learning for audio classification using convolutional deep belief networks , 2009, NIPS.

[6]  Christian Schörkhuber CONSTANT-Q TRANSFORM TOOLBOX FOR MUSIC PROCESSING , 2010 .

[7]  Jyh-Shing Roger Jang,et al.  Music Genre Classification via Compressive Sampling , 2010, ISMIR.

[8]  Douglas Eck,et al.  Learning Features from Music Audio with Deep Belief Networks , 2010, ISMIR.

[9]  Zhenghao Chen,et al.  On Random Weights and Unsupervised Feature Learning , 2011, ICML.

[10]  Yann LeCun,et al.  Unsupervised Learning of Sparse Features for Scalable Audio Classification , 2011, ISMIR.

[11]  Honglak Lee,et al.  An Analysis of Single-Layer Networks in Unsupervised Feature Learning , 2011, AISTATS.

[12]  Douglas Eck,et al.  Temporal Pooling and Multiscale Learning for Automatic Annotation and Ranking of Music Audio , 2011, ISMIR.

[13]  Christian Osendorfer,et al.  Music Similarity Estimation with the Mean-Covariance Restricted Boltzmann Machine , 2011, 2011 10th International Conference on Machine Learning and Applications and Workshops.

[14]  Martin A. Riedmiller,et al.  A learned feature descriptor for object recognition in RGB-D data , 2012, 2012 IEEE International Conference on Robotics and Automation.