On Evaluation Validity in Music Autotagging

Music autotagging, an established problem in Music Information Retrieval, aims to alleviate the human cost required to manually annotate collections of recorded music with textual labels by automating the process. Many autotagging systems have been proposed and evaluated by procedures and datasets that are now standard (used in MIREX, for instance). Very little work, however, has been dedicated to determine what these evaluations really mean about an autotagging system, or the comparison of two systems, for the problem of annotating music in the real world. In this article, we are concerned with explaining the figure of merit of an autotagging system evaluated with a standard approach. Specifically, does the figure of merit, or a comparison of figures of merit, warrant a conclusion about how well autotagging systems have learned to describe music with a specific vocabulary? The main contributions of this paper are a formalization of the notion of validity in autotagging evaluation, and a method to test it in general. We demonstrate the practical use of our method in experiments with three specific state-of-the-art autotagging systems --all of which are reproducible using the linked code and data. Our experiments show for these specific systems in a simple and objective two-class task that the standard evaluation approach does not provide valid indicators of their performance.

[1]  J. Aldrich Correlations Genuine and Spurious in Pearson and Yule , 1995 .

[2]  R. A. Bailey,et al.  Design of comparative experiments , 2008 .

[3]  Kaare Brandt Petersen,et al.  Mel Frequency Cepstral Coefficients: An Evaluation of Robustness of MP3 Encoded Music , 2006, ISMIR.

[4]  Koby Crammer,et al.  A theory of learning from different domains , 2010, Machine Learning.

[5]  Allen Y. Yang,et al.  Robust Face Recognition via Sparse Representation , 2009, IEEE Transactions on Pattern Analysis and Machine Intelligence.

[6]  Constantine Kotropoulos,et al.  Music genre classification via sparse representations of auditory temporal modulations , 2009, 2009 17th European Signal Processing Conference.

[7]  Peter Knees,et al.  Automatic Music Tag Classification Based On Block-Level Features , 2010 .

[8]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[9]  Gerhard Widmer,et al.  Improvements of Audio-Based Music Similarity and Genre Classificaton , 2005, ISMIR.

[10]  Gonçalo Marques,et al.  A Music Classification Method based on Timbral Features , 2009, ISMIR.

[11]  Juhan Nam,et al.  Learning Sparse Feature Representations for Music Annotation and Retrieval , 2012, ISMIR.

[12]  Sebastian Ewert,et al.  The Audio Degradation Toolbox and Its Application to Robustness Evaluation , 2013, ISMIR.

[13]  Xavier Serra,et al.  Roadmap for Music Information ReSearch , 2013 .

[14]  Joan Bruna,et al.  Intriguing properties of neural networks , 2013, ICLR.

[15]  Daniel P. W. Ellis,et al.  A Large-Scale Evaluation of Acoustic and Subjective Music-Similarity Measures , 2004, Computer Music Journal.

[16]  Xavier Serra,et al.  What is the Effect of Audio Quality on the Robustness of MFCCs and Chroma Features? , 2014, ISMIR.

[17]  Bob L. Sturm A Simple Method to Determine if a Music Information Retrieval System is a “Horse” , 2014, IEEE Transactions on Multimedia.

[18]  Marcos Aurélio Domingues,et al.  Three Current Issues In Music Autotagging , 2011, ISMIR.

[19]  B. McCurdy,et al.  Appendices , 1994 .

[20]  Antoni B. Chan,et al.  Multivariate Autoregressive Mixture Models for Music Auto-Tagging , 2012, ISMIR.

[21]  Mohamed Sordo Semantic annotation of music collections: A computational approach , 2012 .

[22]  Daniel P. W. Ellis,et al.  Quantitative Analysis of a Common Audio Similarity Measure , 2009, IEEE Transactions on Audio, Speech, and Language Processing.

[23]  Thierry Bertin-Mahieux,et al.  Automatic Tagging of Audio: The State-of-the-Art , 2011 .

[24]  Thierry Bertin-Mahieux,et al.  The Million Song Dataset , 2011, ISMIR.

[25]  Michael I. Mandel,et al.  Evaluation of Algorithms Using Games: The Case of Music Tagging , 2009, ISMIR.

[26]  Xavier Serra,et al.  Evaluation in Music Information Retrieval , 2013, Journal of Intelligent Information Systems.

[27]  Bob L. Sturm,et al.  A closer look at deep learning neural networks with low-level spectral periodicity features , 2014, 2014 4th International Workshop on Cognitive Information Processing (CIP).

[28]  Youngmoo E. Kim,et al.  Exploring automatic music annotation with "acoustically-objective" tags , 2010, MIR '10.

[29]  Allen Gersho,et al.  Vector quantization and signal compression , 1991, The Kluwer international series in engineering and computer science.

[30]  Zhouyu Fu,et al.  A Survey of Audio-Based Music Classification and Annotation , 2011, IEEE Transactions on Multimedia.

[31]  Bob L. Sturm,et al.  On Automatic Music Genre Recognition by Sparse Representation Classification using Auditory Temporal Modulations , 2012, CMMR 2012.

[32]  Qiang Yang,et al.  A Survey on Transfer Learning , 2010, IEEE Transactions on Knowledge and Data Engineering.

[33]  Gert R. G. Lanckriet,et al.  Semantic Annotation and Retrieval of Music and Sound Effects , 2008, IEEE Transactions on Audio, Speech, and Language Processing.

[34]  Riccardo Miotto,et al.  Improving Auto-tagging by Modeling Semantic Co-occurrences , 2010, ISMIR.

[35]  Paul Lamere,et al.  Social Tagging and Music Information Retrieval , 2008 .

[36]  Daniel P. W. Ellis,et al.  Please Scroll down for Article Journal of New Music Research a Web-based Game for Collecting Music Metadata a Web-based Game for Collecting Music Metadata , 2022 .

[37]  Michael P. Friedlander,et al.  Probing the Pareto Frontier for Basis Pursuit Solutions , 2008, SIAM J. Sci. Comput..

[38]  Bob L. Sturm Two systems for automatic music genre recognition: what are they really recognizing? , 2012, MIRUM '12.

[39]  George Tzanetakis,et al.  Improving automatic music tag annotation using stacked generalization of probabilistic SVM outputs , 2009, ACM Multimedia.

[40]  Riccardo Miotto,et al.  A Generative Context Model for Semantic Music Annotation and Retrieval , 2012, IEEE Transactions on Audio, Speech, and Language Processing.

[41]  George Tzanetakis,et al.  An experimental comparison of audio tempo induction algorithms , 2006, IEEE Transactions on Audio, Speech, and Language Processing.

[42]  Bob L. Sturm Making Explicit the Formalism Underlying Evaluation in Music Information Retrieval Research: A Look at the MIREX Automatic Mood Classification Task , 2013, CMMR.

[43]  Thierry Bertin-Mahieux,et al.  Autotagger: A Model for Predicting Social Tags from Acoustic Features on Large Music Databases , 2008 .

[44]  Dacheng Tao,et al.  Music Tagging with Regularized Logistic Regression , 2011, ISMIR.

[45]  Klaus-Robert Müller,et al.  Covariate Shift Adaptation by Importance Weighted Cross Validation , 2007, J. Mach. Learn. Res..