Significance tests or confidence intervals: which are preferable for the comparison of classifiers?

Null hypothesis significance tests and their p-values currently dominate the statistical evaluation of classifiers in machine learning. Here, we discuss fundamental problems of this research practice. We focus on the problem of comparing multiple fully specified classifiers on a small-sample test set. On the basis of the method by Quesenberry and Hurst, we derive confidence intervals for the effect size, i.e. the difference in true classification performance. These confidence intervals disentangle the effect size from its uncertainty and thereby provide information beyond the p-value. This additional information can drastically change the way in which classification results are currently interpreted, published and acted upon. We illustrate how our reasoning can change, depending on whether we focus on p-values or confidence intervals. We argue that the conclusions from comparative classification studies should be based primarily on effect size estimation with confidence intervals, and not on significance tests and p-values.

[1]  F. Schmidt Statistical Significance Testing and Cumulative Knowledge in Psychology: Implications for Training of Researchers , 1996 .

[2]  Robert E. McGrath,et al.  Alternatives to null hypothesis significance testing. , 2011 .

[3]  S. Goodman,et al.  p values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. , 1993, American journal of epidemiology.

[4]  G. Cumming Understanding the New Statistics: Effect Sizes, Confidence Intervals, and Meta-Analysis , 2011 .

[5]  Pat Langley,et al.  The changing science of machine learning , 2011, Machine Learning.

[6]  Chris. Drummond,et al.  Machine Learning as an Experimental Science ( Revisited ) ∗ , 2006 .

[7]  C. Drummond Finding a Balance between Anarchy and Orthodoxy , 2008 .

[8]  Janez Demsar,et al.  Statistical Comparisons of Classifiers over Multiple Data Sets , 2006, J. Mach. Learn. Res..

[9]  D. C. Hurst,et al.  Large Sample Simultaneous Confidence Intervals for Multinomial Proportions , 1964 .

[10]  Michael J. Marks,et al.  The null hypothesis significance-testing debate and its implications for personality research. , 2007 .

[11]  Thomas G. Dietterich Approximate Statistical Tests for Comparing Supervised Classification Learning Algorithms , 1998, Neural Computation.

[12]  K. Manly,et al.  Genomics, prior probability, and statistical tests of multiple hypotheses. , 2004, Genome research.

[13]  S. Holm A Simple Sequentially Rejective Multiple Test Procedure , 1979 .

[14]  Peter Dixon,et al.  Why scientists valuep values , 1998 .

[15]  Charles Dugas,et al.  Pointwise exact bootstrap distributions of ROC curves , 2009, Machine Learning.

[16]  S. Goodman Toward Evidence-Based Medical Statistics. 1: The P Value Fallacy , 1999, Annals of Internal Medicine.

[17]  S. García,et al.  An Extension on "Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons , 2008 .

[18]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[19]  Maliha S. Nash,et al.  Handbook of Parametric and Nonparametric Statistical Procedures , 2001, Technometrics.

[20]  C Poole,et al.  Low P-Values or Narrow Confidence Intervals: Which Are More Durable? , 2001, Epidemiology.

[21]  Q. Mcnemar Note on the sampling error of the difference between correlated proportions or percentages , 1947, Psychometrika.

[22]  Gemma C. Garriga,et al.  Permutation Tests for Studying Classifier Performance , 2009, 2009 Ninth IEEE International Conference on Data Mining.

[23]  Werner Dubitzky,et al.  Avoiding model selection bias in small-sample genomic datasets , 2006, Bioinform..

[24]  M. J. van de Vijver,et al.  Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. , 2006, Journal of the National Cancer Institute.

[25]  Pat Langley,et al.  Machine learning as an experimental science , 2004, Machine Learning.

[26]  Douglas H. Johnson The Insignificance of Statistical Significance Testing , 1999 .

[27]  Trevor Hastie,et al.  The Elements of Statistical Learning , 2001 .

[28]  K J Rothman,et al.  A show of confidence. , 1978, The New England journal of medicine.

[29]  J. Jackson Barnette,et al.  The Data Analysis Dilemma: Ban or Abandon. A Review of Null Hypothesis Significance Testing. , 1998 .

[30]  Eibe Frank,et al.  Evaluating the Replicability of Significance Tests for Comparing Learning Algorithms , 2004, PAKDD.

[31]  Lois Ann Colaianni,et al.  Uniform Requirements for Manuscripts Submitted to Biomedical Journals , 1991, The Medical journal of Australia.

[32]  Daniel H. Robinson,et al.  Further Reflections on Hypothesis Testing and Editorial Policy for Primary Research Journals , 1999 .

[33]  Yoshua Bengio,et al.  Inference for the Generalization Error , 1999, Machine Learning.

[34]  Nathalie Japkowicz,et al.  Warning: statistical benchmarking is addictive. Kicking the habit in machine learning , 2010, J. Exp. Theor. Artif. Intell..

[35]  David J. Hand,et al.  Classifier Technology and the Illusion of Progress , 2006, math/0606441.

[36]  G. K. Robinson On the necessity of Bayesian inference and the construction of measures of nearness to Bayesian form , 1978 .

[37]  Leo Breiman,et al.  Random Forests , 2001, Machine Learning.

[38]  W D Johnson,et al.  Confidence intervals for differences in correlated binary proportions. , 1997, Statistics in medicine.

[39]  S. Goodman A dirty dozen: twelve p-value misconceptions. , 2008, Seminars in hematology.

[40]  R Core Team,et al.  R: A language and environment for statistical computing. , 2014 .

[41]  James O. Berger,et al.  Statistical Analysis and the Illusion of Objectivity , 1988 .