Train on Validation: Squeezing the Data Lemon

Model selection on validation data is an essential step in machine learning. While the mixing of data between training and validation is considered taboo, practitioners often violate it to increase performance. Here, we offer a simple, practical method for using the validation set for training, which allows for a continuous, controlled trade-off between performance and overfitting of model selection. We define the notion of on-average-validation-stable algorithms as one in which using small portions of validation data for training does not overfit the model selection process. We then prove that stable algorithms are also validation stable. Finally, we demonstrate our method on the MNIST and CIFAR-10 datasets using stable algorithms as well as state-of-the-art neural networks. Our results show significant increase in test performance with a minor trade-off in bias admitted to the model selection process.

[1]  Bron Cedexaelissee,et al.  A Study about Algorithmic Stability and Their Relation to Generalization Performances , 2000 .

[2]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[3]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[4]  Regina Nuzzo,et al.  Scientific method: Statistical errors , 2014, Nature.

[5]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[6]  D. Freedman A Note on Screening Regression Equations , 1983 .

[7]  Yann LeCun,et al.  The mnist database of handwritten digits , 2005 .

[8]  B. Efron The jackknife, the bootstrap, and other resampling plans , 1987 .

[9]  Andreas Christmann,et al.  Support Vector Machines , 2008, Data Mining and Knowledge Discovery Handbook.

[10]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[11]  R. Lanfear,et al.  The Extent and Consequences of P-Hacking in Science , 2015, PLoS biology.

[12]  Toniann Pitassi,et al.  The reusable holdout: Preserving validity in adaptive data analysis , 2015, Science.

[13]  Seymour Geisser,et al.  The Predictive Sample Reuse Method with Applications , 1975 .

[14]  Kilian Q. Weinberger,et al.  Densely Connected Convolutional Networks , 2016, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Ron Kohavi,et al.  A Study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection , 1995, IJCAI.

[16]  Cullen Schaffer,et al.  Selecting a classification method by cross-validation , 1993, Machine Learning.

[17]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[18]  Ohad Shamir,et al.  Learnability and Stability in the General Learning Setting , 2009, COLT.

[19]  N. Altman An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression , 1992 .

[20]  Yuan Yu,et al.  TensorFlow: A system for large-scale machine learning , 2016, OSDI.

[21]  M. Pontil Leave-one-out error and stability of learning algorithms with applications , 2002 .

[22]  Neil D. Lawrence,et al.  Dataset Shift in Machine Learning , 2009 .

[23]  Gavin C. Cawley,et al.  On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation , 2010, J. Mach. Learn. Res..

[24]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[25]  Toniann Pitassi,et al.  Guilt-free data reuse , 2017, Commun. ACM.