Train faster, generalize better: Stability of stochastic gradient descent

We show that parametric models trained by a stochastic gradient method (SGM) with few iterations have vanishing generalization error. We prove our results by arguing that SGM is algorithmically stable in the sense of Bousquet and Elisseeff. Our analysis only employs elementary tools from convex and continuous optimization. We derive stability bounds for both convex and non-convex optimization under standard Lipschitz and smoothness assumptions. Applying our results to the convex case, we provide new insights for why multiple epochs of stochastic gradient methods generalize well in practice. In the non-convex case, we give a new interpretation of common practices in neural networks, and formally show that popular techniques for training large deep models are indeed stability-promoting. Our findings conceptually underscore the importance of reducing training time beyond its obvious benefit.

[1]  R. Rockafellar Monotone Operators and the Proximal Point Algorithm , 1976 .

[2]  Luc Devroye,et al.  Distribution-free performance bounds for potential function rules , 1979, IEEE Trans. Inf. Theory.

[3]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[4]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[5]  Pierre Priouret,et al.  Adaptive Algorithms and Stochastic Approximations , 1990, Applications of Mathematics.

[6]  Anders Krogh,et al.  A Simple Weight Decay Can Improve Generalization , 1991, NIPS.

[7]  Beatrice Santorini,et al.  Building a Large Annotated Corpus of English: The Penn Treebank , 1993, CL.

[8]  William A. Sethares,et al.  Weak convergence and local stability properties of fixed step size recursive algorithms , 1993, IEEE Trans. Inf. Theory.

[9]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[10]  Yoav Freund,et al.  Large Margin Classification Using the Perceptron Algorithm , 1998, COLT' 98.

[11]  Dana Ron,et al.  Algorithmic Stability and Sanity-Check Bounds for Leave-One-Out Cross-Validation , 1997, Neural Computation.

[12]  M. Kearns,et al.  Algorithmic stability and sanity-check bounds for leave-one-out cross-validation , 1999 .

[13]  A. Föhrenbach,et al.  SIMPLE++ , 2000, OR Spectr..

[14]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[15]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[16]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[17]  Patrick L. Combettes,et al.  Signal Recovery by Proximal Forward-Backward Splitting , 2005, Multiscale Model. Simul..

[18]  Massimiliano Pontil,et al.  Stability of Randomized Learning Algorithms , 2005, J. Mach. Learn. Res..

[19]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[20]  Léon Bottou,et al.  The Tradeoffs of Large Scale Learning , 2007, NIPS.

[21]  Elad Hazan,et al.  Logarithmic regret algorithms for online convex optimization , 2006, Machine Learning.

[22]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[23]  Ohad Shamir,et al.  Learnability, Stability and Uniform Convergence , 2010, J. Mach. Learn. Res..

[24]  Elad Hazan,et al.  An optimal algorithm for stochastic strongly-convex optimization , 2010, 1006.2425.

[25]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[26]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[27]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[28]  Oren Somekh,et al.  Almost Optimal Exploration in Multi-Armed Bandits , 2013, ICML.

[29]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[30]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[31]  Roi Livni,et al.  On the Computational Efficiency of Training Neural Networks , 2014, NIPS.

[32]  Stephen P. Boyd,et al.  Proximal Algorithms , 2013, Found. Trends Optim..

[33]  Wojciech Zaremba,et al.  Recurrent Neural Network Regularization , 2014, ArXiv.

[34]  Yurii Nesterov,et al.  First-order methods of smooth convex optimization with inexact oracle , 2013, Mathematical Programming.

[35]  Ryota Tomioka,et al.  In Search of the Real Inductive Bias: On the Role of Implicit Regularization in Deep Learning , 2014, ICLR.

[36]  Kobbi Nissim,et al.  On the Generalization Properties of Differential Privacy , 2015, ArXiv.

[37]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[38]  Sham M. Kakade,et al.  Competing with the Empirical Risk Minimizer in a Single Pass , 2014, COLT.

[39]  Anima Anandkumar,et al.  Generalization Bounds for Neural Networks through Tensor Factorization , 2015, ArXiv.

[40]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[41]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[42]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[43]  Ameet Talwalkar,et al.  Non-stochastic Best Arm Identification and Hyperparameter Optimization , 2015, AISTATS.

[44]  Anima Anandkumar,et al.  Beating the Perils of Non-Convexity: Guaranteed Training of Neural Networks using Tensor Methods , 2017 .