On Learning Rates and Schrödinger Operators

The learning rate is perhaps the single most important parameter in the training of neural networks and, more broadly, in stochastic (nonconvex) optimization. Accordingly, there are numerous effective, but poorly understood, techniques for tuning the learning rate, including learning rate decay, which starts with a large initial learning rate that is gradually decreased. In this paper, we present a general theoretical analysis of the effect of the learning rate in stochastic gradient descent (SGD). Our analysis is based on the use of a learning-rate-dependent stochastic differential equation (lr-dependent SDE) that serves as a surrogate for SGD. For a broad class of objective functions, we establish a linear rate of convergence for this continuous-time formulation of SGD, highlighting the fundamental importance of the learning rate in SGD, and contrasting to gradient descent and stochastic gradient Langevin dynamics. Moreover, we obtain an explicit expression for the optimal linear rate by analyzing the spectrum of the Witten-Laplacian, a special case of the Schrodinger operator associated with the lr-dependent SDE. Strikingly, this expression clearly reveals the dependence of the linear convergence rate on the learning rate -- the linear rate decreases rapidly to zero as the learning rate tends to zero for a broad class of nonconvex functions, whereas it stays constant for strongly convex functions. Based on this sharp distinction between nonconvex and convex problems, we provide a mathematical interpretation of the benefits of using learning rate decay for nonconvex optimization.

[1]  V. Arnold Mathematical Methods of Classical Mechanics , 1974 .

[2]  D. Talay,et al.  The law of the Euler scheme for stochastic differential equations , 1996 .

[3]  Léon Bottou,et al.  Large-Scale Machine Learning with Stochastic Gradient Descent , 2010, COMPSTAT.

[4]  Michael I. Jordan,et al.  Understanding the acceleration phenomenon via high-resolution differential equations , 2018, Mathematical Programming.

[5]  G. N. Mil’shtejn Approximate Integration of Stochastic Differential Equations , 1975 .

[6]  M. Ledoux,et al.  Analysis and Geometry of Markov Diffusion Operators , 2013 .

[7]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[8]  Levent Sagun,et al.  A Tail-Index Analysis of Stochastic Gradient Noise in Deep Neural Networks , 2019, ICML.

[9]  Matthew D. Zeiler ADADELTA: An Adaptive Learning Rate Method , 2012, ArXiv.

[10]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[11]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[12]  Stefano Soatto,et al.  Stochastic Gradient Descent Performs Variational Inference, Converges to Limit Cycles for Deep Networks , 2017, 2018 Information Theory and Applications Workshop (ITA).

[13]  M. Freidlin,et al.  Random Perturbations of Dynamical Systems , 1984 .

[14]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[15]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[16]  Denis Talay,et al.  The law of the Euler scheme for stochastic differential equations , 1996, Monte Carlo Methods Appl..

[17]  Peter L. Bartlett,et al.  Acceleration and Averaging in Stochastic Descent Dynamics , 2017, NIPS.

[18]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[19]  J. Marsden,et al.  A mathematical introduction to fluid mechanics , 1979 .

[20]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[21]  Sanjeev Arora,et al.  An Exponential Learning Rate Schedule for Deep Learning , 2020, ICLR.

[22]  Michael I. Jordan,et al.  Generalized Momentum-Based Methods: A Hamiltonian Perspective , 2019, SIAM J. Optim..

[23]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[24]  F. Nier,et al.  Hypoelliptic Estimates and Spectral Theory for Fokker-Planck Operators and Witten Laplacians , 2005 .

[25]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[26]  Michael Hitrik,et al.  Tunnel effect and symmetries for Kramers–Fokker–Planck type operators , 2010, Journal of the Institute of Mathematics of Jussieu.

[27]  Michael I. Jordan DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION , 2019, Proceedings of the International Congress of Mathematicians (ICM 2018).

[28]  D. Talay,et al.  Discretization and simulation of stochastic differential equations , 1985 .

[29]  Michael I. Jordan,et al.  How Does Learning Rate Decay Help Modern Neural Networks , 2019 .

[30]  F. Nier Quantitative analysis of metastability in reversible diffusion processes via a Witten complex approach. , 2004 .

[31]  Yoshua Bengio,et al.  Three Factors Influencing Minima in SGD , 2017, ArXiv.

[32]  S. Varadhan,et al.  Large deviations , 2019, Graduate Studies in Mathematics.

[33]  P. Kloeden,et al.  The approximation of multiple stochastic integrals , 1992 .

[34]  Yoshua Bengio,et al.  Practical Recommendations for Gradient-Based Training of Deep Architectures , 2012, Neural Networks: Tricks of the Trade.

[35]  Ruoyu Sun,et al.  Optimization for deep learning: theory and algorithms , 2019, ArXiv.

[36]  Nathan Srebro,et al.  Characterizing Implicit Bias in Terms of Optimization Geometry , 2018, ICML.

[37]  P. Lions Generalized Solutions of Hamilton-Jacobi Equations , 1982 .

[38]  Leslie N. Smith,et al.  Cyclical Learning Rates for Training Neural Networks , 2015, 2017 IEEE Winter Conference on Applications of Computer Vision (WACV).

[39]  Jessika Eichel,et al.  Partial Differential Equations Second Edition , 2016 .

[40]  Dmitriy Drusvyatskiy,et al.  Stochastic algorithms with geometric step decay converge linearly on sharp functions , 2019, Mathematical Programming.

[41]  P. Cannarsa,et al.  Semiconcave Functions, Hamilton-Jacobi Equations, and Optimal Control , 2004 .

[42]  Yoshua Bengio,et al.  On the Relation Between the Sharpest Directions of DNN Loss and the SGD Step Length , 2018, ICLR.

[43]  A. S. Kronfeld,et al.  Dynamics of Langevin simulations , 1992, hep-lat/9205008.

[44]  Vladimir Igorevich Arnold,et al.  Geometrical Methods in the Theory of Ordinary Differential Equations , 1983 .

[45]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[46]  Hermano Frid,et al.  Vanishing Viscosity Limit for Initial-Boundary Value Problems for Conservation Laws , 1999 .

[47]  V. Arnold,et al.  Topological methods in hydrodynamics , 1998 .

[48]  S. Zienau Quantum Physics , 1969, Nature.

[49]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[50]  Colin Wei,et al.  Towards Explaining the Regularization Effect of Initial Large Learning Rate in Training Neural Networks , 2019, NeurIPS.

[51]  Stefano Soatto,et al.  Deep relaxation: partial differential equations for optimizing deep neural networks , 2017, Research in the Mathematical Sciences.

[52]  E Weinan,et al.  Stochastic Modified Equations and Adaptive Stochastic Gradient Algorithms , 2015, ICML.

[53]  Cédric Villani,et al.  Hypocoercive Diffusion Operators , 2006 .

[54]  Kenneth F. Caluya,et al.  Gradient Flow Algorithms for Density Propagation in Stochastic Systems , 2019, IEEE Transactions on Automatic Control.

[55]  H. Kushner,et al.  Stochastic Approximation and Recursive Algorithms and Applications , 2003 .

[56]  P. Lions,et al.  Some Properties of Viscosity Solutions of Hamilton-Jacobi Equations. , 1984 .

[57]  Quoc V. Le,et al.  Don't Decay the Learning Rate, Increase the Batch Size , 2017, ICLR.

[58]  Yuchen Zhang,et al.  A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.

[59]  C. Hwang Laplace's Method Revisited: Weak Convergence of Probability Measures , 1980 .

[60]  Denis Talay,et al.  Efficient numerical schemes for the approximation of expectations of functionals of the solution of a S.D.E., and applications , 1984 .

[61]  Israel Michael Sigal,et al.  Introduction to Spectral Theory: With Applications to Schrödinger Operators , 1995 .

[62]  Weijie Su,et al.  Robust Learning Rate Selection for Stochastic Optimization via Splitting Diagnostic , 2019, ArXiv.

[63]  C. Villani,et al.  ON THE TREND TO EQUILIBRIUM FOR THE FOKKER-PLANCK EQUATION : AN INTERPLAY BETWEEN PHYSICS AND FUNCTIONAL ANALYSIS , 2004 .

[64]  Peter L. Bartlett,et al.  Adaptive Online Gradient Descent , 2007, NIPS.

[65]  A. Bovier,et al.  Metastability in reversible diffusion processes II. Precise asymptotics for small eigenvalues , 2005 .

[66]  L. Evans On solving certain nonlinear partial differential equations by accretive operator methods , 1980 .

[67]  G. Mil’shtein Weak Approximation of Solutions of Systems of Stochastic Differential Equations , 1986 .

[68]  Hangfeng He,et al.  The Local Elasticity of Neural Networks , 2020, ICLR.

[69]  Frank Hutter,et al.  SGDR: Stochastic Gradient Descent with Warm Restarts , 2016, ICLR.

[70]  S. Zagatti On viscosity solutions of Hamilton-Jacobi equations , 2008 .

[71]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[72]  Simon M. J. Lyons Introduction to stochastic differential equations , 2011 .

[73]  Laurent Michel,et al.  About small eigenvalues of the Witten Laplacian , 2017, Pure and Applied Analysis.

[74]  F. Bach,et al.  Bridging the gap between constant step size stochastic gradient descent and Markov chains , 2017, The Annals of Statistics.

[75]  Denis Talay,et al.  Resolution trajectorielle et analyse numerique des equations differentielles stochastiques , 1983 .

[76]  P. K. Kundu,et al.  Fluid Mechanics: Fourth Edition , 2008 .