Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

We provide a theoretical explanation for the fast convergence of gradient clipping and adaptively scaled gradient methods commonly used in neural network training. Our analysis is based on a novel relaxation of gradient smoothness conditions that is weaker than the commonly used Lipschitz smoothness assumption. We validate the new smoothness condition in experiments on large-scale neural network training applications where adaptively-scaled methods have been empirically shown to outperform standard gradient based algorithms. Under this new smoothness condition, we prove that two popular adaptively scaled methods, \emph{gradient clipping} and \emph{normalized gradient}, converge faster than the theoretical lower bound of fixed-step gradient descent. We verify this fast convergence empirically in neural network training for language modeling and image classification.

[1]  Sanjiv Kumar,et al.  Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.

[2]  Yann Dauphin,et al.  Convolutional Sequence to Sequence Learning , 2017, ICML.

[3]  Mark W. Schmidt,et al.  Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[4]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[5]  Hermann Ney,et al.  LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[6]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7]  Jürgen Schmidhuber,et al.  Long Short-Term Memory , 1997, Neural Computation.

[8]  Francis Bach,et al.  SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[9]  Yair Carmon,et al.  Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[10]  Peter Richtárik,et al.  Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..

[11]  Yuan Cao,et al.  On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.

[12]  Razvan Pascanu,et al.  On the difficulty of training recurrent neural networks , 2012, ICML.

[13]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[14]  Léon Bottou,et al.  On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[15]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[16]  Tong Zhang,et al.  Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Math. Program..

[17]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[18]  Xiaoxia Wu,et al.  L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[19]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[21]  Quoc V. Le,et al.  Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[22]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[23]  L. Armijo Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[24]  Zaïd Harchaoui,et al.  A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[25]  Zeyuan Allen Zhu,et al.  Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[26]  Yurii Nesterov,et al.  Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[27]  Kfir Y. Levy,et al.  The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[28]  Luke S. Zettlemoyer,et al.  Deep Contextualized Word Representations , 2018, NAACL.

[29]  Yann LeCun,et al.  Regularization of Neural Networks using DropConnect , 2013, ICML.

[30]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[31]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[32]  Lukás Burget,et al.  Recurrent neural network based language model , 2010, INTERSPEECH.

[33]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[34]  Shai Shalev-Shwartz,et al.  Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.

[35]  Geoffrey E. Hinton,et al.  Deep Learning , 2015, Nature.

[36]  Li Shen,et al.  A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37]  Pinghua Gong,et al.  Linear Convergence of Variance-Reduced Stochastic Gradient without Strong Convexity , 2014 .

[38]  Yiming Yang,et al.  Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[39]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[40]  Erik Cambria,et al.  Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[41]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42]  Nitish Srivastava,et al.  Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[43]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[44]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[45]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[46]  Yoshua Bengio,et al.  Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[47]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[48]  Sanjiv Kumar,et al.  On the Convergence of Adam and Beyond , 2018 .

[49]  Razvan Pascanu,et al.  Understanding the exploding gradient problem , 2012, ArXiv.

[50]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[51]  Eric Moulines,et al.  Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.