论文信息 - Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition - 字舞流文

Analysis of Gradient Clipping and Adaptive Scaling with a Relaxed Smoothness Condition

We provide a theoretical explanation for the fast convergence of gradient clipping and adaptively scaled gradient methods commonly used in neural network training. Our analysis is based on a novel relaxation of gradient smoothness conditions that is weaker than the commonly used Lipschitz smoothness assumption. We validate the new smoothness condition in experiments on large-scale neural network training applications where adaptively-scaled methods have been empirically shown to outperform standard gradient based algorithms. Under this new smoothness condition, we prove that two popular adaptively scaled methods, \emph{gradient clipping} and \emph{normalized gradient}, converge faster than the theoretical lower bound of fixed-step gradient descent. We verify this fast convergence empirically in neural network training for language modeling and image classification.

Suvrit Sra | Ali Jadbabaie | Jingzhao Zhang | Tianxing He | A. Jadbabaie | S. Sra | J. Zhang | Tianxing He

[1] Sanjiv Kumar,et al. Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.

[2] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.

[3] Mark W. Schmidt,et al. Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.

[4] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[5] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.

[6] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[7] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.

[8] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.

[9] Yair Carmon,et al. Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[10] Peter Richtárik,et al. Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..

[11] Yuan Cao,et al. On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.

[12] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.

[13] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.

[14] Léon Bottou,et al. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.

[15] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[16] Tong Zhang,et al. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Math. Program..

[17] Lin Xiao,et al. A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[18] Xiaoxia Wu,et al. L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .

[19] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[20] Saeed Ghadimi,et al. Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..

[21] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.

[22] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[23] L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .

[24] Zaïd Harchaoui,et al. A Universal Catalyst for First-Order Optimization , 2015, NIPS.

[25] Zeyuan Allen Zhu,et al. Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.

[26] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..

[27] Kfir Y. Levy,et al. The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.

[28] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.

[29] Yann LeCun,et al. Regularization of Neural Networks using DropConnect , 2013, ICML.

[30] Aleksander Madry,et al. How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[31] Saeed Ghadimi,et al. Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Mathematical Programming.

[32] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.

[33] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .

[34] Shai Shalev-Shwartz,et al. Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.

[35] Geoffrey E. Hinton,et al. Deep Learning , 2015, Nature.

[36] Li Shen,et al. A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[37] Pinghua Gong,et al. Linear Convergence of Variance-Reduced Stochastic Gradient without Strong Convexity , 2014 .

[38] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.

[39] Francesco Orabona,et al. On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[40] Erik Cambria,et al. Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..

[41] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.

[42] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..

[43] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .

[44] Michael I. Jordan,et al. Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[45] Tong Zhang,et al. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[46] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.

[47] Marc Teboulle,et al. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[48] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .

[49] Razvan Pascanu,et al. Understanding the exploding gradient problem , 2012, ArXiv.

[50] Yair Carmon,et al. Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[51] Eric Moulines,et al. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.