Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity
暂无分享,去创建一个
Suvrit Sra | Ali Jadbabaie | Jingzhao Zhang | Tianxing He | A. Jadbabaie | S. Sra | J. Zhang | Tianxing He
[1] Lin Xiao,et al. A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..
[2] Xiaoxia Wu,et al. L ] 1 0 A pr 2 01 9 AdaGrad-Norm convergence over nonconvex landscapes AdaGrad stepsizes : sharp convergence over nonconvex landscapes , from any initialization , 2019 .
[3] Marc Teboulle,et al. A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..
[4] Sanjiv Kumar,et al. On the Convergence of Adam and Beyond , 2018 .
[5] Zaïd Harchaoui,et al. A Universal Catalyst for First-Order Optimization , 2015, NIPS.
[6] Yann LeCun,et al. Regularization of Neural Networks using DropConnect , 2013, ICML.
[7] Shai Shalev-Shwartz,et al. Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.
[8] Razvan Pascanu,et al. On the difficulty of training recurrent neural networks , 2012, ICML.
[9] Sanjiv Kumar,et al. Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.
[10] Yann Dauphin,et al. Convolutional Sequence to Sequence Learning , 2017, ICML.
[11] Y. Nesterov. A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .
[12] Yi Zhang,et al. The Case for Full-Matrix Adaptive Regularization , 2018, ArXiv.
[13] Richard Socher,et al. Regularizing and Optimizing LSTM Language Models , 2017, ICLR.
[14] Tong Zhang,et al. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization , 2013, Math. Program..
[15] Pinghua Gong,et al. Linear Convergence of Variance-Reduced Stochastic Gradient without Strong Convexity , 2014 .
[16] Hermann Ney,et al. LSTM Neural Networks for Language Modeling , 2012, INTERSPEECH.
[17] Yiming Yang,et al. Transformer-XL: Attentive Language Models beyond a Fixed-Length Context , 2019, ACL.
[18] Li Shen,et al. On the Convergence of Weighted AdaGrad with Momentum for Training Deep Neural Networks , 2018 .
[19] Francesco Orabona,et al. On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.
[20] Yoram Singer,et al. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..
[21] Jürgen Schmidhuber,et al. Long Short-Term Memory , 1997, Neural Computation.
[22] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.
[23] Léon Bottou,et al. On the Ineffectiveness of Variance Reduced Optimization for Deep Learning , 2018, NeurIPS.
[24] Li Shen,et al. A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[25] Yong Yu,et al. AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.
[26] Tong Zhang,et al. SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.
[27] Jian Sun,et al. Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[28] Yurii Nesterov,et al. Efficiency of Coordinate Descent Methods on Huge-Scale Optimization Problems , 2012, SIAM J. Optim..
[29] Kfir Y. Levy,et al. The Power of Normalization: Faster Evasion of Saddle Points , 2016, ArXiv.
[30] Saeed Ghadimi,et al. Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..
[31] Luke S. Zettlemoyer,et al. Deep Contextualized Word Representations , 2018, NAACL.
[32] Mingyi Hong,et al. On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.
[33] Aleksander Madry,et al. How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.
[34] Yoshua Bengio,et al. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation , 2014, EMNLP.
[35] Tong Zhang,et al. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.
[36] L. Armijo. Minimization of functions having Lipschitz continuous first partial derivatives. , 1966 .
[37] Erik Cambria,et al. Recent Trends in Deep Learning Based Natural Language Processing , 2017, IEEE Comput. Intell. Mag..
[38] Guigang Zhang,et al. Deep Learning , 2016, Int. J. Semantic Comput..
[39] Jimmy Ba,et al. Adam: A Method for Stochastic Optimization , 2014, ICLR.
[40] Nitish Srivastava,et al. Dropout: a simple way to prevent neural networks from overfitting , 2014, J. Mach. Learn. Res..
[41] Alex Krizhevsky,et al. Learning Multiple Layers of Features from Tiny Images , 2009 .
[42] Quoc V. Le,et al. Sequence to Sequence Learning with Neural Networks , 2014, NIPS.
[43] Yair Carmon,et al. Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..
[44] Nathan Srebro,et al. The Marginal Value of Adaptive Gradient Methods in Machine Learning , 2017, NIPS.
[45] Eric Moulines,et al. Non-strongly-convex smooth stochastic approximation with convergence rate O(1/n) , 2013, NIPS.
[46] Razvan Pascanu,et al. Understanding the exploding gradient problem , 2012, ArXiv.
[47] Yair Carmon,et al. Lower bounds for finding stationary points I , 2017, Mathematical Programming.
[48] Quanquan Gu,et al. Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..
[49] Michael I. Jordan,et al. Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.
[50] Zeyuan Allen Zhu,et al. Katyusha: the first direct acceleration of stochastic gradient methods , 2017, STOC.
[51] Mark W. Schmidt,et al. Minimizing finite sums with the stochastic average gradient , 2013, Mathematical Programming.
[52] Francis Bach,et al. SAGA: A Fast Incremental Gradient Method With Support for Non-Strongly Convex Composite Objectives , 2014, NIPS.
[53] Saeed Ghadimi,et al. Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Math. Program..
[54] Lukás Burget,et al. Recurrent neural network based language model , 2010, INTERSPEECH.
[55] Boris Polyak. Some methods of speeding up the convergence of iteration methods , 1964 .
[56] Peter Richtárik,et al. Semi-Stochastic Gradient Descent Methods , 2013, Front. Appl. Math. Stat..
[57] Yuan Cao,et al. On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.