Momentum Improves Normalized SGD

We provide an improved analysis of normalized SGD showing that adding momentum provably removes the need for large batch sizes on non-convex objectives. Then, we consider the case of objectives with bounded second derivative and show that in this case a small tweak to the momentum formula allows normalized SGD with momentum to find an $\epsilon$-critical point in $O(1/\epsilon^{3.5})$ iterations, matching the best-known rates without accruing any logarithmic factors or dependence on dimension. We also provide an adaptive method that automatically improves convergence rates when the variance in the gradients is small. Finally, we show that our method is effective when employed on popular large scale tasks such as ResNet-50 and BERT pretraining, matching the performance of the disparate methods used to get state-of-the-art results on both tasks.

[1]  Li Fei-Fei,et al.  ImageNet: A large-scale hierarchical image database , 2009, CVPR.

[2]  Ioannis Mitliagkas,et al.  Reducing the variance in online optimization by transporting past gradients , 2019, NeurIPS.

[3]  Yair Carmon,et al.  "Convex Until Proven Guilty": Dimension-Free Acceleration of Gradient Descent on Non-Convex Functions , 2017, ICML.

[4]  Zeyuan Allen-Zhu,et al.  Natasha 2: Faster Non-Convex Optimization Than SGD , 2017, NeurIPS.

[5]  Yang You,et al.  Scaling SGD Batch Size to 32K for ImageNet Training , 2017, ArXiv.

[6]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[7]  Noam Shazeer,et al.  Adafactor: Adaptive Learning Rates with Sublinear Memory Cost , 2018, ICML.

[8]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[9]  Lam M. Nguyen,et al.  A Hybrid Stochastic Optimization Framework for Stochastic Composite Nonconvex Optimization , 2019, ArXiv.

[10]  Geoffrey E. Hinton,et al.  Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer , 2017, ICLR.

[11]  Colin Raffel,et al.  Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , 2019, J. Mach. Learn. Res..

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Nathan Srebro,et al.  Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[14]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[15]  Quanquan Gu,et al.  Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[16]  Ilya Sutskever,et al.  Language Models are Unsupervised Multitask Learners , 2019 .

[17]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[18]  Francesco Orabona,et al.  Momentum-Based Variance Reduction in Non-Convex SGD , 2019, NeurIPS.

[19]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[20]  Shai Shalev-Shwartz,et al.  Beyond Convexity: Stochastic Quasi-Convex Optimization , 2015, NIPS.

[21]  James Demmel,et al.  Reducing BERT Pre-Training Time from 3 Days to 76 Minutes , 2019, ArXiv.

[22]  知秀 柴田 5分で分かる!? 有名論文ナナメ読み:Jacob Devlin et al. : BERT : Pre-training of Deep Bidirectional Transformers for Language Understanding , 2020 .

[23]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[24]  Yang You,et al.  Large Batch Training of Convolutional Networks , 2017, 1708.03888.

[25]  Ming-Wei Chang,et al.  BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding , 2019, NAACL.

[26]  Michael I. Jordan,et al.  Stochastic Cubic Regularization for Fast Nonconvex Optimization , 2017, NeurIPS.

[27]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[28]  Sanjiv Kumar,et al.  Adaptive Methods for Nonconvex Optimization , 2018, NeurIPS.

[29]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[30]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[31]  Yoram Singer,et al.  Memory Efficient Adaptive Optimization , 2019, NeurIPS.

[32]  Kaiming He,et al.  Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour , 2017, ArXiv.

[33]  Zhouchen Lin,et al.  Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.