Stochastic Optimization with Non-stationary Noise

We investigate stochastic optimization problems under relaxed assumptions on the distribution of noise that are motivated by empirical observations in neural network training. Standard results on optimal convergence rates for stochastic optimization assume either there exists a uniform bound on the moments of the gradient noise, or that the noise decays as the algorithm progresses. These assumptions do not match the empirical behavior of optimization algorithms used in neural network training where the noise level in stochastic gradients could even increase with time. We address this behavior by studying convergence rates of stochastic gradient methods subject to changing second moment (or variance) of the stochastic oracle as the iterations progress. When the variation in the noise is known, we show that it is always beneficial to adapt the step-size and exploit the noise variability. When the noise statistics are unknown, we obtain similar improvements by developing an online estimator of the noise level, thereby recovering close variants of RMSProp. Consequently, our results reveal an important scenario where adaptive stepsize methods outperform SGD.

[1]  L. Rosasco,et al.  Convergence of Stochastic Proximal Gradient Algorithm , 2014, Applied Mathematics & Optimization.

[2]  Kfir Y. Levy,et al.  Online to Offline Conversions, Universality and Adaptive Minibatch Sizes , 2017, NIPS.

[3]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[4]  Eric Moulines,et al.  Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning , 2011, NIPS.

[5]  Andreas Veit,et al.  Why are Adaptive Methods Good for Attention Models? , 2020, NeurIPS.

[6]  Mingrui Liu,et al.  Towards Better Understanding of Adaptive Gradient Algorithms in Generative Adversarial Nets , 2019, ICLR.

[7]  Li Shen,et al.  A Sufficient Condition for Convergences of Adam and RMSProp , 2018, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).

[8]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[9]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[10]  Rémi Munos,et al.  Minimax Regret Bounds for Reinforcement Learning , 2017, ICML.

[11]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[12]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[13]  Suvrit Sra,et al.  Why Gradient Clipping Accelerates Training: A Theoretical Justification for Adaptivity , 2019, ICLR.

[14]  Ohad Shamir,et al.  The Complexity of Finding Stationary Points with Stochastic Gradient Descent , 2020, ICML.

[15]  Liyuan Liu,et al.  On the Variance of the Adaptive Learning Rate and Beyond , 2019, ICLR.

[16]  S. Gadat,et al.  Optimal non-asymptotic bound of the Ruppert-Polyak averaging without strong convexity , 2017, 1709.03342.

[17]  Bin Dong,et al.  Nostalgic Adam: Weighing more of the past gradients when designing the adaptive learning rate , 2019, IJCAI.

[18]  Michael L. Littman,et al.  An analysis of model-based Interval Estimation for Markov Decision Processes , 2008, J. Comput. Syst. Sci..

[19]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[20]  Xiaoxia Wu,et al.  AdaGrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization , 2018, ICML.

[21]  Alejandro Jofré,et al.  On variance reduction for stochastic smooth convex optimization with multiplicative noise , 2017, Math. Program..

[22]  Yi Zhang,et al.  The Case for Full-Matrix Adaptive Regularization , 2018, ArXiv.

[23]  Li Shen,et al.  On the Convergence of Weighted AdaGrad with Momentum for Training Deep Neural Networks , 2018 .

[24]  Francesco Orabona,et al.  On the Convergence of Stochastic Gradient Descent with Adaptive Stepsizes , 2018, AISTATS.

[25]  Denis Yarats,et al.  On the adequacy of untuned warmup for adaptive optimization , 2019, AAAI.

[26]  Ohad Shamir,et al.  Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization , 2011, ICML.

[27]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[28]  Li Shen,et al.  Weighted AdaGrad with Unified Momentum , 2018 .

[29]  Yong Yu,et al.  AdaShift: Decorrelation and Convergence of Adaptive Learning Rate Methods , 2018, ICLR.

[30]  Sashank J. Reddi,et al.  Why ADAM Beats SGD for Attention Models , 2019, ArXiv.

[31]  Sanjiv Kumar,et al.  Escaping Saddle Points with Adaptive Gradient Methods , 2019, ICML.

[32]  Richard Socher,et al.  Regularizing and Optimizing LSTM Language Models , 2017, ICLR.

[33]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[34]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[35]  Volkan Cevher,et al.  Online Adaptive Methods, Universality and Acceleration , 2018, NeurIPS.

[36]  Martin J. Wainwright,et al.  Information-theoretic lower bounds on the oracle complexity of convex optimization , 2009, NIPS.

[37]  Michael I. Jordan,et al.  Is Q-learning Provably Efficient? , 2018, NeurIPS.

[38]  Mingyi Hong,et al.  On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization , 2018, ICLR.

[39]  Nathan Srebro,et al.  Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[40]  Yuan Cao,et al.  On the Convergence of Adaptive Gradient Methods for Nonconvex Optimization , 2018, ArXiv.

[41]  Saeed Ghadimi,et al.  Optimal Stochastic Approximation Algorithms for Strongly Convex Stochastic Composite Optimization I: A Generic Algorithmic Framework , 2012, SIAM J. Optim..