Acceleration and Averaging in Stochastic Descent Dynamics

We formulate and study a general family of (continuous-time) stochastic dynamics for accelerated first-order minimization of smooth convex functions. Building on an averaging formulation of accelerated mirror descent, we propose a stochastic variant in which the gradient is contaminated by noise, and study the resulting stochastic differential equation. We prove a bound on the rate of change of an energy function associated with the problem, then use it to derive estimates of convergence rates of the function values (almost surely and in expectation), both for persistent and asymptotically vanishing noise. We discuss the interaction between the parameters of the dynamics (learning rate and averaging rates) and the covariation of the noise process. In particular, we show how the asymptotic rate of covariation affects the choice of parameters and, ultimately, the convergence rate.

[1]  F. Black,et al.  The Pricing of Options and Corporate Liabilities , 1973, Journal of Political Economy.

[2]  丸山 徹 Convex Analysisの二,三の進展について , 1977 .

[3]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[4]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[5]  C. Hwang,et al.  Diffusion for global optimization in R n , 1987 .

[6]  B. Øksendal Stochastic differential equations : an introduction with applications , 1987 .

[7]  A. M. Lyapunov The general problem of the stability of motion , 1992 .

[8]  A. Bloch Hamiltonian and Gradient Flows, Algorithms and Control , 1995 .

[9]  M. Hirsch,et al.  Asymptotic pseudotrajectories and chain recurrent flows, with applications , 1996 .

[10]  M. Benaïm Dynamics of stochastic approximation algorithms , 1999 .

[11]  Arkadi Nemirovski,et al.  The Ordered Subsets Mirror Descent Optimization Method with Applications to Tomography , 2001, SIAM J. Optim..

[12]  P. Vallois,et al.  Itô's formula for C1,λ-functions of a càdlàg process and related calculus , 2002 .

[13]  Marc Teboulle,et al.  Mirror descent and nonlinear projected subgradient methods for convex optimization , 2003, Oper. Res. Lett..

[14]  Inderjit S. Dhillon,et al.  Clustering with Bregman Divergences , 2005, J. Mach. Learn. Res..

[15]  S. Gadat,et al.  On the long time behavior of second order differential equations with asymptotically small dissipation , 2007, 0710.1107.

[16]  Alexander Shapiro,et al.  Stochastic Approximation approach to Stochastic Programming , 2013 .

[17]  Michael I. Jordan,et al.  Ergodic mirror descent , 2011, 2011 49th Annual Allerton Conference on Communication, Control, and Computing (Allerton).

[18]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[19]  Maxim Raginsky,et al.  Continuous-time stochastic Mirror Descent on a network: Variance reduction, consensus, convergence , 2012, 2012 IEEE 51st IEEE Conference on Decision and Control (CDC).

[20]  Tong Zhang,et al.  Accelerating Stochastic Gradient Descent using Predictive Variance Reduction , 2013, NIPS.

[21]  A. Dalalyan Theoretical guarantees for approximate sampling from smooth and log‐concave densities , 2014, 1412.7392.

[22]  Lin Xiao,et al.  A Proximal Stochastic Gradient Method with Progressive Variance Reduction , 2014, SIAM J. Optim..

[23]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[24]  Sébastien Bubeck,et al.  Finite-Time Analysis of Projected Langevin Monte Carlo , 2015, NIPS.

[25]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[26]  H. Attouch,et al.  Fast Convergence of an Inertial Gradient-like System with Vanishing Viscosity , 2015, 1507.04782.

[27]  Alexandre M. Bayen,et al.  Adaptive Averaging in Accelerated Descent Dynamics , 2016, NIPS.

[28]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[29]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[30]  É. Moulines,et al.  Sampling from a strongly log-concave distribution with the Unadjusted Langevin Algorithm , 2016 .

[31]  Michael I. Jordan,et al.  A Lyapunov Analysis of Momentum Methods in Optimization , 2016, ArXiv.

[32]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[33]  Michael I. Jordan,et al.  Underdamped Langevin MCMC: A non-asymptotic analysis , 2017, COLT.

[34]  Peter L. Bartlett,et al.  Convergence of Langevin MCMC in KL-divergence , 2017, ALT.

[35]  Mathias Staudigl,et al.  On the convergence of gradient-like flows with noisy gradient input , 2016, SIAM J. Optim..

[36]  Jorge Nocedal,et al.  Optimization Methods for Large-Scale Machine Learning , 2016, SIAM Rev..

[37]  A. Eberle,et al.  Couplings and quantitative contraction rates for Langevin dynamics , 2017, The Annals of Probability.