A variational perspective on accelerated methods in optimization

Significance Optimization problems arise naturally in statistical machine learning and other fields concerned with data analysis. The rapid growth in the scale and complexity of modern datasets has led to a focus on gradient-based methods and also on the class of accelerated methods, first proposed by Nesterov in 1983. Accelerated methods achieve faster convergence rates than gradient methods and indeed, under certain conditions, they achieve optimal rates. However, accelerated methods are not descent methods and remain a conceptual mystery. We propose a variational, continuous-time framework for understanding accelerated methods. We provide a systematic methodology for converting accelerated higher-order methods from continuous time to discrete time. Our work illuminates a class of dynamics that may be useful for designing better algorithms for optimization. Accelerated gradient methods play a central role in optimization, achieving optimal rates in many settings. Although many generalizations and extensions of Nesterov’s original acceleration method have been proposed, it is not yet clear what is the natural scope of the acceleration concept. In this paper, we study accelerated methods from a continuous-time perspective. We show that there is a Lagrangian functional that we call the Bregman Lagrangian, which generates a large class of accelerated methods in continuous time, including (but not limited to) accelerated gradient descent, its non-Euclidean extension, and accelerated higher-order gradient methods. We show that the continuous-time limit of all of these methods corresponds to traveling the same curve in spacetime at different speeds. From this perspective, Nesterov’s technique and many of its generalizations can be viewed as a systematic way to go from the continuous-time curves generated by the Bregman Lagrangian to a family of discrete-time accelerated algorithms.

[1]  John Darzentas,et al.  Problem Complexity and Method Efficiency in Optimization , 1983 .

[2]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[3]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[4]  Felipe Alvarez,et al.  Hessian Riemannian Gradient Flows in Convex Programming , 2018, SIAM J. Control. Optim..

[5]  Yurii Nesterov,et al.  Smooth minimization of non-smooth functions , 2005, Math. Program..

[6]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[7]  Jorge Cortés,et al.  Finite-time convergent gradient flows with applications to network consensus , 2006, Autom..

[8]  Y. Nesterov Gradient methods for minimizing composite objective function , 2007 .

[9]  Yurii Nesterov,et al.  Accelerating the cubic regularization of Newton’s method on convex problems , 2005, Math. Program..

[10]  C. Villani Optimal Transport: Old and New , 2008 .

[11]  Jieping Ye,et al.  Multi-label Multiple Kernel Learning , 2008, NIPS.

[12]  Jieping Ye,et al.  An accelerated gradient method for trace norm minimization , 2009, ICML '09.

[13]  Marc Teboulle,et al.  A Fast Iterative Shrinkage-Thresholding Algorithm for Linear Inverse Problems , 2009, SIAM J. Imaging Sci..

[14]  M. Baes Estimate sequence methods: extensions and approximations , 2009 .

[15]  James T. Kwok,et al.  Accelerated Gradient Methods for Stochastic Optimization and Online Learning , 2009, NIPS.

[16]  Stephen Gould,et al.  Accelerated dual decomposition for MAP inference , 2010, ICML.

[17]  Renato D. C. Monteiro,et al.  Primal-dual first-order methods with $${\mathcal {O}(1/\epsilon)}$$ iteration-complexity for cone programming , 2011, Math. Program..

[18]  Guanghui Lan,et al.  Primal-dual first-order methods with O (1/e) iteration-complexity for cone programming. , 2011 .

[19]  Guanghui Lan,et al.  An optimal method for stochastic composite optimization , 2011, Mathematical Programming.

[20]  Yoram Singer,et al.  Parallel Boosting with Momentum , 2013, ECML/PKDD.

[21]  Huan Li,et al.  Accelerated Proximal Gradient Methods for Nonconvex Programming , 2015, NIPS.

[22]  Francis R. Bach,et al.  From Averaging to Acceleration, There is Only a Step-size , 2015, COLT.

[23]  Emmanuel J. Candès,et al.  Adaptive Restart for Accelerated Gradient Schemes , 2012, Foundations of Computational Mathematics.

[24]  Fast Convergence of an Inertial Gradient-like System with Vanishing Viscosity , 2015 .

[25]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[26]  Sayan Mukherjee,et al.  The Information Geometry of Mirror Descent , 2013, IEEE Transactions on Information Theory.

[27]  H. Attouch,et al.  Fast inertial dynamics and FISTA algorithms in convex optimization. Perturbation aspects , 2015, 1507.01367.

[28]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[29]  Ohad Shamir,et al.  On Lower and Upper Bounds for Smooth and Strongly Convex Optimization Problems , 2015, ArXiv.

[30]  H. Attouch,et al.  Fast Convergence of an Inertial Gradient-like System with Vanishing Viscosity , 2015, 1507.04782.

[31]  Saeed Ghadimi,et al.  Accelerated gradient methods for nonconvex nonlinear and stochastic programming , 2013, Math. Program..

[32]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[33]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[34]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.