Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives

We analyze the convergence rate of various momentum-based optimization algorithms from a dynamical systems point of view. Our analysis exploits fundamental topological properties, such as the continuous dependence of iterates on their initial conditions, to provide a simple characterization of convergence rates. In many cases, closed-form expressions are obtained that relate algorithm parameters to the convergence rate. The analysis encompasses discrete time and continuous time, as well as time-invariant and time-variant formulations, and is not limited to a convex or Euclidean setting. In addition, the article rigorously establishes why symplectic discretization schemes are important for momentum-based optimization algorithms, and provides a characterization of algorithms that exhibit accelerated convergence.

[1]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[2]  C. Ebenbauer,et al.  On a Class of Smooth Optimization Algorithms with Applications in Control , 2012 .

[3]  Daniel P. Robinson,et al.  Conformal symplectic and relativistic optimization , 2019, NeurIPS.

[4]  Michael I. Jordan,et al.  Stochastic Gradient Descent Escapes Saddle Points Efficiently , 2019, ArXiv.

[5]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[6]  W. Rudin Principles of mathematical analysis , 1964 .

[7]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[8]  Alexandre d'Aspremont,et al.  Integration Methods and Optimization Algorithms , 2017, NIPS.

[9]  Ravi P. Agarwal,et al.  Difference equations and inequalities , 1992 .

[10]  Juan Peypouquet,et al.  Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity , 2018, Math. Program..

[11]  Simon Michalowsky,et al.  Robust and structure exploiting optimisation algorithms: an integral quadratic constraint approach , 2019, Int. J. Control.

[12]  Michael I. Jordan,et al.  On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points , 2019 .

[13]  R. Bellman Stability theory of differential equations , 1953 .

[14]  P. Olver Nonlinear Systems , 2013 .

[15]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[16]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[17]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[18]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[19]  Yin Tat Lee,et al.  Near-optimal method for highly smooth convex optimization , 2018, COLT.

[20]  Daniel P. Robinson,et al.  A trust region algorithm with a worst-case iteration complexity of $$\mathcal{O}(\epsilon ^{-3/2})$$O(ϵ-3/2) for nonconvex optimization , 2017, Math. Program..

[21]  Jelena Diakonikolas,et al.  The Approximate Duality Gap Technique: A Unified Theory of First-Order Methods , 2017, SIAM J. Optim..

[22]  Roscoe B White,et al.  Asymptotic Analysis Of Differential Equations , 2005 .

[23]  Yee Whye Teh,et al.  Hamiltonian Descent Methods , 2018, ArXiv.

[24]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[25]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[26]  Michael I. Jordan,et al.  On Symplectic Optimization , 2018, 1802.03653.

[27]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[28]  L. Einkemmer Structure preserving numerical methods for the Vlasov equation , 2016, 1604.02616.

[29]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[30]  S. Elaydi An introduction to difference equations , 1995 .

[31]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.

[32]  E. Hairer,et al.  Geometric Numerical Integration , 2022, Oberwolfach Reports.

[33]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[34]  Michael I. Jordan,et al.  A Dynamical Systems Perspective on Nesterov Acceleration , 2019, ICML.

[35]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[36]  V. Arnold,et al.  Ordinary Differential Equations , 1973 .