Optimization with Momentum: Dynamical, Control-Theoretic, and Symplectic Perspectives

We analyze the convergence rate of various momentum-based optimization algorithms from a dynamical systems point of view. Our analysis exploits fundamental topological properties, such as the continuous dependence of iterates on their initial conditions, to provide a simple characterization of convergence rates. In many cases, closed-form expressions are obtained that relate algorithm parameters to the convergence rate. The analysis encompasses discrete time and continuous time, as well as time-invariant and time-variant formulations, and is not limited to a convex or Euclidean setting. In addition, the article rigorously establishes why symplectic discretization schemes are important for momentum-based optimization algorithms, and provides a characterization of algorithms that exhibit accelerated convergence.

[1]  Yee Whye Teh,et al.  Hamiltonian Descent Methods , 2018, ArXiv.

[2]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[3]  Jelena Diakonikolas,et al.  The Approximate Duality Gap Technique: A Unified Theory of First-Order Methods , 2017, SIAM J. Optim..

[4]  Alexandre M. Bayen,et al.  Accelerated Mirror Descent in Continuous and Discrete Time , 2015, NIPS.

[5]  S. Elaydi An introduction to difference equations , 1995 .

[6]  C. Ebenbauer,et al.  On a Class of Smooth Optimization Algorithms with Applications in Control , 2012 .

[7]  Benjamin Recht,et al.  Analysis and Design of Optimization Algorithms via Integral Quadratic Constraints , 2014, SIAM J. Optim..

[8]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[9]  W. Rudin Principles of mathematical analysis , 1964 .

[10]  Yin Tat Lee,et al.  Near-optimal method for highly smooth convex optimization , 2018, COLT.

[11]  Roscoe B White,et al.  Asymptotic Analysis Of Differential Equations , 2005 .

[12]  Mohit Singh,et al.  A geometric alternative to Nesterov's accelerated gradient descent , 2015, ArXiv.

[13]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[14]  Michael I. Jordan,et al.  A Dynamical Systems Perspective on Nesterov Acceleration , 2019, ICML.

[15]  Michael I. Jordan,et al.  On Symplectic Optimization , 2018, 1802.03653.

[16]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[17]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[18]  J. Hale Asymptotic Behavior of Dissipative Systems , 1988 .

[19]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[20]  Stephen P. Boyd,et al.  A Differential Equation for Modeling Nesterov's Accelerated Gradient Method: Theory and Insights , 2014, J. Mach. Learn. Res..

[21]  R. Bellman Stability theory of differential equations , 1953 .

[22]  Andre Wibisono,et al.  A variational perspective on accelerated methods in optimization , 2016, Proceedings of the National Academy of Sciences.

[23]  E. Hairer,et al.  Geometric Numerical Integration , 2022, Oberwolfach Reports.

[24]  O. Nelles,et al.  An Introduction to Optimization , 1996, IEEE Antennas and Propagation Magazine.

[25]  Simon Michalowsky,et al.  Robust and structure exploiting optimisation algorithms: an integral quadratic constraint approach , 2019, Int. J. Control.

[26]  P. Olver Nonlinear Systems , 2013 .

[27]  Yurii Nesterov,et al.  Introductory Lectures on Convex Optimization - A Basic Course , 2014, Applied Optimization.

[28]  V. Arnold,et al.  Ordinary Differential Equations , 1973 .

[29]  J. M. Sanz-Serna,et al.  Symplectic integrators for Hamiltonian problems: an overview , 1992, Acta Numerica.

[30]  Michael I. Jordan,et al.  On Nonconvex Optimization for Machine Learning: Gradients, Stochasticity, and Saddle Points , 2019 .

[31]  Boris Polyak Some methods of speeding up the convergence of iteration methods , 1964 .

[32]  Daniel P. Robinson,et al.  Conformal symplectic and relativistic optimization , 2019, NeurIPS.

[33]  Juan Peypouquet,et al.  Fast convergence of inertial dynamics and algorithms with asymptotic vanishing viscosity , 2018, Math. Program..

[34]  Alexandre d'Aspremont,et al.  Integration Methods and Optimization Algorithms , 2017, NIPS.

[35]  Michael I. Jordan,et al.  Stochastic Gradient Descent Escapes Saddle Points Efficiently , 2019, ArXiv.

[36]  S. Gadat,et al.  Stochastic Heavy ball , 2016, 1609.04228.

[37]  Ravi P. Agarwal,et al.  Difference equations and inequalities , 1992 .

[38]  Daniel P. Robinson,et al.  A trust region algorithm with a worst-case iteration complexity of $$\mathcal{O}(\epsilon ^{-3/2})$$O(ϵ-3/2) for nonconvex optimization , 2017, Math. Program..

[39]  L. Einkemmer Structure preserving numerical methods for the Vlasov equation , 2016, 1604.02616.

[40]  Zeyuan Allen Zhu,et al.  Linear Coupling: An Ultimate Unification of Gradient and Mirror Descent , 2014, ITCS.