First-order methods almost always avoid saddle points: The case of vanishing step-sizes

We establish that first-order methods avoid saddle points for almost all initializations. Our results apply to a wide variety of first-order methods, including gradient descent, block coordinate descent, mirror descent and variants thereof. The connecting thread is that such algorithms can be studied from a dynamical systems perspective in which appropriate instantiations of the Stable Manifold Theorem allow for a global stability analysis. Thus, neither access to second-order derivative information nor randomness beyond initialization is necessary to provably avoid saddle points.

[1]  Philip E. Gill,et al.  Newton-type methods for unconstrained and linearly constrained optimization , 1974, Math. Program..

[2]  Danny C. Sorensen,et al.  On the use of directions of negative curvature in a modified newton method , 1979, Math. Program..

[3]  M. Shub Global Stability of Dynamical Systems , 1986 .

[4]  Katta G. Murty,et al.  Some NP-complete problems in quadratic and nonlinear programming , 1987, Math. Program..

[5]  R. Pemantle,et al.  Nonconvergence to Unstable Points in Urn Models and Stochastic Approximations , 1990 .

[6]  L. Perko Differential Equations and Dynamical Systems , 1991 .

[7]  P. Mikusinski,et al.  An Introduction to Multivariable Analysis from Vector to Manifold , 2001 .

[8]  Yurii Nesterov,et al.  Cubic regularization of Newton method and its global performance , 2006, Math. Program..

[9]  Robert E. Mahony,et al.  Optimization Algorithms on Matrix Manifolds , 2007 .

[10]  L. Barreira,et al.  Stability Of Nonautonomous Differential Equations , 2007 .

[11]  Adrian S. Lewis,et al.  Alternating Projections on Manifolds , 2008, Math. Oper. Res..

[12]  Martin Rasmussen,et al.  Computation of nonautonomous invariant and inertial manifolds , 2009, Numerische Mathematik.

[13]  Éva Tardos,et al.  Multiplicative updates outperform generic no-regret learning in congestion games: extended abstract , 2009, STOC '09.

[14]  Andrea Montanari,et al.  Matrix completion from a few entries , 2009, 2009 IEEE International Symposium on Information Theory.

[15]  Antonio Auffinger,et al.  Random Matrices and Complexity of Spin Glasses , 2010, 1003.1129.

[16]  Jérôme Malick,et al.  Projection-like Retractions on Matrix Manifolds , 2012, SIAM J. Optim..

[17]  Sanjeev Arora,et al.  The Multiplicative Weights Update Method: a Meta-Algorithm and Applications , 2012, Theory Comput..

[18]  Robert E. Mahony,et al.  An Extrinsic Look at the Riemannian Hessian , 2013, GSI.

[19]  A. Latif Banach Contraction Principle and Its Generalizations , 2014 .

[20]  Surya Ganguli,et al.  On the saddle point problem for non-convex optimization , 2014, ArXiv.

[21]  Sébastien Bubeck,et al.  Theory of Convex Optimization for Machine Learning , 2014, ArXiv.

[22]  Surya Ganguli,et al.  Identifying and attacking the saddle point problem in high-dimensional non-convex optimization , 2014, NIPS.

[23]  Peter Richtárik,et al.  Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function , 2011, Mathematical Programming.

[24]  Shai Ben-David,et al.  Understanding Machine Learning - From Theory to Algorithms , 2014 .

[25]  Xi Chen,et al.  Spectral Methods Meet EM: A Provably Optimal Algorithm for Crowdsourcing , 2014, J. Mach. Learn. Res..

[26]  Ruta Mehta,et al.  Natural Selection as an Inhibitor of Genetic Diversity: Multiplicative Weights Updates Algorithm and a Conjecture of Haploid Genetics [Working Paper Abstract] , 2014, ITCS.

[27]  T. Zhao,et al.  Nonconvex Low Rank Matrix Factorization via Inexact First Order Oracle , 2015 .

[28]  Sanjeev Arora,et al.  Simple, Efficient, and Neural Algorithms for Sparse Coding , 2015, COLT.

[29]  Xiaodong Li,et al.  Optimal Rates of Convergence for Noisy Sparse Phase Retrieval via Thresholded Wirtinger Flow , 2015, ArXiv.

[30]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[31]  Yann LeCun,et al.  The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[32]  Xiaodong Li,et al.  Phase Retrieval via Wirtinger Flow: Theory and Algorithms , 2014, IEEE Transactions on Information Theory.

[33]  John Wright,et al.  When Are Nonconvex Problems Not Scary? , 2015, ArXiv.

[34]  John Wright,et al.  A Geometric Analysis of Phase Retrieval , 2016, 2016 IEEE International Symposium on Information Theory (ISIT).

[35]  Mikhail Belkin,et al.  Basis Learning as an Algorithmic Primitive , 2014, COLT.

[36]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.

[37]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[38]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[39]  Georgios Piliouras,et al.  Average Case Performance of Replicator Dynamics in Potential Games via Computing Regions of Attraction , 2014, EC.

[40]  Georgios Piliouras,et al.  Gradient Descent Only Converges to Minimizers: Non-Isolated Critical Points and Invariant Regions , 2016, ITCS.

[41]  Michael I. Jordan,et al.  Gradient Descent Can Take Exponential Time to Escape Saddle Points , 2017, NIPS.

[42]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[43]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere I: Overview and the Geometric Picture , 2015, IEEE Transactions on Information Theory.

[44]  Georgios Piliouras,et al.  Multiplicative Weights Update with Constant Step-Size in Congestion Games: Convergence, Limit Cycles and Chaos , 2017, NIPS.

[45]  John Wright,et al.  Complete Dictionary Recovery Over the Sphere II: Recovery by Riemannian Trust-Region Method , 2015, IEEE Transactions on Information Theory.

[46]  Mingrui Liu,et al.  On Noisy Negative Curvature Descent: Competing with Gradient Descent for Faster Non-convex Optimization , 2017, 1709.08571.

[47]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[48]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[49]  S. Shankar Sastry,et al.  Step Size Matters in Deep Learning , 2018, NeurIPS.

[50]  Stephen J. Wright,et al.  Complexity Analysis of Second-Order Line-Search Algorithms for Smooth Nonconvex Optimization , 2017, SIAM J. Optim..

[51]  Alexander J. Smola,et al.  A Generic Approach for Escaping Saddle points , 2017, AISTATS.

[52]  Constantinos Daskalakis,et al.  The Limit Points of (Optimistic) Gradient Descent in Min-Max Optimization , 2018, NeurIPS.

[53]  G. Piliouras,et al.  Family of chaotic maps from game theory , 2018, 1807.06831.

[54]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[55]  G. Piliouras,et al.  The route to chaos in routing games: Population increase drives period-doubling instability, chaos & inefficiency with Price of Anarchy equal to one , 2019, ArXiv.

[56]  Xiao Wang,et al.  Multiplicative Weights Updates as a distributed constrained optimization algorithm: Convergence to second-order stationary points almost always , 2018, ICML.

[57]  Michael I. Jordan DYNAMICAL, SYMPLECTIC AND STOCHASTIC PERSPECTIVES ON GRADIENT-BASED OPTIMIZATION , 2019, Proceedings of the International Congress of Mathematicians (ICM 2018).

[58]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[59]  Xiao Wang,et al.  Depth-Width Trade-offs for ReLU Networks via Sharkovsky's Theorem , 2019, ICLR.

[60]  Ioannis Panageas,et al.  On the Analysis of EM for truncated mixtures of two Gaussians , 2019, ALT.