On Complexity of Finding Stationary Points of Nonsmooth Nonconvex Functions

We provide the first non-asymptotic analysis for finding stationary points of nonsmooth, nonconvex functions. In particular, we study the class of Hadamard semi-differentiable functions, perhaps the largest class of nonsmooth functions for which the chain rule of calculus holds. This class contains examples such as ReLU neural networks and others with non-differentiable activation functions. We first show that finding an $\epsilon$-stationary point with first-order methods is impossible in finite time. We then introduce the notion of $(\delta, \epsilon)$-stationarity, which allows for an $\epsilon$-approximate gradient to be the convex combination of generalized gradients evaluated at points within distance $\delta$ to the solution. We propose a series of randomized first-order methods and analyze their complexity of finding a $(\delta, \epsilon)$-stationary point. Furthermore, we provide a lower bound and show that our stochastic algorithm has min-max optimal dependence on $\delta$. Empirically, our methods perform well for training ReLU neural networks.

[1]  Niao He,et al.  On the Convergence Rate of Stochastic Mirror Descent for Nonsmooth Nonconvex Optimization , 2018, 1806.04781.

[2]  Josef Hofbauer,et al.  Stochastic Approximations and Differential Inclusions , 2005, SIAM J. Control. Optim..

[3]  Yair Carmon,et al.  Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[4]  Zhouchen Lin,et al.  Sharp Analysis for Nonconvex SGD Escaping from Saddle Points , 2019, COLT.

[5]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[6]  Marten van Dijk,et al.  Finite-sum smooth optimization with SARAH , 2019, Computational Optimization and Applications.

[7]  A. A. Goldstein,et al.  Optimization of lipschitz continuous functions , 1977, Math. Program..

[8]  Yair Carmon,et al.  Lower bounds for finding stationary points I , 2017, Mathematical Programming.

[9]  Dmitriy Drusvyatskiy,et al.  Stochastic Subgradient Method Converges on Tame Functions , 2018, Foundations of Computational Mathematics.

[10]  Quanquan Gu,et al.  Stochastic Nested Variance Reduction for Nonconvex Optimization , 2018, J. Mach. Learn. Res..

[11]  Nadav Hallak,et al.  On the Convergence to Stationary Points of Deterministic and Randomized Feasible Descent Directions Methods , 2020, SIAM J. Optim..

[12]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[13]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[14]  Ohad Shamir,et al.  The Complexity of Finding Stationary Points with Stochastic Gradient Descent , 2020, ICML.

[15]  Marc Teboulle,et al.  First Order Methods beyond Convexity and Lipschitz Gradient Continuity with Applications to Quadratic Inverse Problems , 2017, SIAM J. Optim..

[16]  Saeed Ghadimi,et al.  Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming , 2013, SIAM J. Optim..

[17]  Tong Zhang,et al.  SPIDER: Near-Optimal Non-Convex Optimization via Stochastic Path Integrated Differential Estimator , 2018, NeurIPS.

[18]  Yoram Singer,et al.  Adaptive Subgradient Methods for Online Learning and Stochastic Optimization , 2011, J. Mach. Learn. Res..

[19]  Dmitriy Drusvyatskiy,et al.  Stochastic model-based minimization of weakly convex functions , 2018, SIAM J. Optim..

[20]  Marten van Dijk,et al.  Optimal Finite-Sum Smooth Non-Convex Optimization with SARAH , 2019, ArXiv.

[21]  Luca Antiga,et al.  Automatic differentiation in PyTorch , 2017 .

[22]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[23]  Robert Mifflin,et al.  An Algorithm for Constrained Optimization with Semismooth Functions , 1977, Math. Oper. Res..

[24]  O. Smolyanov,et al.  The theory of differentiation in linear topological spaces , 1967 .

[25]  F. Clarke Optimization And Nonsmooth Analysis , 1983 .

[26]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[27]  Adrian S. Lewis,et al.  Approximating Subdifferentials by Random Sampling of Gradients , 2002, Math. Oper. Res..

[28]  Thomas Hofmann,et al.  Escaping Saddles with Stochastic Gradients , 2018, ICML.

[29]  Edouard Pauwels,et al.  Conservative set valued fields, automatic differentiation, stochastic gradient method and deep learning , 2019, ArXiv.

[30]  Nathan Srebro,et al.  Lower Bounds for Non-Convex Stochastic Optimization , 2019, ArXiv.

[31]  Michael L. Overton,et al.  Gradient Sampling Methods for Nonsmooth Optimization , 2018, Numerical Nonsmooth Optimization.

[32]  O. Mangasarian On Concepts of Directional Differentiability , 2004 .

[33]  Ohad Shamir,et al.  The Complexity of Making the Gradient Small in Stochastic Convex Optimization , 2019, COLT.

[34]  É. Moulines,et al.  Analysis of nonsmooth stochastic approximation: the differential inclusion approach , 2018, 1805.01916.

[35]  Feng Ruan,et al.  Stochastic Methods for Composite and Weakly Convex Optimization Problems , 2017, SIAM J. Optim..

[36]  Lamberto Cesari,et al.  Optimization-Theory And Applications , 1983 .

[37]  Ohad Shamir,et al.  Can We Find Near-Approximately-Stationary Points of Nonsmooth Nonconvex Functions? , 2020, ArXiv.

[38]  Zeyuan Allen-Zhu,et al.  How To Make the Gradients Small Stochastically: Even Faster Convex and Nonconvex SGD , 2018, NeurIPS.

[39]  Dmitriy Drusvyatskiy,et al.  Efficiency of minimizing compositions of convex functions and smooth maps , 2016, Math. Program..

[40]  M. Coste AN INTRODUCTION TO O-MINIMAL GEOMETRY , 2002 .

[41]  Krzysztof C. Kiwiel,et al.  Convergence of the Gradient Sampling Algorithm for Nonsmooth Nonconvex Optimization , 2007, SIAM J. Optim..

[42]  Alexander J. Smola,et al.  Stochastic Variance Reduction for Nonconvex Optimization , 2016, ICML.

[43]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.