Minimizing Nonconvex Population Risk from Rough Empirical Risk

Population risk---the expectation of the loss over the sampling mechanism---is always of primary interest in machine learning. However, learning algorithms only have access to empirical risk, which is the average loss over training examples. Although the two risks are typically guaranteed to be pointwise close, for applications with nonconvex nonsmooth losses (such as modern deep networks), the effects of sampling can transform a well-behaved population risk into an empirical risk with a landscape that is problematic for optimization. The empirical risk can be nonsmooth, and it may have many additional local minima. This paper considers a general optimization framework which aims to find approximate local minima of a smooth nonconvex function $F$ (population risk) given only access to the function value of another function $f$ (empirical risk), which is pointwise close to $F$ (i.e., $\|F-f\|_{\infty} \le \nu$). We propose a simple algorithm based on stochastic gradient descent (SGD) on a smoothed version of $f$ which is guaranteed to find an $\epsilon$-second-order stationary point if $\nu \le O(\epsilon^{1.5}/d)$, thus escaping all saddle points of $F$ and all the additional local minima introduced by $f$. We also provide an almost matching lower bound showing that our SGD-based approach achieves the optimal trade-off between $\nu$ and $\epsilon$, as well as the optimal dependence on problem dimension $d$, among all algorithms making a polynomial number of queries. As a concrete example, we show that our results can be directly used to give sample complexities for learning a ReLU unit, whose empirical risk is nonsmooth.

[1]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[2]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[3]  Tengyu Ma,et al.  Matrix Completion has No Spurious Local Minimum , 2016, NIPS.

[4]  Anand D. Sarwate,et al.  Differentially Private Empirical Risk Minimization , 2009, J. Mach. Learn. Res..

[5]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[6]  Hariharan Narayanan,et al.  Escaping the Local Minima via Simulated Annealing: Optimization of Approximately Convex Functions , 2015, COLT.

[7]  Yair Carmon,et al.  Accelerated Methods for NonConvex Optimization , 2018, SIAM J. Optim..

[8]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[9]  Yair Carmon,et al.  Accelerated Methods for Non-Convex Optimization , 2016, SIAM J. Optim..

[10]  Michael I. Jordan,et al.  Accelerated Gradient Descent Escapes Saddle Points Faster than Gradient Descent , 2017, COLT.

[11]  Y. Nesterov A method for solving the convex programming problem with convergence rate O(1/k^2) , 1983 .

[12]  Yuchen Zhang,et al.  A Hitting Time Analysis of Stochastic Gradient Langevin Dynamics , 2017, COLT.

[13]  A. Montanari,et al.  The landscape of empirical risk for nonconvex losses , 2016, The Annals of Statistics.

[14]  Gábor Lugosi,et al.  Concentration Inequalities - A Nonasymptotic Theory of Independence , 2013, Concentration Inequalities.

[15]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[16]  Yuanzhi Li,et al.  Algorithms and matching lower bounds for approximately-convex optimization , 2016, NIPS.

[17]  Po-Ling Loh,et al.  Regularized M-estimators with nonconvexity: statistical and algorithmic theory for local optima , 2013, J. Mach. Learn. Res..

[18]  Yuanzhi Li,et al.  An Alternative View: When Does SGD Escape Local Minima? , 2018, ICML.

[19]  Ohad Shamir,et al.  On the Complexity of Bandit and Derivative-Free Stochastic Convex Optimization , 2012, COLT.

[20]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[21]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[22]  Anima Anandkumar,et al.  Efficient approaches for escaping higher order saddle points in non-convex optimization , 2016, COLT.

[23]  Yee Whye Teh,et al.  Bayesian Learning via Stochastic Gradient Langevin Dynamics , 2011, ICML.

[24]  Tengyu Ma,et al.  Finding approximate local minima faster than gradient descent , 2016, STOC.

[25]  Yi Zheng,et al.  No Spurious Local Minima in Nonconvex Low Rank Problems: A Unified Geometric Analysis , 2017, ICML.

[26]  Adam Tauman Kalai,et al.  Online convex optimization in the bandit setting: gradient descent without a gradient , 2004, SODA '05.

[27]  Jan Vondrák,et al.  Information-theoretic lower bounds for convex optimization with erroneous oracles , 2015, NIPS.

[28]  Percy Liang,et al.  Certified Defenses for Data Poisoning Attacks , 2017, NIPS.

[29]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[30]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[31]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[32]  Lin Xiao,et al.  Optimal Algorithms for Online Convex Optimization with Multi-Point Bandit Feedback. , 2010, COLT 2010.

[33]  John C. Duchi,et al.  Certifiable Distributional Robustness with Principled Adversarial Training , 2017, ArXiv.

[34]  Martin J. Wainwright,et al.  Optimal Rates for Zero-Order Convex Optimization: The Power of Two Function Evaluations , 2013, IEEE Transactions on Information Theory.

[35]  Ingo Rechenberg,et al.  Evolutionsstrategie : Optimierung technischer Systeme nach Prinzipien der biologischen Evolution , 1973 .

[36]  Po-Ling Loh,et al.  High-dimensional regression with noisy and missing data: Provable guarantees with non-convexity , 2011, NIPS.

[37]  Nathan Srebro,et al.  Global Optimality of Local Search for Low Rank Matrix Recovery , 2016, NIPS.