论文信息 - Small nonlinearities in activation functions create bad local minima in neural networks - 字舞流文

Small nonlinearities in activation functions create bad local minima in neural networks

We investigate the loss surface of neural networks. We prove that even for one-hidden-layer networks with "slightest" nonlinearity, the empirical risks have spurious local minima in most cases. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks, and insights obtained from linear networks may not be robust. Specifically, for ReLU(-like) networks we constructively prove that for almost all practical datasets there exist infinitely many local minima. We also present a counterexample for more general activations (sigmoid, tanh, arctan, ReLU, etc.), for which there exists a bad local minimum. Our results make the least restrictive assumptions relative to existing results on spurious local optima in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks, which unifies other results on this topic.

Suvrit Sra | Ali Jadbabaie | Chulhee Yun | A. Jadbabaie | S. Sra | Chulhee Yun

[1] Yi Zhou,et al. Critical Points of Linear Neural Networks: Analytical Forms and Landscape Properties , 2017, ICLR.

[2] Yi Zhou,et al. SGD Converges to Global Minimum in Deep Learning via Star-convex Path , 2019, ICLR.

[3] Thomas Laurent,et al. Deep linear neural networks with arbitrary loss: All local minima are global , 2017, ArXiv.

[4] Kenji Kawaguchi,et al. Deep Learning without Poor Local Minima , 2016, NIPS.

[5] David Tse,et al. Porcupine Neural Networks: (Almost) All Local Optima are Global , 2017, ArXiv.

[6] Suvrit Sra,et al. Global optimality conditions for deep neural networks , 2017, ICLR.

[7] Joan Bruna,et al. Topology and Geometry of Half-Rectified Network Optimization , 2016, ICLR.

[8] Daniel Soudry,et al. No bad local minima: Data independent training error guarantees for multilayer neural networks , 2016, ArXiv.

[9] Yuan Cao,et al. Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[10] Razvan Pascanu,et al. Local minima in training of neural networks , 2016, 1611.06310.

[11] Thomas Laurent,et al. The Multilinear Structure of ReLU Networks , 2017, ICML.

[12] Liwei Wang,et al. Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[13] Yuanzhi Li,et al. A Convergence Theory for Deep Learning via Over-Parameterization , 2018, ICML.

[14] Barnabás Póczos,et al. Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[15] Yuandong Tian,et al. Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[16] Thomas Laurent,et al. Deep Linear Networks with Arbitrary Loss: All Local Minima Are Global , 2017, ICML.

[17] Joan Bruna,et al. Neural Networks with Finite Intrinsic Dimension have no Spurious Valleys , 2018, ArXiv.

[18] Yi Zhou,et al. Critical Points of Neural Networks: Analytical Forms and Landscape Properties , 2017, ArXiv.

[19] Harold R. Parks,et al. A Primer of Real Analytic Functions , 1992 .

[20] Haihao Lu,et al. Depth Creates No Bad Local Minima , 2017, ArXiv.

[21] Yann LeCun,et al. The Loss Surfaces of Multilayer Networks , 2014, AISTATS.

[22] R. Srikant,et al. Understanding the Loss Surface of Neural Networks for Binary Classification , 2018, ICML.

[23] Inderjit S. Dhillon,et al. Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[24] Jason D. Lee,et al. No Spurious Local Minima in a Two Hidden Unit ReLU Network , 2018, ICLR.

[25] R. Srikant,et al. Adding One Neuron Can Eliminate All Bad Local Minima , 2018, NeurIPS.

[26] Ohad Shamir,et al. Spurious Local Minima are Common in Two-Layer ReLU Neural Networks , 2017, ICML.

[27] René Vidal,et al. Global Optimality in Neural Network Training , 2017, 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[28] Yuanzhi Li,et al. Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[29] Matthias Hein,et al. The Loss Surface of Deep and Wide Neural Networks , 2017, ICML.

[30] Amir Globerson,et al. Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[31] Ohad Shamir,et al. Are ResNets Provably Better than Linear Predictors? , 2018, NeurIPS.

[32] Gang Wang,et al. Learning ReLU Networks on Linearly Separable Data: Algorithm, Optimality, and Generalization , 2018, IEEE Transactions on Signal Processing.

[33] Le Song,et al. Diverse Neural Network Learns True Target Functions , 2016, AISTATS.

[34] X H Yu,et al. On the local minima free condition of backpropagation learning , 1995, IEEE Trans. Neural Networks.

[35] Sepp Hochreiter,et al. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs) , 2015, ICLR.

[36] Yuanzhi Li,et al. Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[37] Shai Shalev-Shwartz,et al. SGD Learns Over-parameterized Networks that Provably Generalize on Linearly Separable Data , 2017, ICLR.

[38] Sepp Hochreiter,et al. Self-Normalizing Neural Networks , 2017, NIPS.

[39] Mahdi Soltanolkotabi,et al. Learning ReLUs via Gradient Descent , 2017, NIPS.

[40] Xiao Zhang,et al. Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[41] Nicolas Le Roux,et al. Convex Neural Networks , 2005, NIPS.

[42] Kurt Hornik,et al. Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[43] Yuandong Tian,et al. An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[44] Matthias Hein,et al. Optimization Landscape and Expressivity of Deep CNNs , 2017, ICML.