Spurious local minima in neural networks: a critical view

We investigate the loss surface of nonlinear neural networks. We prove that even for networks with one hidden layer and "slightest" nonlinearity, there can be spurious local minima. Our results thus indicate that in general "no spurious local minima" is a property limited to deep linear networks. Specifically, for ReLU(-like) networks we prove that for almost all (in contrast to previous results) practical datasets there exist infinitely many local minima. We also present a counterexample for more general activation functions (such as sigmoid, tanh, arctan, ReLU, etc.), for which there exists a local minimum strictly inferior to the global minimum. Our results make the least restrictive assumptions relative to the existing results on local optimality in neural networks. We complete our discussion by presenting a comprehensive characterization of global optimality for deep linear networks. Our results unify and subsume other results on this topic.