Theory of Deep Learning III : the non-overfitting puzzle

A main puzzle of deep networks revolves around the apparent absence of overfitting intended as robustness of the expected error against overparametrization, despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to a gradient system in a quadratic potential with a degenerate (for square loss) or almost degenerate (for logistic or crossentropy loss) Hessian. The proposition depends on the qualitative theory of dynamical systems and is supported by numerical results. The result extends to deep nonlinear networks two key properties of gradient descent for linear networks, that have been recently recognized (1) to provide a form of implicit regularization: 1. For classification, which is the main application of today’s deep networks, there is asymptotic convergence to the maximum margin solution by minimization of loss functions such as the logistic, the cross entropy and the exp-loss . The maximum margin solution guarantees good classification error for “low noise” datasets. Importantly, this property holds independently of the initial conditions. Because of this property, our proposition guarantees a maximum margin solution also for deep nonlinear networks. 2. Gradient descent enforces a form of implicit regularization controlled by the number of iterations, and asymptotically converges to the minimum norm solution for appropriate initial conditions of gradient descent. This implies that there is usually an optimum early stopping that avoids overfitting of the expected risk. This property, valid for the square loss and many other loss functions, is relevant especially for regression. In the case of deep nonlinear networks the solution however is not expected to be strictly minimum norm, unlike the linear case. The robustness to overparametrization has suggestive implications for the robustness of the architecture of deep convolutional networks with respect to the curse of dimensionality.

[1]  J. Czipszer,et al.  Sur l'approximation d'une fonction périodique et de ses dérivées successives par un polynome trigono-métrique et par ses dérivées successives , 1958 .

[2]  Kurt Hornik,et al.  Neural networks and principal component analysis: Learning from examples without local minima , 1989, Neural Networks.

[3]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[4]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[5]  B. Aulbach,et al.  The Hartman-Grobman theorem for Carathéodory-type differential equations in Banach spaces , 2000 .

[6]  김희라 Waiting for Godot에 나타난 희망의 구조 , 2003 .

[7]  Y. Yao,et al.  On Early Stopping in Gradient Descent Learning , 2007 .

[8]  Shie Mannor,et al.  Robustness and Regularization of Support Vector Machines , 2008, J. Mach. Learn. Res..

[9]  Surya Ganguli,et al.  Exact solutions to the nonlinear dynamics of learning in deep linear neural networks , 2013, ICLR.

[10]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[11]  Kenji Kawaguchi,et al.  Deep Learning without Poor Local Minima , 2016, NIPS.

[12]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[13]  Yann LeCun,et al.  Singularity of the Hessian in Deep Learning , 2016, ArXiv.

[14]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[15]  Lorenzo Rosasco,et al.  Optimal Rates for Multi-pass Stochastic Gradient Methods , 2016, J. Mach. Learn. Res..

[16]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[17]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[18]  Tomaso A. Poggio,et al.  Theory II: Landscape of the Empirical Risk in Deep Learning , 2017, ArXiv.

[19]  Tengyu Ma,et al.  Identity Matters in Deep Learning , 2016, ICLR.

[20]  Noah Golowich,et al.  Musings on Deep Learning: Properties of SGD , 2017 .

[21]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[22]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[23]  Tomaso A. Poggio,et al.  Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.

[24]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.