Memo No . 067 June 27 , 2017 Theory of Deep Learning III : Generalization Properties of SGD

In Theory III we characterize with a mix of theory and experiments the consistency and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predicitve performance when the classical learning theory seems to suggest overfitting. We show that these empirical results can be explained by the classical theory and by the following new results on SGD: 1. SGD concentrates in probability like the classical Langevin equation – on large volume, “flat” minima, selecting possibly degenerate, flat minimizers which are also global minimizers. 2. Minimization under the constraint of maximum volume (usually corresponding to flatness) yields large margin classification in the case of separable data (zero empirical error for classification loss). 3. SGD for linear degenerate networks converges to the minimum norm solution. Here we can consider the polynomial describing f(x) the function computed by the network when each RELU is replaced by a univariate polynomial approximant. In this case f(x) is a linear function in the vector X comprising all the relevant monomials, implying that SGD converges to the minimum norm solution in X . This in particular implies that the expected error does not change for overparametrization, going from W = N to W > N , where N is the number of training data and W is the number of weights, assuming that the training error is zero (target function in the hypothesis space of the network). Thus SGD maximizes flatness and margin, performing implicit regularization (we show these properties rigorously in the case of linear networks). This explains qualitatively all the puzzling findings about fitting randomly labeled data while performing well on natural labeled data. It also explains while overparametrization does not result in overfitting. This is version 2. The first version was released on 04/04/2017 at https://dspace.mit.edu/handle/1721.1/107841.

[1]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[2]  B. Gidas Global optimization via the Langevin equation , 1985, 1985 24th IEEE Conference on Decision and Control.

[3]  Sanguthevar Rajasekaran,et al.  On the Convergence Time of Simulated Annealing , 1990 .

[4]  S. Mitter,et al.  Recursive stochastic algorithms for global optimization in R d , 1991 .

[5]  S. Mitter,et al.  Metropolis-type annealing algorithms for global optimization in R d , 1993 .

[6]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[7]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[8]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[9]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[10]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[11]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[12]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[13]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[14]  Shie Mannor,et al.  Robustness and Regularization of Support Vector Machines , 2008, J. Mach. Learn. Res..

[15]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[16]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[17]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[18]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[19]  Lorenzo Rosasco,et al.  Deep Convolutional Networks are Hierarchical Kernel Machines , 2015, ArXiv.

[20]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[21]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[22]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[23]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[24]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[25]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[26]  Tomaso A. Poggio,et al.  Theory II: Landscape of the Empirical Risk in Deep Learning , 2017, ArXiv.

[27]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[28]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[29]  Ruslan Salakhutdinov,et al.  Geometry of Optimization and Implicit Regularization in Deep Learning , 2017, ArXiv.

[30]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[31]  Mikhail Belkin,et al.  Diving into the shallows: a computational perspective on large-scale shallow learning , 2017, NIPS.

[32]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[33]  S. Mitter,et al.  RECURSIVE STOCHASTIC ALGORITHMS FOR GLOBAL OPTIMIZATION IN , 2022 .