Memo No . 067 December 26 , 2017 Musings on Deep Learning : Properties of SGD by

We ruminate with a mix of theory and experiments on the optimization and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predictive performance when overparametrization relative to the number of training data suggests overfitting. We dream an explanation of these empirical results in terms of the following new results on SGD: 1. SGD concentrates in probability like the classical Langevin equation – on large volume, “flat” minima, selecting flat minimizers which are with very high probability also global minimizers. 2. Minimization by GD or SGD on flat minima can be approximated well by minimization on a linear funcion of the weights, suggesting pseudoinverse solutions. 3. Pseudoinverse solutions are known to be intrinsically regularized with a regularization parameter λ which decreases as 1 T where T is the number of iterations. This can qualitatively explain all the generalization properties empirically observed for deep networks. 4. GD and SGD are connected closely to robust optimization. This provides an alternative way to show that GD and SGD perform implicit regularization. These results explain the puzzling findings about fitting randomly labeled data while performing well on natural labeled data. They also explain while overparametrization does not result in overfitting. Quantitative, non-vacuous bounds are still missing, as it has almost always been the case for most practical applications of machine learning. This is version 3, which differs from version 2 only in the title. The first version was released on 04/04/2017 at https://dspace.mit.edu/handle/1721.1/107841.

[1]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[2]  Ruslan Salakhutdinov,et al.  Geometry of Optimization and Implicit Regularization in Deep Learning , 2017, ArXiv.

[3]  Gintare Karolina Dziugaite,et al.  Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data , 2017, UAI.

[4]  Tomaso A. Poggio,et al.  Theory II: Landscape of the Empirical Risk in Deep Learning , 2017, ArXiv.

[5]  Razvan Pascanu,et al.  Sharp Minima Can Generalize For Deep Nets , 2017, ICML.

[6]  Mikhail Belkin,et al.  Diving into the shallows: a computational perspective on large-scale shallow learning , 2017, NIPS.

[7]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[8]  Stefano Soatto,et al.  Entropy-SGD: biasing gradient descent into wide valleys , 2016, ICLR.

[9]  Samy Bengio,et al.  Understanding deep learning requires rethinking generalization , 2016, ICLR.

[10]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[11]  Jorge Nocedal,et al.  On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima , 2016, ICLR.

[12]  David M. Blei,et al.  A Variational Analysis of Stochastic Gradient Algorithms , 2016, ICML.

[13]  Yoram Singer,et al.  Train faster, generalize better: Stability of stochastic gradient descent , 2015, ICML.

[14]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[15]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[16]  Lorenzo Rosasco,et al.  Deep Convolutional Networks are Hierarchical Kernel Machines , 2015, ArXiv.

[17]  Ryota Tomioka,et al.  Norm-Based Capacity Control in Neural Networks , 2015, COLT.

[18]  Sébastien Bubeck,et al.  Convex Optimization: Algorithms and Complexity , 2014, Found. Trends Mach. Learn..

[19]  Lorenzo Rosasco,et al.  Learning with Incremental Iterative Regularization , 2014, NIPS.

[20]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[21]  Shie Mannor,et al.  Robustness and generalization , 2010, Machine Learning.

[22]  Ambuj Tewari,et al.  Smoothness, Low Noise and Fast Rates , 2010, NIPS.

[23]  Shie Mannor,et al.  Robustness and Regularization of Support Vector Machines , 2008, J. Mach. Learn. Res..

[24]  Ambuj Tewari,et al.  On the Complexity of Linear Prediction: Risk Bounds, Margin Bounds, and Regularization , 2008, NIPS.

[25]  Sayan Mukherjee,et al.  Learning theory: stability is sufficient for generalization and necessary and sufficient for consistency of empirical risk minimization , 2006, Adv. Comput. Math..

[26]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[27]  V. Koltchinskii,et al.  Empirical margin distributions and bounding the generalization error of combined classifiers , 2002, math/0405343.

[28]  Peter L. Bartlett,et al.  Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , 2003, J. Mach. Learn. Res..

[29]  John N. Tsitsiklis,et al.  Gradient Convergence in Gradient methods with Errors , 1999, SIAM J. Optim..

[30]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[31]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[32]  Peter L. Bartlett,et al.  The Sample Complexity of Pattern Classification with Neural Networks: The Size of the Weights is More Important than the Size of the Network , 1998, IEEE Trans. Inf. Theory.

[33]  Yoav Freund,et al.  A decision-theoretic generalization of on-line learning and an application to boosting , 1997, EuroCOLT.

[34]  S. Mitter,et al.  Metropolis-type annealing algorithms for global optimization in R d , 1993 .

[35]  S. Mitter,et al.  Recursive stochastic algorithms for global optimization in R d , 1991 .

[36]  Sanguthevar Rajasekaran,et al.  On the Convergence Time of Simulated Annealing , 1990 .

[37]  B. Gidas Global optimization via the Langevin equation , 1985, 1985 24th IEEE Conference on Decision and Control.

[38]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[39]  S. Mitter,et al.  RECURSIVE STOCHASTIC ALGORITHMS FOR GLOBAL OPTIMIZATION IN , 2022 .