Memo No . 90 September 8 , 2019 Theory III : Dynamics and Generalization in Deep Networks 1

The key to generalization is controlling the complexity of the network. However, there is no obvious control of complexity – such as an explicit regularization term – in the training of deep networks. We will show that a classical form of norm control – but kind of hidden – is responsible for generalization in deep networks trained with gradient descent techniques. In particular, gradient descent induces a dynamics of the normalized weights which converges to a hyperbolic equilibrium. Our approach extends some of the results of Srebro from linear networks to deep networks and provides a new perspective on the implicit bias of gradient descent. The elusive complexity control we describe is responsible, at least in part, for the puzzling empirical finding of good generalization despite overparametrization by deep networks. This material is based upon work supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. 1This replaces previous versions of Theory IIIa and TheoryIIIb. Theory III: Dynamics and Generalization in Deep Networks∗ Andrzej Banburski 1, Qianli Liao1, Brando Miranda1, Tomaso Poggio1, Lorenzo Rosasco1, Fernanda De La Torre1, and Jack Hidary2 1Center for Brains, Minds and Machines, MIT 1CSAIL, MIT 2Alphabet (Google) X Abstract The key to generalization is controlling the complexity of the network. However, there is no obvious control of complexity – such as an explicit regularization term – in the training of deep networks. We will show that a classical form of norm control – but kind of hidden – is responsible for generalization in deep networks trained with gradient descent techniques. In particular, gradient descent induces a dynamics of the normalized weights which converges to a hyperbolic equilibrium. Our approach extends some of the results of Srebro from linear networks to deep networks and provides a new perspective on the implicit bias of gradient descent. The elusive complexity control we describe is responsible, at least in part, for the puzzling empirical finding of good generalization despite overparametrization by deep networks.The key to generalization is controlling the complexity of the network. However, there is no obvious control of complexity – such as an explicit regularization term – in the training of deep networks. We will show that a classical form of norm control – but kind of hidden – is responsible for generalization in deep networks trained with gradient descent techniques. In particular, gradient descent induces a dynamics of the normalized weights which converges to a hyperbolic equilibrium. Our approach extends some of the results of Srebro from linear networks to deep networks and provides a new perspective on the implicit bias of gradient descent. The elusive complexity control we describe is responsible, at least in part, for the puzzling empirical finding of good generalization despite overparametrization by deep networks.

[1]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[2]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[3]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[4]  Lorenzo Rosasco,et al.  Theory III: Dynamics and Generalization in Deep Networks , 2019, ArXiv.

[5]  Phan-Minh Nguyen,et al.  Mean Field Limit of the Learning Dynamics of Multilayer Neural Networks , 2019, ArXiv.

[6]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[7]  Alexander Rakhlin,et al.  Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon , 2018, COLT.

[8]  Yuan Cao,et al.  Stochastic Gradient Descent Optimizes Over-parameterized Deep ReLU Networks , 2018, ArXiv.

[9]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[10]  Liwei Wang,et al.  Gradient Descent Finds Global Minima of Deep Neural Networks , 2018, ICML.

[11]  Barnabás Póczos,et al.  Gradient Descent Provably Optimizes Over-parameterized Neural Networks , 2018, ICLR.

[12]  Xiao Zhang,et al.  Learning One-hidden-layer ReLU Networks via Gradient Descent , 2018, AISTATS.

[13]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[14]  Adel Javanmard,et al.  Theoretical Insights Into the Optimization Landscape of Over-Parameterized Shallow Neural Networks , 2017, IEEE Transactions on Information Theory.

[15]  Qiang Liu,et al.  On the Margin Theory of Feedforward Neural Networks , 2018, ArXiv.

[16]  Yuanzhi Li,et al.  Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data , 2018, NeurIPS.

[17]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[18]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[19]  Aleksander Madry,et al.  How Does Batch Normalization Help Optimization? (No, It Is Not About Internal Covariate Shift) , 2018, NeurIPS.

[20]  Andrea Montanari,et al.  A mean field view of the landscape of two-layer neural networks , 2018, Proceedings of the National Academy of Sciences.

[21]  Tomaso A. Poggio,et al.  Theory of Deep Learning IIb: Optimization Properties of SGD , 2018, ArXiv.

[22]  Yuandong Tian,et al.  Gradient Descent Learns One-hidden-layer CNN: Don't be Afraid of Spurious Local Minima , 2017, ICML.

[23]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[24]  Yuandong Tian,et al.  When is a Convolutional Filter Easy To Learn? , 2017, ICLR.

[25]  Inderjit S. Dhillon,et al.  Learning Non-overlapping Convolutional Neural Networks with Multiple Kernels , 2017, ArXiv.

[26]  Guillermo Sapiro,et al.  Robust Large Margin Deep Neural Networks , 2017, IEEE Transactions on Signal Processing.

[27]  Nathan Srebro,et al.  Exploring Generalization in Deep Learning , 2017, NIPS.

[28]  Matus Telgarsky,et al.  Spectrally-normalized margin bounds for neural networks , 2017, NIPS.

[29]  Inderjit S. Dhillon,et al.  Recovery Guarantees for One-hidden-layer Neural Networks , 2017, ICML.

[30]  Yuanzhi Li,et al.  Convergence Analysis of Two-layer Neural Networks with ReLU Activation , 2017, NIPS.

[31]  Noah Golowich,et al.  Musings on Deep Learning: Properties of SGD , 2017 .

[32]  Tomaso A. Poggio,et al.  Theory II: Landscape of the Empirical Risk in Deep Learning , 2017, ArXiv.

[33]  Michael I. Jordan,et al.  How to Escape Saddle Points Efficiently , 2017, ICML.

[34]  Yuandong Tian,et al.  An Analytical Formula of Population Gradient for two-layered ReLU network and its Applications in Convergence and Critical Point Analysis , 2017, ICML.

[35]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[36]  Amir Globerson,et al.  Globally Optimal Gradient Descent for a ConvNet with Gaussian Inputs , 2017, ICML.

[37]  Matus Telgarsky,et al.  Non-convex learning via Stochastic Gradient Langevin Dynamics: a nonasymptotic analysis , 2017, COLT.

[38]  Michael I. Jordan,et al.  Gradient Descent Only Converges to Minimizers , 2016, COLT.

[39]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[40]  Furong Huang,et al.  Escaping From Saddle Points - Online Stochastic Gradient for Tensor Decomposition , 2015, COLT.

[41]  Sergey Ioffe,et al.  Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift , 2015, ICML.

[42]  Michael I. Jordan,et al.  Convexity, Classification, and Risk Bounds , 2006 .

[43]  Gábor Lugosi,et al.  Introduction to Statistical Learning Theory , 2004, Advanced Lectures on Machine Learning.

[44]  Ji Zhu,et al.  Margin Maximizing Loss Functions , 2003, NIPS.

[45]  Sun-Yuan Kung,et al.  On gradient adaptation with unit-norm constraints , 2000, IEEE Trans. Signal Process..

[46]  Aleksej F. Filippov,et al.  Differential Equations with Discontinuous Righthand Sides , 1988, Mathematics and Its Applications.