Theoretical issues in deep networks

While deep learning is successful in a number of applications, it is not yet well understood theoretically. A theoretical characterization of deep learning should answer questions about their approximation power, the dynamics of optimization, and good out-of-sample performance, despite overparameterization and the absence of explicit regularization. We review our recent results toward this goal. In approximation theory both shallow and deep networks are known to approximate any continuous functions at an exponential cost. However, we proved that for certain types of compositional functions, deep networks of the convolutional type (even without weight sharing) can avoid the curse of dimensionality. In characterizing minimization of the empirical exponential loss we consider the gradient flow of the weight directions rather than the weights themselves, since the relevant function underlying classification corresponds to normalized networks. The dynamics of normalized weights turn out to be equivalent to those of the constrained problem of minimizing the loss subject to a unit norm constraint. In particular, the dynamics of typical gradient descent have the same critical points as the constrained problem. Thus there is implicit regularization in training deep networks under exponential-type loss functions during gradient flow. As a consequence, the critical points correspond to minimum norm infima of the loss. This result is especially relevant because it has been recently shown that, for overparameterized models, selection of a minimum norm solution optimizes cross-validation leave-one-out stability and thereby the expected error. Thus our results imply that gradient descent in deep networks minimize the expected error.

[1]  Christine McGourty Dealing with the data , 1989, Nature.

[2]  Tengyuan Liang,et al.  Just Interpolate: Kernel "Ridgeless" Regression Can Generalize , 2018, The Annals of Statistics.

[3]  Tim Salimans,et al.  Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks , 2016, NIPS.

[4]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[5]  Tomaso Poggio,et al.  Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? , 2013, 1311.4158.

[6]  Amit Daniely,et al.  SGD Learns the Conjugate Kernel Class of the Network , 2017, NIPS.

[7]  Yuanzhi Li,et al.  Learning and Generalization in Overparameterized Neural Networks, Going Beyond Two Layers , 2018, NeurIPS.

[8]  Alex Krizhevsky,et al.  Learning Multiple Layers of Features from Tiny Images , 2009 .

[9]  Matus Telgarsky,et al.  Representation Benefits of Deep Feedforward Networks , 2015, ArXiv.

[10]  Dmitry Yarotsky,et al.  Error bounds for approximations with deep ReLU networks , 2016, Neural Networks.

[11]  Tomaso Poggio,et al.  Complexity control by gradient descent in deep networks , 2020, Nature Communications.

[12]  H. N. Mhaskar,et al.  Neural Networks for Optimal Approximation of Smooth and Analytic Functions , 1996, Neural Computation.

[13]  H. Mhaskar,et al.  Neural networks for localized approximation , 1994 .

[14]  André Elisseeff,et al.  Stability and Generalization , 2002, J. Mach. Learn. Res..

[15]  Tomaso A. Poggio,et al.  Learning Real and Boolean Functions: When Is Deep Better Than Shallow , 2016, ArXiv.

[16]  Xin Li,et al.  Limitations of the approximation capabilities of neural networks with one hidden layer , 1996, Adv. Comput. Math..

[17]  G. Petrova,et al.  Nonlinear Approximation and (Deep) ReLU\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\mathrm {ReLU}$$\end{document} , 2019, Constructive Approximation.

[18]  Wei Hu,et al.  Algorithmic Regularization in Learning Deep Homogeneous Models: Layers are Automatically Balanced , 2018, NeurIPS.

[19]  Lorenzo Rosasco,et al.  Deep Convolutional Networks are Hierarchical Kernel Machines , 2015, ArXiv.

[20]  Robert C. Wolpert,et al.  A Review of the , 1985 .

[21]  Qiang Liu,et al.  On the Margin Theory of Feedforward Neural Networks , 2018, ArXiv.

[22]  Mikhail Belkin,et al.  Two models of double descent for weak features , 2019, SIAM J. Math. Data Sci..

[23]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[24]  Paulo Jorge S. G. Ferreira,et al.  The existence and uniqueness of the minimum norm solution to certain linear and nonlinear problems , 1996, Signal Process..

[25]  Ruosong Wang,et al.  Fine-Grained Analysis of Optimization and Generalization for Overparameterized Two-Layer Neural Networks , 2019, ICML.

[26]  Philipp Petersen,et al.  Optimal approximation of piecewise smooth functions using deep ReLU neural networks , 2017, Neural Networks.

[27]  Tomaso Poggio,et al.  Notes on Hierarchical Splines, DCLNs and i-theory , 2015 .

[28]  Mikhail Belkin,et al.  To understand deep learning we need to understand kernel learning , 2018, ICML.

[29]  Andrea Montanari,et al.  The Generalization Error of Random Features Regression: Precise Asymptotics and the Double Descent Curve , 2019, Communications on Pure and Applied Mathematics.

[30]  Roi Livni,et al.  A Provably Efficient Algorithm for Training Deep Networks , 2013, ArXiv.

[31]  Nathan Srebro,et al.  The Implicit Bias of Gradient Descent on Separable Data , 2017, J. Mach. Learn. Res..

[32]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[33]  Tengyuan Liang,et al.  On the Risk of Minimum-Norm Interpolants and Restricted Lower Isometry of Kernels , 2019, ArXiv.

[34]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations , 2016, Theor. Comput. Sci..

[35]  Tomaso Poggio,et al.  I-theory on depth vs width: hierarchical function composition , 2015 .

[36]  Federico Girosi,et al.  On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions , 1996, Neural Computation.

[37]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[38]  Nathan Srebro,et al.  Lexicographic and Depth-Sensitive Margins in Homogeneous and Non-Homogeneous Deep Models , 2019, ICML.

[39]  Tomaso A. Poggio,et al.  A Surprising Linear Relationship Predicts Test Performance in Deep Networks , 2018, ArXiv.

[40]  Kaifeng Lyu,et al.  Gradient Descent Maximizes the Margin of Homogeneous Neural Networks , 2019, ICLR.

[41]  Ohad Shamir,et al.  Depth Separation in ReLU Networks for Approximating Smooth Non-Linear Functions , 2016, ArXiv.

[42]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[43]  Tomaso A. Poggio,et al.  Fisher-Rao Metric, Geometry, and Complexity of Neural Networks , 2017, AISTATS.

[44]  H. Mhaskar Neural networks for localized approximation of real functions , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[45]  Alexander Rakhlin,et al.  Consistency of Interpolation with Laplace Kernels is a High-Dimensional Phenomenon , 2018, COLT.

[46]  Lorenzo Rosasco,et al.  Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review , 2016, International Journal of Automation and Computing.

[47]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[48]  Aleksej F. Filippov,et al.  Differential Equations with Discontinuous Righthand Sides , 1988, Mathematics and Its Applications.