Deep vs. shallow networks : An approximation theory perspective

The paper briefly reviews several recent results on hierarchical architectures for learning from examples, that may formally explain the conditions under which Deep Convolutional Neural Networks perform much better in function approximation problems than shallow, one-hidden layer architectures. The paper announces new results for a non-smooth activation function — the ReLU function — used in present-day neural networks, as well as for the Gaussian networks. We propose a new definition of relative dimension to encapsulate different notions of sparsity of a function class that can possibly be exploited by deep networks but not by shallow ones to drastically reduce the complexity required for approximation and learning.

[1]  Leon M. Hall,et al.  Special Functions , 1998 .

[2]  G. Freud On direct and converse theorems in the theory of weighted polynomial approximation , 1972 .

[3]  J. Cooper SINGULAR INTEGRALS AND DIFFERENTIABILITY PROPERTIES OF FUNCTIONS , 1973 .

[4]  J. Håstad Computational limitations of small-depth circuits , 1987 .

[5]  R. DeVore,et al.  Optimal nonlinear approximation , 1989 .

[6]  H. Mhaskar,et al.  Neural networks for localized approximation , 1994 .

[7]  Amara Lynn Graps,et al.  An introduction to wavelets , 1995 .

[8]  H. N. Mhaskar,et al.  Neural Networks for Optimal Approximation of Smooth and Analytic Functions , 1996, Neural Computation.

[9]  Xin Li,et al.  Limitations of the approximation capabilities of neural networks with one hidden layer , 1996, Adv. Comput. Math..

[10]  Daniel L. Ruderman,et al.  Origins of scaling in natural images , 1996, Vision Research.

[11]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[12]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[13]  H. Mhaskar On the Degree of Approximation in Multivariate Weighted Approximation , 2002 .

[14]  Hrushikesh Narhar Mhaskar,et al.  When is approximation by Gaussian networks necessarily a linear process? , 2004, Neural Networks.

[15]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[16]  Hrushikesh Narhar Mhaskar,et al.  A Markov-Bernstein Inequality for Gaussian Networks , 2005 .

[17]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[18]  Rene F. Swarttouw,et al.  Orthogonal polynomials , 2020, NIST Handbook of Mathematical Functions.

[19]  Hrushikesh Narhar Mhaskar,et al.  Weighted quadrature formulas and approximation by zonal function networks on the sphere , 2006, J. Complex..

[20]  H. N. Mhaskar,et al.  Eignets for function approximation on manifolds , 2009, ArXiv.

[21]  Tomaso Poggio,et al.  I-theory on depth vs width: hierarchical function composition , 2015 .

[22]  Matus Telgarsky,et al.  Representation Benefits of Deep Feedforward Networks , 2015, ArXiv.

[23]  Jian Sun,et al.  Deep Residual Learning for Image Recognition , 2015, 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[24]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[25]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[26]  Tomaso A. Poggio,et al.  Learning Real and Boolean Functions: When Is Deep Better Than Shallow , 2016, ArXiv.

[27]  H. N. Mhaskar,et al.  Local Approximation Using Hermite Functions , 2017 .