Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review

The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.

[1]  H. Wold,et al.  Some Theorems on Distribution Functions , 1936 .

[2]  E. Corominas,et al.  Condiciones para que una función infinitamente derivable sea un polinomio , 1954 .

[3]  Marvin Minsky,et al.  Perceptrons: An Introduction to Computational Geometry , 1969 .

[4]  J. Håstad Computational limitations of small-depth circuits , 1987 .

[5]  Tomaso A. Poggio,et al.  Representation properties of multilayer feedforward networks , 1988, Neural Networks.

[6]  Tomaso A. Poggio,et al.  Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant , 1989, Neural Computation.

[7]  Noam Nisan,et al.  Constant depth circuits, Fourier transform, and learnability , 1989, 30th Annual Symposium on Foundations of Computer Science.

[8]  R. DeVore,et al.  Optimal nonlinear approximation , 1989 .

[9]  Tomaso A. Poggio,et al.  Extensions of a Theory of Networks for Approximation and Learning , 1990, NIPS.

[10]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[11]  H. Mhaskar Neural networks for localized approximation of real functions , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[12]  Yishay Mansour,et al.  Learning Boolean Functions via the Fourier Transform , 1994 .

[13]  H. Mhaskar,et al.  Neural networks for localized approximation , 1994 .

[14]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[15]  H. N. Mhaskar,et al.  Neural Networks for Optimal Approximation of Smooth and Analytic Functions , 1996, Neural Computation.

[16]  Xin Li,et al.  Limitations of the approximation capabilities of neural networks with one hidden layer , 1996, Adv. Comput. Math..

[17]  Daniel L. Ruderman,et al.  Origins of scaling in natural images , 1996, Vision Research.

[18]  Peter L. Bartlett,et al.  Neural Network Learning - Theoretical Foundations , 1999 .

[19]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[20]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[21]  Gregory Piatetsky-Shapiro,et al.  High-Dimensional Data Analysis: The Curses and Blessings of Dimensionality , 2000 .

[22]  장윤희,et al.  Y. , 2003, Industrial and Labor Relations Terms.

[23]  Hrushikesh Narhar Mhaskar,et al.  On the tractability of multivariate integration and approximation by neural networks , 2004, J. Complex..

[24]  T. Poggio,et al.  Networks and the best approximation property , 1990, Biological Cybernetics.

[25]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[26]  T. Poggio,et al.  On optimal nonlinear associative recall , 1975, Biological Cybernetics.

[27]  T. Poggio,et al.  On the representation of multi-input systems: Computational properties of polynomial algorithms , 1980, Biological Cybernetics.

[28]  T. Poggio,et al.  The Mathematics of Learning: Dealing with Data , 2005, 2005 International Conference on Neural Networks and Brain.

[29]  Andreas Maurer,et al.  Bounds for Linear Multi-Task Learning , 2006, J. Mach. Learn. Res..

[30]  Yoshua Bengio,et al.  Scaling learning algorithms towards AI , 2007 .

[31]  Lars Grasedyck,et al.  Hierarchical Singular Value Decomposition of Tensors , 2010, SIAM J. Matrix Anal. Appl..

[32]  Yoshua Bengio,et al.  Shallow vs. Deep Sum-Product Networks , 2011, NIPS.

[33]  Stefano Soatto,et al.  Steps Towards a Theory of Visual Information: Active Perception, Signal-to-Symbol Conversion and the Interplay Between Sensing and Control , 2011, ArXiv.

[34]  Yoshua Bengio,et al.  Random Search for Hyper-Parameter Optimization , 2012, J. Mach. Learn. Res..

[35]  Roi Livni,et al.  A Provably Efficient Algorithm for Training Deep Networks , 2013, ArXiv.

[36]  Joel Z. Leibo,et al.  Unsupervised Learning of Invariant Representations in Hierarchical Architectures , 2013, ArXiv.

[37]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[38]  Tomaso Poggio,et al.  Unsupervised learning of invariant representations with low sample complexity: the magic of sensory cortex or a new framework for machine learning? , 2013, 1311.4158.

[39]  Razvan Pascanu,et al.  On the Number of Linear Regions of Deep Neural Networks , 2014, NIPS.

[40]  Shai Ben-David,et al.  Understanding Machine Learning: From Theory to Algorithms , 2014 .

[41]  Tomaso Poggio,et al.  Notes on Hierarchical Splines, DCLNs and i-theory , 2015 .

[42]  Nadav Cohen,et al.  On the Expressive Power of Deep Learning: A Tensor Analysis , 2015, COLT 2016.

[43]  Lorenzo Rosasco,et al.  Deep Convolutional Networks are Hierarchical Kernel Machines , 2015, ArXiv.

[44]  Joshua B. Tenenbaum,et al.  Human-level concept learning through probabilistic program induction , 2015, Science.

[45]  Tomaso Poggio,et al.  I-theory on depth vs width: hierarchical function composition , 2015 .

[46]  Jimmy Ba,et al.  Adam: A Method for Stochastic Optimization , 2014, ICLR.

[47]  Matus Telgarsky,et al.  Representation Benefits of Deep Feedforward Networks , 2015, ArXiv.

[48]  Guigang Zhang,et al.  Deep Learning , 2016, Int. J. Semantic Comput..

[49]  Fabio Anselmi,et al.  Visual Cortex and Deep Networks: Learning Invariant Representations , 2016 .

[50]  Ohad Shamir,et al.  Depth Separation in ReLU Networks for Approximating Smooth Non-Linear Functions , 2016, ArXiv.

[51]  David J. Schwab,et al.  Comment on "Why does deep and cheap learning work so well?" [arXiv: 1608.08225] , 2016, ArXiv.

[52]  Martín Abadi,et al.  TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems , 2016, ArXiv.

[53]  T. Poggio,et al.  Deep vs. shallow networks : An approximation theory perspective , 2016, ArXiv.

[54]  Ohad Shamir,et al.  The Power of Depth for Feedforward Neural Networks , 2015, COLT.

[55]  Tomaso A. Poggio,et al.  Bridging the Gaps Between Residual Learning, Recurrent Neural Networks and Visual Cortex , 2016, ArXiv.

[56]  Max Tegmark,et al.  Why Does Deep and Cheap Learning Work So Well? , 2016, Journal of Statistical Physics.

[57]  Lorenzo Rosasco,et al.  Unsupervised learning of invariant representations , 2016, Theor. Comput. Sci..

[58]  Tomaso A. Poggio,et al.  Learning Real and Boolean Functions: When Is Deep Better Than Shallow , 2016, ArXiv.

[59]  Ohad Shamir,et al.  Depth-Width Tradeoffs in Approximating Natural Functions with Neural Networks , 2016, ICML.

[60]  Francis R. Bach,et al.  Breaking the Curse of Dimensionality with Convex Neural Networks , 2014, J. Mach. Learn. Res..