Memo No . 35 August 5 , 2015 Deep Convolutional Networks are Hierarchical Kernel Machines

In i-theory a typical layer of a hierarchical architecture consists of HW modules pooling the dot products of the inputs to the layer with the transformations of a few templates under a group. Such layers include as special cases the convolutional layers of Deep Convolutional Networks (DCNs) as well as the non-convolutional layers (when the group contains only the identity). Rectifying nonlinearities – which are used by present-day DCNs – are one of the several nonlinearities admitted by i-theory for the HW module. We discuss here the equivalence between group averages of linear combinations of rectifying nonlinearities and an associated kernel. This property implies that present-day DCNs can be exactly equivalent to a hierarchy of kernel machines with pooling and nonpooling layers. Finally, we describe a conjecture for theoretically understanding hierarchies of such modules. A main consequence of the conjecture is that hierarchies of trained HW modules minimize memory requirements while computing a selective and invariant representation. This work was supported by the Center for Brains, Minds and Machines (CBMM), funded by NSF STC award CCF-1231216. Deep Convolutional Networks are Hierarchical Kernel Machines Fabio Anselmi1,2, Lorenzo Rosasco1,2,3, Cheston Tan4, and Tomaso Poggio1,2,4 1Center for Brains Minds and Machines, Massachusetts Institute of Technology, Cambridge, MA 02139. 2Laboratory for Computational Learning, Istituto Italiano di Tecnologia and Massachusetts Institute of Technology. 3DIBRIS, Universitá degli studi di Genova, Italy, 16146. 4Institute for Infocomm Research, Singapore, 138632.

[1]  Yuesheng Xu,et al.  Universal Kernels , 2006, J. Mach. Learn. Res..

[2]  Leo Breiman,et al.  Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[3]  Tomaso A. Poggio,et al.  Representation Properties of Networks: Kolmogorov's Theorem Is Irrelevant , 1989, Neural Computation.

[4]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[5]  I. Omiaj,et al.  Extensions of a Theory of Networks for Approximation and Learning : dimensionality reduction and clustering , 2022 .

[6]  Arthur Albert,et al.  Regression and the Moore-Penrose Pseudoinverse , 2012 .

[7]  D. Hubel,et al.  Receptive fields, binocular interaction and functional architecture in the cat's visual cortex , 1962, The Journal of physiology.

[8]  Bernhard Schölkopf,et al.  Hilbert Space Embeddings and Metrics on Probability Measures , 2009, J. Mach. Learn. Res..

[9]  T. Poggio,et al.  Hierarchical models of object recognition in cortex , 1999, Nature Neuroscience.

[10]  Lawrence K. Saul,et al.  Kernel Methods for Deep Learning , 2009, NIPS.

[11]  Tomaso A. Poggio,et al.  Regularization Theory and Neural Networks Architectures , 1995, Neural Computation.

[12]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[13]  James Demmel,et al.  Steve Smale and the Geometry of Ill-Conditioning , 1993 .

[14]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[15]  Rob Fergus,et al.  Visualizing and Understanding Convolutional Networks , 2013, ECCV.

[16]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[17]  Dumitru Erhan,et al.  Going deeper with convolutions , 2014, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[18]  Amnon Shashua,et al.  SimNets: A Generalization of Convolutional Networks , 2014, ArXiv.

[19]  Kunihiko Fukushima,et al.  Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position , 1980, Biological Cybernetics.

[20]  J. MacQueen Some methods for classification and analysis of multivariate observations , 1967 .

[21]  T. Poggio,et al.  Networks and the best approximation property , 1990, Biological Cybernetics.

[22]  Thomas Serre,et al.  A feedforward architecture accounts for rapid categorization , 2007, Proceedings of the National Academy of Sciences.

[23]  S. G. Mikhlin,et al.  The problem of the minimum of a quadratic functional , 1965 .

[24]  Hans Burkhardt,et al.  Invariant kernel functions for pattern analysis and machine learning , 2007, Machine Learning.

[25]  Shimon Edelman,et al.  Bringing the Grandmother back into the Picture: A Memory-Based View of Object Recognition , 1990, Int. J. Pattern Recognit. Artif. Intell..

[26]  Michael S. Bernstein,et al.  ImageNet Large Scale Visual Recognition Challenge , 2014, International Journal of Computer Vision.

[27]  Tomaso A. Poggio,et al.  A Canonical Neural Circuit for Cortical Nonlinear Operations , 2008, Neural Computation.

[28]  Ha Hong,et al.  Performance-optimized hierarchical models predict neural responses in higher visual cortex , 2014, Proceedings of the National Academy of Sciences.

[29]  Lorenzo Rosasco,et al.  On Invariance and Selectivity in Representation Learning , 2015, ArXiv.

[30]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..