Neural Network with Unbounded Activation Functions is Universal Approximator

This paper investigates the approximation property of the neural network with unbounded activation functions, such as the rectied linear unit (ReLU), which is new de-facto standard of deep learning. The ReLU network can be analyzed by the ridgelet transform with respect to Lizorkin distributions, which is introduced in this paper. By showing two reconstruction formulas by using the Fourier slice theorem and the Radon transform, it is shown that the neural network with unbounded activations still holds the universal approximation property. As an additional consequence, the ridgelet transform, or the backprojection lter in the Radon domain, is what the network will have learned after backpropagation. Subject to a constructive admissibility condition, the trained network can be obtained by just discretizing the ridgelet transform, without backpropagation. Numerical examples not only support the consistency of the admissibility condition but also imply that some nonadmissible cases result in low-pass ltering.

[1]  H. Fédérer Geometric Measure Theory , 1969 .

[2]  Yoshua Bengio,et al.  Deep Sparse Rectifier Neural Networks , 2011, AISTATS.

[3]  Yoshifusa Ito,et al.  Representation of functions by superpositions of a step or sigmoid function and their applications to neural network theory , 1991, Neural Networks.

[4]  J. Cooper SINGULAR INTEGRALS AND DIFFERENTIABILITY PROPERTIES OF FUNCTIONS , 1973 .

[5]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[6]  Andrew L. Maas Rectifier Nonlinearities Improve Neural Network Acoustic Models , 2013 .

[7]  Tara N. Sainath,et al.  Improving deep neural networks for LVCSR using rectified linear units and dropout , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[8]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[9]  Geoffrey E. Hinton,et al.  On rectified linear units for speech processing , 2013, 2013 IEEE International Conference on Acoustics, Speech and Signal Processing.

[10]  Emmanuel J. Cand Harmonic Analysis of Neural Networks , 1998 .

[11]  L. Grafakos Classical Fourier Analysis , 2010 .

[12]  Marcello Sanguineti,et al.  Approximating Multivariable Functions by Feedforward Neural Nets , 2013, Handbook on Neural Information Processing.

[13]  G. Shilov,et al.  Generalized Functions, Volume 1: Properties and Operations , 1967 .

[14]  Vra Krková Complexity estimates based on integral transforms induced by computational units , 2012 .

[15]  C. Micchelli,et al.  Approximation by superposition of sigmoidal and radial basis functions , 1992 .

[16]  Noboru Murata,et al.  An Integral Representation of Functions Using Three-layered Networks and Their Approximation Bounds , 1996, Neural Networks.

[17]  Matthias Holschneider,et al.  Wavelets - an analysis tool , 1995, Oxford mathematical monographs.

[18]  Wen Yuan,et al.  Morrey and Campanato Meet Besov, Lizorkin and Triebel , 2010, Lecture Notes in Mathematics.

[19]  Stevan Pilipovic,et al.  The Ridgelet transform of distributions , 2013, 1306.2024.

[20]  Noboru Murata,et al.  Sampling Hidden Parameters from Oracle Distribution , 2014, ICANN.

[21]  Boris Rubin,et al.  Convolution–backprojection method for the k-plane transform, and Calderón's identity for ridgelet transforms , 2004 .

[22]  Max L. Warshauer,et al.  Lecture Notes in Mathematics , 2001 .

[23]  Lakhmi C. Jain,et al.  Handbook on Neural Information Processing , 2013, Handbook on Neural Information Processing.

[24]  H. Brezis Functional Analysis, Sobolev Spaces and Partial Differential Equations , 2010 .

[25]  F. Trèves Topological vector spaces, distributions and kernels , 1967 .

[26]  David L. Donoho Ridge Functions and Orthonormal Ridgelets , 2001, J. Approx. Theory.

[27]  Allan Pinkus,et al.  Multilayer Feedforward Networks with a Non-Polynomial Activation Function Can Approximate Any Function , 1991, Neural Networks.

[28]  Vera Kurková,et al.  Complexity estimates based on integral transforms induced by computational units , 2012, Neural Networks.

[29]  S. Helgason Integral Geometry and Radon Transforms , 2010 .

[30]  Paul C. Kainen,et al.  A Sobolev-type upper bound for rates of approximation by linear combinations of Heaviside plane waves , 2007, J. Approx. Theory.

[31]  Boris Rubin,et al.  The Calderón reproducing formula, windowed X-ray transforms, and radon transforms in LP-spaces , 1998 .

[32]  Geoffrey E. Hinton,et al.  ImageNet classification with deep convolutional neural networks , 2012, Commun. ACM.

[33]  곽순섭,et al.  Generalized Functions , 2006, Theoretical and Mathematical Physics.

[34]  L. Schwartz Théorie des distributions , 1966 .

[35]  B. F. Logan,et al.  The Fourier reconstruction of a head section , 1974 .

[36]  Jean-Luc Starck,et al.  Sparse Image and Signal Processing: The Ridgelet and Curvelet Transforms , 2010 .

[37]  S. M. Carroll,et al.  Construction of neural nets using the radon transform , 1989, International 1989 Joint Conference on Neural Networks.

[38]  Allan Pinkus,et al.  Approximation theory of the MLP model in neural networks , 1999, Acta Numerica.

[39]  Alexander Hertle,et al.  Continuity of the radon transform and its inverse on Euclidean space , 1983 .

[40]  B. Irie,et al.  Capabilities of three-layered perceptrons , 1988, IEEE 1988 International Conference on Neural Networks.

[41]  I. M. Gelfand,et al.  Generalized Functions: Properties and Operations , 2016 .

[42]  Hyunjoong Kim,et al.  Functional Analysis I , 2017 .

[43]  Yann LeCun,et al.  What is the best multi-stage architecture for object recognition? , 2009, 2009 IEEE 12th International Conference on Computer Vision.

[44]  D. Donoho,et al.  Tight frames of k-plane ridgelets and the problem of representing objects that are smooth away from d-dimensional singularities in Rn. , 1999, Proceedings of the National Academy of Sciences of the United States of America.

[45]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[46]  E. Candès Harmonic Analysis of Neural Networks , 1999 .

[47]  Yoshua Bengio,et al.  Maxout Networks , 2013, ICML.

[48]  W. D. Evans,et al.  PARTIAL DIFFERENTIAL EQUATIONS , 1941 .

[49]  Stevan Pilipovic,et al.  The Ridgelet Transform and Quasiasymptotic Behavior of Distributions , 2014, 1401.1853.