Regularization Theory and Neural Networks Architectures

We had previously shown that regularization principles lead to approximation schemes that are equivalent to networks with one layer of hidden units, called regularization networks. In particular, standard smoothness functionals lead to a subclass of regularization networks, the well known radial basis functions approximation schemes. This paper shows that regularization networks encompass a much broader range of approximation schemes, including many of the popular general additive models and some of the neural networks. In particular, we introduce new classes of smoothness functionals that lead to different classes of basis functions. Additive splines as well as some tensor product splines can be obtained from appropriate classes of smoothness functionals. Furthermore, the same generalization that extends radial basis functions (RBF) to hyper basis functions (HBF) also leads from additive models to ridge approximation models, containing as special cases Breiman's hinge functions, some forms of projection pursuit regression, and several types of neural networks. We propose to use the term generalized regularization networks for this broad class of approximation schemes that follow from an extension of regularization. In the probabilistic interpretation of regularization, the different classes of basis functions correspond to different classes of prior probabilities on the approximating function spaces, and therefore to different types of smoothness assumptions. In summary, different multilayer networks with one hidden layer, which we collectively call generalized regularization networks, correspond to different classes of priors and associated smoothness functionals in a classical regularization principle. Three broad classes are (1) radial basis functions that can be generalized to hyper basis functions, (2) some tensor product splines, and (3) additive splines that can be generalized to schemes of the type of ridge approximation, hinge functions, and several perceptron-like neural networks with one hidden layer.

[1]  I. J. Schoenberg Contributions to the problem of approximation of equidistant data by analytic functions. Part A. On the problem of smoothing or graduation. A first class of analytic approximation formulae , 1946 .

[2]  N. Aronszajn Theory of Reproducing Kernels. , 1950 .

[3]  G. Lorentz METRIC ENTROPY, WIDTHS, AND SUPERPOSITIONS OF FUNCTIONS , 1962 .

[4]  E. Parzen On Estimation of a Probability Density Function and Mode , 1962 .

[5]  A Tikhonov,et al.  Solution of Incorrectly Formulated Problems and the Regularization Method , 1963 .

[6]  M. R. Leadbetter,et al.  On the Estimation of the Probability Density, I , 1963 .

[7]  E. Nadaraya On Estimating Regression , 1964 .

[8]  G. S. Watson,et al.  Smooth regression analysis , 1964 .

[9]  R. Bellman,et al.  V. Adaptive Control Processes , 1964 .

[10]  V. K. Murthy Estimation of Probability Density , 1965 .

[11]  L. Goddard Approximation of Functions , 1965, Nature.

[12]  I. J. Schoenberg,et al.  Cardinal interpolation and spline functions , 1969 .

[13]  G. Wahba,et al.  A Correspondence Between Bayesian Estimation on Stochastic Processes and Smoothing by Splines , 1970 .

[14]  E. Stein Singular Integrals and Di?erentiability Properties of Functions , 1971 .

[15]  R. L. Hardy Multiquadric equations of topography and other irregular surfaces , 1971 .

[16]  Vladimir Vapnik,et al.  Chervonenkis: On the uniform convergence of relative frequencies of events to their probabilities , 1971 .

[17]  R. N. Desmarais,et al.  Interpolation using surface splines. , 1972 .

[18]  M. Priestley,et al.  Non‐Parametric Function Fitting , 1972 .

[19]  David M. Allen,et al.  The Relationship Between Variable Selection and Data Agumentation and a Method for Prediction , 1974 .

[20]  G. Wahba Smoothing noisy data with spline functions , 1975 .

[21]  G. Wahba,et al.  A completely automatic french curve: fitting spline functions by cross validation , 1975 .

[22]  J. Stewart Positive definite functions and generalizations, an historical survey , 1976 .

[23]  Jean Duchon,et al.  Splines minimizing rotation-invariant semi-norms in Sobolev spaces , 1976, Constructive Theory of Functions of Several Variables.

[24]  Ray J. Solomonoff,et al.  Complexity-based induction systems: Comparisons and convergence theorems , 1978, IEEE Trans. Inf. Theory.

[25]  C. R. Deboor,et al.  A practical guide to splines , 1978 .

[26]  Peter Craven,et al.  Smoothing noisy data with spline functions , 1978 .

[27]  J. Rissanen,et al.  Modeling By Shortest Data Description* , 1978, Autom..

[28]  F. Utreras Cross-validation techniques for smoothing spline functions in one or two dimensions , 1979 .

[29]  Grace Wahba Smoothing and Ill-Posed Problems , 1979 .

[30]  J. Meinguet Multivariate interpolation at arbitrary points made simple , 1979 .

[31]  L. Devroye,et al.  Distribution-Free Consistency Results in Nonparametric Discrimination and Regression Function Estimation , 1980 .

[32]  J. Friedman,et al.  Projection Pursuit Regression , 1981 .

[33]  L. Schumaker Spline Functions: Basic Theory , 1981 .

[34]  R. Franke Scattered data interpolation: tests of some methods , 1982 .

[35]  W E Grimson,et al.  A computational theory of visual surface interpolation. , 1982, Philosophical transactions of the Royal Society of London. Series B, Biological sciences.

[36]  D. Pollard Convergence of stochastic processes , 1984 .

[37]  B. Silverman,et al.  Spline Smoothing: The Equivalent Variable Kernel Method , 1984 .

[38]  D. Freedman,et al.  Asymptotics of Graphical Projection Pursuit , 1984 .

[39]  D. Cox MULTIVARIATE SMOOTHING SPLINE FUNCTIONS , 1984 .

[40]  H. Müller,et al.  Estimating regression functions and their derivatives by the kernel method , 1984 .

[41]  C. J. Stone,et al.  Additive Regression and Other Nonparametric Models , 1985 .

[42]  Tomaso Poggio,et al.  Computational vision and regularization theory , 1985, Nature.

[43]  G. Wahba A Comparison of GCV and GML for Choosing the Smoothing Parameter in the Generalized Spline Smoothing Problem , 1985 .

[44]  A. Pinkus n-Widths in Approximation Theory , 1985 .

[45]  Geoffrey E. Hinton,et al.  Learning representations by back-propagating errors , 1986, Nature.

[46]  S. Rippa,et al.  Numerical Procedures for Surface Fitting of Scattered Data by Radial Functions , 1986 .

[47]  Dana H. Ballard,et al.  Cortical connections and parallel processing: Structure and function , 1986, Behavioral and Brain Sciences.

[48]  C. Micchelli Interpolation of scattered data: Distance matrices and conditionally positive definite functions , 1986 .

[49]  M. Bertero Regularization methods for linear inverse problems , 1986 .

[50]  Bartlett W. Mel MURPHY: A Robot that Learns by Doing , 1987, NIPS.

[51]  Robin Sibson,et al.  What is projection pursuit , 1987 .

[52]  Robert M. Farber,et al.  How Neural Nets Work , 1987, NIPS.

[53]  Nira Dyn,et al.  Interpolation of scattered Data by radial Functions , 1987, Topics in Multivariate Approximation.

[54]  R. Tibshirani,et al.  Generalized Additive Models: Some Applications , 1987 .

[55]  Tomaso Poggio,et al.  Probabilistic Solution of Ill-Posed Problems in Computational Vision , 1987 .

[56]  Richard Franke,et al.  Recent Advances in the Approximation of surfaces from scattered Data , 1987, Topics in Multivariate Approximation.

[57]  M. J. D. Powell,et al.  Radial basis functions for multivariable interpolation: a review , 1987 .

[58]  Terrence J. Sejnowski,et al.  Parallel Networks that Learn to Pronounce English Text , 1987, Complex Syst..

[59]  Stephen M. Omohundro,et al.  Efficient Algorithms with Neural Network Behavior , 1987, Complex Syst..

[60]  Eric B. Baum,et al.  On the capabilities of multilayer perceptrons , 1988, J. Complex..

[61]  M. Bertero,et al.  Ill-posed problems in early vision , 1988, Proc. IEEE.

[62]  Alan L. Yuille,et al.  A regularized solution to edge detection , 1985, J. Complex..

[63]  Alan L. Yuille,et al.  The Motion Coherence Theory , 1988, [1988 Proceedings] Second International Conference on Computer Vision.

[64]  T. Poggio,et al.  Synthesizing a color algorithm from examples. , 1988, Science.

[65]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Syst..

[66]  B. Irie,et al.  Capabilities of three-layered perceptrons , 1988, IEEE 1988 International Conference on Neural Networks.

[67]  G. Parisi,et al.  Statistical Field Theory , 1988 .

[68]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[69]  W. Madych,et al.  Multivariate interpolation and condi-tionally positive definite functions , 1988 .

[70]  Richard Lippmann,et al.  Review of Neural Networks for Speech Recognition , 1989, Neural Computation.

[71]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[72]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[73]  W. Ziemer Weakly Differentiable Functions: Sobolev Spaces and Functions of Bounded Variation , 1989 .

[74]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[75]  M. C. Jones,et al.  Spline Smoothing and Nonparametric Regression. , 1989 .

[76]  R. Tibshirani,et al.  Linear Smoothers and Additive Models , 1989 .

[77]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[78]  I. Johnstone,et al.  Projection-Based Approximation and a Duality with Kernel Methods , 1989 .

[79]  R. DeVore,et al.  Optimal nonlinear approximation , 1989 .

[80]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[81]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[82]  E. Kansa MULTIQUADRICS--A SCATTERED DATA APPROXIMATION SCHEME WITH APPLICATIONS TO COMPUTATIONAL FLUID-DYNAMICS-- II SOLUTIONS TO PARABOLIC, HYPERBOLIC AND ELLIPTIC PARTIAL DIFFERENTIAL EQUATIONS , 1990 .

[83]  F. Girosi,et al.  Networks for approximation and learning , 1990, Proc. IEEE.

[84]  F. Girosi,et al.  A Nondeterministic Minimization Algorithm , 1990 .

[85]  E. Kansa Multiquadrics—A scattered data approximation scheme with applications to computational fluid-dynamics—I surface approximations and partial derivative estimates , 1990 .

[86]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[87]  M. Buhmann Multivariate cardinal interpolation with radial-basis functions , 1990 .

[88]  R. L. Hardy Theory and applications of the multiquadric-biharmonic method : 20 years of discovery 1968-1988 , 1990 .

[89]  Halbert White,et al.  Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings , 1990, Neural Networks.

[90]  Tomaso A. Poggio,et al.  Extensions of a Theory of Networks for Approximation and Learning , 1990, NIPS.

[91]  Grace Wahba,et al.  Spline Models for Observational Data , 1990 .

[92]  W. Madych,et al.  Polyharmonic cardinal splines: a minimization property , 1990 .

[93]  C. D. Boor,et al.  Quasiinterpolants and Approximation Power of Multivariate Splines , 1990 .

[94]  W. Madych,et al.  Polyharmonic cardinal splines , 1990 .

[95]  T. Poggio A theory of how the brain might work. , 1990, Cold Spring Harbor symposia on quantitative biology.

[96]  Christophe Rabut,et al.  How to Build Quasi-Interpolants: Application to Polyharmonic B-Splines , 1991, Curves and Surfaces.

[97]  F. Girosi Models of Noise and Robust Estimates , 1991 .

[98]  Donald F. Specht,et al.  A general regression neural network , 1991, IEEE Trans. Neural Networks.

[99]  Norman Yarvin,et al.  Networks with Learned Unit Response Functions , 1991, NIPS.

[100]  R. P. Lippmann A critical overview of neural network pattern classifiers , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[101]  John Moody,et al.  Note on generalization, regularization and architecture selection in nonlinear learning systems , 1991, Neural Networks for Signal Processing Proceedings of the 1991 IEEE Workshop.

[102]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[103]  A. Barron Approximation and Estimation Bounds for Artificial Neural Networks , 1991, COLT '91.

[104]  F. Girosi Models of Noise and Robust Estimation , 1991 .

[105]  Léon Bottou,et al.  Local Learning Algorithms , 1992, Neural Computation.

[106]  F. Girosi Some extensions of radial basis functions and their applications in artificial intelligence , 1992 .

[107]  Bartlett W. Mel NMDA-Based Pattern Discrimination in a Modeled Cortical Neuron , 1992, Neural Computation.

[108]  C. Rabut AN INTRODUCTION TO SCHOENBERG'S APPROXIMATION , 1992 .

[109]  L. Jones A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training , 1992 .

[110]  W. Härdle Applied Nonparametric Regression , 1992 .

[111]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[112]  Volker Tresp,et al.  Network Structuring and Training Using Rule-Based Knowledge , 1992, NIPS.

[113]  A. Ron,et al.  On multivariate approximation by integer translates of a basis function , 1992 .

[114]  C. Micchelli,et al.  Approximation by superposition of sigmoidal and radial basis functions , 1992 .

[115]  David Haussler,et al.  Decision Theoretic Generalizations of the PAC Model for Neural Net and Other Learning Applications , 1992, Inf. Comput..

[116]  Andrew R. Barron,et al.  Universal approximation bounds for superpositions of a sigmoidal function , 1993, IEEE Trans. Inf. Theory.

[117]  F. Girosi,et al.  From regularization to radial, tensor and additive splines , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[118]  Charles A. Micchelli,et al.  How to Choose an Activation Function , 1993, NIPS.

[119]  F. Girosi,et al.  From regularization to radial, tensor and additive splines , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[120]  Tomaso Poggio,et al.  Observations on Cortical Mechanisms for Object Recognition and Learning , 1993 .

[121]  Leo Breiman,et al.  Hinging hyperplanes for regression, classification, and function approximation , 1993, IEEE Trans. Inf. Theory.

[122]  Hrushikesh Narhar Mhaskar,et al.  Approximation properties of a multilayered feedforward artificial neural network , 1993, Adv. Comput. Math..

[123]  J. Ward,et al.  On the least squares fit by radial functions to multidimensional scattered data , 1993 .

[124]  M. Buhmann On quasi-interpolation with radial basis functions , 1993 .

[125]  H. Mhaskar Neural networks for localized approximation of real functions , 1993, Neural Networks for Signal Processing III - Proceedings of the 1993 IEEE-SP Workshop.

[126]  Sun-Yuan Kung,et al.  Digital neural networks , 1993, Prentice Hall Information and System Sciences Series.

[127]  Brian D. Ripley,et al.  Neural Networks and Related Methods for Classification , 1994 .

[128]  A. Timan Theory of Approximation of Functions of a Real Variable , 1994 .

[129]  Richard J. Mammone,et al.  Artificial neural networks for speech and vision , 1994 .

[130]  Federico Girosi,et al.  Regularization Theory, Radial Basis Functions and Networks , 1994 .

[131]  Shakespeare vs. fletcher: A stylometric analysis by radial basis functions , 1995, Comput. Humanit..

[132]  H. N. Mhaskar,et al.  Neural Networks for Optimal Approximation of Smooth and Analytic Functions , 1996, Neural Computation.

[133]  Tommy W. S. Chow,et al.  NONLINEAR DILATION NETWORK FOR PREDICTION APPLICATIONS , 1994 .

[134]  Xin Li,et al.  Limitations of the approximation capabilities of neural networks with one hidden layer , 1996, Adv. Comput. Math..

[135]  Tomaso Poggio,et al.  Learning to see , 1996 .

[136]  Federico Girosi,et al.  On the Relationship between Generalization Error, Hypothesis Complexity, and Sample Complexity for Radial Basis Functions , 1996, Neural Computation.

[137]  Sandro Ridella,et al.  On the convergence of a growing topology neural algorithm , 1996, Neurocomputing.

[138]  Tor Arne Johansen,et al.  Identification of non-linear systems using empirical data and prior knowledge - an optimization approach , 1996, Autom..

[139]  Lizhong Wu,et al.  A Smoothing Regularizer for Feedforward and Recurrent Neural Networks , 1996, Neural Computation.

[140]  Zhiyong Yang Nonlinear superposition of receptive fields and phase transitions , 1996 .

[141]  Hrushikesh Narhar Mhaskar,et al.  Neural Networks for Functional Approximation and System Identification , 1997, Neural Computation.

[142]  Rajesh P. N. Rao,et al.  Dynamic Model of Visual Recognition Predicts Neural Response Properties in the Visual Cortex , 1997, Neural Computation.

[143]  I. Omiaj,et al.  Extensions of a Theory of Networks for Approximation and Learning : dimensionality reduction and clustering , 2022 .