Learning in Artificial Neural Networks: A Statistical Perspective

The premise of this article is that learning procedures used to train artificial neural networks are inherently statistical techniques. It follows that statistical theory can provide considerable insight into the properties, advantages, and disadvantages of different network learning methods. We review concepts and analytical results from the literatures of mathematical statistics, econometrics, systems identification, and optimization theory relevant to the analysis of learning in artificial neural networks. Because of the considerable variety of available learning procedures and necessary limitations of space, we cannot provide a comprehensive treatment. Our focus is primarily on learning procedures for feedforward networks. However, many of the concepts and issues arising in this framework are also quite broadly relevant to other network learning paradigms. In addition to providing useful insights, the material reviewed here suggests some potentially useful new training methods for artificial neural networks.

[1]  C. E. SHANNON,et al.  A mathematical theory of communication , 1948, MOCO.

[2]  J. Kiefer,et al.  Stochastic Estimation of the Maximum of a Regression Function , 1952 .

[3]  J. Blum Multidimensional Stochastic Approximation Methods , 1954 .

[4]  Robert Gardner,et al.  The Elements Of Integration , 1968 .

[5]  F. Downton Stochastic Approximation , 1969, Nature.

[6]  M. T. Wasan Stochastic Approximation , 1969 .

[7]  R. Jennrich Asymptotic Properties of Non-Linear Least Squares Estimators , 1969 .

[8]  James M. Ortega,et al.  Iterative solution of nonlinear equations in several variables , 2014, Computer science and applied mathematics.

[9]  P. Nowosad The elements of integration , 1971 .

[10]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[11]  G. Nielson Multivariate Smoothing and Interpolating Splines , 1974 .

[12]  M. Stone Cross‐Validatory Choice and Assessment of Statistical Predictions , 1976 .

[13]  John H. Holland,et al.  Adaptation in Natural and Artificial Systems: An Introductory Analysis with Applications to Biology, Control, and Artificial Intelligence , 1992 .

[14]  Lennart Ljung,et al.  Analysis of recursive stochastic algorithms , 1977 .

[15]  R. Davies Hypothesis testing when a nuisance parameter is present only under the alternative , 1977 .

[16]  Harro Walk An invariance principle for the Robbins-Monro process in a Hilbert space , 1977 .

[17]  R. Davies Hypothesis testing when a nuisance parameter is present only under the alternative , 1977 .

[18]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[19]  P. Billingsley,et al.  Probability and Measure , 1980 .

[20]  H. Kushner,et al.  RATES OF CONVERGENCE FOR STOCHASTIC APPROXIMATION TYPE ALGORITHMS , 1979 .

[21]  Ing Rj Ser Approximation Theorems of Mathematical Statistics , 1980 .

[22]  Leen Stougie,et al.  Global optimization : a stochastic approach , 1980 .

[23]  R. Serfling Approximation Theorems of Mathematical Statistics , 1980 .

[24]  H. White Consequences and Detection of Misspecified Nonlinear Regression Models , 1981 .

[25]  H. White,et al.  Misspecified models with dependent observations , 1982 .

[26]  S. Geman,et al.  Nonparametric Maximum Likelihood Estimation by the Method of Sieves , 1982 .

[27]  H. White Maximum Likelihood Estimation of Misspecified Models , 1982 .

[28]  C. D. Gelatt,et al.  Optimization by Simulated Annealing , 1983, Science.

[29]  H. White Asymptotic theory for econometricians , 1985 .

[30]  H. White Maximum Likelihood Estimation of Misspecified Dynamic Models , 1984 .

[31]  Bruce Hajek,et al.  A tutorial survey of theory and applications of simulated annealing , 1985, 1985 24th IEEE Conference on Decision and Control.

[32]  V. Cerný Thermodynamical approach to the traveling salesman problem: An efficient simulation algorithm , 1985 .

[33]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[34]  R. Davies Hypothesis Testing when a Nuisance Parameter is Present Only Under the Alternatives , 1987 .

[35]  Werner C. Rheinboldt,et al.  Methods for solving systems of nonlinear equations , 1987 .

[36]  Lawrence Davis,et al.  Genetic Algorithms and Simulated Annealing , 1987 .

[37]  A. Gallant,et al.  Semi-nonparametric Maximum Likelihood Estimation , 1987 .

[38]  H. Kushner Asymptotic global behavior for stochastic approximation and diffusions with slowly decreasing noise effects: Global minimization via Monte Carlo , 1987 .

[39]  David E. Goldberg,et al.  Genetic Algorithms in Search Optimization and Machine Learning , 1988 .

[40]  H. White,et al.  A Unified Theory of Estimation and Inference for Nonlinear Dynamic Models , 1988 .

[41]  Bruce E. Hajek,et al.  Cooling Schedules for Optimal Annealing , 1988, Math. Oper. Res..

[42]  Esther Levin,et al.  A statistical approach to learning and generalization in layered neural networks , 1989, Proc. IEEE.

[43]  H. White Some Asymptotic Results for Learning in Single Hidden-Layer Feedforward Network Models , 1989 .

[44]  H. White,et al.  An additional hidden unit test for neglected nonlinearity in multilayer feedforward networks , 1989, International 1989 Joint Conference on Neural Networks.

[45]  Yamashita,et al.  Backpropagation algorithm which varies the number of hidden units , 1989 .

[46]  T. Ash,et al.  Dynamic node creation in backpropagation networks , 1989, International 1989 Joint Conference on Neural Networks.

[47]  David Haussler,et al.  What Size Net Gives Valid Generalization? , 1989, Neural Computation.

[48]  Yoshio Hirose,et al.  Backpropagation algorithm which varies the number of hidden units , 1991, International 1989 Joint Conference on Neural Networks.

[49]  Ken-ichi Funahashi,et al.  On the approximate realization of continuous mappings by neural networks , 1989, Neural Networks.

[50]  A. Barron,et al.  Statistical properties of artificial neural networks , 1989, Proceedings of the 28th IEEE Conference on Decision and Control,.

[51]  P. Phillips Partially Identified Econometric Models , 1988, Econometric Theory.

[52]  George Cybenko,et al.  Approximation by superpositions of a sigmoidal function , 1989, Math. Control. Signals Syst..

[53]  S. M. Carroll,et al.  Construction of neural nets using the radon transform , 1989, International 1989 Joint Conference on Neural Networks.

[54]  Naftali Tishby,et al.  Consistent inference of probabilities in layered networks: predictions and generalizations , 1989, International 1989 Joint Conference on Neural Networks.

[55]  R. Hecht-Nielsen,et al.  Theory of the Back Propagation Neural Network , 1989 .

[56]  H. White,et al.  Universal approximation using feedforward networks with non-sigmoid hidden layer activation functions , 1989, International 1989 Joint Conference on Neural Networks.

[57]  Kurt Hornik,et al.  Multilayer feedforward networks are universal approximators , 1989, Neural Networks.

[58]  Halbert White,et al.  Recursive M-estimation, nonlinear regression and neural network learning with dependent observations , 1990 .

[59]  Kurt Hornik,et al.  Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks , 1990, Neural Networks.

[60]  Terrence J. Sejnowski,et al.  Neural network models of sensory integration for improved vowel recognition , 1990, Proc. IEEE.

[61]  Halbert White,et al.  Connectionist nonparametric regression: Multilayer feedforward networks can learn arbitrary mappings , 1990, Neural Networks.

[62]  Eric A. Wan,et al.  Neural network classification: a Bayesian interpretation , 1990, IEEE Trans. Neural Networks.

[63]  John J. Hopfield,et al.  A protein structure predictor based on an energy model with learned parameters , 1990 .

[64]  Neil E. Cotter,et al.  Neural networks in noisy environment: a simple temporal higher order learning for feed-forward networks , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[65]  Bruce W. Suter,et al.  The multilayer perceptron as an approximation to a Bayes optimal discriminant function , 1990, IEEE Trans. Neural Networks.

[66]  Patrick A. Shoemaker,et al.  A note on least-squares learning procedures and classification by neural network models , 1991, IEEE Trans. Neural Networks.

[67]  D. Balmer Theoretical and Computational Aspects of Simulated Annealing , 1991 .