First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method

On-line first-order backpropagation is sufficiently fast and effective for many large-scale classification problems but for very high precision mappings, batch processing may be the method of choice. This paper reviews first- and second-order optimization methods for learning in feedforward neural networks. The viewpoint is that of optimization: many methods can be cast in the language of optimization techniques, allowing the transfer to neural nets of detailed results about computational complexity and safety procedures to ensure convergence and to avoid numerical problems. The review is not intended to deliver detailed prescriptions for the most appropriate methods in specific applications, but to illustrate the main characteristics of the different methods and their mutual relations.

[1]  Allen A. Goldstein,et al.  Constructive Real Analysis , 1967 .

[2]  John E. Dennis,et al.  On the Local and Superlinear Convergence of Quasi-Newton Methods , 1973 .

[3]  E. Polak Introduction to linear and nonlinear programming , 1973 .

[4]  D. Goldfarb Factorized variable metric methods for unconstrained optimization , 1976 .

[5]  David F. Shanno,et al.  Conjugate Gradient Methods with Inexact Searches , 1978, Math. Oper. Res..

[6]  Jorge J. Moré,et al.  User Guide for Minpack-1 , 1980 .

[7]  J. Nocedal Updating Quasi-Newton Matrices With Limited Storage , 1980 .

[8]  Philip E. Gill,et al.  Practical optimization , 1981 .

[9]  John E. Dennis,et al.  Algorithm 573: NL2SOL—An Adaptive Nonlinear Least-Squares Algorithm [E4] , 1981, TOMS.

[10]  John E. Dennis,et al.  Numerical methods for unconstrained optimization and nonlinear equations , 1983, Prentice Hall series in computational mathematics.

[11]  Alan S. Lapedes,et al.  A self-optimizing, nonsymmetrical neural net for content addressable memory and pattern recognition , 1986 .

[12]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[13]  Frank Fallside,et al.  An adaptive training algorithm for back propagation networks , 1987 .

[14]  L. Toint On Large Scale Nonlinear Least Squares Calculations , 1987 .

[15]  Jack Dongarra,et al.  LINPACK Users' Guide , 1987 .

[16]  Terrence J. Sejnowski,et al.  NETtalk: a parallel network that learns to read aloud , 1988 .

[17]  Scott E. Fahlman,et al.  An empirical study of learning speed in back-propagation networks , 1988 .

[18]  Raymond L. Watrous Learning Algorithms for Connectionist Networks: Applied Gradient Methods of Nonlinear Optimization , 1988 .

[19]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[20]  David S. Broomhead,et al.  Multivariable Functional Interpolation and Adaptive Networks , 1988, Complex Systems.

[21]  Alberto L. Sangiovanni-Vincentelli,et al.  Efficient Parallel Learning Algorithms for Neural Networks , 1988, NIPS.

[22]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[23]  John A. C. Bingham,et al.  Theory and Practice of Modem Design , 1988 .

[24]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[25]  Roberto Battiti,et al.  Accelerated Backpropagation Learning: Two Optimization Methods , 1989, Complex Syst..

[26]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[27]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[28]  Ronald A. Cole,et al.  A neural-net training program based on conjugate-radient optimization , 1989 .

[29]  Stefanos Kollias,et al.  An adaptive least squares algorithm for the efficient training of artificial neural networks , 1989 .

[30]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[31]  T Poggio,et al.  Regularization Algorithms for Learning That Are Equivalent to Multilayer Networks , 1990, Science.

[32]  G. E. Kelly,et al.  Supervised learning techniques for backpropagation networks , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[33]  Lai-Wan Chan Efficacy of different learning algorithms of the back-propagation network , 1990, IEEE TENCON'90: 1990 IEEE Region 10 Conference on Computer and Communication Systems. Conference Proceedings.

[34]  Peter J. Gawthrop,et al.  Stochastic Approximation and Multilayer Perceptrons: The Gain Backpropagation Algorithm , 1990, Complex Syst..

[35]  Roberto Battiti,et al.  BFGS Optimization for Faster and Automated Supervised Learning , 1990 .

[36]  Luís B. Almeida,et al.  Acceleration Techniques for the Backpropagation Algorithm , 1990, EURASIP Workshop.

[37]  Sandro Ridella,et al.  An optimum weights initialization for improving scaling relationships in BP learning , 1991 .

[38]  Zhi-Quan Luo,et al.  On the Convergence of the LMS Algorithm with Adaptive Learning Rate for Linear Feedforward Networks , 1991, Neural Computation.

[39]  Farid U. Dowla,et al.  Backpropagation Learning for Multilayer Feed-Forward Neural Networks Using the Conjugate Gradient Method , 1991, Int. J. Neural Syst..

[40]  Alfredo Petrosino,et al.  Competitive neural networks on message-passing parallel computers , 1993, Concurr. Pract. Exp..

[41]  Martin Fodslette Møller A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[42]  Hilbert J. Kappen,et al.  On-line learning processes in artificial neural networks , 1993 .

[43]  S. Thiria,et al.  A neural network approach for modeling nonlinear transfer functions: Application for wind retrieval from spaceborne scatterometer data , 1993 .

[44]  Rudy Setiono,et al.  Efficient neural network training algorithm for the Cray Y-MP supercomputer , 1993, Proceedings of 1993 International Conference on Neural Networks (IJCNN-93-Nagoya, Japan).

[45]  Thorsteinn S. Rögnvaldsson,et al.  JETNET 3.0—A versatile artificial neural network package , 1994 .