1 Efficient BackProp

The convergence of back-propagation learning is analyzed so as to explain common phenomenon observed by practitioners. Many undesirable behaviors of backprop can be avoided with tricks that are rarely exposed in serious technical publications. This paper gives some of those tricks, and offers explanations of why they work. Many authors have suggested that second-order optimization methods are advantageous for neural net training. It is shown that most “classical” second-order methods are impractical for large neural networks. A few methods are proposed that do not have these limitations.

[1]  Juha Karhunen,et al.  Principal component neural networks — Theory and applications , 1998, Pattern Analysis and Applications.

[2]  Vladimir N. Vapnik,et al.  The Nature of Statistical Learning Theory , 2000, Statistics for Engineering and Information Science.

[3]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[4]  Vladimir Vapnik,et al.  Statistical learning theory , 1998 .

[5]  Shun-ichi Amari,et al.  The Efficiency and the Robustness of Natural Gradient Descent Learning Rule , 1997, NIPS.

[6]  Andreas Ziehe,et al.  Adaptive On-line Learning in Changing Environments , 1996, NIPS.

[7]  Genevieve B. Orr,et al.  Removing Noise in On-Line Search using Adaptive Batch Sizes , 1996, NIPS.

[8]  Shun-ichi Amari,et al.  Neural Learning in Structured Parameter Spaces - Natural Riemannian Gradient , 1996, NIPS.

[9]  Saad,et al.  Exact solution for on-line learning in multilayer neural networks. , 1995, Physical review letters.

[10]  Mark J. L. Orr,et al.  Regularization in the Selection of Radial Basis Function Centers , 1995, Neural Computation.

[11]  Haim Sompolinsky,et al.  On-line Learning of Dichotomies: Algorithms and Learning Curves. , 1995, NIPS 1995.

[12]  W. Wiegerinck,et al.  Stochastic dynamics of learning with momentum in neural networks , 1994 .

[13]  Wray L. Buntine,et al.  Computing second derivatives in feed-forward networks: a review , 1994, IEEE Trans. Neural Networks.

[14]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[15]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[16]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[17]  Hilbert J. Kappen,et al.  On-line learning processes in artificial neural networks , 1993 .

[18]  Barak A. Pearlmutter,et al.  Automatic Learning Rate Maximization by On-Line Estimation of the Hessian's Eigenvectors , 1992, NIPS 1992.

[19]  M. Moller,et al.  Supervised learning on large redundant training sets , 1992, Neural Networks for Signal Processing II Proceedings of the 1992 IEEE Workshop.

[20]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[21]  Roberto Battiti,et al.  First- and Second-Order Methods for Learning: Between Steepest Descent and Newton's Method , 1992, Neural Computation.

[22]  Elie Bienenstock,et al.  Neural Networks and the Bias/Variance Dilemma , 1992, Neural Computation.

[23]  John E. Moody,et al.  Note on Learning Rate Schedules for Stochastic Optimization , 1990, NIPS.

[24]  Yann LeCun,et al.  Second Order Properties of Error Surfaces , 1990, NIPS.

[25]  John Moody,et al.  Fast Learning in Networks of Locally-Tuned Processing Units , 1989, Neural Computation.

[26]  Geoffrey E. Hinton,et al.  Phoneme recognition using time-delay neural networks , 1989, IEEE Trans. Acoust. Speech Signal Process..

[27]  Yann LeCun,et al.  Generalization and network design strategies , 1989 .

[28]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[29]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[30]  Lawrence D. Jackel,et al.  Handwritten Digit Recognition with a Back-Propagation Network , 1989, NIPS.

[31]  D. Broomhead,et al.  Radial Basis Functions, Multi-Variable Functional Interpolation and Adaptive Networks , 1988 .

[32]  R. Fletcher Practical Methods of Optimization , 1988 .

[33]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[34]  Alberto L. Sangiovanni-Vincentelli,et al.  Efficient Parallel Learning Algorithms for Neural Networks , 1988, NIPS.

[35]  Yann LeCun PhD thesis: Modeles connexionnistes de l'apprentissage (connectionist learning models) , 1987 .