Computing second derivatives in feed-forward networks: a review

The calculation of second derivatives is required by recent training and analysis techniques of connectionist networks, such as the elimination of superfluous weights, and the estimation of confidence intervals both for weights and network outputs. We review and develop exact and approximate algorithms for calculating second derivatives. For networks with |w| weights, simply writing the full matrix of second derivatives requires O(|w|(2)) operations. For networks of radial basis units or sigmoid units, exact calculation of the necessary intermediate terms requires of the order of 2h+2 backward/forward-propagation passes where h is the number of hidden units in the network. We also review and compare three approximations (ignoring some components of the second derivative, numerical differentiation, and scoring). The algorithms apply to arbitrary activation functions, networks, and error functions.

[1]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[2]  William H. Press,et al.  Numerical Recipes: The Art of Scientific Computing , 1987 .

[3]  Eric B. Baum,et al.  Supervised Learning of Probability Distributions by Neural Networks , 1987, NIPS.

[4]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[5]  John Scott Bridle,et al.  Probabilistic Interpretation of Feedforward Classification Network Outputs, with Relationships to Statistical Pattern Recognition , 1989, NATO Neurocomputing.

[6]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[7]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[8]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[9]  David E. Rumelhart,et al.  Predicting the Future: a Connectionist Approach , 1990, Int. J. Neural Syst..

[10]  Amro El-Jaroudi,et al.  A new error criterion for posterior probability estimation with neural nets , 1990, 1990 IJCNN International Joint Conference on Neural Networks.

[11]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[12]  Wray L. Buntine,et al.  Bayesian Back-Propagation , 1991, Complex Syst..

[13]  D. Rumelhart,et al.  Generalization through Minimal Networks with Application to Forecasting , 1992 .

[14]  Chris Bishop,et al.  Current address: Microsoft Research, , 2022 .

[15]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[16]  David J. C. MacKay,et al.  A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[17]  Barak A. Pearlmutter,et al.  Automatic Learning Rate Maximization in Large Adaptive Machines , 1992, NIPS.

[18]  Martin Fodslette Møller,et al.  A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[19]  M. F. Møller,et al.  Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time , 1993 .

[20]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[21]  Yves Chauvin,et al.  Backpropagation: the basic theory , 1995 .