Fast Exact Multiplication by the Hessian

Just storing the Hessian H (the matrix of second derivatives 2E/wiwj of the error E with respect to each pair of weights) of a large neural network is difficult. Since a common use of a large matrix like H is to compute its product with various vectors, we derive a technique that directly calculates Hv, where v is an arbitrary vector. To calculate Hv, we first define a differential operator Rv{f(w)} = (/r)f(w rv)|r=0, note that Rv{w} = Hv and Rv{w} = v, and then apply Rv{} to the equations used to compute w. The result is an exact and numerically stable procedure for computing Hv, which takes about as much computation, and is about as local, as a gradient evaluation. We then apply the technique to a one pass gradient calculation algorithm (backpropagation), a relaxation gradient calculation algorithm (recurrent backpropagation), and two stochastic gradient calculation algorithms (Boltzmann machines and weight perturbation). Finally, we show that this technique can be used at the heart of many iterative techniques for computing various properties of H, obviating any need to calculate the full Hessian.

[1]  John E. Moody,et al.  The Effective Number of Parameters: An Analysis of Generalization and Regularization in Nonlinear Learning Systems , 1991, NIPS.

[2]  Marwan A. Jabri,et al.  Weight Perturbation: An Optimal Architecture and Learning Technique for Analog VLSI Feedforward and Recurrent Multilayer Networks , 1991, Neural Comput..

[3]  Marwan A. Jabri,et al.  Summed Weight Neuron Perturbation: An O(N) Improvement Over Weight Perturbation , 1992, NIPS.

[4]  Christopher M. Bishop,et al.  Training with Noise is Equivalent to Tikhonov Regularization , 1995, Neural Computation.

[5]  M. Møller Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time , 1993 .

[6]  Maureen Caudill,et al.  IEEE First International Conference on Neural Networks : Sheraton Harbor Island East, San Diego, California, June 21-24, 1987 , 1987 .

[7]  Martin Fodslette Møller,et al.  Supervised Learning On Large Redundant Training Sets , 1993, Int. J. Neural Syst..

[8]  B. Widrow,et al.  Stationary and nonstationary learning characteristics of the LMS adaptive filter , 1976, Proceedings of the IEEE.

[9]  Yann LeCun,et al.  Optimal Brain Damage , 1989, NIPS.

[10]  Bruce Christianson Geometric approach to Fletcher's ideal penalty function , 1995 .

[11]  Martin Fodslette Meiller A Scaled Conjugate Gradient Algorithm for Fast Supervised Learning , 1993 .

[12]  Raymond L. Watrous Learning Algorithms for Connectionist Networks: Applied Gradient Methods of Nonlinear Optimization , 1988 .

[13]  Geoffrey E. Hinton,et al.  A Learning Algorithm for Boltzmann Machines , 1985, Cogn. Sci..

[14]  Gert Cauwenberghs,et al.  A Fast Stochastic Error-Descent Algorithm for Supervised Learning and Optimization , 1992, NIPS.

[15]  Pineda Generalization of back-propagation to recurrent neural networks. , 1987, Physical review letters.

[16]  Wray L. Buntine,et al.  Computing second derivatives in feed-forward networks: a review , 1994, IEEE Trans. Neural Networks.

[17]  J. Skilling The Eigenvalues of Mega-dimensional Matrices , 1989 .

[18]  Kurt W. Fleischer,et al.  Analog VLSI Implementation of Gradient Descent , 1992, NIPS.

[19]  P. J. Werbos,et al.  Backpropagation: past and future , 1988, IEEE 1988 International Conference on Neural Networks.

[20]  Geoffrey E. Hinton Connectionist Learning Procedures , 1989, Artif. Intell..

[21]  Yann LeCun,et al.  Improving the convergence of back-propagation learning with second-order methods , 1989 .

[22]  Stephen José Hanson,et al.  In Advances in Neural Information Processing Systems , 1990, NIPS 1990.

[23]  Bruce Christianson,et al.  Automatic Hessians by reverse accumulation , 1992 .

[24]  Luís B. Almeida,et al.  A learning rule for asynchronous perceptrons with feedback in a combinatorial environment , 1990 .

[25]  Yann LeCun,et al.  Second Order Properties of Error Surfaces: Learning Time and Generalization , 1990, NIPS 1990.

[26]  Peter M. Williams,et al.  Bayesian Regularization and Pruning Using a Laplace Prior , 1995, Neural Computation.

[27]  Barak A. Pearlmutter,et al.  Automatic Learning Rate Maximization by On-Line Estimation of the Hessian's Eigenvectors , 1992, NIPS 1992.

[28]  Chris Bishop,et al.  Exact Calculation of the Hessian Matrix for the Multilayer Perceptron , 1992, Neural Computation.

[29]  D. Mackay A Practical Bayesian Framework for Backprop Networks , 1991 .

[30]  Babak Hassibi,et al.  Second Order Derivatives for Network Pruning: Optimal Brain Surgeon , 1992, NIPS.

[31]  P. Werbos,et al.  Beyond Regression : "New Tools for Prediction and Analysis in the Behavioral Sciences , 1974 .

[32]  David J. C. MacKay A Practical Bayesian Framework for Backpropagation Networks , 1992, Neural Computation.

[33]  Barak A. Pearlmutter Gradient Descent: Second Order Momentum and Saturating Error , 1991, NIPS.

[34]  Martin Fodslette Møller A scaled conjugate gradient algorithm for fast supervised learning , 1993, Neural Networks.

[35]  Ron Meir,et al.  A Parallel Gradient Descent Method for Learning in Analog VLSI Neural Networks , 1992, NIPS.