Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent

We propose a generic method for iteratively approximating various second-order gradient steps-Newton, Gauss-Newton, Levenberg-Marquardt, and natural gradient-in linear time per iteration, using special curvature matrix-vector products that can be computed in O(n). Two recent acceleration techniques for on-line learning, matrix momentum and stochastic meta-descent (SMD), implement this approach. Since both were originally derived by very different routes, this offers fresh insight into their operation, resulting in further improvements to SMD.

[1]  Kenneth Levenberg A METHOD FOR THE SOLUTION OF CERTAIN NON – LINEAR PROBLEMS IN LEAST SQUARES , 1944 .

[2]  D. Marquardt An Algorithm for Least-Squares Estimation of Nonlinear Parameters , 1963 .

[3]  Shun-ichi Amari,et al.  Differential-geometrical methods in statistics , 1985 .

[4]  P. J. Werbos,et al.  Backpropagation: past and future , 1988, IEEE 1988 International Conference on Neural Networks.

[5]  Sharad Singhal,et al.  Training Multilayer Perceptrons with the Extende Kalman Algorithm , 1988, NIPS.

[6]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[7]  William H. Press,et al.  Numerical recipes in C (2nd ed.): the art of scientific computing , 1992 .

[8]  M. Møller Exact Calculation of the Product of the Hessian Matrix of Feed-Forward Network Error Functions and a Vector in 0(N) Time , 1993 .

[9]  Todd K. Leen,et al.  Optimal Stochastic Search and Adaptive Momentum , 1993, NIPS.

[10]  Heekuck Oh,et al.  Neural Networks for Pattern Recognition , 1993, Adv. Comput..

[11]  Barak A. Pearlmutter Fast Exact Multiplication by the Hessian , 1994, Neural Computation.

[12]  Terrence J. Sejnowski,et al.  Tempering Backpropagation Networks: Not All Weights are Created Equal , 1995, NIPS.

[13]  Peter Auer,et al.  Exponentially many local minima for single neurons , 1995, NIPS.

[14]  Manfred K. Warmuth,et al.  Additive versus exponentiated gradient updates for linear prediction , 1995, STOC '95.

[15]  Manfred K. Warmuth,et al.  Worst-case Loss Bounds for Single Neurons , 1995, NIPS.

[16]  Todd K. Leen,et al.  Using Curvature Information for Fast Stochastic Search , 1996, NIPS.

[17]  Mance E. Harmon,et al.  Multi-Agent Residual Advantage Learning with General Function Approximation. , 1996 .

[18]  Mark Harmon Multi-player residual advantage learning with general function , 1996 .

[19]  Shun-ichi Amari,et al.  Natural Gradient Works Efficiently in Learning , 1998, Neural Computation.

[20]  Shun-ichi Amari,et al.  Complexity Issues in Natural Gradient Descent Method for Training Multilayer Perceptrons , 1998, Neural Computation.

[21]  Nicol N. Schraudolph Online Learning with Adaptive Local Step Sizes , 1999 .

[22]  M. Rattray,et al.  Incorporating curvature information into on-line learning , 1999 .

[23]  Nicol N. Schraudolph Local Gain Adaptation in Stochastic Gradient Descent , 1999 .

[24]  Nicol N. Schraudolph,et al.  Online Independent Component Analysis with Local Learning Rate Adaptation , 1999, NIPS.

[25]  M. Rattray,et al.  MATRIX MOMENTUM FOR PRACTICAL NATURAL GRADIENT LEARNING , 1999 .

[26]  Gavin C. Cawley,et al.  On a Fast, Compact Approximation of the Exponential Function , 2000, Neural Computation.

[27]  Motoaki Kawanabe,et al.  On-line learning in changing environments with applications in supervised and unsupervised learning , 2002, Neural Networks.

[28]  W. Press,et al.  Numerical Recipes in C++: The Art of Scientific Computing (2nd edn)1 Numerical Recipes Example Book (C++) (2nd edn)2 Numerical Recipes Multi-Language Code CD ROM with LINUX or UNIX Single-Screen License Revised Version3 , 2003 .

[29]  Nicol N. Schraudolph,et al.  Gradient-based manipulation of nonparametric entropy estimates , 2004, IEEE Transactions on Neural Networks.

[30]  S. Amari Natural Gradient Works Eciently in Learning , .