Online Local Gain Adaptation for Multi-Layer Perceptrons

We introduce a new method for adapting the step size of each individual weight in a multi-layer perceptron trained by stochastic gradient descent. Our technique derives from the K1 algorithm for linear systems (Sutton, 1992), which in turn is based on a diagonalized Kalman Filter. We expand upon Sutton''s work in two regards: K1 is a) extended to nonlinear systems, and b) made more efficient by linearizing an exponentiation operation. The resulting ELK1 (extended, linearized K1) algorithm is computationally little more expensive than alternative proposals (Zimmermann, 1994; Almeida et al., 1997, 1998), and does not require an arbitrary smoothing parameter. On a first benchmark problem ELK1 clearly outperforms these alternatives, as well as stochastic gradient descent with momentum, even when the number of floating-point operations required per weight update is taken into account. Unlike the method of Almeida et al., ELK1 does not require statistical independence between successive training patterns, and handles large initial learning rates well.

[1]  Thibault Langlois,et al.  Parameter adaptation in stochastic optimization , 1999 .

[2]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[3]  William H. Press,et al.  The Art of Scientific Computing Second Edition , 1998 .

[4]  Barak A. Pearlmutter,et al.  Automatic Learning Rate Maximization in Large Adaptive Machines , 1992, NIPS.

[5]  Richard S. Sutton,et al.  Adapting Bias by Gradient Descent: An Incremental Version of Delta-Bar-Delta , 1992, AAAI.

[6]  Nicol N. Schraudolph,et al.  A Fast, Compact Approximation of the Exponential Function , 1999, Neural Computation.

[7]  Guo-An Chen,et al.  Acceleration of backpropagation learning using optimised learning rate and momentum , 1993 .

[8]  Terrence J. Sejnowski,et al.  Tempering Backpropagation Networks: Not All Weights are Created Equal , 1995, NIPS.

[9]  Luís B. Almeida,et al.  Speeding up Backpropagation , 1990 .

[10]  Alan S. Lapedes,et al.  A self-optimizing, nonsymmetrical neural net for content addressable memory and pattern recognition , 1986 .

[11]  Andreas Ziehe,et al.  Adaptive On-line Learning in Changing Environments , 1996, NIPS.

[12]  Roberto Battiti,et al.  Accelerated Backpropagation Learning: Two Optimization Methods , 1989, Complex Syst..

[13]  Lee A. Feldkamp,et al.  Decoupled extended Kalman filter training of feedforward layered networks , 1991, IJCNN-91-Seattle International Joint Conference on Neural Networks.

[14]  F. A. Seiler,et al.  Numerical Recipes in C: The Art of Scientific Computing , 1989 .

[15]  E. S. Plumer Training neural networks using sequential extended Kalman filtering , 1995 .

[16]  Francesco Palmieri,et al.  Optimal filtering algorithms for fast learning in feedforward neural networks , 1992, Neural Networks.