Gradient Descent: Second Order Momentum and Saturating Error

Batch gradient descent, Δw(t) = -νdE/dw(t), converges to a minimum of quadratic form with a time constant no better than 1/4λmax/λmin where λmin and λmax are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term Δw(t) = -νdE/dw(t) + αΔw(t - 1) improves this to 1/4√λmax/λmin, although only in the batch case. Here we show that second-order momentum, Δw(t) = -νdE/dw(t) + αΔw(t -1) + βΔw(t - 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics.

[1]  B. Widrow,et al.  Stationary and nonstationary learning characteristics of the LMS adaptive filter , 1976, Proceedings of the IEEE.

[2]  Geoffrey E. Hinton,et al.  Learning internal representations by error propagation , 1986 .

[3]  S. Thomas Alexander,et al.  Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[4]  S. T. Alexander,et al.  Adaptive Signal Processing: Theory and Applications , 1986 .

[5]  J. Shynk,et al.  The LMS algorithm with momentum updating , 1988, 1988., IEEE International Symposium on Circuits and Systems.

[6]  Robert A. Jacobs,et al.  Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[7]  M. Tugay,et al.  Properties of the momentum LMS algorithm , 1989, Proceedings. Electrotechnical Conference Integrating Research, Industry and Education in Energy and Communication Engineering',.

[8]  Terrence J. Sejnowski,et al.  Faster Learning for Dynamic Recurrent Backpropagation , 1990, Neural Computation.

[9]  Luís B. Almeida,et al.  Acceleration Techniques for the Backpropagation Algorithm , 1990, EURASIP Workshop.

[10]  Tom Tollenaere,et al.  SuperSAB: Fast adaptive back propagation with good scaling properties , 1990, Neural Networks.

[11]  H. Dabis,et al.  Least mean squares as a control system , 1991 .