论文信息 - Gradient Descent: Second Order Momentum and Saturating Error

Gradient Descent: Second Order Momentum and Saturating Error

Batch gradient descent, Δw(t) = -νdE/dw(t), converges to a minimum of quadratic form with a time constant no better than 1/4λmax/λmin where λmin and λmax are the minimum and maximum eigenvalues of the Hessian matrix of E with respect to w. It was recently shown that adding a momentum term Δw(t) = -νdE/dw(t) + αΔw(t - 1) improves this to 1/4√λmax/λmin, although only in the batch case. Here we show that second-order momentum, Δw(t) = -νdE/dw(t) + αΔw(t -1) + βΔw(t - 2), can lower this no further. We then regard gradient descent with momentum as a dynamic system and explore a non quadratic error surface, showing that saturation of the error accounts for a variety of effects observed in simulations and justifies some popular heuristics.

Barak A. Pearlmutter

[1] B. Widrow,et al. Stationary and nonstationary learning characteristics of the LMS adaptive filter , 1976, Proceedings of the IEEE.

[2] Geoffrey E. Hinton,et al. Learning internal representations by error propagation , 1986 .

[3] S. Thomas Alexander,et al. Adaptive Signal Processing , 1986, Texts and Monographs in Computer Science.

[4] S. T. Alexander,et al. Adaptive Signal Processing: Theory and Applications , 1986 .

[5] J. Shynk,et al. The LMS algorithm with momentum updating , 1988, 1988., IEEE International Symposium on Circuits and Systems.

[6] Robert A. Jacobs,et al. Increased rates of convergence through learning rate adaptation , 1987, Neural Networks.

[7] M. Tugay,et al. Properties of the momentum LMS algorithm , 1989, Proceedings. Electrotechnical Conference Integrating Research, Industry and Education in Energy and Communication Engineering',.

[8] Terrence J. Sejnowski,et al. Faster Learning for Dynamic Recurrent Backpropagation , 1990, Neural Computation.

[9] Luís B. Almeida,et al. Acceleration Techniques for the Backpropagation Algorithm , 1990, EURASIP Workshop.

[10] Tom Tollenaere,et al. SuperSAB: Fast adaptive back propagation with good scaling properties , 1990, Neural Networks.

[11] H. Dabis,et al. Least mean squares as a control system , 1991 .