Optimal Stochastic Search and Adaptive Momentum

Stochastic optimization algorithms typically use learning rate schedules that behave asymptotically as µ(t) = µ0/t. The ensemble dynamics (Leen and Moody, 1993) for such algorithms provides an easy path to results on mean squared weight error and asymptotic normality. We apply this approach to stochastic gradient algorithms with momentum. We show that at late times, learning is governed by an effective learning rate µeff = µ0/(1 - β) where β is the momentum parameter. We describe the behavior of the asymptotic weight error and give conditions on µeff that insure optimal convergence speed. Finally, we use the results to develop an adaptive form of momentum that achieves optimal convergence speed independent of µ0.

[1]  J. H. Venter An extension of the Robbins-Monro procedure , 1967 .

[2]  D. Bedeaux,et al.  On the Relation between Master Equations and Random Walks and Their Solutions , 1971 .

[3]  Harold J. Kushner,et al.  wchastic. approximation methods for constrained and unconstrained systems , 1978 .

[4]  J. Shynk,et al.  The LMS algorithm with momentum updating , 1988, 1988., IEEE International Symposium on Circuits and Systems.

[5]  Halbert White,et al.  Learning in Artificial Neural Networks: A Statistical Perspective , 1989, Neural Computation.

[6]  M. Tugay,et al.  Properties of the momentum LMS algorithm , 1989, Proceedings. Electrotechnical Conference Integrating Research, Industry and Education in Energy and Communication Engineering',.

[7]  John E. Moody,et al.  Towards Faster Stochastic Gradient Search , 1991, NIPS.

[8]  John E. Moody,et al.  Weight Space Probability Densities in Stochastic Learning: I. Dynamics and Equilibria , 1992, NIPS.

[9]  Heskes,et al.  Learning in neural networks with local minima. , 1992, Physical review. A, Atomic, molecular, and optical physics.

[10]  Todd K. Leen,et al.  Weight Space Probability Densities in Stochastic Learning: II. Transients and Basin Hopping Times , 1992, NIPS.

[11]  G. Orr,et al.  Momentum and optimal stochastic search , 1993 .