The performance of stochastic gradient descent (SGD) depends critically on how learning rates are tuned and decreased over time. We propose a method to automatically adjust multiple learning rates so as to minimize the expected error at any one time. The method relies on local gradient variations across samples. In our approach, learning rates can increase as well as decrease, making it suitable for non-stationary problems. Using a number of convex and non-convex learning tasks, we show that the resulting algorithm matches the performance of SGD or other adaptive approaches with their best settings obtained through systematic search, and effectively removes the need for learning rate tuning.
Stability and positive supermartingales
Pierre Priouret,et al.
Adaptive Algorithms and Stochastic Approximations
Applications of Mathematics.
Thibault Langlois,et al.
Parameter adaptation in stochastic optimization
Nicol N. Schraudolph.
Local Gain Adaptation in Stochastic Gradient Descent
Kenji Fukumizu,et al.
Adaptive Method of Realizing Natural Gradient Learning for Multilayer Perceptrons
Nicol N. Schraudolph,et al.
Fast Curvature Matrix-Vector Products for Second-Order Gradient Descent
Yann LeCun,et al.
Large Scale Online Learning
Warren B. Powell,et al.
Adaptive stepsizes for recursive estimation with applications in approximate dynamic programming
A Stochastic Approximation Method
Léon Bottou,et al.
The Tradeoffs of Large Scale Learning
Nicolas Le Roux,et al.
Topmoumoute Online Natural Gradient Algorithm
Alex Krizhevsky,et al.
Learning Multiple Layers of Features from Tiny Images
Patrick Gallinari,et al.
SGD-QN: Careful Quasi-Newton Stochastic Gradient Descent
J. Mach. Learn. Res..
Yoram Singer,et al.
Adaptive Subgradient Methods for Online Learning and Stochastic Optimization
J. Mach. Learn. Res..
Yoshua Bengio,et al.
Understanding the difficulty of training deep feedforward neural networks
Andrew W. Fitzgibbon,et al.
A fast natural Newton method
Eric Moulines,et al.
Non-Asymptotic Analysis of Stochastic Approximation Algorithms for Machine Learning
Wei Xu,et al.
Towards Optimal One Pass Large Scale Learning with Averaged Stochastic Gradient Descent
Improved Preconditioner for Hessian Free Optimization
Klaus-Robert Müller,et al.
Neural Networks: Tricks of the Trade.
Ilya Sutskever,et al.
Estimating the Hessian by Back-propagating Curvature
Tom Schaul,et al.
Adaptive learning rates and parallelization for stochastic, sparse, non-smooth gradients